Skip to content

katia9848/DSB9_Pandas_SQL_Data_Visual

Repository files navigation

Pandas, SQL and Data Visualization

Summary: Today we will help you with data visualization in Matplotlib, Seaborn, and Plotly.

💡 Tap here to leave your feedback on the project. It's anonymous and will help our team make your educational experience better. We recommend completing the survey immediately after the project.

Contents

  1. Chapter I
    1.1. Foreword
  2. Chapter II
    2.1. Instructions
  3. Chapter III
    3.1. Specific instructions for the day
  4. Chapter IV Mandatory part
    4.1. Exercise 00: Line chart
  5. Chapter V
    5.1. Exercise 01: Line chart with styles
  6. Chapter VI
    6.1. Exercise 02: Bar
  7. Chapter VII
    7.1. Exercise 03: Bar charts
  8. Chapter VIII
    8.1. Exercise 04: Histogram
  9. Chapter IX
    9.1. Exercise 05: Boxplot
  10. Chapter X
    10.1. Part 1
  11. Chapter XI Bonus part
    11.1. Exercise 07: Heatmap
  12. Chapter XII
    12.1. Exercise 08: Seaborn
  13. Chapter XIII
    13.1. Exercise 09: Plotly

Chapter I

Foreword

There are two reasons why visualization is good. First, it is useful for communicating your results to others, such as your employer, colleagues, or customers. Second, visualization helps you develop a better understanding of the data. Here is an example that illustrates its importance.

Try to imagine, or use a piece of paper, to graph the distribution of two variables with the following characteristics:

1

Do you think there is only one way to place the dots?

No, there are several ways. This is called Anscombe’s quartet.

2

All of the graphs above have the same characteristics. Can you believe it?

Looking at the characteristics alone can be misleading. Use graphs to better understand the data.

Chapter II

Instructions

  • Use this page as your only reference. Do not pay attention to rumors or speculation about how to prepare your solution.
  • Here and throughout, we use Python 3 as the only correct version of Python.
  • The python files for python exercises (module01, module02, module03) must have the following block at the end: if __name__ == ‘__main__’.
  • Pay attention to the permissions of your files and directories.
  • To be assessed your solution must be in your GIT repository.
  • Your solutions will be evaluated by your peers in the bootcamp.
  • You should not leave any other files in your directory other than those explicitly specified in the exercise instructions. It is recommended that you modify your .gitignore to avoid any accidents.
  • Your solution must be in your GIT repository for evaluation. Always push only to the develop branch! The master branch will be ignored. Work in the src directory.
  • When you need to get precise output in your programs, it is forbidden to display a precalculated output instead of performing the exercise correctly.
  • Have a question? Ask your neighbor on the right. If that fails, try your neighbor on the left.
  • Your reference materials are your peers, the internet, and Google.
  • Read the examples carefully. They may require information that is not specified elsewhere in the subject.
  • May the Force be with you!

Chapter III

Specific instructions for the day

  • Use the Jupyter notebook to work with your code.
  • No imports are allowed except those mentioned in the "Authorized Functions" section of the title block of each exercise.
  • You can use any built-in function as long as it is not prohibited by the exercise.
  • Save and load all the required data in the subfolder data/.

Chapter IV Mandatory part

Exercise 00: Line chart

  • Turn-in directory: ex00/.
  • Files to turn in: 00_line_chart.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3.

Today, you will work with the same datasets that you used the previous day.

Our goal is to gain a better understanding of how students of the educational company behave. You will use Pandas and SQL again to refine your skills, as well as various Python libraries for data visualization: Matplotlib, Seaborn, and Plotly.

As usual, we will start with something simple. If you have never drawn a graph in Python before, now is the time.

Remember when we analyzed the newsfeed page? Did you ever wonder how often the page was visited over time?

  • Make a connection to the database (it is the same as the previous day).
  • Run a query that gets the datetime from the pageviews table, selecting only users and not admins.
  • Using Pandas, create a new dataframe where the visits are counted and grouped by date.
  • Using the Pandas method .plot(), create a graph:
    • The size of the font should be 8.
    • The size of the figure is (15, 8).
    • The graph must have the title "Views per Day".
    • Notice the rotation of the xticks on the graph below.
  • Close the connection to the database.

3

Chapter V

Exercise 01: Line chart with styles

  • Turn-in directory: ex01/.
  • Files to turn in: 01_line_chart_styles.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3.

Cool! Do you remember that we have data on the commits? Wouldn't it be interesting to plot both metrics over time on the same graph? What if we saw some patterns?

Create exactly the same graph as the one below, including both values and style:

4

  • Analyze only the users, not the admins.
  • Analyze only the dates when there were both views and checker commits.
  • The font size should be 8.
  • The size of the figure should be (15, 8).
  • At the end of your Jupyter Notebook, create a Markdown cell and insert the following question: "How many times was the number of views larger than 150?" Insert: "The answer is ____." Replace the underline with the actual number.

Chapter VI

Exercise 02: Bar

  • Turn-in directory: ex02/.
  • Files to turn in: 02_bar_chart.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3.

We have another question for you: When do our users usually complete the labs: in the night, morning, afternoon, or evening? How has this changed over time?

Create a graph like this one:

5

  • Analyze only the users, not the admins.
  • The font size and fig size are still the same.
  • Night is from 0:00:00 to 03:59:59, morning is from 04:00:00 to 09:59:59, afternoon is from 10:00:00 to 16:59:59, and evening is from 17:00:00 to 23:59:59.
  • Choose a palette you really enjoy. You do not have to replicate it from the graph above.
  • At the end of your Jupyter notebook, create a Markdown cell and insert the following questions:
    • When do our users usually commit to the labs: in the night, morning, afternoon, or evening? The answer is the two most common periods.
    • Which day has:
      • the most number of commits?
      • and at the same time, is the number of commits higher in the evening than in the afternoon?

The answer is the date of that day.

Chapter VII

Exercise 03: Bar charts

  • Turn-in directory: ex03/.
  • Files to turn in: 03_bar_charts.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3.

What if the average number of commits differs between weekdays and weekends?

Create a graph like this by doing what you need to do:

6

  • Analyze only the users, not the admins.
  • The font size and fig size should remain the same.
  • For each hour, calculate the average number of commits on working days and weekends. If there were no commits in an hour, do not use it to calculate the average. Use these values for your graph. For example: Monday, 17–18: 5 commits; Tuesday, 17–18: 6 commits; Wednesday, 17–18: 7 commits.
  • Choose a palette you really enjoy. You do not have to replicate the one from the graph above.
  • At the end of your Jupyter Notebook, create a Markdown cell and insert the following question:
    • "Is the dynamic different on working days and weekends?" For the answer, include the hour when the number of commits is largest during working days and the hour when it is largest during the weekend.

Chapter VIII

Exercise 04: Histogram

  • Turn-in directory: ex04/.
  • Files to turn in: 04_histogram.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3, import matplotlib.pyplot as plt.

In the previous exercise, you had to use Pandas to draw a distribution grouping the values. Wouldn't it be nice to draw it automatically? Well, we can.

However, we have to use a different type of visualization: histograms. This time, we won't use averages. Instead, we will use the absolute number of commits and compare them during working days and weekends.

Create a graph like this:

7

  • Analyze only the users, not the admins.
  • Create two lists of values for the histogram input: one for working days and one for weekends.
  • The fig size remains the same. You can also choose the font size and color palette.
  • Use a level of transparency equal to 0.7 for the histogram in front.
  • At the end of your Jupyter Notebook, create a Markdown cell and insert the following question: "Are there hours when the total number of commits was higher on weekends than on working days?" In your answer, include the top four examples.

Chapter IX

Exercise 05: Boxplot

  • Turn-in directory: ex05/.
  • Files to turn in: 05_boxplot.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3, import matplotlib.pyplot as plt.

Do you remember when we tried to determine whether the newsfeed affected the behavior of the test and control users? Last time, we only calculated the average values. But what about the variances? What if they changed, too? What if there were outliers?

Drawing a boxplot may help answer these questions.

Create a graph like this:

8

  • Use the data from the file, read it into a dataframe, and make any modifications you find useful to solve the task.
  • The fig size is still the same; you can choose whatever font size you like.
  • The color palette should be the same as in the example.
  • The font size of the title is 15.
  • The width of the box lines is 3 and the width of the median lines is 2.
  • At the end of your Jupyter Notebook, create a Markdown cell and insert the following question: "What was the IQR of the control group before the newsfeed?" In your answer, provide an approximate value that can be obtained simply by looking at the graph and round it to the nearest 10.

Chapter X

Exercise 06: Scatter Matrix

  • Turn-in directory: ex06/.
  • Files to turn in: 06_scatter_matrix.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3, from pandas.plotting import scatter_matrix.

Do you remember when we tried to determine if there was a correlation between the number of Newsfeed visits and the average difference between the first commit and the lab deadline? The problem is that the correlation coefficient only shows whether there is a linear relationship between two variables. But what if the relationship is not linear? How can we tell? That's right—by drawing graphs!

Create a graph like this:

9

  • Create a dataframe where each user in the test group has the average difference, the number of pageviews, and the number of commits.
  • Do not take project1 into account when calculating the average difference and number of commits.
  • Take the number of commits from the checker table.
  • The fig size remains the same. You can choose whatever font size and color palette you like.
  • The size of the dots should be 200.
  • The width of the lines of the diagonal graphs (KDE) should be 3.
  • At the end of your Jupyter Notebook, create a Markdown cell and insert the following questions:
    • "Can we say that if a user has a low number of pageviews, then they likely have a low number of commits?" The answer is yes or no.
    • "Can we say that if a user has a low number of pageviews, then they likely have a small average difference between the first commit and the lab deadline?" The answer is yes or no.
    • "Can we say that there are many users with a low number of commits and a few with a high number of commits?" The answer is yes or no.
    • "Can we say that there are many users with a small average difference and a few with a large average difference?" The answer is yes or no.

Chapter XI Bonus part

Exercise 07: Heatmap

  • Turn-in directory: ex07/.
  • Files to turn in: 07_heatmap.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3, import matplotlib.pyplot as plt, from mpl_toolkits.axes_grid1 import make_axes_locatable.

In an earlier exercise, we wanted to see if there were different patterns for users on weekdays versus weekends. In this exercise, let's find out if users have different patterns on different weekdays and at different times.

  • Analyze only the users, not the admins.
  • You can choose a color palette for both graphs.
  • Use the table checker for your query.
  • Use the absolute values of the commits, not the averages.
  • Sort the dataframes by the total number of commits made by a user.
  • At the end of your Jupyter Notebook, create a Markdown cell and insert the questions. Answer them by looking only at the graphs.
    • "Which user has the most commits on Tuesday?" The answer is user_*.
    • "Which user has the most commits on Thursday?" The answer is user_*.
    • "On which weekday do users not like to make a lot of commits?" The answer, for example, is Monday.
    • "Which user made the largest number of commits at which hour?" The answer is user_1, 15.

Create two graphs like these:

10

Chapter XII

Exercise 08: Seaborn

  • Turn-in directory: ex08/.
  • Files to turn in: 08_seaborn.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3, import matplotlib.pyplot as plt, import seaborn as sns.

Okay, in previous exercises, we sometimes ignored project1 in our calculations. The project was a competition. It had longer deadlines and many more commits than ordinary labs. Now, let's look at the commit activity in this project per user. This time, we will use Seaborn, another library for data visualization in Python. In general, it is much easier to create beautiful graphs with that library.

Do what you need to do to create a graph like this:

11

  • Analyze only the users, not the admins.
  • Take into account only logs from the table checker where the status is "ready".
  • You can choose the palette you like.
  • The line width should be 3.
  • The background of the graph is gray.
  • The height should be 10 and the width should be 1.5 times the height.
  • The font size of the title should be 30.
  • The font size of the axis labels should be 15.
  • At the end of your Jupyter Notebook, create a Markdown cell and insert the following questions. Answer them by looking only at the graphs:
    • "Which user was the leader in the number of commits almost all of the time?" The answer is user_*.
    • "Which user was the leader for only a short period of time?" The answer is user_*.

Chapter XIII

Exercise 09: Plotly

  • Turn-in directory: ex09/.
  • Files to turn in: 09_plotly.ipynb.
  • Allowed functions: import pandas as pd, import sqlite3, import plotly.graph_objects as go, import numpy as np.

Matplotlib and Seaborn are powerful libraries that can be used for most DataViz tasks. However, they do not offer the ability to create interactive charts and animations. Plotly can help you with that. In this exercise, you will create an animation of the graph from the previous exercise.

Do whatever is necessary to create a graph like this:

12

This is not an easy task, and good, clear tutorials are hard to find, so use this link as a reference.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors