# Lab 00 - B - Data Visualization In Python

Visualization is an important step in data analysis and machine learning. It provides insights into the data itself, the models and the training processes. These insights are sometimes hard to achieve otherwise.

In the following labs we will use the [Plotly](https://plotly.com/python/) graphics library. We will practice creating some basic graphs. You are encouraged to have a deeper look at the [Plotly-Python webiste](https://plotly.com/python/).

We will be using two datasets:
- The first one, [based on this dataset](https://www.kaggle.com/spscientist/students-performance-in-exams) presents results of students in exams in different disciplines. In addition to scores to the different exams, the gender of the student is indicated as a categorical column, as well as their ethnicity, parents' level of education and whether the student undertook a preparation course prior to the test.

- The second dataset is of [temperature measurements taken in different major cities](https://www.kaggle.com/sudalairajkumar/daily-temperature-of-major-cities).

In [1]:
import sys
sys.path.append("../")
from utils import *

The following imports from the `plotly` library are already imported in the `utils` file,  but we are still importing them here for educational reasons.

In [5]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

## Loading Datasets

In [7]:
students_df = pd.read_csv("../datasets/Students_Performance.csv")
students_df.head()

Unnamed: 0.1,Unnamed: 0,gender,race.ethnicity,parental.level.of.education,lunch,test.preparation.course,math.score,reading.score,writing.score,science.score
0,1,female,group B,bachelor's degree,standard,none,88.8,86.506024,88.444444,84.438076
1,2,female,group C,some college,standard,completed,87.6,95.180723,94.666667,62.911799
2,3,female,group B,master's degree,standard,none,96.0,97.590361,96.888889,58.740116
3,4,male,group A,associate's degree,free/reduced,none,78.8,79.277108,75.111111,79.26114
4,5,male,group C,some college,standard,none,90.4,89.39759,88.888889,87.83354


In [8]:
temperature_df = pd.read_csv("../datasets/City_Temperature.csv")
temperature_df.head()

Unnamed: 0,Country,City,Date,Year,Month,Day,Temp
0,South Africa,Capetown,1995-01-01,1995,1,1,19.333333
1,South Africa,Capetown,1995-01-02,1995,1,2,19.888889
2,South Africa,Capetown,1995-01-03,1995,1,3,19.388889
3,South Africa,Capetown,1995-01-04,1995,1,4,20.833333
4,South Africa,Capetown,1995-01-05,1995,1,5,21.444444


In [10]:
temperature_df = pd.read_csv("../datasets/City_Temperature.csv")
temperature_df = temperature_df.loc[temperature_df["Temp"] != -99]
temperature_df.head()

Unnamed: 0,Country,City,Date,Year,Month,Day,Temp
0,South Africa,Capetown,1995-01-01,1995,1,1,19.333333
1,South Africa,Capetown,1995-01-02,1995,1,2,19.888889
2,South Africa,Capetown,1995-01-03,1995,1,3,19.388889
3,South Africa,Capetown,1995-01-04,1995,1,4,20.833333
4,South Africa,Capetown,1995-01-05,1995,1,5,21.444444


## Distribution Of Features

When you first look at a new dataset, it is important to understand how the different features "behave". Generally, we distinguish between 3 kinds of features: categorical (i.e. gender, country, etc.), discrete (i.e. years in range 1950 to 1960) or continuous (i.e. price, weight, height, etc.). We will explore different visualization options for different features, both on themselves and together with other features.

### Categorical Features

One of the simplest and most used plots to visualize categorical or discrete data is a bar plot. We use it to visualize how many items there are per category of a given feature.

We use the `px.bar` function of `plotly` to do so. The 2 first parameters are `x` and `y`, where `x` spepcifies the different categories and `y` specifies the number of occurences of each of those categories. Thus, we need to build a `Pandas.DataFrame` object that fits this structure.

In [12]:
df_count_ethnicities = students_df.groupby(['race.ethnicity']).size().reset_index(name='Count')
df_count_ethnicities

Unnamed: 0,race.ethnicity,Count
0,group A,89
1,group B,190
2,group C,319
3,group D,262
4,group E,140


In [13]:
px.bar(df_count_ethnicities, x="race.ethnicity", y="Count", height=200).show()

To see if in each ethinical category there is an equal proportion of men and women, we can color by gender.

In [14]:
df_count_ethnicities = students_df.groupby(['race.ethnicity', 'gender']).size().reset_index(name='Count')
px.bar(df_count_ethnicities, x="race.ethnicity", y="Count", color = "gender", height=200).show()

Now, let us check if the parental educational degree influences whether the students had a test preparation course. To keep it simple, we make independent plots for each level of education.

In [15]:
colored_by = "test.preparation.course"
split_by = 'parental.level.of.education'

for level in students_df[split_by].unique():
    df = students_df.loc[students_df[split_by] == level].groupby([colored_by]).size().reset_index(name='Count')
    px.pie(df, values='Count', names = colored_by, title = level, height=150).show()

## Continuous Features
Next, we deal with features on a continuous scale. Suppose we want to know the distribution of the grades. We can do so by plotting a histogram of grades for each subject. We will look at both the absolute- and relative counts.
- Observe that for all three subjects we are getting a noisy bell-like shape.
- Most students achieved grades around 85.
- Some students achieved very high grades, while other achieved lower grades of around 60-70.

<br>Unlike the separate figures we used for the pie charts, here we are using Plotly's `make_subplots` function to create all plots in a single figure.

In [16]:
fig = make_subplots(rows=1, cols=3,
                    subplot_titles=("Math score distribution", "Writing score distribution", "Reading score distribution"))

for i, label in enumerate(["math.score", "writing.score", "reading.score"]):
    fig.append_trace(go.Histogram(x=students_df[label], showlegend = False), row = 1, col = i+1)
    fig.update_xaxes(title_text=label.capitalize(), row=1, col=i+1)

fig.update_layout(height=300).show()

Often we want to know how two different continuous features influence one another. It could be interesting to see if they correlate. Here we will check if there is a correlation between a student's grade in math and in reading or science. We do so by plotting a scatter plot with the `x` values being the one feature and the `y` values the other feature. We can also color the dots by some category like gender or some continuous value like the score.

In the figure below, we clearly see that the score of the math exam is highly correlated with the reading exam. For the science exam, it is a little bit more complex: for the girls (in blue), the higher the grade in math, the lower the grade in science.

In [17]:
students_df["gender.cat"] = pd.Categorical(students_df["gender"]).codes

fig = make_subplots(rows=1, cols=2, start_cell="bottom-left")

fig.add_traces([go.Scatter(x=students_df["math.score"], y=students_df["reading.score"], mode="markers", 
                           marker = dict(color = students_df["gender.cat"], colorscale="Bluered"), showlegend = False),
                go.Scatter(x=students_df["math.score"], y=students_df["science.score"], mode="markers", 
                           marker = dict(color = students_df["gender.cat"], colorscale="Bluered"), showlegend = False)],
               rows=[1,1], cols=[1,2])
fig.add_trace(go.Scatter(x = [None], y = [None], mode = 'markers',
                        marker = dict(color="Blue"), legendgroup = "female", name = "female"), row = 1, col =1)
fig.add_trace(go.Scatter(x = [None], y = [None], mode = 'markers',
                        marker = dict(color="Red"), legendgroup = "male", name = "male"), row = 1, col =1)
fig.update_xaxes(title_text="Reading Score", row=1, col=1)
fig.update_xaxes(title_text="Science Score", row=1, col=2)
fig.update_yaxes(title_text="Math Score")
fig.show()


## Visualizing Combinations Of Features

When we have two categorical features and we want to know the distribution of samples across those two features, we often use heatmaps. A heatmap is also useful to represent any continuous feature across different catgorical features. 

In [18]:
df_count_ethnicities = students_df.groupby(['race.ethnicity', 'gender']).size().reset_index(name='Count')

x_ = np.unique(df_count_ethnicities["race.ethnicity"].tolist())
y_ = np.unique(df_count_ethnicities["gender"].tolist())

values = np.array(df_count_ethnicities["Count"]).reshape(5, 2)
values_norm_row = (values.T/values.sum(axis = 1)).T
values_norm_col = values/values.sum(axis = 0)

for title, z in [["Counts Heatmap", values], ["Row Normalized", values_norm_row], ["Column Normalized", values_norm_col]]:
    go.Figure(go.Heatmap(x=y_, y=x_,z=z), layout=go.Layout(title=title, height=300, width=200)).show()

In order to visualize the distribution of continuous features, we can use boxplots. In the graph below, we display the distribution of the temperature of Paris and Bordeaux for each graph separately. You first observe that summer months (6-8) are warmer than the other months. We can also compare between Paris and Bordeaux per month, and see that Bordeaux is generally warmed than Paris.

In [37]:
temperature_df['Country'].unique()

array(['South Africa', 'The Netherlands', 'Israel', 'Jordan'],
      dtype=object)

In [43]:
di ={i+1:m_ for i, m_ in enumerate(['Jan', 'Feb', 'March', 'Apr', 'May', 'June', 'July', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec'])}

france_temperature = temperature_df[temperature_df['Country'] == "Israel"]
france_temperature.replace({"Month": di})

Unnamed: 0,Country,City,Date,Year,Month,Day,Temp
18532,Israel,Tel Aviv,1995-01-01,1995,Jan,1,14.055556
18533,Israel,Tel Aviv,1995-01-02,1995,Jan,2,13.388889
18534,Israel,Tel Aviv,1995-01-03,1995,Jan,3,13.277778
18535,Israel,Tel Aviv,1995-01-04,1995,Jan,4,13.833333
18536,Israel,Tel Aviv,1995-01-05,1995,Jan,5,13.666667
...,...,...,...,...,...,...,...
23168,Israel,Tel Aviv,2007-09-11,2007,Sept,11,26.388889
23169,Israel,Tel Aviv,2007-09-12,2007,Sept,12,26.500000
23170,Israel,Tel Aviv,2007-09-13,2007,Sept,13,26.500000
23171,Israel,Tel Aviv,2007-09-14,2007,Sept,14,26.444444


In [44]:
px.scatter(france_temperature, x="Month", y="Temp", color="City", facet_col = "City").show()

In [45]:
px.box(france_temperature, x="Month", y="Temp", color="City").show()