# Data Science and Visualization (RUC F2023)

## Lecture 3: Data Visualization

# Basic Visualizations

* ### Histogram
* ### Bar chart
* ### Boxplot
* ### Scatterplot
* ### Line chart

We demonstrate with a number of synthetic datasets about kids health. This notebook contains both demo code and exercise (with blank cells). We use the library of **matplotlib.pyplot** for making the basic plots.

## 0. Setup and construct the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.DataFrame([['Alis', 'Female', 2, 90, 13, 'Negative', 'Yes'],
                   ['Alex', 'Male', 6, 100, 20, 'Negative', 'Yes'],
                   ['Bo', 'Male', 2, 80, 10, 'Negative', 'Yes'],
                   ['Chris', 'Male', 2, 90, 17, 'Positive', 'Yes'],
                   ['Daisy', 'Female', 2, 90, 17, 'Positive', 'No'],
                   ['John', 'Male', 3, 96, 15, 'Negative', 'Yes'],
                   ['Kate', 'Female', 4, 100, 19, 'Negative', 'No'],
                   ['Sebastian', 'Male', 5, 110, 19, 'N/A', 'Yes'],
                   ['Mads', 'Male', 3, 100, None, 'Positive', 'No'],
                   ['Emil', 'Male', 5, None, 18, 'Negative', 'No'],
                   ['Kelly', 'Female', 4, 100, 18, 'Positive', 'Yes'],
                   ['Karin', 'Female', 5, 90, 15, 'Positive', 'No'],
                   ['Sarah', 'Female', 3, 90, 13, 'Negative', 'No']], 
                  columns=["Name", "Gender", "Age", "Height", "Weight", "Test", 'Fever'])

df

## 1. Histogram

We can call plt.hist(.) directly to get the histogram for a column.

### 1.0 We create an array of numbers following a normal distribution

In [None]:
# A normal distribution with mean=0 and std=10. The number of values is 275.
x = np.random.normal(0, 10, 275)

# If we print x, it will look like this:
"""
print(x)
[-12.86467624  -1.55637358   2.04774988  -2.78593052   7.0037208
  -2.2541189   -8.15792232 -16.54421057   4.7852369  -17.48727612
 ... ...
 -9.86399721 -15.79293859   6.11728893   9.6049965    9.6579244
  14.4480663   -4.02353426  -0.49568545   1.29181534  -4.13082611]
"""

# Let's plot the histogram of all values in x
plt.hist(x)
plt.show() 

### 1.1 We create a histgram for the Fever column in the children health data.

In [None]:
plt.hist(df.Fever)

plt.title('Histogram of Fever')

### 1.2 We create a function to create histgram for a given column.

A histogram like above needs some 'decorations', e.g., for the Y ticks and title. Those decorations could be the same for all such histograms. Therefore, we define a function that takes a an arbitrary column to plot histogram for and automate some things, e.g., the range of the column.

In [None]:
def plotHistgram(data, title='Histogram'):
    plt.hist(data)
    plt.title(title)
    
    # We obtain the max and min counts and use them to decide the Y ticks.
    data_counted = data.value_counts()
    upper = data_counted.nlargest(1).values[0]
    lower = data_counted.nsmallest(1).values[0]
    plt.yticks(range(0, upper+1))
    
    # We can also do so likewise for the X axis but it will be more complicated.
    # A much easier way is to let the caller of this function to decide the X ticks.
    # Parameter 'ha' means horizontal alignment for the x ticks. If you 
    #plt.xticks([3, 4, 5], ha='center')
    plt.xticks(ha='center')

We use the function to create a histogram for the Fever column.

In [None]:
plotHistgram(df.Fever, 'Histogram of Fever')

We use the function to create a histogram for the Weight column.

In [None]:
plotHistgram(df.Weight)

For the Age column.

In [None]:
plotHistgram(df.Age)
plt.xticks([2, 3, 4, 5, 6], ha='left')

### 1.3 More decorations

If the x ticks are of (long) strings, we may specify its horizontal alignment (**ha**) and rotation mode (**rotation_mode**) to make it look better.
* ha: 'left', 'center', 'right'
* va: "top", "center", "baseline", "bottom" (for vertial alignment for y ticks)
* rotation_mode:<br>
    * 'default' (or None): first rotates the text and then aligns the bounding box of the rotated text.
    * 'anchor': aligns the unrotated text and then rotates the text around the point of alignment.
    
It may cost some time to show ticks in the best way through aligment and rotation, or it may not always possible to have them in the way you want if there's not much room for maneuver. 

In [None]:
plotHistgram(df.Test)
plt.xticks(rotation=45, ha='right', rotation_mode='default')

### 1.4 (Exercise) Make a histogram for Age *without* using the function defined above.

There are many tricks to beautify such a histogram, e.g., to position the bars closer or distanter from each other, to customize the X ticks. You may find such low level details from online examples if you need to use them. NB: They could be time-consuming to implement. 

### 1.5 (Exercise) Make a histogram for Height using the function defined above.

## 2. Bar chart

### 2.1 Make a bar chart about the age of all kids.

In [None]:
plt.bar(df.Name, df.Age)
plt.xlabel('Names')
plt.xticks(ha='right', rotation_mode='anchor', rotation=45)
plt.ylabel('Age')
plt.title('Age of each kid')

### 2.2 Make a bar chart about the number of kids per each gender group.

In [None]:
series_1 = df.groupby(['Gender'])['Name'].count()
series_1

In [None]:
plt.bar(series_1.index, series_1.values)
plt.title('Number of kids per gender')
plt.yticks(series_1.values)
plt.xlabel('Gender')
plt.ylabel('#Kids')

### 2.3 Make a bar chart about the number of Positives and Negatives of each gender group.

In [None]:
series_3 = df.groupby(['Gender', 'Test'])['Test'].count()
series_3

In [None]:
# Below rot=0 is needed to make the x ticks horizontal
series_3.unstack().plot(kind='bar', rot=0)
plt.yticks(series_3.values)

We may also show the bars in a stacked way, by set 'stacked' to True.

In [None]:
# Below the width parameter specifies the width of the bars
series_3.unstack().plot(kind='bar', stacked=True, rot=0, width=0.2)

# We should set the y ticks differently as above
# We get two counts each for a gender group
counts = df.groupby(['Gender'])['Test'].count()
# We use the larger count as the upper limit for the y ticks
plt.yticks(range(0, counts.values.max()+1))

### 2.4 (**Exercise**) Make a bar chart about the height (or weight) of all kids.

In [None]:
plt.bar(df.Name, df.Weight)
plt.xlabel('Names')
plt.xticks(ha='right', rotation_mode='anchor', rotation=45)
plt.ylabel('Weight (kg)')
plt.title('Weight of each kid')

### 2.5 (**Exercise**) Make a bar chart about the number of kids per each test result group.

### 2.6 (**Exercise**) Make a bar chart about the number of fever cases and no-fever cases of each gender group.

## 3. Boxplot

### 3.1 Get the statistics of Age and make a boxplot for it

In [None]:
df['Age'].describe()

In [None]:
import matplotlib.pyplot as plt

plt.boxplot(df['Age'])
plt.xticks([1], ['Age'])

### 3.2 Get the statistics of Age for each gender group and make a boxplot for each

In [None]:
# We prepare a list of series objects, each having the age values for a specific gender.
data2plot = [df['Age'][df.Gender == 'Male'],
             df['Age'][df.Gender == 'Female']]

# We pass the list to boxplot() which will make a boxplot for each series object in the list
plt.boxplot(data2plot)

# Below, the first argument specifies the positions on the x axis to place the ticks,
# and the second argument gives a list of labels for the x ticks
plt.xticks(range(1, 3), ['Male', 'Female'], ha='center')

# We also save the graph into a disk file
#plt.savefig('boxes.jpg', dpi=300, bbox_inches='tight')

### 3.3 (Exercise) Make a boxplot for Weight

You need to assign values to missing values on Weight, otherwise nothing will be plotted. You may use forward filling (ffill)  on the data.

### 3.4 (Exercise) Make two boxplots of Weight each for a gender group, in a single figure.

You also need to fill in for the missing value first.

### 3.5 (Exercise) Make two boxplots of Height each for a gender group, in a single figure.

You also need to fill in for the missing value first.

## 4. Scatterplot

### 4.1 Age-weight scatterplot

In [None]:
age_weight = df[['Age', 'Weight']]
plt.scatter(age_weight.Age, age_weight.Weight)
plt.xlabel('Age')
plt.ylabel('Weight')

### 4.2 Age-weight scatterplot for each gender group

In [None]:
groups = df.groupby('Gender')

# Plot
fig, ax = plt.subplots()
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.Age, group.Weight, marker='o', linestyle='', label=name)
ax.legend()

plt.xlabel('Age')
plt.ylabel('Weight')
plt.show()

### 4.3 (Exercise) Make a scatterplot for Weight and Height

### 4.4 (Exercise) Make a scatterplot for Weight and Height for each gender group

## 5. Line chart

Let's create another dataset about children's weights over a number of years

In [None]:
weights = pd.DataFrame([[2010, 5, 6, 4, 5],
                        [2011, 6, 7, 5, 7],
                        [2012, 6, 7, 7, 9],
                        [2013, 7, 9, 9, 10],
                        [2014, 9, 11, 10, 12],
                        [2015, 11, 12, 12, 13],
                        [2016, 13, 14, 14, 15],
                        [2017, 14, 15, 15, 16],
                        [2018, 17, 18, 18, 17],
                        [2019, 19, 20, 20, 19],
                        [2020, 20, 21, 20, 21]], 
                       columns=["Year", "Alex", "Emma", "Noah", "Will"])

weights

### 5.1 Plot a line for each child over the years

In [None]:
plt.plot(weights.Year, weights.Alex, 'b.-', label = 'Alex', color='green')
plt.plot(weights.Year, weights.Emma, 'b.-', label = 'Emma', color='pink')
#plt.plot(weights.Year, weights.Noah, 'b.-', label = 'Noach', color='blue')
#plt.plot(weights.Year, weights.Will, 'b.-', label = 'Will', color='gold')

plt.legend()

### 5.2 Plot the largest weight values over the years

In [None]:
plt.plot(weights.Year, weights.iloc[:, 1:].max(axis=1))

# We get the year range from the Year column
yearRange = range(weights.min(axis=0)['Year'], weights.max(axis=0)['Year']+1, 1)
# We set the x axis to the year range
plt.xticks(yearRange)
plt.xlabel('Year')

plt.ylabel('Largest weight')

### (Exercises) 5.3 Plot the average weight values over the years

### (Exercises) 5.4 Plot the smallest weight values over the years

### (Exercises) 5.5 Plot both the average and smallest weight values over the years in a single plot

## 6. Advanced visualization

### 6.1 (Exercise) Create a Heatmap for the kid health data set

What do you see from it?

### 6.2 (Exercise) Create a Heatmap for the children weight data set

What do you see from it?