# Unit 6 - Visualize the data
---

1. [Boxplots](#section1)
2. [Side note: Pickle](#section2)
2. [Histograms](#section3)
3. [Same stats, different graphs](#section4)



Introducing an additional library: [seaborn](https://seaborn.pydata.org/) - for statistical data visualization\
Behind the scenes, seaborn uses matplotlib to draw its plots.\
[matplotlib.pyplot](https://matplotlib.org/stable/api/pyplot_summary.html#module-matplotlib.pyplot) is the GUI manager of the figure. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  #for reshaping graph size
import seaborn as sns  # for creating the graphs

<a id='section1'></a>

## 1. Boxplots 

What are they good for? Let's look at an example with the Titanic dataset

#### Titanic dataset

In [2]:
titanic_df = sns.load_dataset('titanic')

In [None]:
titanic_df.shape

In [None]:
titanic_df.head()

#### We would like to vizualize the passengers `age`

##### Attempt #1: With `scatterplot`

The figure size is set using matplotlib, but there are other ways. See [this](https://stackoverflow.com/questions/31594549/how-to-change-the-figure-size-of-a-seaborn-axes-or-figure-level-plot) highly voted question on stackoverflow.


In [None]:
plt.figure(figsize=(4,3))  #figure size
sns.scatterplot(data = titanic_df[['age']])

This is the raw data:

axis x - the 891 passangers 

axis y - the age of each passenger

This is not informative

##### Attempt #2: With `lineplot`

In [None]:
plt.figure(figsize=(6,3))
g = sns.lineplot(data = titanic_df[['age']])

##### Attempt #3:`boxplot`

In [None]:
plt.figure(figsize=(2,5))
g = sns.boxplot(data = titanic_df[['age' ]])

We can save the figure using [savefig](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html) if we want to use it later.\
`bbox_inches` - only the given portion of the figure is saved. If 'tight', create a tight box around the figure. Try removing tight and see the difference. 

In [8]:
#g.figure.savefig("boxplot_no_tight.png", bbox_inches='tight')

In [None]:
g.figure.savefig("boxplot.png", bbox_inches='tight')

<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/boxplot.png" width="600"/>
</div>

The data seems fine. What would we think if we had the outliers under the bottom whisker?

Data from a project in 2022. Israel vs. the world.

<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/graze%20%20footprints.png" width="800"/>
</div>



<a id='section2'></a>

## 2. Side note: Pickle

We want to go back to our vaccinations data. But it is getting rather tedious to read and wrangle it every time (perhaps also fill missing values):

In [None]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url) 
vacc_df = vacc_df[(vacc_df.location != "Europe") & 
                            (vacc_df.location != "High income") &
                            (vacc_df.location != "World") &
                            (vacc_df.location != "European Union") &
                            (vacc_df.location != "North America") &
                            (vacc_df.location != "Upper middle income") &
                            (vacc_df.location != "Lower middle income") &
                            (vacc_df.location != "Asia") &
                            (vacc_df.location != "South America")]

[Pickle](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_pickle.html) the file:

In [None]:
vacc_df.to_pickle("treated_vacc")

Read the file:

In [None]:
vacc_df = pd.read_pickle("treated_vacc")

Pickling the file allows us to save it with all of the changes we made in python. It is also supposed to be faster to read&write

### Back to boxplots:

use a groupby and look at part of the data, by location:

In [None]:
fix the NaN's, or else the graphs will just ignore them

#### sort the values using 'sort_values()`

### <span style="color:blue"> Exercise:</span>
> For the data in `grouped_df`:
>
> display a scatterplot for `total_boosters_per_hundred`
>
> display a boxplot for two `total_boosters_per_hundred` and `people_fully_vaccinated_per_hundred` in the **same** boxplot

It's not perfect. Or else we would have had outliers for any number over a 100. 

<a id='section3'></a>

## 3. Histograms

Why use histograms? \
Boxplots display summary statistics, but they don't tell us much about the distribution shape. \
We use histograms to show the shape. 

In [19]:
url = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/DatasaurusDozen.tsv'
df = pd.read_csv(url, sep='\t')

In [None]:
df

In [None]:
fig, ax = plt.subplots(1,2, figsize = (5,4))
plt.subplots_adjust(wspace = 0.5)

sns.boxplot(data = df[df.dataset == 'slant_up'], y = 'x', ax = ax[0])
sns.boxplot(data = df[df.dataset == 'h_lines'], y = 'x', ax = ax[1])

plt.show()

In [None]:
fig, ax = plt.subplots(1,2, figsize = (12,4))
plt.subplots_adjust(wspace = 0.5)

sns.histplot(data = df[df.dataset == 'slant_up'], x = 'x', ax = ax[0],bins=3)
sns.histplot(data = df[df.dataset == 'h_lines'], x = 'x', ax = ax[1], bins=30)

plt.show()

Histograms can show the number (count), percentage, probability or density

In [None]:
fig, ax = plt.subplots(2,2, figsize = (10,5))
plt.subplots_adjust(wspace = 0.5)

sns.histplot(data=titanic_df, x ='age', ax = ax[0,0] )
sns.histplot(data=titanic_df, x='age', stat='percent', ax = ax[0,1])
sns.histplot(data=titanic_df, x='age', stat='probability', ax = ax[1,0])
sns.histplot(data=titanic_df, x='age', stat='density', ax = ax[1,1])

The shape won't change as long as the number of bins doesn't change.\ 
Change the number of bins:

In [None]:
plt.figure(figsize=(4,3))
sns.histplot(data=titanic_df, x='age', stat='percent', bins=30)

Histograms of males and females:

In [None]:
plt.figure(figsize=(4,3))
sns.histplot(data=titanic_df[titanic_df.sex == 'male'], x='age', stat='percent')

---
### <span style="color:blue"> Exercise:</span>
>
>create a histogram for the age of female passangers on the titanic:
>

---



These two histograms don't have the same number of bins

In [None]:
len(titanic_df[titanic_df.sex == 'male'])

In [None]:
len(titanic_df[titanic_df.sex == 'female'])

---
### <span style="color:blue"> Exercise:</span>
> Create two histograms, one for males and one for females, with the **same** number of bins
>

---

Both sexs on the same graph:

In [None]:
titanic_df.columns

In [None]:
plt.figure(figsize=(4,3))
sns.histplot(data=titanic_df, x='age', stat='percent', hue='sex', multiple = 'layer' )
plt.show()

---
### <span style="color:blue"> Exercise:</span>
>
> try other options:
>
> `multiple{“layer”, “dodge”, “stack”, “fill”}`
>
> what is the default?
>
> create a histogram for `total_boosters_per_hundred` for our `grouped_df` dataframe


---

<a id='section4'></a>

## 4. Same stats, different graphs

In [33]:
url = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/DatasaurusDozen.tsv'

In [34]:
df = pd.read_csv(url, sep='\t')

In [None]:
df.head(15)

Dataset names:

In [None]:
df['dataset'].unique()

Dataset statistics

In [None]:
df.groupby('dataset').agg(['count', 'mean', 'std'])

What can you say about the mean, std, and number of points in each dataset?

In [None]:
fig, ax = plt.subplots(1,3, figsize = (12,4))
plt.subplots_adjust(wspace = 0.5)

sns.histplot(data = df[df.dataset == 'slant_down'], x = 'x', ax = ax[0])
sns.histplot(data = df[df.dataset == 'star'], x = 'x', ax = ax[1])
sns.histplot(data = df[df.dataset == 'circle'], x = 'x', ax = ax[2])

plt.show()

---
### <span style="color:blue"> Exercise:</span>
>
> Create 3 boxplot figures for the above datasets

FacetGrid is designed to split your data in several categories and plot the same relationship with the same plotting function across all categories for easy comparison

In [None]:
grid_histplots = sns.FacetGrid(df, col="dataset", hue="dataset", col_wrap=4, height = 2)
grid_histplots.map_dataframe(sns.histplot, x = 'x')
#plt.show()

In [None]:
grid_scatterplots = sns.FacetGrid(df, col="dataset", hue="dataset", col_wrap=4, height =2)
grid_scatterplots.map_dataframe(sns.scatterplot, x="x", y="y")

---
>### Functions covered in this unit:
>
> `scatterplot` - (x,y) points on the graphs
>
> `lineplot` - simple lineplot
>
> `plt.figure(fixsize(m,n))` - set the size of the graph\figure to (m,n)
>
> `boxplot` - create a boxplot
>
> `reset_index` - reset index to a numerical index beginning at 0
>
> `sort_values()` - sorts values 
>
> `histplot` - create a histogram
>
> `std()` - standard deviation
>
> `to_pickle`, `read_pickle` - serialize dataframe to file, read from file
---