# Unit 3 - Descriptive visualization of the data
---

1. [Boxplots](#section1)
2. [Histograms](#section2)
3. [Why is this important](#section3)
4. [What to use when](#section4)
5. [Same stats, different graphs](#section5)



Introducing an additional library: [seaborn](https://seaborn.pydata.org/) - for statistical data visualization\
Behind the scenes, seaborn uses matplotlib to draw its plots.\
[matplotlib.pyplot](https://matplotlib.org/stable/api/pyplot_summary.html#module-matplotlib.pyplot) is the GUI manager of the figure. 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  #for reshaping graph size
import seaborn as sns  # for creating the graphs

<a id='section1'></a>
## 1. Boxplots 

What are they good for? Let's look at an example with the Titanic dataset

#### Titanic dataset

In [None]:
titanic_df = sns.load_dataset('titanic')

In [None]:
titanic_df.shape

In [None]:
titanic_df.head()

#### We would like to vizualize the passengers `age`

##### Attempt #1: With `scatterplot`

The figure size is set using matplotlib, but there are other ways. See [this](https://stackoverflow.com/questions/31594549/how-to-change-the-figure-size-of-a-seaborn-axes-or-figure-level-plot) highly voted question on stackoverflow.


In [None]:
plt.figure(figsize=(4,3))  #figure size
sns.scatterplot(data = titanic_df[['age']])

This is the raw data:

axis x - the 891 passangers 

axis y - the age of each passenger

This is not informative

##### Attempt #2: With `lineplot`

In [None]:
plt.figure(figsize=(6,3))
g = sns.lineplot(data = titanic_df[['age']])

##### Attempt #3:`boxplot`

In [None]:
plt.figure(figsize=(2,5))
g = sns.boxplot(data = titanic_df[['age' ]])

We can save the figure using [savefig](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.savefig.html) if we want to use it later.\
`bbox_inches` - only the given portion of the figure is saved. If 'tight', create a tight box around the figure. Try removing tight and see the difference. 

In [None]:
g.figure.savefig("boxplot.png", bbox_inches='tight')

<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/boxplot.png" width="600"/>
</div>

The data seems fine. What would we think if we had the outliers under the bottom whisker?

Data from a project in 2022. Israel vs. the world.

<div>
<img src="https://raw.githubusercontent.com/nlihin/data-analytics/main/images/graze%20%20footprints.png" width="800"/>
</div>



<a id='section2'></a>

### Vaccination data:

In [None]:
url = 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv'
vacc_df = pd.read_csv(url) 
vacc_df.head()

We will use `groupby` to view the data by location  
Don't worry about this now, you will learn more on `groupby` in the next Unit

In [None]:
grouped_df = vacc_df.groupby('location')[['daily_vaccinations','people_fully_vaccinated_per_hundred'\
                                          ,'total_boosters_per_hundred']].max()
grouped_df = grouped_df.reset_index()
grouped_df

#### sort the values using 'sort_values()`

In [None]:
grouped_df.sort_values('people_fully_vaccinated_per_hundred', ascending = False).head(10)

### <span style="color:blue"> Exercise:</span>
> For the data in `grouped_df`:
>
> display a scatterplot for `total_boosters_per_hundred`
>
> display a boxplot for two `total_boosters_per_hundred` and `people_fully_vaccinated_per_hundred` in the **same** boxplot

In [None]:
#YOUR CODE HERE

**It's not perfect. Or else we would have had outliers for any number over a 100.**

<a id='section3'></a>

<a id='section2'></a>
## 2. Histograms

Why use histograms? \
Boxplots display summary statistics, but they don't tell us much about the distribution shape. \
We use histograms to show the shape. 

<div>
<center><img src="https://github.com/nlihin/data-analytics/blob/main/images/distribution%20shapes.png?raw=true" width="500"/>
    <p style="text-align: center;"><em>O'Niel and Schutt. Doing data science: Straight talk from the frontline. O'Rielly Meida, Inc. 2013. </em></p></center>
</div>


Histograms can show the number (count), percentage, probability or density

In [None]:
fig, ax = plt.subplots(2,2, figsize = (10,5))
plt.subplots_adjust(wspace = 0.5)

sns.histplot(titanic_df, x ='age', ax = ax[0,0])
sns.histplot(data=titanic_df, x='age', stat='percent', ax = ax[0,1])
sns.histplot(data=titanic_df, x='age', stat='probability', ax = ax[1,0])
sns.histplot(data=titanic_df, x='age', stat='density', ax = ax[1,1])

The shape won't change as long as the number of bins doesn't change. 
Change the number of bins:

In [None]:
plt.figure(figsize=(4,3))
sns.histplot(data=titanic_df, x='age', stat='percent', bins = 2)

Histograms of males and females:

In [None]:
plt.figure(figsize=(4,3))
sns.histplot(data=titanic_df[titanic_df.sex == 'male'], x='age', stat='percent')

---
### <span style="color:blue"> Exercise:</span>
>
> Create a histogram for the age of female passengers on the Titanic
>
> This histogram doesn't have the same number of bins as the histogram of males. Any idea why?

In [None]:
# first - filter so you have the females:
#female_df = #COMPLETE THIS CODE

In [None]:
# then plot
#YOUR CODE HERE

---
### <span style="color:blue"> Exercise:</span>
> Create two histograms, one for males and one for females, with the **same** number of bins
>

In [None]:
fig, ax = plt.subplots(1,2, figsize = (8,3))
#YOUR CODE HERE

Both sexs on the same graph:

In [None]:
plt.figure(figsize=(4,3))
sns.histplot(data=titanic_df, x='age', stat='percent', hue='sex', multiple = 'layer' )
plt.show()

---
### <span style="color:blue"> Exercise:</span>
>
> try other options for histpolt:
>
> `multiple{“layer”, “dodge”, “stack”, “fill”}`
>
> what is the default?
>
> create a histogram for `total_boosters_per_hundred` for our `grouped_df` dataframe


In [None]:
plt.figure(figsize=(4,3))
#sns.histplot(data=titanic_df, x='age', stat='percent', hue='sex', multiple = '...' ) #COMPLETE THIS CODE
plt.show()

---

<a id='section4'></a>

## 3. Why is this important?  
#### **⚠️ BEWARE!!** An example from a student project:  



#### Students worked on a depression dataset

In [None]:
dep_df = pd.read_csv("https://raw.githubusercontent.com/nlihin/EDA-course/refs/heads/main/datasets/Student%20Depression%20Dataset.csv")

In [None]:
dep_df.info()

#### They plotted a scatter plot of `Work Pressure` versus `Job Satisfaction` without first examining the data distribution:  

In [None]:
plt.figure(figsize=(4,3))
sns.scatterplot(dep_df, x = "Work Pressure", y = "Job Satisfaction")

#### It looked a bit weird so they moved to a lineplot:

In [None]:
plt.figure(figsize=(4,3))
sns.lineplot(dep_df, x = "Work Pressure", y = "Job Satisfaction")

#### However, the data is actually highly skewed:

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(12, 3))

sns.boxplot(data=dep_df[["Work Pressure", "Job Satisfaction"]], ax=axs[0])
sns.histplot(dep_df["Work Pressure"], ax=axs[1])
sns.histplot(dep_df["Job Satisfaction"], ax=axs[2])

plt.tight_layout()


#### This can also be seen using `value_counts`

In [None]:
print(dep_df["Work Pressure"].value_counts())
print("")
print(dep_df["Job Satisfaction"].value_counts())
#dep_df[["Work Pressure", "Job Satisfaction"]].value_counts()

<a id='section4'></a>
## 4. What to use when?

In [None]:
url1 = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/aircraft%20wildlife%20strikes%202018-2020.csv'
url2 = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/aircraft%20wildlife%20strikes%202021-2023.csv'
strike_df_18_20 = pd.read_csv(url1)
strike_df_21_23 = pd.read_csv(url2)
strike_df = pd.concat([strike_df_18_20 ,strike_df_21_23]).reset_index(drop = True)

In [None]:
strike_df.head(2)

In [None]:
strike_df.shape

#### numeric features:

In [None]:
numeric_features = ['HEIGHT', 'SPEED', 'AC_MASS']
target_features = ['AircraftOutOfService','people_impact','struck_parts', 'damaged_parts']

#### categorical features:

In [None]:
categorical_features = ['WARNED','PHASE_OF_FLIGHT','SKY','TIME_OF_DAY']

### What to use when?
* **Countplots** are good for describing **categorical features**  
* **Hitograms** and **boxplots** are good for describing **numeric features**
* **Scatterplots** are good for describing the interaction between two **numeric features**

In [None]:
fig, axes = plt.subplots(ncols=3, figsize=(16, 4))
sns.countplot(data=strike_df, x = "WARNED", ax = axes[0])
sns.countplot(data=strike_df, x = "TIME_OF_DAY", ax = axes[1])
sns.countplot(data=strike_df, x = "SKY", ax = axes[2])
plt.tight_layout()
plt.show()

#### Lets look at some of the numeric features:

In [None]:
fig, axes = plt.subplots(figsize=(12, 3), ncols=3)
sns.boxplot(data=strike_df[['HEIGHT']], ax = axes[0])
sns.boxplot(data=strike_df[['SPEED']], ax = axes[1])
sns.boxplot(data=strike_df[['AC_MASS']], ax = axes[2])
plt.tight_layout()
plt.show()

In [None]:
fig, axs = plt.subplots(ncols=3, figsize=(12, 3))
for i, feature in enumerate(numeric_features):
    sns.histplot(data=strike_df[feature], ax=axs[i], bins=20)
   
plt.tight_layout()
plt.show()

In [None]:
sns.scatterplot(data = strike_df, y = "SPEED", x = "HEIGHT")

---
### <span style="color:blue"> Exercise:</span>
> Remove outliers so that:  `SPEED<600` & `HEIGHT<20000`
>


<a id='section4'></a>
## 5. Same stats, different graphs

In [None]:
url = 'https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/DatasaurusDozen.tsv'

In [None]:
df = pd.read_csv(url, sep='\t')

In [None]:
df.head(15)

Dataset names:

In [None]:
df['dataset'].unique()

Dataset statistics

What can you say about the mean, std, and number of points in each dataset?

In [None]:
datasets = df['dataset'].unique()
result = []

for dataset in datasets:
    data = df.loc[df['dataset'] == dataset, 'x']
    result.append({
        'dataset': dataset,
        'count': len(data),
        'mean': data.mean(),
        'std': data.std()
    })

result_df = pd.DataFrame(result)
result_df



A more efficent way to do acheive this, that we will learn in the next lesson:

In [None]:
df.groupby('dataset')[["x"]].agg(['count', 'mean', 'std']).reset_index()

---
### <span style="color:blue"> Exercise:</span>
>
> Create 3 boxplot figures. For : `dataset == 'slant_down'` `dataset == 'star'` `dataset == 'circle'`

In [None]:
fig, ax = plt.subplots(1,3, figsize = (12,4))
plt.subplots_adjust(wspace = 0.5)

# your code here

Lets display the histograms, but do it in a better way, using FacetGrid

FacetGrid is designed to split your data in several categories and plot the same relationship with the same plotting function across all categories for easy comparison

In [None]:
g = sns.FacetGrid(df, col="dataset", hue="dataset", col_wrap=4, height = 2)
g.map(sns.histplot, 'x')
plt.show()

The same, but with scatterplots instead of histplots

In [None]:
g = sns.FacetGrid(df, col="dataset", hue="dataset", col_wrap=4, height =2)
g.map(sns.scatterplot, "x", "y")
plt.show()

---
>### Functions covered in this unit:
>
> `scatterplot` - (x,y) points on the graphs
>
> `lineplot` - simple lineplot
>
> `histplot` - create a histogram
>
> `boxplot` - create a boxplot
>
> `countplot` - count categorical data
>
> `plt.figure(fixsize(m,n))` - set the size of the graph\figure to (m,n)
>
> `sort_values()` - sorts values 
>
> `std()` - standard deviation
>

---