## YBLL Workshop 5

### Hands-on: Exploratory Data Analysis with Pandas
<i> By Amos Changcoco </i>

In this notebook, we will be using Pandas to perform one of the usual data science techniques: <b><i>exploratory data analysis (EDA)</b></i>. This technique allows us to understand the data that we are handling better, gain insights from it, and know what data preprocessing (or data cleaning) steps we need to do before providing it to our machine learning model (<i>oops, that's a spoiler</i>).

In [None]:
## Install required libraries for this hands-on
!pip install pandas
!pip install numpy
!pip install seaborn
!pip install matplotlib
!pip install scipy

In [None]:
## Importing required libraries to the notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

For the first part of the hands-on, you will be performing a guided EDA on one of the well-known datasets used in learning machine learning (<i>sorry for the word repetition ✌</i>), the Titanic Dataset.

In [None]:
df_titanic = pd.read_csv("titanic.csv")
df_titanic.head(5)

To provide more context to this dataset, below is the data dictionary derived from <a href="https://www.kaggle.com/c/titanic/data">https://www.kaggle.com/c/titanic/data</a>:

| Attribute | Description |
|:--|:--|
| `survived` | indicates if the passenger survived (`1` means yes and `0` if otherwise) |
| `pclass` | ticket class represented in ordinal numbers |
| `sex` | sex of the passenger |
| `sibsp` |number of siblings and spouses accompanied by the passenger |
| `parch` | number of parents and children accompanied by the passenger |
|  `fare` | passenger fare |
| `embarked` | first letter of port of embarkation |
|  `class` | ticket class represented in words |
| `who` | another representation for sex |
| `adult_male` | indicates if the passenger is male |
| `deck` | unknown |
| `embark_town`| full name of the port of embarkation |
|  `alive` | indicates if the passenger is alive |
| `alone` | indicates if the passenger is alone during the trip|

Time to put our knowledge on <i> descriptive statistics </i> to good use. Finding the counts, minimum and maximum values, measures of central tendency, and measures of variability are usually the first steps in EDA.

#### Min and Max

In [None]:
## Replace None values to find the min and max values for age
min_age = None
max_age = None

print("Mininum value for age:", min_age)
print("Maximum value for age:", max_age)

In [None]:
## Replace None values to find the min and max values for fare
min_fare = None
max_fare = None

print("Mininum value for fare:", min_fare)
print("Maximum value for fare:", max_fare)

#### Quartiles

In [None]:
## Replace None values to find the first three quantiles
q1_age = None
q2_age = None
q3_age = None

print("First quantile for age:", q1_age)
print("Second quantile for age:", q2_age)
print("First quantile for age:", q3_age)

In [None]:
## Plotting distribution of the age and its quantiles
fig, ax = plt.subplots(figsize=(10, 5))
ax = df_titanic['age'].dropna().plot.kde(c="#00aeef")
labels = {f'First quantile for age: {q1_age}',
          f'Second quantile for age: {q2_age}',
          f'Third quantile for age: {q3_age}',
          }
handles, _ = ax.get_legend_handles_labels()
handles.append(ax.axvline(x=q1_age, c="#99cc33", linestyle='dashed'))
handles.append(ax.axvline(x=q2_age, c="#99cc33", linestyle='dashed'))
handles.append(ax.axvline(x=q3_age, c="#99cc33", linestyle='dashed'))
ax.legend(handles = handles[1:], labels = labels)

#### Measures of Central Tendency (Mean, Median, Mode)

In [None]:
## Replace None to get the mean for the age
mean_age = None

## Replace None to get the median for the age
median_age = None

## Replace None to get the mode for the age
mode_age = None

print(f"Mean age: {mean_age}")
print(f"Median age: {median_age}")
print(f"Mode age: {mode_age}")

#### Measures of Variability (Range, IQR, Variance/Standard Deviation)

In [None]:
# Replace None to get the range of age
range_age = None

## Replace None to get the IQR for age
iqr_age = None

## Replace None to get the variance for age
var_age = None

## Replace None to get the standard deviation for age
std_age = None

print(f"Range of age: {range_age}")
print(f"IQR of age: {iqr_age}")
print(f"Variance of age: {var_age}")
print(f"Standard Deviation of age: {std_age}")

For the next cells, the notebook will provide some questions to be answered by EDA exercise. In reality, one of the skills a data scientist must have is to be curious enough in generating these questions (*don't be afraid to ask dumb questions as long as the data can answer it*)

#### Q1. Did being rich gain the upper hand at survivability?

While the dataset did not specifically say which ones are rich, the ability to purchase first class might be a good indicator of social status. But even so, were they able to make it out of the incident alive?

In [None]:
## Place code for grouping by 'pclass' and 'alive'


#### Q2. Were there more young passengers alive than old ones?

How are we able to say that the passenger is young or old? Perhaps we can utilize our descriptive statistics to build the definition of young and old out of the data.

In [None]:
# Assumption one: Minimun value for age and first quartile can determine the age range for the young
lower_young = None
upper_young = None

# Assumption two: Maximum value for age and third quartile can detemine the age range for old
lower_old = None
upper_old = None

# Setting conditions and values for age group
conditions = [
    (df_titanic['age'] >= lower_young) & (df_titanic['age'] < upper_young),
    (df_titanic['age'] >= lower_old) & (df_titanic['age'] < upper_old),
]
values = ['young', 'old']
df_titanic['age_group'] = np.select(conditions, values)

# Place code to perform groupby on `age_group` and `alive`


#### Q3. Is it possible to know the estimated number of families that were onboard the ship?

The dataset indicated the number of siblings, spouses, parents, and children for each passenger. Maybe we could also utilize `embark_town`, `pclass`, and `alone` as well.

*Note: The below solution is just one approach to answer this question. You are free to make your own assumptions to come up with another estimate.*

In [None]:
# Assumption one: a passenger with a family should not be alone.
# Replace None to get all passengers that are not alone.
df_alone = None

# Assumption two: the family size is the number of all people the passenger is with
# Replace None to determine family size (including passenger)
df_alone['fam_size'] = None

# Assumption three: the family has the same port of embarkation and is of the same class
# Replace None to group by embark_town, pclass, and fam_size (reset index)
df = None
df.rename(columns={0: 'num_passengers'}, inplace=True)

# Assumption four: we can use the unique fam size per embark_town and pclass to get the total families
# Replace None to divide number of passengers per row by the fam_size
df['total_fam'] = (None).apply(np.ceil)

print(df)
print("\n Estimated number of families:", df.total_fam.sum())

#### Homework

Use the other dataset, Data Science Salaries dataset, to perform EDA. Again, curiousity is key to having a fulfilling EDA journey.

In [None]:
df_salaries = pd.read_csv("data_science_salaries.csv")
df_salaries.head(5)