# Unit 3: EDA (Exploratory Data Analysis)

Welcome to Unit 3 of the Machine Learning course! In this unit, we will dive into Exploratory Data Analysis

---

Before we begin the lesson on Pandas, let's recall to lesson 1.5 why EDA is important for all data scientists. 

EDA is important because

If you jump straight into training a model without looking at your data, you might:
- Use broken or missing values
- Forget to clean up weird categories
- Miss obvious patterns

**Good models start with good data.** EDA helps you find out:
- What’s useful
- What’s not
- What might need fixing

## Lesson 3.1: Basic Pandas for EDA

In [2]:
# First we will load all the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

To help understand Pandas and other parts of EDA, we will be using the famous Titanic dataset for the rest of this unit

In [4]:
# Loading the Dataset
df = pd.read_csv("Data/titanic_leapcode_train.csv")

## 1. .shape

Returns the shape/dimensionality of the Dataframe. (number of rows x number of columns)

In [7]:
df.shape

(891, 12)

There are 891 rows by 12 columns

## 2. .head()

Returns Dataframe with top n rows, by default it will show the first 5

In [12]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


## 3. .info()

Returns Basic information of the dataset

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


- **Index info** - Number of rows            
- **Column names** - Names of each column in your dataset                         
- **Non-null count** - How many entries are **not missing** in each column          
- **Dtype** - Data type: `int64`, `float64`, `object` (text), `bool`, etc. 
- **Memory usage** - How much memory the DataFrame is using                       


Using this, we can see that there are missing values in the Age, Cabin, and Embarked columns because the Non-Null Count is less than the amount of entries

Additionaly, we can see that there are 5 categorical columns

## 4. .describe()

Returns description of the numerical data in the DataFrame.

In [16]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


- **count** - The number of not-empty values
- **mean** - The average value
- **std** - The standard deviation (average distance of a randomly picked point to the mean)
- **min** - the minimum value
- **25%** - The 25% percentile
- **50%** - The 50% percentile
- **75%** - The 75% percentile
- **max** - the maximum value

## 5. .columns

Returns the column labels of the DataFrame

In [18]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## 6. value_counts()

Returns the counts of unique values in a column

In [19]:
df['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

In [20]:
df["Embarked"].value_counts()


Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

## 7. isnull().sum()

Use this to find null values in each column

In [21]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## 8. duplicated().sum()

Returns number of duplicate rows in the dataset.

In [24]:
df.duplicated().sum()

np.int64(0)

There are no duplicates in this dataset

## 9. nunique()

Returns number of unique values in each colunm

In [26]:
df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64