# Unit 3: EDA (Exploratory Data Analysis)

Welcome to Unit 3 of the Machine Learning course! In this unit, we will dive into Exploratory Data Analysis

---

Before we begin the lesson on Pandas, let's recall to lesson 1.5 why EDA is important for all data scientists. 

EDA is important because

If you jump straight into training a model without looking at your data, you might:
- Use broken or missing values
- Forget to clean up weird categories
- Miss obvious patterns

**Good models start with good data.** EDA helps you find out:
- What’s useful
- What’s not
- What might need fixing

## Lesson 3.1: Basic Pandas for EDA

In [None]:
# First we will load all the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

To help understand Pandas and other parts of EDA, we will be using the famous Titanic dataset for the rest of this unit

In [None]:
# Loading the Dataset
df = pd.read_csv("Data/titanic_leapcode_train.csv")

## 1. .shape

Returns the shape/dimensionality of the Dataframe. (number of rows x number of columns)

In [None]:
df.shape

There are 891 rows by 12 columns

## 2. .head()

Returns Dataframe with top n rows, by default it will show the first 5

In [None]:
df.head()

In [None]:
df.head(10)

## 3. .info()

Returns Basic information of the dataset

In [None]:
df.info()

- **Index info** - Number of rows            
- **Column names** - Names of each column in your dataset                         
- **Non-null count** - How many entries are **not missing** in each column          
- **Dtype** - Data type: `int64`, `float64`, `object` (text), `bool`, etc. 
- **Memory usage** - How much memory the DataFrame is using                       


Using this, we can see that there are missing values in the Age, Cabin, and Embarked columns because the Non-Null Count is less than the amount of entries

Additionaly, we can see that there are 5 categorical columns

## 4. .describe()

Returns description of the numerical data in the DataFrame.

In [None]:
df.describe()

- **count** - The number of not-empty values
- **mean** - The average value
- **std** - The standard deviation (average distance of a randomly picked point to the mean)
- **min** - the minimum value
- **25%** - The 25% percentile
- **50%** - The 50% percentile
- **75%** - The 75% percentile
- **max** - the maximum value

## 5. .columns

Returns the column labels of the DataFrame

In [None]:
df.columns

## 6. value_counts()

Returns the counts of unique values in a column

In [None]:
df['Survived'].value_counts()

In [None]:
df["Embarked"].value_counts()


## 7. isnull().sum()

Use this to find null values in each column

In [None]:
df.isnull().sum()

## 8. duplicated().sum()

Returns number of duplicate rows in the dataset.

In [None]:
df.duplicated().sum()

There are no duplicates in this dataset

## 9. nunique()

Returns number of unique values in each colunm

In [None]:
df.nunique()