# **Pandas Day 4**

In [None]:
import pandas as pd  
import seaborn as sns  

df = sns.load_dataset('titanic')
df.head()

## **Handling the missing data of titanic dataset**

In [None]:
# find the percentage of missing values in each columns

df.isnull().sum() / len(df) * 100

In [None]:
# drop the columns when its missing values percentage like ( 50%, 60%, 70% or more)

# drop function use manually to remove specific column and rows by using its index or name

df = df.drop(columns='deck') # remove the deck column
df = df.drop(2, axis=0) # Delete row at index 2
df.columns

In [None]:
# dropna function is used to delete all the NaN values rows form the dataset 

df = df.dropna() # Drop all rows that contain NaN
df = df.dropna(axis=1) # Drop all columns that contain NaN

In [None]:
df.isnull().sum()

# handle the age column missing value using mean, median, mode techniques 

df['age'] = df['age'].fillna(df['age'].mean())
df.isnull().sum()

# handle the embarked column missing value using mean, median, mode techniques

df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df.isnull().sum()

# handle the embarked_town column missing value using mean, median, mode techniques

df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
df.isnull().sum()

## **Assignment No 3 : If you do not impute the missing values, What will happen?**

If you do not impute (fill) the missing values in your dataset, it can cause the following problems:

### 1. ❌ Errors in Machine Learning Models
- Most ML algorithms like Linear Regression, Decision Trees, etc. do **not accept missing values (`NaN`)**.
- The model will throw an error or fail to train/predict.

### 2. ❗ Loss of Data
- If you drop missing rows using `df.dropna()`, you may lose **valuable information**.
- This can reduce the **size and quality** of your dataset.

### 3. 📉 Inaccurate Results
- Statistics like **mean, median, and standard deviation** will be wrong.
- Missing values can lead to **biased or misleading analysis**.

### 4. 📊 Visualization Problems
- Plots and charts may break or show incomplete trends.
- Missing data may create **gaps or distortions** in visualizations.

### 5. 🧪 Wrong Decisions
- Decisions based on incomplete data can be **unreliable**.
- Models built on missing data may perform poorly in real-world situations.

### ✅ Conclusion:
> It is important to handle missing values using proper methods like **mean/median/mode imputation** or **row/column delet**


## **Binning Concept**

In [None]:
sns.histplot(df['age'], kde=True)
# `kde=True` adds a smooth curve over the histogram (called Kernel Density Estimate)

In [None]:
# Age bin boundaries
bins = [0, 1, 3, 12, 19, 35, 59, 100]

# Corresponding age group labels
labels = ['Infant', 'Toddler', 'Child', 'Teen', 'Young Adult', 'Adult', 'Senior']


## **Feature Enginnering**

In [None]:
# adding a new column in the dataset using binning concept 

df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=True) 
# right=True means upper edge is included (e.g., 12 goes in "Child")
df.head()

## **📊 Histogram and Data Distribution**

### **1. What is Data Distribution?**
Data distribution means how the values in a dataset are spread across the range.
It tells:

- Where most values are concentrated

- Whether the data is symmetric or skewed

- If there are any outliers or unusual values

### **2. What is a Histogram?**
A histogram is a type of bar chart that shows how frequently data falls within certain ranges `(called bins)`.
It helps visualize the distribution of continuous data.

#### **Example using Pandas:**

```python
df['age'].hist(bins=10)
```

### **3. What is histplot in Seaborn?**
histplot is a Seaborn function that creates histograms with extra features like smooth curves `(KDE)`.

#### **Example using Seaborn:**
```python
import seaborn as sns  
sns.histplot(data=df, x='age', bins=10, kde=True)
```

- `bins=10` means the data will be divided into 10 equal intervals

- `kde=True` adds a smooth curve over the histogram (called Kernel Density Estimate)

### **4. What is Binning?**
Binning means dividing continuous numeric values into fixed intervals or groups called bins.
Each bin represents a range of values.

#### **Example:**
If ages range from `0 to 100`, you can create bins like:

- 0–20

- 21–40

- 41–60

- 61–80

- 81–100

`Code example using Pandas:`
```python
pd.cut(df['age'], bins=[0, 20, 40, 60, 80, 100])
# Or with Seaborn:
sns.histplot(data=df, x='age', bins=5)
```

### **5. What is Normal Distribution (also called Gaussian Distribution)?**
Normal distribution is a special type of data distribution that looks like a bell curve.

#### **Characteristics:**

Data is symmetrically distributed

- Mean = Median = Mode

- Most values are close to the center

- Fewer values at the edges `(called tails)`

#### **Shape:**


    ^
    |
 ∩———∩
/
_/ _

#### **Example with Seaborn:**
```python
sns.histplot(df['marks'], kde=True)
```
If the KDE curve is smooth and bell-shaped, it means the data is normally distributed.

In [None]:
# rename the column

df.rename(columns={'age_group' : 'age_binned'}, inplace=True)
df.head()

## **Filtering the dataset**

In [None]:
# filter by gender

# All female passengers
df[df['sex'] == 'female']

# All male passengers
df[df['sex'] == 'male']

In [None]:
# combined conditions 

# Female passengers who survived
df[(df['sex'] == 'female') & (df['survived'] == 1)]

# Male passengers in 2nd class who did not survive
df[(df['sex'] == 'male') & (df['pclass'] == 2) & (df['survived'] == 0)]

In [None]:
# filter by age 

# Passengers under 18 (children)
df[df['age'] < 18]

# Adult passengers (age 18+)
df[df['age'] >= 18]

# Passengers with age missing (NaN)
df[df['age'].isnull()]

In [None]:
# filter top N oldest and youngest passenger in the dataset 

# Top 5 oldest passengers
df.sort_values('age', ascending=False).head(5)

# Top 5 youngest passengers
df.sort_values('age', ascending=True).head(5)


In [None]:
# filter rows with missing data 

# All rows with any missing values
df[df.isnull().any(axis=1)]

# Rows where 'age' is missing
df[df['age'].isnull()]
