# **Pandas Day 3**

In [None]:
# Primary Libraries for EDA 

import pandas as pd  
import seaborn as sns  
import matplotlib.pyplot as plt  
import numpy as np  

In [None]:
# load data set from seaborn and store in object 

df = sns.load_dataset('titanic')

## **Basic Information :**

In [None]:
# metadata about dataset

# df.info # This gives a reference to the info() method of the DataFrame, but does not execute it.
df.info() # run actual function when use brackets

# summary of Dataset
df.describe()

# find the columns name 
df.columns

### 🔍 Difference Between `object` and `category` in Pandas :

#### ✅ **`object` dtype :**
- General-purpose type for text or mixed data.
- Commonly used for **strings**.
- **Slower** and **uses more memory**.
- Not optimized for repeated values.

#### ✅ **`category` dtype :**
- Used for **categorical data** with fixed and repeated values.
- Internally stored as **integers with labels (lookup table)**.
- **Faster**, more efficient, and uses **less memory**.
- Great for grouping, filtering, and saving memory.

### 📊 Comparison Table :

| Feature           | `object`                      | `category`                          |
|------------------|-------------------------------|-------------------------------------|
| Type             | General-purpose               | Optimized for categorical data      |
| Memory Usage     | High                          | Low                                 |
| Speed            | Slower                        | Faster for filtering/grouping       |
| Ideal Use        | Free-form text, mixed values  | Repeating fixed categories          |
| Encoding         | Plain text                    | Encoded as integers with labels     |

### 💡 Example :

```python
# Object type (default for strings)
df['gender'] = ['Male', 'Female', 'Male']
print(df['gender'].dtype)  # Output: object

# Convert to category type
df['gender'] = df['gender'].astype('category')
print(df['gender'].dtype)  # Output: category
```


In [None]:
# finding the missing values / null values / NaNs in the columns
df.isnull() # False = Value hai & True = Value missing hai
df.isnull().sum() # In which column how many values are missing 
df.isnull().sum() / len(df) * 100 # calculate the percentage of missing values

In [None]:
# Another way to find or chk the missing values in the columns by using heatgraph
sns.heatmap(df.isnull()) # with color bar 
sns.heatmap(df.isnull(), cbar=False) # without color bar

# 📘 Assignment No. 1: What is Mean, Median, and Mode?

---

## 📌 1. Mean (Average)

**Definition:**  
Mean is the **average** of a list of numbers.  
It is calculated by adding all the values and dividing by the number of values.

**Formula:**
```
Mean = (Sum of all values) / (Total number of values)
```

**Example:**
```
Values = [4, 7, 10, 5, 6]
Mean = (4 + 7 + 10 + 5 + 6) / 5 = 32 / 5 = 6.4
```

---

## 📌 2. Median

**Definition:**  
Median is the **middle value** in a sorted list of numbers.  
It separates the data into two equal halves.

- If the number of values is **odd**, median is the middle value.
- If **even**, median is the average of the two middle values.

**Example 1 (Odd number of values):**
```
Values = [3, 5, 7]
Median = 5 (middle value)
```

**Example 2 (Even number of values):**
```
Values = [2, 4, 6, 8]
Median = (4 + 6) / 2 = 5.0
```

---

## 📌 3. Mode

**Definition:**  
Mode is the value that appears **most frequently** in a dataset.

- A dataset can have **one mode**, **more than one mode**, or **no mode**.

**Example 1:**
```
Values = [3, 5, 7, 5, 9]
Mode = 5 (occurs twice)
```

**Example 2 (No mode):**
```
Values = [1, 2, 3, 4]
Mode = None (all values appear once)
```

---

## ✅ Summary Table

| Term   | Meaning                         | Use Case                           |
|--------|----------------------------------|-------------------------------------|
| Mean   | Average of all values            | For normally distributed data       |
| Median | Middle value in sorted data      | For skewed data or with outliers    |
| Mode   | Most frequent value              | For categorical or repeating data   |

---

## 💡 Extra: Python Example

```python
import numpy as np
from statistics import mode

data = [4, 7, 10, 5, 6]

mean = np.mean(data)
median = np.median(data)
mode_value = mode(data)

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode_value)
```


---

# 📘 Assignment No. 2: How to Impute Missing Values?

---

## ❓ What are Missing Values?

Missing values are the empty or null entries in a dataset.  
In Pandas, they are usually represented as `NaN` (Not a Number).

---

## 🔧 What is Imputation?

**Imputation** means filling in missing values using some technique, so that the dataset becomes complete and can be used for analysis or modeling.

---

## 🔍 Methods to Impute Missing Values

### ✅ 1. **Removing Missing Values**
- Use when missing data is small and random.
- You can remove rows or columns with missing data.

```python
df.dropna()         # Removes rows with any missing value
df.dropna(axis=1)   # Removes columns with missing values
```

---

### ✅ 2. **Mean Imputation**
- Replace missing values with the **mean** (average) of the column.
- Best for **numeric data** with no extreme outliers.

```python
df['age'].fillna(df['age'].mean(), inplace=True)
```

---

### ✅ 3. **Median Imputation**
- Use the **median** value to fill missing data.
- Good when data has **outliers or is skewed**.

```python
df['salary'].fillna(df['salary'].median(), inplace=True)
```

---

### ✅ 4. **Mode Imputation**
- Use the **most frequent value** (mode).
- Common for **categorical data** (e.g., gender, city, etc.)

```python
df['gender'].fillna(df['gender'].mode()[0], inplace=True)
```

---

### ✅ 5. **Constant/Fixed Value Imputation**
- Fill missing values with a specific value like `"Unknown"` or `0`.

```python
df['city'].fillna('Unknown', inplace=True)
```

---

### ✅ 6. **Forward Fill (ffill)**
- Fill missing value with the **previous value** in the column.

```python
df.fillna(method='ffill', inplace=True)
```

---

### ✅ 7. **Backward Fill (bfill)**
- Fill missing value with the **next value** in the column.

```python
df.fillna(method='bfill', inplace=True)
```

---

### ✅ 8. **Advanced Techniques (Optional for ML)**
- **KNN Imputation**
- **Regression Imputation**
- **Multivariate Imputation by Chained Equations (MICE)**
> These are used in machine learning when more accuracy is needed.

---

## ✅ Summary Table

| Method               | Description                                | Best For                |
|----------------------|--------------------------------------------|--------------------------|
| Drop Missing          | Remove rows or columns with NaN            | When few values are missing |
| Mean Imputation       | Fill with average                          | Numeric, no outliers     |
| Median Imputation     | Fill with median                           | Numeric, skewed data     |
| Mode Imputation       | Fill with most common value                | Categorical data         |
| Constant Value        | Fill with fixed value                      | Custom cases             |
| Forward Fill (ffill)  | Use previous value                         | Time series              |
| Backward Fill (bfill) | Use next value                             | Time series              |

---

## 💡 Python Example

```python
import pandas as pd
import numpy as np

data = {
    'age': [25, np.nan, 30, 22, np.nan],
    'gender': ['Male', 'Female', np.nan, 'Female', 'Male']
}

df = pd.DataFrame(data)

# Fill missing age with mean
df['age'].fillna(df['age'].mean(), inplace=True)

# Fill missing gender with mode
df['gender'].fillna(df['gender'].mode()[0], inplace=True)

print(df)
```

---

## 📝 Conclusion

- Imputation helps make datasets usable and prevents errors in analysis or machine learning.
- Choose the method based on **data type**, **distribution**, and **use case**.


In [None]:
# First tecnique to remove missing values

print(df.shape)
# missing values rows and columns remove form the dataset
df.dropna().shape # remove missing values rows
df.dropna(axis=1).shape # removes missing values column

In [None]:
# second technique to handle missing value by using mean imputation 

df['age'] = df['age'].fillna(df['age'].mean()) # assign the final reuslt to age column 
# or Use fillna on the whole DataFrame
df.fillna({'age' : df['age'].mean()}, inplace=True) # fillna apply on the whole dataframe



In [None]:
# third technique to handle missing values by using median imputation

df['age'] = df['age'].fillna(df['age'].median()) # assign the final reuslt to age column
# or Use fillna on the whole DataFrame
df.fillna({'age' : df['age'].median()}, inplace=True) # fillna apply on the whole dataframe

In [31]:
# forth technique to handle missing values by using mode imputation

df['sex'] = df['sex'].fillna(df['sex'].mode()[0]) # assign the final reuslt to sex column
# or Use fillna on the whole DataFrame
df.fillna({'sex' : df['sex'].mode()[0]}, inplace=True) # fillna apply on the whole dataframe

In [32]:
# fifth technique to handle missing values by using constant value / fixed imputation

df['embarked'] = df['embarked'].fillna('unknown') # assign the final reuslt to sex column
# or Use fillna on the whole DataFrame
df.fillna({'embarked' : 'unknown'}, inplace=True) # fillna apply on the whole dataframe
# or apply fillna() function more than one columnn
df.fillna({'embarked': 'unknown', 'age': df['age'].median()}, inplace=True)

In [34]:
# sixth technique to handle missing values by using forward fill imputation

df['age'] = df['age'].ffill() # forward fill only 'age'
# or apply on whole dataframe
df.ffill(inplace=True)  

In [35]:
# seventh technique to handle missing values by using backward fill imputation

df['age'] = df['age'].bfill() # forward fill only 'age'
# or apply on whole dataframe
df.bfill(inplace=True)

In [None]:
# select specific columns form the dataset
df.sex # select one column ( ye wala tariqa professional nhi hai )
df['sex'] # select one column ( ye wala tariqa professional hai )
df[['sex', 'age']] # select 2 columns form the dataset
df[['sex', 'age', 'embarked']] # select more than 2 column from the dataset

In [None]:
# unique function used with the series only ( mean used with one dimension )

# find the unique values in the columns 
df.sex.unique() # find the unique values in the column ( not professional )
df['sex'].unique() # find the unique values in the column ( professional )
df['sex'].nunique() # find the number of unique values in the columns

df.nunique() # find the number of uniques values of all the column

In [None]:
# in columns, how many time the values repeat it seld count by value count
df['embark_town'].value_counts()

In [None]:
# basic summary 
df.describe()

# find the mean of fare of both male and female using groupby 
df.groupby('sex')['fare'].mean()

# find the mean of fare based on class type of both male and female using groupby
df.groupby(['sex','pclass'])['fare'].mean()

# find the values count by using groupby 
df.groupby(['sex', 'embarked']).size()

In [None]:
# make correlation matric
correlaton_df = df[['fare', 'age', 'parch', 'sibsp']].corr()
correlaton_df

In [None]:
# make the heatmap of correlation dataframe 
sns.heatmap(correlaton_df, annot=True)

In [None]:
# make pairplot of dataset
sns.pairplot(df)

## 📊 What is EDA (Exploratory Data Analysis)?

**EDA (Exploratory Data Analysis)** is the process of examining your dataset before applying any machine learning or statistical modeling. The goal is to understand:
- What the data contains
- Its structure
- Patterns, trends, and anomalies
- Whether the data is clean or needs preprocessing

## 🔍 Questions to Ask in EDA

### 1. **Who Collected the Data?**
- Understand the **source** of data.
- Was it collected by a **reliable organization**, a **survey**, **sensor**, **government**, or **scraped from a website**?

### 2. **What is the Data About?**
- What is the **context** of the data?
- What is the **main topic** — e.g., health records, sales data, weather, population, etc.?
- What is the **goal** of analyzing it?

---

## 🗂️ What is Metadata?

**Metadata = Data about Data**

It gives information **about** the dataset, such as:
- Column names
- Data types (int, float, string, etc.)
- Units (e.g., cm, $, °C)
- Description of fields
- When and how it was collected

> For example:  
> In a CSV file of student records, metadata tells us that  
> `age` is a number, `name` is a string, and `grade` is a float.

---

## 📐 Data Dimensions

Data dimensions refer to the **shape** of the dataset:

- **Rows = Observations** (Each record or sample)
- **Columns = Features/Variables** (Each attribute of the observation)

You can find the dimensions using:
```python
df.shape  # (rows, columns)
```

### Example:
If `df.shape` returns `(100, 5)`:
- There are **100 records (rows)**.
- Each record has **5 attributes (columns)**.

---

## ✅ Summary Table

| Term             | Meaning                                                       |
|------------------|---------------------------------------------------------------|
| EDA              | Explore and understand dataset                                |
| Data Source      | Who collected or provided the data                            |
| Data Meaning     | What the dataset represents                                   |
| Metadata         | Info about the data (e.g., types, descriptions, units)        |
| Data Dimensions  | Shape of data (rows = samples, columns = features)            |


---

## 📊 Understanding Key EDA Concepts

---

### 🔹 1. **Data**
- Data is a **collection of raw facts and figures**.
- In EDA, data is usually organized in rows (records) and columns (features).
- Example:
  ```plaintext
  Name     Age   Gender   Salary
  Alice    28    Female   50000
  Bob      35    Male     62000
  ```

---

### 🔹 2. **Compositions**
- Shows **what the data is made of** — the **proportions or categories** within a feature.
- Useful for analyzing **categorical columns** like gender, region, product type.
- Example tools:
  - Bar chart
  - Pie chart
  - Value counts

```python
df['gender'].value_counts()
df['region'].value_counts(normalize=True)
```

---

### 🔹 3. **Correlations**
- Measure of how **two numerical variables are related**.
- Value ranges from **-1 to 1**:
  - `+1`: strong positive relationship
  - `0`: no relationship
  - `-1`: strong negative relationship
- Useful for understanding **dependencies** or **multicollinearity**.

```python
df.corr()
sns.heatmap(df.corr(), annot=True)
```

---

### 🔹 4. **Comparisons**
- Comparing **two or more groups** based on a feature.
- Example: Average income of males vs females.
- Useful for finding differences or trends between categories.

```python
df.groupby('gender')['income'].mean()
sns.boxplot(x='gender', y='income', data=df)
```

---

### 🔹 5. **Distributions**
- Shows how data values are **spread out**.
- Helps identify:
  - Skewness
  - Outliers
  - Central tendency
- Tools:
  - Histogram
  - KDE plot
  - Boxplot

```python
sns.histplot(df['age'], kde=True)
sns.boxplot(y=df['salary'])
```

---

## ✅ Summary Table

| Concept       | Purpose                                       | Visual Tools                      |
|---------------|-----------------------------------------------|-----------------------------------|
| Data          | Raw information (rows × columns)              | `.head()`, `.info()`, `.shape`    |
| Composition   | Breakdown of categories/proportions           | Bar chart, Pie chart              |
| Correlation   | Relationship between numerical variables       | Correlation matrix, Heatmap       |
| Comparison    | Difference between groups/categories           | Boxplot, GroupBy + Mean           |
| Distribution  | Spread of values in a column                   | Histogram, KDE plot, Boxplot      |


---

## 📈 Pearson vs Spearman Correlation

---

### 🔹 1. **Pearson Correlation**
- Measures **linear relationship** between two **continuous numeric variables**.
- Assumes:
  - Data is **normally distributed**
  - Relationship is **linear**
- Value range: `-1` to `+1`
  - `+1`: Perfect positive linear relationship
  - `0`: No linear correlation
  - `-1`: Perfect negative linear relationship

#### ✅ Use When:
- Data is **continuous**
- Relationship is **linear**
- No major outliers

```python
df.corr(method='pearson')
```

---

### 🔹 2. **Spearman Correlation**
- Measures **monotonic relationship** using **ranked data**.
- Does **not require linearity** or normal distribution.
- Value range: `-1` to `+1` (based on rank correlation)

#### ✅ Use When:
- Data is **ordinal**
- Relationship is **non-linear but monotonic**
- There are **outliers**

```python
df.corr(method='spearman')
```

---

### 📊 Comparison Table

| Feature              | Pearson                         | Spearman                        |
|----------------------|----------------------------------|----------------------------------|
| Measures             | Linear relationship              | Monotonic relationship (ranks)  |
| Data type            | Continuous numeric               | Ordinal, ranked, or numeric     |
| Sensitive to outliers| Yes                              | No                               |
| Requires normality   | Yes                              | No                               |
| Method used          | Raw values                       | Ranked values                   |
| Use case             | Linear data                      | Non-linear or ranked data       |

---

### 💡 Example in Code

```python
# Pearson correlation
df.corr(method='pearson')

# Spearman correlation
df.corr(method='spearman')
```

---

### ✅ Summary:
- Use **Pearson** for **linear, clean numeric** data.
- Use **Spearman** when data is **ordinal**, **non-linear**, or has **outliers**.


---

## 📊 Key Steps in Data Handling

---

### 🔍 1. **Data Exploration**
- First step of understanding the dataset.
- Involves checking:
  - Structure of data (rows, columns)
  - Data types
  - Summary statistics
  - Missing values, duplicates, or anomalies
- Tools: `df.info()`, `df.describe()`, `.head()`, `.value_counts()`

```python
df.info()
df.describe()
df['gender'].value_counts()
```

---

### 🔄 2. **Data Wrangling**
- Also called **data transformation** or **reshaping**.
- Process of **converting raw data into a usable format**.
- Includes:
  - Merging datasets
  - Splitting columns
  - Restructuring (pivoting/unpivoting)
  - Handling nested or messy data

```python
df = df.pivot_table(index='city', columns='year', values='population')
```

---

### 🔁 3. **Data Munging**
- Similar to **data wrangling** — often used interchangeably.
- Focuses on **converting complex or unstructured data into clean format**.
- Useful when data comes from web scraping, logs, APIs, or inconsistent sources.

---

### 🧼 4. **Data Cleaning**
- Removing or fixing **errors**, **incomplete**, **duplicate**, or **inconsistent** data.
- Steps may include:
  - Handling missing values
  - Removing duplicates
  - Correcting typos or formatting issues

```python
df.dropna()             # remove missing values
df.duplicated().sum()   # check duplicates
df['salary'] = df['salary'].str.replace('$', '').astype(float)
```

---

### ⚙️ 5. **Data Preprocessing**
- Preparing data for **machine learning or statistical modeling**.
- Combines:
  - Cleaning
  - Encoding categorical variables
  - Scaling/normalizing data
  - Feature engineering

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
```

---

## ✅ Summary Table

| Step              | Purpose                                                    |
|-------------------|------------------------------------------------------------|
| Data Exploration  | Understand the structure, types, and overview of the data  |
| Data Wrangling    | Reshape and reformat raw data for analysis                 |
| Data Munging      | Clean and convert messy/unstructured data                  |
| Data Cleaning     | Fix errors, missing values, duplicates, typos              |
| Data Preprocessing| Prepare clean data for ML models (scaling, encoding, etc.) |

