![alt text](<../images/just enough.png>)
# Just Enough Python for AI/Data Science
## Module 4: Pandas- From Raw Data to Elegant Tables
>Pandas is every data scientist's best friend for wrangling rows and columns of messy real-world data.
### Day 10 - Pandas DataFrames: Excel on Steroids
----

##### Overview:

Data cleaning is the most critical step in any data analysis workflow since real-world data is often incomplete, inconsistent, or downright messy. Today, you'll learn **basic cleaning tricks** with Pandas to take your dataset from chaos to order.

Today we will know how to:

- Identify and handle **missing values**.
- Rename and reformat **columns**.
- Remove irrelevant rows and columns.
- Filter and **sort data** for better usability.
- Handle duplicate rows.

#### 1. Why Clean Data??

Real-world data often has issues like:

- **Missing values**: Blank or `NaN` entries in a dataset make calculations inaccurate.
- **Duplicate rows**: Redundant rows waste memory and skew results.
- **Inconsistent formats**: Misleading column names, mixed data types, or invalid entries lead to errors during analysis.
- **Irrelevant data**: Columns or rows not contributing to the goals of analysis clutter the dataset.
    
**Pandas to the Rescue:**

Pandas provides simple, intuitive tools that allow you to clean and manipulate messy data.



#### 2. Handling Missing Values
Missing values are often represented as `NaN` (Not a Number). The first step is **detecting** where they exist and then either **replacing** or **removing** them.

**Detect Missing Values**
Use `.isnull()` or `.notnull()` to find missing values.

In [9]:
import pandas as pd
import numpy as np

# Sample dataset
data = {'Name': ['Alice', 'Bob', None, 'Diana'],
        'Age': [24, np.nan, 15, 28],
        'City': ['New York', 'San Francisco', 'Chicago', None]}

df = pd.DataFrame(data)
df.head()


Unnamed: 0,Name,Age,City
0,Alice,24.0,New York
1,Bob,,San Francisco
2,,15.0,Chicago
3,Diana,28.0,


In [10]:
# Detect missing values
print("Missing Values:\n", df.isnull())  # True for NaN entries
print("\nTotal Missing Values per Column:\n", df.isnull().sum())

Missing Values:
     Name    Age   City
0  False  False  False
1  False   True  False
2   True  False  False
3  False  False   True

Total Missing Values per Column:
 Name    1
Age     1
City    1
dtype: int64


**Dropping Missing Values**

- You can remove rows or columns that contain missing values using `.dropna().`

In [11]:

# Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDrop Rows with Missing Values:\n", df_dropped_rows)


Drop Rows with Missing Values:
     Name   Age      City
0  Alice  24.0  New York


In [12]:
# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)  # axis=1 drops columns
print("\nDrop Columns with Missing Values:\n", df_dropped_cols)




Drop Columns with Missing Values:
 Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]


In [13]:
# Keep rows only if all columns are non-null
df_keep_all = df.dropna(how="all")
print("\nDrop Rows Only if All Columns Null:\n", df_keep_all)


Drop Rows Only if All Columns Null:
     Name   Age           City
0  Alice  24.0       New York
1    Bob   NaN  San Francisco
2   None  15.0        Chicago
3  Diana  28.0           None


**Filling Missing Values**

Instead of dropping rows or columns, you can fill missing values with specific values or statistical measures (e.g., mean or median).

In [14]:
# Fill missing values with a constant
df_filled_constant = df.fillna("Unknown")
print("\nFill Missing Values with Constant:\n", df_filled_constant)


Fill Missing Values with Constant:
       Name      Age           City
0    Alice     24.0       New York
1      Bob  Unknown  San Francisco
2  Unknown     15.0        Chicago
3    Diana     28.0        Unknown


In [15]:
# Fill missing numerical values with the column mean
df["Age"] = df["Age"].fillna(df["Age"].mean())
print("\nFill Missing Ages with Mean:\n", df)




Fill Missing Ages with Mean:
     Name        Age           City
0  Alice  24.000000       New York
1    Bob  22.333333  San Francisco
2   None  15.000000        Chicago
3  Diana  28.000000           None


In [44]:
data = {'Name': ['Alice', 'Bob', None, 'Diana'],
        'Age': [24, np.nan, 15, 28],
        'City': ['New York', 'San Francisco', 'Chicago', None]}

df = pd.DataFrame(data)
df.head()

Unnamed: 0,Name,Age,City
0,Alice,24.0,New York
1,Bob,,San Francisco
2,,15.0,Chicago
3,Diana,28.0,


In [45]:
# Fill missing values forward or backward (propagation)
#df_filled_forward = df.fillna(method="ffill")  # Forward fill
df_filled_forward = df.ffill()  # Alternative syntax


print("\nForward Fill:\n", df_filled_forward)



Forward Fill:
     Name   Age           City
0  Alice  24.0       New York
1    Bob  24.0  San Francisco
2    Bob  15.0        Chicago
3  Diana  28.0        Chicago


In [46]:
# df_filled_backward = df.fillna(method="bfill")  # Backward fill
df_filled_backward = df.bfill()  # Alternative syntax

print("\nBackward Fill:\n", df_filled_backward)
 



Backward Fill:
     Name   Age           City
0  Alice  24.0       New York
1    Bob  15.0  San Francisco
2  Diana  15.0        Chicago
3  Diana  28.0           None


#### 3. Renaming and Reformatting Columns
Clear and consistent column names are essential for readability and usability.

**Renaming Columns**
- Use `.rename()` to rename columns individually:

In [47]:
# Rename a single column
df_renamed = df.rename(columns={"Name": "Full Name"})
print("\nRenamed Columns:\n", df_renamed)



Renamed Columns:
   Full Name   Age           City
0     Alice  24.0       New York
1       Bob   NaN  San Francisco
2      None  15.0        Chicago
3     Diana  28.0           None


In [48]:
# Rename multiple columns
df_renamed_all = df.rename(columns={"Name": "Full Name", "Age": "Passenger Age", "City": "Hometown"})
print("\nRenamed Multiple Columns:\n", df_renamed_all)


Renamed Multiple Columns:
   Full Name  Passenger Age       Hometown
0     Alice           24.0       New York
1       Bob            NaN  San Francisco
2      None           15.0        Chicago
3     Diana           28.0           None


**Standardizing Column Names**

- Use `.str` methods with `.columns` for standardized column names.

In [49]:
df_renamed_all.columns = df_renamed_all.columns.str.lower()  # Convert column names to lowercase
print("\nStandardized Column Names:\n", df_renamed_all)




Standardized Column Names:
   full name  passenger age       hometown
0     Alice           24.0       New York
1       Bob            NaN  San Francisco
2      None           15.0        Chicago
3     Diana           28.0           None


In [50]:
df_renamed_all.columns = df_renamed_all.columns.str.replace(" ", "_")  # Replace spaces with underscores
print("\nStandardized Column Names:\n", df_renamed_all)


Standardized Column Names:
   full_name  passenger_age       hometown
0     Alice           24.0       New York
1       Bob            NaN  San Francisco
2      None           15.0        Chicago
3     Diana           28.0           None


#### 4. Removing Irrelevant Data
- Removing irrelevant rows or columns can help streamline analysis.

**Dropping Columns**
- Use `.drop()` to remove unnecessary columns.

In [51]:
df.head() 

Unnamed: 0,Name,Age,City
0,Alice,24.0,New York
1,Bob,,San Francisco
2,,15.0,Chicago
3,Diana,28.0,


In [52]:
# Drop a single column
df_dropped = df.drop("City", axis=1)  # axis=1 means columns
print("\nDrop 'City' Column:\n", df_dropped)



Drop 'City' Column:
     Name   Age
0  Alice  24.0
1    Bob   NaN
2   None  15.0
3  Diana  28.0


In [53]:
# Drop multiple columns
df_dropped_multi = df.drop(["Age", "City"], axis=1)
print("\nDrop Multiple Columns:\n", df_dropped_multi)


Drop Multiple Columns:
     Name
0  Alice
1    Bob
2   None
3  Diana


**Dropping Rows**
- Use `.drop()` to remove rows based on their index.

In [54]:
# Drop rows by index
df_dropped_rows = df.drop(index=1)  # Drop row at index 1
print("\nDrop Row 1:\n", df_dropped_rows)



Drop Row 1:
     Name   Age      City
0  Alice  24.0  New York
2   None  15.0   Chicago
3  Diana  28.0      None


#### 5. Removing Duplicate Rows
Duplicate rows are often introduced during data collection or merging datasets. Use `.duplicated()` to identify duplicates, and `.drop_duplicates()`to remove them.

In [55]:
# Sample dataset with duplicates
data = {
    "Name": ["Alice", "Bob", "Alice", "Charlie", "Bob"],
    "Age": [24, 30, 24, 35, 30]
}
df_duplicates = pd.DataFrame(data)
print("\nDataset with Duplicates:\n", df_duplicates)




Dataset with Duplicates:
       Name  Age
0    Alice   24
1      Bob   30
2    Alice   24
3  Charlie   35
4      Bob   30


In [56]:
# Identify duplicates
print("\nDuplicates:\n", df_duplicates.duplicated())





Duplicates:
 0    False
1    False
2     True
3    False
4     True
dtype: bool


In [57]:
# Drop duplicates
df_no_duplicates = df_duplicates.drop_duplicates()
print("\nDataset Without Duplicates:\n", df_no_duplicates)


Dataset Without Duplicates:
       Name  Age
0    Alice   24
1      Bob   30
3  Charlie   35


#### 6. Sorting Data
Sorting helps organize data for easier insight extraction.

**Sorting by a Column**
- Use `.sort_values()` to sort rows based on a specific column.

In [59]:
# Sort by Age
df_sorted_age = df.sort_values("Age")  # Ascending order by default
print("\nSort by Age:\n", df_sorted_age)



Sort by Age:
     Name   Age           City
2   None  15.0        Chicago
0  Alice  24.0       New York
3  Diana  28.0           None
1    Bob   NaN  San Francisco


In [60]:
# Sort by Age in descending order
df_sorted_desc = df.sort_values("Age", ascending=False)
print("\nSort by Age (Descending):\n", df_sorted_desc)



Sort by Age (Descending):
     Name   Age           City
3  Diana  28.0           None
0  Alice  24.0       New York
2   None  15.0        Chicago
1    Bob   NaN  San Francisco


**Sorting by Multiple Columns**

In [62]:
# Sort by Age, then by Name
df_sorted_multiple = df.sort_values(["Age", "Name"], ascending=[True, False])
print("\nSort by Age and Name:\n", df_sorted_multiple)




Sort by Age and Name:
     Name   Age           City
2   None  15.0        Chicago
0  Alice  24.0       New York
3  Diana  28.0           None
1    Bob   NaN  San Francisco


#### 7. Filtering Rows
Filtering is one of the most common operations when cleaning datasets. Use conditions to subset rows based on specific rules.

**Filter Rows Based on Conditions**


In [63]:
# Filter rows where Age > 20
filtered = df[df["Age"] > 20]
print("\nRows Where Age > 20:\n", filtered)



Rows Where Age > 20:
     Name   Age      City
0  Alice  24.0  New York
3  Diana  28.0      None


In [65]:
# Filter rows where Age is between 20 and 25
filtered_between = df[(df["Age"] > 20) & (df["Age"] < 25)]
print("\nRows Where Age is Between 20 and 30:\n", filtered_between)



Rows Where Age is Between 20 and 30:
     Name   Age      City
0  Alice  24.0  New York


#### 7 . Combining Cleaning Operations
Let's perform multiple cleaning steps in succession. Here’s an example:

In [66]:
# Dataset with issues
data = {'Name': ["Alice", "Bob", "Charlie", "Diana", "Alice"],
        'Age': [24, None, 35, 28, None],
        'City': ["New York", "San Francisco", "Chicago", None, "New York"]}
df = pd.DataFrame(data)

# Cleaning Step-by-Step
df = df.dropna(subset=["City"])  # Drop rows where 'City' is missing
df["Age"] = df["Age"].fillna(df["Age"].mean())  # Fill missing 'Age' values with the mean
df = df.drop_duplicates()  # Remove duplicate rows
df["Name"] = df["Name"].str.upper()  # Convert names to uppercase
df = df.sort_values(by="Age", ascending=True)  # Sort by Age
print("\nCleaned Dataset:\n", df)



Cleaned Dataset:
       Name   Age           City
0    ALICE  24.0       New York
1      BOB  29.5  San Francisco
4    ALICE  29.5       New York
2  CHARLIE  35.0        Chicago


----
#### Quick Exercises
1. Explore the Titanic Dataset:

    - Download the Titanic dataset here https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv.
    - Load it into a Pandas DataFrame.

2. Clean the Titanic Dataset:

    - Drop rows where Age or Embarked are missing.
    - Fill in missing Fare values with the average fare.
    - Standardize column names (e.g., make them lowercase and replace spaces with underscores).
    - Add a new column named Family_Size which is the sum of SibSp and Parch.

3. Sort and Filter:

    - Sort passengers by Fare in descending order.
    - Filter and print all passengers who paid more than $100 and were older than 30.

4. Identify Duplicates:

    - Check if the dataset has duplicates and remove them, if any.

**Please Note:** The solutions to above questions will be present at the end of next session's (Day 11) Notebook.


---- 


### Day 9 Exercise Solution

1. Download the Titanic dataset.

2. Load the dataset into a Pandas DataFrame.

In [67]:
# Load Titanic dataset
df = pd.read_csv("../data/titanic_dataset.csv")  # Replace with the path to your dataset
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


3. Answer the following:
    - What are the column names?
    - How many rows and columns are there?
    - What are the data types of the columns?

In [69]:
print("Columns:", df.columns)

Columns: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


In [None]:
print("Shape:", df.shape) 

Shape: (891, 12)


In [71]:
print("Data Types:\n", df.dtypes)

Data Types:
 PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


4. Perform these operations:
    - Access the first 5 rows of the dataset.
    - Select and print the "Name" and "Age" columns.
    - Filter and print passengers above the age of 40.

In [73]:
df.head()




Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [76]:
df[["Name", "Age"]]





Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
...,...,...
886,"Montvila, Rev. Juozas",27.0
887,"Graham, Miss. Margaret Edith",19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",26.0


In [78]:
older_passengers = df[df["Age"] > 40]
older_passengers


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5000,,S
35,36,0,1,"Holverson, Mr. Alexander Oskar",male,42.0,1,0,113789,52.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
862,863,1,1,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",female,48.0,0,0,17466,25.9292,D17,S
865,866,1,2,"Bystrom, Mrs. (Karolina)",female,42.0,0,0,236852,13.0000,,S
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S


5. (Bonus) Add a new column called Family Size that’s the sum of SibSp and Parch (siblings/spouse + parents/children)

In [83]:
df["Family_Size"] = df["SibSp"] + df["Parch"]
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family_Size
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


# HAPPY LEARNING