# ai03bTasks
# Machine Learning: Decision Trees
## Data Cleaning

**Instructions:**
- Complete each task below by running the code cells
- Fill in the blanks and answer questions in markdown cells
- Save your work when finished
- Push this file to your GitHub "Machine Learning" Repo under the appropriate folder.

---
## Setup: Import Libraries and Load Data

Run this cell first to set up your environment.

In [1]:


import pandas as pd

# Load the Titanic dataset if it's not already
df = pd.read_csv("Titanic Dataset.csv")

print("✓ Data loaded successfully!")
print(f"Original shape: {df.shape}")
print(f"\nOriginal columns: {df.columns.tolist()}")



✓ Data loaded successfully!
Original shape: (1309, 14)

Original columns: ['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']


---
## Task 1: Understand the Original Data

Before cleaning, let's see what we're working with.

### 1a. How many rows and columns are in the original dataset?

In [2]:
# TODO: Print the shape of the DataFrame
print(f"Rows: {len(df)}")
print(f"Columns: {len(df.columns)}")

Rows: 1309
Columns: 14


### 1b. Display the first few rows

In [3]:
# TODO: Use .head() to display the first 5 rows
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### 1c. Check for missing values in the original data

In [4]:
# TODO: Use .isnull().sum() to count missing values per column
print(f"Missing values in original data: {df.isnull().sum()}")

Missing values in original data: pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64


**Q: Which columns have the most missing values?**

A: body

---
## Task 2: Select Useful Features

We'll keep only the columns that help predict survival.

### 2a. Keep only these 8 columns: pclass, survived, sex, age, sibsp, parch, fare, embarked

In [5]:
# TODO: Select only the useful columns
# Hint: df = df[['column1', 'column2', ...]]
df = df[['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

print("✓ Columns selected!")
print(f"New shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

✓ Columns selected!
New shape: (1309, 8)

Columns: ['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']


### 2b. Explain why we dropped certain columns

**Q: Why did we drop the 'name' column?**.  (Refer to the lesson if you don't recall - or ask a neighbor!)

A:  It does not help predict survival.

**Q: Why did we drop the 'cabin' column?**

A: Too many missing values + complex

**Q: Why did we drop the 'boat' column?**

A: Lifeboat number but only if survived

---
## Task 3: Check for Missing Values

Now let's see which of our selected columns have missing values.

In [6]:
# TODO: Check for missing values in the cleaned dataset
print(f"Missing values after feature selection:{df.isnull().sum()}")

Missing values after feature selection:pclass        0
survived      0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64


**Q: How many missing values are in the 'age' column?**

A: 263

**Q: How many missing values are in the 'fare' column?**

A:  1

**Q: How many missing values are in the 'embarked' column?**

A: 2

---
## Task 4: Handle Missing Age Values

Age has many missing values. We'll fill them with the median age.

### 4a. Calculate the median age

In [7]:
# TODO: Calculate the median age
median_age = df['age'].median()
print(f"Median age: {median_age}")

Median age: 28.0


### 4b. Fill missing ages with the median

In [8]:
# TODO: Fill missing ages with median_age
# Hint: Use .fillna(value, inplace=True)
df['age'].fillna(f'{median_age}', inplace=True)

print("✓ Missing ages filled with median!")
print(f"Missing ages now: {df['age'].isnull().sum()}")

✓ Missing ages filled with median!
Missing ages now: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['age'].fillna(f'{median_age}', inplace=True)
  df['age'].fillna(f'{median_age}', inplace=True)


### 4c. Compare median vs mean for age

In [10]:
# this should convert the column to numeric, turning non-numeric values into NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

print(f"Median age: {df['age'].median():.2f}")
print(f"Mean age:   {df['age'].mean():.2f}")

Median age: 28.00
Mean age:   29.50


**Q: Which is larger, the median or mean? Why might this be?**

A: The mean, because It has more values to work off of, while the median is just the center.

---
## Task 5: Handle Missing Fare Values

Fare has only 1 missing value. We'll fill it with the median fare.

In [11]:
# TODO: Calculate median fare
median_fare = df['fare'].median()
print(f"Median fare: ${df['fare'].median():.2f}")

# TODO: Fill missing fare with median
df['fare'].fillna(median_fare, inplace=True)

print("✓ Missing fare filled!")
print(f"Missing fares now: {df['fare'].isnull().sum()}")

Median fare: $14.45
✓ Missing fare filled!
Missing fares now: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['fare'].fillna(median_fare, inplace=True)


---
## Task 6: Handle Missing Embarked Values

Embarked has only 2 missing values. Since this is so few, we'll drop those rows.

### 6a. How many rows before dropping?

In [12]:
rows_before = len(df)
print(f"Rows before dropping: {rows_before}")

Rows before dropping: 1309


### 6b. Drop rows with missing embarked values

In [13]:
# TODO: Drop rows where 'embarked' is missing
# Hint: Use .dropna(subset=['column_name'], inplace=True) - also in the lesson material
df.dropna(subset=['embarked'], inplace=True)

rows_after = len(df)
rows_dropped = rows_before - rows_after

print("✓ Rows with missing embarked dropped!")
print(f"Rows after dropping: {rows_after}")
print(f"Rows dropped: {rows_dropped}")
print(f"Missing embarked now: {df['embarked'].isnull().sum()}")

✓ Rows with missing embarked dropped!
Rows after dropping: 1307
Rows dropped: 2
Missing embarked now: 0


**Q: What percentage of data did we lose by dropping these rows?**

A: 0.15%

---
## Task 7: Verify All Missing Values Are Gone

Let's do a final check to make sure our data is completely clean.

In [14]:
# TODO: Check for any remaining missing values
print("Final missing value check:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

Final missing value check:
pclass      0
survived    0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    0
dtype: int64

Total missing values: 0


**Q: Are there any missing values remaining? (Should be 0!)**

A: 0

---
## Task 8: Summary Statistics

Now that our data is clean, let's look at summary statistics.

In [15]:
# Display summary statistics
print("Summary statistics after cleaning:")
df.describe()

Summary statistics after cleaning:


Unnamed: 0,pclass,survived,age,sibsp,parch,fare
count,1307.0,1307.0,1307.0,1307.0,1307.0,1307.0
mean,2.296863,0.381025,29.471821,0.499617,0.385616,33.209595
std,0.836942,0.485825,12.881592,1.042273,0.866092,51.748768
min,1.0,0.0,0.17,0.0,0.0,0.0
25%,2.0,0.0,22.0,0.0,0.0,7.8958
50%,3.0,0.0,28.0,0.0,0.0,14.4542
75%,3.0,1.0,35.0,1.0,0.0,31.275
max,3.0,1.0,80.0,8.0,9.0,512.3292


**Q: What is the average age after filling missing values?**

A: 29.5

**Q: What is the average fare?**

A: 33.2

---
## Task 9: Save the Cleaned Data
Save your cleaned data to a new CSV file so you can use it in the next lesson.

In [16]:
# Save cleaned data
# Writes the DataFrame to a new CSV file without adding extra row numbers (index)
df.to_csv("Titanic_Cleaned.csv", index=False)     

# Prints a confirmation message so the user knows the save was successful
print("✓ Cleaned data saved to 'Titanic_Cleaned.csv'")   

✓ Cleaned data saved to 'Titanic_Cleaned.csv'


---
## Reflection Questions

Answer these questions based on your work:

**1. Why is it important to check for missing values before building a model?**

Answer: Because missing values can harm accurate data collection.

**2. When should you fill missing values vs. drop rows?**

Answer:  When dropping would be harmfull to the overall dataset.

**3. Why did we use median instead of mean to fill missing ages?**

Answer:  Because we wanted to calculate the mean AFTER filling those values.

**4. What could happen if we trained a model on data with missing values?**

Answer:  It would be innacurate.

**5. Name one real-world scenario where missing data might occur.**

Answer: Someone does a study on people that involves an interview. During said interview, the interviewer forgets to fill in the data to the dataset.

---
## Lesson Complete!

You've successfully cleaned the Titanic dataset!

**Summary of what you did:**
- Selected 8 useful features from 15 columns
- Filled missing ages with median
- Filled missing fares with median
- Dropped 2 rows with missing embarked values
- Verified all missing values are gone

Save this notebook and push to GitHub.

**Next lesson**: Convert categorical data to numbers