# ai03cTasks
# Machine Learning: Decision Trees
## Data Preparation

**Instructions:**
- Complete each task below by running the code cells
- Fill in the blanks and answer questions in markdown cells
- Save your work when finished
- Push this file to your GitHub "Machine Learning" Repo under the appropriate folder.

---
## Setup: Import Libraries and Load Cleaned Data

Run this cell first. We'll start with the cleaned data from Lesson 2.

In [None]:
import pandas as pd

# Load the cleaned data (or load and clean again)
df = pd.read_csv("Titanic_Cleaned.csv")

print("✓ Cleaned data loaded")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

---
## Task 1: Understand Categorical vs Numerical Data

Let's identify which columns are categorical and which are numerical.

### 1a. Check data types

In [None]:
# Display data types of each column
print("Data types:")
print(df.dtypes)

### 1b. Identify categorical columns

**Q: Which columns have 'object' data type? (These are categorical)**

A: 

**Q: Which columns have 'int64' or 'float64'? (These are numerical)**

A: 

### 1c. View unique values in categorical columns

In [None]:
# Check unique values in 'sex'
print("Unique values in 'sex':")        
print(df['sex'].unique())               # Shows all different (unique) values that appear in the 'sex' column
print(f"Count: {df['sex'].nunique()}")  # Prints how many unique values there are in total (like how many categories)


# Check unique values in 'embarked'
print("\nUnique values in 'embarked':")
print(df['embarked'].unique())
print(f"Count: {df['embarked'].nunique()}")

---
## Task 2: Convert 'sex' to Dummy Variables

Convert the 'sex' column from text to numbers using one-hot encoding.

### 2a. Preview the data before encoding

In [None]:
# Show first few rows with 'sex' column
print("Before encoding:")
print(df[['sex', 'age', 'survived']].head())

### 2b. Convert 'sex' to dummy variables

In [None]:
# TODO: Use pd.get_dummies() to convert 'sex' to dummy variables
# Hint: df = pd.get_dummies(df, columns=['sex'], drop_first=True)
df = pd.get_dummies(df, columns=[________], drop_first=________)

print("✓ 'sex' converted to dummy variables!")
print(f"New columns: {df.columns.tolist()}")

### 2c. Preview the data after encoding

In [None]:
# Show first few rows with new 'sex_male' column
print("After encoding:")
print(df[['sex_male', 'age', 'survived']].head(10))

**Q: What does sex_male = 1 mean?**

A: 

**Q: What does sex_male = 0 mean?**

A: 

**Q: Why do we only have one column (sex_male) instead of two (sex_male and sex_female)?**

A: 

---
## Task 3: Convert 'embarked' to Dummy Variables

Now convert the 'embarked' column (which has 3 categories: C, Q, S).

### 3a. Preview before encoding

In [None]:
# Show first few rows with 'embarked' column
print("Before encoding:")
print(df[['embarked', 'fare', 'survived']].head())

### 3b. Convert 'embarked' to dummy variables

In [None]:
# TODO: Use pd.get_dummies() to convert 'embarked' to dummy variables
df = pd.get_dummies(________, columns=[________], drop_first=________)

print("✓ 'embarked' converted to dummy variables!")
print(f"New columns: {df.columns.tolist()}")

### 3c. Preview after encoding

In [None]:
# Show first few rows with new dummy columns
print("After encoding:")
print(df[['embarked_Q', 'embarked_S', 'fare', 'survived']].head(10))

**Q: How many dummy columns were created for 'embarked'?**

A: 

**Q: What does embarked_Q = 1 mean?**

A: 

**Q: If both embarked_Q = 0 and embarked_S = 0, where did the passenger embark?**

A: 

---
## Task 4: Understanding the Encoding Results

Let's verify our encoding makes sense.

### 4a. Check value counts for dummy variables

In [None]:
# Count how many males vs females
print("Sex distribution:")
print(df['sex_male'].value_counts())
print(f"\nMales: {df['sex_male'].sum()}")
print(f"Females: {(df['sex_male'] == 0).sum()}")

In [None]:
# Count embarked locations
print("Embarked distribution:")
print(f"Embarked at Q: {df['embarked_Q'].sum()}")
print(f"Embarked at S: {df['embarked_S'].sum()}")
print(f"Embarked at C: {((df['embarked_Q'] == 0) & (df['embarked_S'] == 0)).sum()}")

---
## Task 5: Final Dataset Review

Let's look at our fully prepared dataset.

In [None]:
# Display dataset info
print("Final dataset after encoding:")
print(df.info())
print(f"\nColumns: {df.columns.tolist()}")

**Q: How many columns do we have now?**

A: 

**Q: Are all columns now numerical (int64 or float64)?**

A: 

---
## Task 6: Separate Features (X) and Target (y)

Now we'll split our data into features (X) and target (y).

### 6a. Create X (features) - all columns except 'survived'

In [None]:
# TODO: Create X by dropping the 'survived' column
# Hint: X = df.drop('survived', axis=1)
X = df.drop(________, axis=________)

print("✓ X (features) created!")
print(f"X shape: {X.shape}")
print(f"X columns: {X.columns.tolist()}")

### 6b. Create y (target) - just the 'survived' column

In [None]:
# TODO: Create y by selecting only the 'survived' column
# Hint: y = df['survived']
y = df[________]

print("✓ y (target) created!")
print(f"y shape: {y.shape}")
print(f"y type: {type(y)}")

### 6c. Verify X and y

In [None]:
# Display first few rows of X
print("First 5 rows of X (features):")
print(X.head())

print("\nFirst 10 values of y (target):")
print(y.head(10).tolist())

**Q: How many features (columns) are in X?**

A: 

**Q: Do X and y have the same number of rows?**

A: 

**Q: Why is it important that X and y have the same number of rows?**

A: 

---
## Task 7: Save Prepared Data (Optional)

Save X and y for use in the next lesson.

In [None]:
# Save X and y to CSV files
X.to_csv("Titanic_X_features.csv", index=False)
y.to_csv("Titanic_y_target.csv", index=False)

print("✓ X and y saved!")

---
## Reflection Questions

Answer these questions based on your work:

**1. Why do machine learning models need numerical data instead of text?**

Answer: 

**2. What is one-hot encoding and why is it useful?**

Answer: 

**3. Why do we use drop_first=True when creating dummy variables?**

Answer: 

**4. What is the difference between X (features) and y (target)?**

Answer: 

**5. Give an example of a real-world categorical variable that would need to be encoded.**

Answer: 

---
## Lesson Complete! 

You've successfully prepared your data for machine learning!

**Summary of what you did:**
- Converted 'sex' from text to dummy variable (sex_male)
- Converted 'embarked' from text to dummy variables (embarked_Q, embarked_S)
- All data is now numerical
- Separated features (X) from target (y)
- Data is ready for model training!

Save this notebook and push to GitHub.

**Next lesson**: Train/test split and building our decision tree model!