# ai03cTasks
# Machine Learning: Decision Trees
## Data Preparation

**Instructions:**
- Complete each task below by running the code cells
- Fill in the blanks and answer questions in markdown cells
- Save your work when finished
- Push this file to your GitHub "Machine Learning" Repo under the appropriate folder.

---
## Setup: Import Libraries and Load Cleaned Data

Run this cell first. We'll start with the cleaned data from Lesson 2.

In [19]:
import pandas as pd

# Load the cleaned data (or load and clean again)
df = pd.read_csv("Titanic_Cleaned.csv")

print("✓ Cleaned data loaded")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

✓ Cleaned data loaded
Shape: (1307, 8)
Columns: ['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']


---
## Task 1: Understand Categorical vs Numerical Data

Let's identify which columns are categorical and which are numerical.

### 1a. Check data types

In [20]:
# Display data types of each column
print("Data types:")
print(df.dtypes)

Data types:
pclass        int64
survived      int64
sex          object
age         float64
sibsp         int64
parch         int64
fare        float64
embarked     object
dtype: object


### 1b. Identify categorical columns

**Q: Which columns have 'object' data type? (These are categorical)**

A: Embarked and sex.

**Q: Which columns have 'int64' or 'float64'? (These are numerical)**

A: pclass, survived, age, sibsp, parch, and fare.

### 1c. View unique values in categorical columns

In [21]:
# Check unique values in 'sex'
print("Unique values in 'sex':")        
print(df['sex'].unique())               # Shows all different (unique) values that appear in the 'sex' column
print(f"Count: {df['sex'].nunique()}")  # Prints how many unique values there are in total (like how many categories)


# Check unique values in 'embarked'
print("\nUnique values in 'embarked':")
print(df['embarked'].unique())
print(f"Count: {df['embarked'].nunique()}")

Unique values in 'sex':
['female' 'male']
Count: 2

Unique values in 'embarked':
['S' 'C' 'Q']
Count: 3


---
## Task 2: Convert 'sex' to Dummy Variables

Convert the 'sex' column from text to numbers using one-hot encoding.

### 2a. Preview the data before encoding

In [22]:
# Show first few rows with 'sex' column
print("Before encoding:")
print(df[['sex', 'age', 'survived']].head())

Before encoding:
      sex    age  survived
0  female  29.00         1
1    male   0.92         1
2  female   2.00         0
3    male  30.00         0
4  female  25.00         0


### 2b. Convert 'sex' to dummy variables

In [23]:
# TODO: Use pd.get_dummies() to convert 'sex' to dummy variables
# Hint: df = pd.get_dummies(df, columns=['sex'], drop_first=True)
df = pd.get_dummies(df, columns=['sex'], drop_first=True)

print("✓ 'sex' converted to dummy variables!")
print(f"New columns: {df.columns.tolist()}")

✓ 'sex' converted to dummy variables!
New columns: ['pclass', 'survived', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'sex_male']


### 2c. Preview the data after encoding

In [24]:
# Show first few rows with new 'sex_male' column
print("After encoding:")
print(df[['sex_male', 'age', 'survived']].head(10))

After encoding:
   sex_male    age  survived
0     False  29.00         1
1      True   0.92         1
2     False   2.00         0
3      True  30.00         0
4     False  25.00         0
5      True  48.00         1
6     False  63.00         1
7      True  39.00         0
8     False  53.00         1
9      True  71.00         0


**Q: What does sex_male = 1 mean?**

A: 1 means True, the sex is male.

**Q: What does sex_male = 0 mean?**

A: 0 means false, sex is female.

**Q: Why do we only have one column (sex_male) instead of two (sex_male and sex_female)?**

A: Because in the case where only male and female genders are counted, it is unnecessary to have 2 columns.

---
## Task 3: Convert 'embarked' to Dummy Variables

Now convert the 'embarked' column (which has 3 categories: C, Q, S).

### 3a. Preview before encoding

In [25]:
# Show first few rows with 'embarked' column
print("Before encoding:")
print(df[['embarked', 'fare', 'survived']].head())

Before encoding:
  embarked      fare  survived
0        S  211.3375         1
1        S  151.5500         1
2        S  151.5500         0
3        S  151.5500         0
4        S  151.5500         0


### 3b. Convert 'embarked' to dummy variables

In [26]:
# TODO: Use pd.get_dummies() to convert 'embarked' to dummy variables
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)

print("✓ 'embarked' converted to dummy variables!")
print(f"New columns: {df.columns.tolist()}")

✓ 'embarked' converted to dummy variables!
New columns: ['pclass', 'survived', 'age', 'sibsp', 'parch', 'fare', 'sex_male', 'embarked_Q', 'embarked_S']


### 3c. Preview after encoding

In [27]:
# Show first few rows with new dummy columns
print("After encoding:")
print(df[['embarked_Q', 'embarked_S', 'fare', 'survived']].head(10))

After encoding:
   embarked_Q  embarked_S      fare  survived
0       False        True  211.3375         1
1       False        True  151.5500         1
2       False        True  151.5500         0
3       False        True  151.5500         0
4       False        True  151.5500         0
5       False        True   26.5500         1
6       False        True   77.9583         1
7       False        True    0.0000         0
8       False        True   51.4792         1
9       False       False   49.5042         0


**Q: How many dummy columns were created for 'embarked'?**

A: 2

**Q: What does embarked_Q = 1 mean?**

A: Someone embarked at that specific point.

**Q: If both embarked_Q = 0 and embarked_S = 0, where did the passenger embark?**

A: At a point different than the two columns.

---
## Task 4: Understanding the Encoding Results

Let's verify our encoding makes sense.

### 4a. Check value counts for dummy variables

In [28]:
# Count how many males vs females
print("Sex distribution:")
print(df['sex_male'].value_counts())
print(f"\nMales: {df['sex_male'].sum()}")
print(f"Females: {(df['sex_male'] == 0).sum()}")

Sex distribution:
sex_male
True     843
False    464
Name: count, dtype: int64

Males: 843
Females: 464


In [29]:
# Count embarked locations
print("Embarked distribution:")
print(f"Embarked at Q: {df['embarked_Q'].sum()}")
print(f"Embarked at S: {df['embarked_S'].sum()}")
print(f"Embarked at C: {((df['embarked_Q'] == 0) & (df['embarked_S'] == 0)).sum()}")

Embarked distribution:
Embarked at Q: 123
Embarked at S: 914
Embarked at C: 270


---
## Task 5: Final Dataset Review

Let's look at our fully prepared dataset.

In [30]:
# Display dataset info
print("Final dataset after encoding:")
print(df.info())
print(f"\nColumns: {df.columns.tolist()}")

Final dataset after encoding:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1307 entries, 0 to 1306
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   pclass      1307 non-null   int64  
 1   survived    1307 non-null   int64  
 2   age         1307 non-null   float64
 3   sibsp       1307 non-null   int64  
 4   parch       1307 non-null   int64  
 5   fare        1307 non-null   float64
 6   sex_male    1307 non-null   bool   
 7   embarked_Q  1307 non-null   bool   
 8   embarked_S  1307 non-null   bool   
dtypes: bool(3), float64(2), int64(4)
memory usage: 65.2 KB
None

Columns: ['pclass', 'survived', 'age', 'sibsp', 'parch', 'fare', 'sex_male', 'embarked_Q', 'embarked_S']


**Q: How many columns do we have now?**

A: 9.

**Q: Are all columns now numerical (int64 or float64)?**

A: No.

---
## Task 6: Separate Features (X) and Target (y)

Now we'll split our data into features (X) and target (y).

### 6a. Create X (features) - all columns except 'survived'

In [31]:
# TODO: Create X by dropping the 'survived' column
# Hint: X = df.drop('survived', axis=1)
X = df.drop('survived', axis=1)

print("✓ X (features) created!")
print(f"X shape: {X.shape}")
print(f"X columns: {X.columns.tolist()}")

✓ X (features) created!
X shape: (1307, 8)
X columns: ['pclass', 'age', 'sibsp', 'parch', 'fare', 'sex_male', 'embarked_Q', 'embarked_S']


### 6b. Create y (target) - just the 'survived' column

In [32]:
# TODO: Create y by selecting only the 'survived' column
# Hint: y = df['survived']
y = df['survived']

print("✓ y (target) created!")
print(f"y shape: {y.shape}")
print(f"y type: {type(y)}")

✓ y (target) created!
y shape: (1307,)
y type: <class 'pandas.core.series.Series'>


### 6c. Verify X and y

In [33]:
# Display first few rows of X
print("First 5 rows of X (features):")
print(X.head())

print("\nFirst 10 values of y (target):")
print(y.head(10).tolist())

First 5 rows of X (features):
   pclass    age  sibsp  parch      fare  sex_male  embarked_Q  embarked_S
0       1  29.00      0      0  211.3375     False       False        True
1       1   0.92      1      2  151.5500      True       False        True
2       1   2.00      1      2  151.5500     False       False        True
3       1  30.00      1      2  151.5500      True       False        True
4       1  25.00      1      2  151.5500     False       False        True

First 10 values of y (target):
[1, 1, 0, 0, 0, 1, 1, 0, 1, 0]


**Q: How many features (columns) are in X?**

A: 8.

**Q: Do X and y have the same number of rows?**

A: Yes.

**Q: Why is it important that X and y have the same number of rows?**

A: So they all have the same amount of data for collection.

---
## Task 7: Save Prepared Data (Optional)

Save X and y for use in the next lesson.

In [34]:
# Save X and y to CSV files
X.to_csv("Titanic_X_features.csv", index=False)
y.to_csv("Titanic_y_target.csv", index=False)

print("✓ X and y saved!")

✓ X and y saved!


---
## Reflection Questions

Answer these questions based on your work:

**1. Why do machine learning models need numerical data instead of text?**

Answer: It is easier to interpret.

**2. What is one-hot encoding and why is it useful?**

Answer: Creating seperate columns for each catagory.

**3. Why do we use drop_first=True when creating dummy variables?**

Answer: Because having multiple columns is redundant.

**4. What is the difference between X (features) and y (target)?**

Answer: 
    Features (X): What we use to make predictions
    Target (y): What we're trying to predict

**5. Give an example of a real-world categorical variable that would need to be encoded.**

Answer: When determining attendence. If someone is missing or if someone is accounted for.

---
## Lesson Complete! 

You've successfully prepared your data for machine learning!

**Summary of what you did:**
- Converted 'sex' from text to dummy variable (sex_male)
- Converted 'embarked' from text to dummy variables (embarked_Q, embarked_S)
- All data is now numerical
- Separated features (X) from target (y)
- Data is ready for model training!

Save this notebook and push to GitHub.

**Next lesson**: Train/test split and building our decision tree model!