# Part 2: Basic Machine Learning
**Time Allocation:** 20-25 minutes  
**Points:** 35 points total  

**Prerequisites:** Complete Part 1 and have `cleaned_customer_data.csv` ready

**Instructions:** Complete all tasks in order. Add your code in the empty cells provided.

---

## Overview
In this part, you will use the cleaned dataset from Part 1 to build and evaluate basic machine learning models for predicting customer churn. This demonstrates core ML skills expected in internships.

---
## Task 2.1: Data Preparation for ML (8 points)

### Instructions:
1. **Load your cleaned data**
   - Import the cleaned dataset from Part 1
   - Verify it has no missing values

2. **Prepare features and target**
   - Separate features (X) from target variable (y)
   - Target: `churn` column (what we want to predict)
   - Features: All other columns except `customer_id` and `churn`

3. **Handle categorical data**
   - Convert categorical features to numbers using Label Encoding
   - For example: Male=0, Female=1 for gender
   - Apply to: `gender`, `contract_type`, `internet_service`

4. **Train-test split**
   - Split data into 80% training, 20% testing
   - Use `random_state=42` for reproducible results

In [None]:
# Import necessary libraries
# Your code here


In [None]:
# Load cleaned dataset from Part 1
# Your code here


In [None]:
# Verify no missing values
# Your code here


In [None]:
# Separate features (X) and target (y)
# X = all columns except customer_id and churn
# y = churn column
# Your code here


In [None]:
# Apply Label Encoding to categorical columns: gender, contract_type, internet_service
# Note: Keep track of feature names for later interpretation
# Your code here

In [None]:
# Encode target variable (churn: Yes=1, No=0)
# Your code here


In [None]:
# Train-test split (80% train, 20% test, random_state=42)
# Your code here


---
## Task 2.2: Build Basic ML Models (12 points)

### Instructions:
Build **2 different models** using scikit-learn:

1. **Logistic Regression Model**
   - Import `LogisticRegression` from sklearn
   - Create and train the model on training data
   - Make predictions on test data

2. **Decision Tree Model**
   - Import `DecisionTreeClassifier` from sklearn
   - Create and train the model on training data
   - Make predictions on test data

### Code Pattern to Follow:
```
# For each model:
# 1. Import the model
# 2. Create model instance: model = ModelName()
# 3. Train: model.fit(X_train, y_train)
# 4. Predict: predictions = model.predict(X_test)
```

In [None]:
# Import ML models
# Your code here


In [None]:
# Create and train Logistic Regression model
# Your code here


In [None]:
# Make predictions with Logistic Regression
# Your code here


In [None]:
# Create and train Decision Tree model
# Your code here


In [None]:
# Make predictions with Decision Tree
# Your code here


---
## Task 2.3: Model Evaluation & Understanding (10 points)

### Instructions:
1. **Calculate Basic Metrics**
   - Calculate accuracy for both models
   - Create confusion matrix for both models
   - Use `accuracy_score` and `confusion_matrix` from sklearn.metrics

2. **Compare Models**
   - Which model performed better?
   - What is the accuracy difference?
   - Display results in a clear format

3. **Understanding Confusion Matrix**
   - Explain what each number in the confusion matrix means
   - Calculate by hand: How many customers did the model correctly predict would churn?
   - How many did it incorrectly predict?

In [None]:
# Import evaluation metrics
# Your code here


In [None]:
# Calculate accuracy for Logistic Regression
# Your code here


In [None]:
# Calculate accuracy for Decision Tree
# Your code here


In [None]:
# Create confusion matrix for Logistic Regression
# Your code here


In [None]:
# Create confusion matrix for Decision Tree
# Your code here


In [None]:
# Compare model performance - display results clearly
# Your code here


### Interpretation Section
**Instructions:** Write your answers in the markdown cell below

**Your Model Comparison & Confusion Matrix Interpretation:**

1. Which model performed better and by how much?
   - Your answer here...

2. Explain what each number in the confusion matrix means:
   - Your explanation here...

3. How many customers did your best model correctly predict would churn? How many did it miss?
   - Your analysis here...

---
## Task 2.4: Basic Feature Importance (5 points)

### Instructions:
1. **Extract Feature Importance** (Decision Tree only)
   - Use `.feature_importances_` attribute of the trained decision tree
   - Display which features are most important for predicting churn

2. **Simple Interpretation**
   - Write 2-3 sentences explaining what the most important features mean
   - For example: "Monthly charges is the most important feature, meaning higher bills increase churn probability"

3. **Business Insight**
   - Based on the important features, suggest ONE simple action the company could take to reduce churn

In [None]:
# Extract feature importances from Decision Tree
# Your code here


In [None]:
# Display feature names with their importance scores
# Hint: You'll need the column names from your features (X) to match with importance values
# Your code here

### Feature Importance Interpretation
**Instructions:** Write your analysis in the markdown cell below

**Your Feature Importance Analysis:**

1. What are the most important features for predicting churn?
   - Your answer here...

2. What do these important features tell us about why customers leave?
   - Your explanation here... (2-3 sentences)

3. Business Recommendation: Based on these results, what is ONE action the telecom company should take to reduce churn?
   - Your recommendation here...

---
## Submission Checklist for Part 2

Before submitting, verify you have completed:

- [ ] ✅ Loaded cleaned data and verified no missing values
- [ ] ✅ Separated features and target correctly
- [ ] ✅ Applied label encoding to categorical variables
- [ ] ✅ Created train-test split with random_state=42
- [ ] ✅ Built and trained Logistic Regression model
- [ ] ✅ Built and trained Decision Tree model
- [ ] ✅ Calculated accuracy for both models
- [ ] ✅ Created confusion matrices for both models
- [ ] ✅ Compared model performance with written interpretation
- [ ] ✅ Extracted and displayed feature importances
- [ ] ✅ Provided business insights and recommendations
- [ ] ✅ All code cells run without errors

**Time Check:** Part 2 should take approximately 20-25 minutes
**Total Time:** Both parts should be completed in 40-50 minutes

---
## Helpful Libraries Reference

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
```