## Objective
In this lab, you will analyze a customer dataset to identify key factors influencing customer churn, create visualizations to explore the data, and build a predictive model using machine learning. The goal is to extract actionable insights and present your findings in a comprehensive report.
## Scenario
You are a data analyst at a fast-growing subscription-based service company. The company is concerned about customer churn—customers canceling their subscriptions—and has tasked you with analyzing customer data. Your objectives are to identify key factors that influence churn and build a predictive model to identify customers at risk of leaving.
## Materials Provided
- A dataset (`customer_churn.csv`) preloaded into a pandas DataFrame named `df`.
- Python environment with essential libraries such as pandas, Scikit-Learn, and Matplotlib pre-installed.

## High-Level Tasks
1. **Load and Explore the Data**
2. **Data Cleaning and Preprocessing**
3. **Exploratory Data Analysis (EDA) and Visualization**
4. **Machine Learning Model Building and Evaluation**
5. **Presenting Findings in a Comprehensive Report**

## Lab Instructions
### 1. Load and Explore the Data (5 minutes)
#### Step 1.1: Import the required Python library and load dataset.
Code is provided.

In [None]:
import pandas as pd 
df = pd.read_csv("customer_churn.csv")

# Display the first 5 rows of the DataFrame
df.head()

#### Step 1.2: Examine Column Names and Data Types
Inspect the column names and data types using `df.info()`. (code provided)

In [None]:
# Display column names and data types
df.info()

#### Step 1.3: Get Summary Statistics
Get summary statistics of numerical columns using `df.describe()`. (code provided)

In [None]:
# Get summary statistics of numerical columns
df.describe()

#### Step 1.4: Remove CSV Index Column
The index from the CSV turned into a column and should be dropped. Use `df.drop` to get rid of the `Unnamed: 0` column. Then use `df.describe()` again to confirm the column is removed.

In [None]:
# Drop the "Unnamed: 0" column
### YOUR CODE HERE ###

# Use df.describe() to confirm the column was removed (code provided)
df.describe()

# Expected shape of DataFrame is (3333,11) after dropping column. 
# Ensure the results are stored in the df variable
print(f"Shape: {df.shape}. Expected is (3333, 11)")

#### Check Your Results:

In [None]:
# Checking DataFrame (df) shape
print(f"Shape: {df.shape}.")

#### Step 1.5: Identify Potential Features and Target Variable, and Encode ContractRenewal
Select all features from the dataset, except churn (e.g., `"AccountWeeks"`, `"DataPlan"`, `"Data Usage"`, etc) and set the target variable (`'churn'`).

You are provided the code for one-hot encoding the `ContractRenewal` column. This column currently has text values ("Yes" or "No"). pd.get_dummies() converts these text values into numerical 1s and 0s. It creates new columns (`'ContractRenewal_Yes'`, `'ContractRenewal_No'`). A 'Yes' becomes a 1 in the 'Yes' column and 0 in the 'No' column, and vice versa. This allows us to use this information effectively in our machine learning models and in charts.

In [None]:
# Select all features and set target variable
### YOUR CODE HERE ###
features = # YOUR CODE HERE
target_variable = # YOUR CODE HERE


# One-hot encoding for 'ContractRenewal' feature (provided; do not change)
features = pd.get_dummies(features,columns=['ContractRenewal'],dtype=int)
# See results with one-hot encoding (Notice last 2 columns)
features.head()

# Expected shape of features DataFrame is (3333,11) after one-hot encoding. 
print(f"features shape: {features.shape}. Expected is (3333, 11)")
# Expected shape of target_variable DataFrame is (3333,).
print(f"target_variable shape: {target_variable.shape}. Expected is (3333,)")

#### Check Your Results:

In [None]:
# Checking DataFrame (features and target_variable) shapes

### 2. Data Cleaning and Preprocessing (5 minutes)
#### Step 2.1: Split the Data
Split the data into training and testing sets (70% train, 30% test) using `train_test_split` from Scikit-Learn. 

Make sure to set the `random_state` parameter to 42 to ensure reproducibility and obtain the same results as the expected solution.

In [None]:
from sklearn.model_selection import train_test_split
# Assume "x" is features and "y" is target_variables
x = features 
y = target_variable

# Split the data
x_train, x_test, y_train, y_test = # YOUR CODE HERE

print(x_train.shape) # Expected (2333,11)
print(x_test.shape) # Expected (1000,11)
print(y_train.shape) # Expected (2333,)
print(y_test.shape) # Expected (1000,)


#### Check Your Results:

In [None]:
# Checking DataFrame (features and target_variable) shapes

### 3. Exploratory Data Analysis (EDA) and Visualization (20 minutes)
#### Step 3.1: Summary Statistics for Relevant Features
Calculate and print summary statistics for relevant features (average tenure for churned vs. non-churned customers).

In [None]:
# Summary statistics for churned vs. non-churned customers
churned = # YOUR CODE HERE
non_churned = # YOUR CODE HERE

# Print average tenure
### YOUR CODE HERE ###

#### Step 3.2: Create Visualizations
Create visualizations (bar chart, histogram, and box plot) to explore the relationships between features and the target variable (`'churn'`). The titles, labels, and commands to show the plots have been provided; you will just need to set up the plots in each cell below.

In [None]:
import matplotlib.pyplot as plt

# Bar chart for contract renewal vs churn
churn_counts = # YOUR CODE HERE

# Chart options provided
churn_counts.plot(kind='bar', stacked=True)
plt.title('Contract Renewal vs. Churn')
plt.xlabel('Contract Renewal')
plt.ylabel('Count')
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Histogram for tenure distribution
plt.hist(# YOUR CODE HERE)

# Chart options provided
plt.title('Tenure Distribution by Churn Status')
plt.xlabel('Account Weeks')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Box plot for monthly charges
df.boxplot(# YOUR CODE HERE)

# Chart options provided
plt.title('Monthly Charges vs. Churn')
plt.xlabel('Churn')
plt.ylabel('Monthly Charge')
plt.suptitle('')  # Remove the default suptitle
plt.show()

#### Step 3.3: Interpret Visualizations
Interpret the visualizations and identify key insights about factors influencing churn.  Enter your observations in the cell below. These will not be graded, but this may be useful if you want to add this to your portfolio.

Enter your observations about the visualizations here:

- Observation 1: 
- Observation 2: 
- Observation 3: 

### 4. Machine Learning Model Building and Evaluation (20 minutes)
#### Step 4.1: Choose a Classification Algorithm and Train the Model
Import a suitable classification algorithm (`LogisticRegression` in this case) and create an instance of it (provided). 

Setting `max_iter = 1000` in our Logistic Regression model means we're giving it a limit of 1000 attempts to learn the optimal patterns in the data, which is often a good initial value to allow for convergence without excessive training time, though the ideal number can vary depending on the specific dataset.

In [None]:
from sklearn.linear_model import LogisticRegression

# Create an instance of the Logistic Regression model (provided)
model = LogisticRegression(max_iter = 1000)

# Train the model
### YOUR CODE HERE ###

#### Step 4.2: Make Predictions
Use the trained model to make predictions on the testing data.

In [None]:
# Make predictions on the test set
y_pred = # YOUR CODE HERE

#### Step 4.3: Evaluate the Model
Evaluate the model's performance using appropriate metrics (`accuracy`, `precision`, `recall`, `f1`).

**Note:** For grading purposes, calculate and store each of these metrics in the following variables:
- `accuracy`
- `precision`
- `recall`
- `f1`

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Evaluate the model
# Round all values to 3 decimal places

accuracy = # YOUR CODE HERE
precision = # YOUR CODE HERE
recall = # YOUR CODE HERE
f1 = # YOUR CODE HERE

print(f"Accuracy: {accuracy}") # Expected: approximately 0.867
print(f"Precision: {precision}") # Expected: approximately 0.604
print(f"Recall: {recall}") # Expected: approximately 0.203
print(f"F1 Score: {f1}") # Expected: approximately 0.304

#### Check Your Results:

In [None]:
# Checking accuracy
print(f"Accuracy: {accuracy}")

In [None]:
# Checking precision
print(f"Precision: {precision}")

In [None]:
# Checking recall
print(f"Recall: {recall}")

In [None]:
# Checking f1
print(f"F1 Score: {f1}")

### 5. Presenting Findings in a Comprehensive Report
#### Step 5.1: Compile the Results
Compile your analysis, visualizations, and model evaluation results into a comprehensive report. Fill them in as directed below. This will not be graded, but may be useful if you want to add this to your portfolio.
- `Introduction:` Write a sentence or two describing the purpose of this analysis.
- `Data Exploration:` Write a sentence or two highlighting the key factors in customer churn.
- `Model Building and Evaluation:` Write a sentence or two describing how your model was trained, and the accuracy, precision, and recall rates.
- `Key Insights:` Add two or three bullet points summarizing your findings.
- `Recommendations:` Add two or three bullet points with the recommendations you would make based on this analysis.

# Customer Churn Analysis Report

## Introduction
- 


## Data Exploration
- 


## Model Building and Evaluation
- 


## Key Insights
- 
- 
- 


## Recommendations
- 
- 

## Hints & Tips
- Use the "pandas cheat sheet" for quick syntax reference on DataFrame operations.
- Check the "Scikit-Learn documentation" for examples and explanations of classification models.
- Use Matplotlib for creating informative visualizations. Reference various materials in Course 2.

Good luck with your customer churn analysis!