# Customer Churn Prediction
## Overview
You have been hired by a telecommunications company to develop a machine-learning model that predicts customer churn. The company wants to identify customers who are likely to cancel their service so they can take proactive steps to retain them. The model you develop will be integrated into the company’s customer relationship management (CRM) system and used by the marketing team to target at-risk customers with retention offers.

## Project requirements
1. Predictive accuracy: The model must accurately predict whether a customer is likely to churn based on historical data.

2. Scalability: The model should be able to handle a large volume of data, as the company has millions of customers.

3. Integration: The model needs to be easily integrated into the company’s existing CRM system, which is built on a Python-based backend.

4. Efficiency: The model should be optimized for real-time or near-real-time predictions to allow timely interventions by the marketing team.

## Load provided dataset
Load the customer churn dataset
Using the provided link to the dataset, download the file and import it into your notebook. Make sure it is named correctly and is the correct file type. This dataset file is designed to be small and straightforward, making it simple to use for understanding these processes. We will explore this topic in greater depth later in the “Advanced AI and Machine Learning Techniques and Capstone” course.

https://d3c33hcgiwev3.cloudfront.net/ZXZr6KfMQ1-zgKttC_LdOA_15745db8af41431c9159e7a07eb292a1_Coursera-AIML_0754---Dataset.csv?Expires=1765024874&Signature=SJBNzTxS-1vbpxBWPhyEOpfBL1Rxx1V1DEi~A4gnfZJ7BIpVJMenaG5eOtppQLVDzOjqD6tce-Urom5VgAKwtmPg-TH~VAiIGrrUBqsWEYyxCv8RQsX~mkEKxVDUKm0lk~1JZHq5PM5w1Wl2mikfMNQo1oKBUoHqIN~Zka-a-GY_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A

A dataset containing historical customer data, including features such as customer tenure, service usage, contract type, and customer satisfaction scores will be provided.

In [9]:
# Import necessary libraries
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


In [None]:
# Load the dataset example
data = pd.read_csv('coursera-customer-churn-dataset.csv')

# Explore the dataset
print(data.head())
print(data.info())

   CustomerID  Tenure  MonthlyCharges  TotalCharges        Contract  \
0        1001       5            70.0         350.0  Month-to-month   
1        1002      10            85.5         850.5        Two year   
2        1003       3            55.3         165.9        One year   
3        1004       8            90.0         720.0  Month-to-month   
4        1005       2            65.2         130.4        One year   

      PaymentMethod  Churn  
0  Electronic check      1  
1      Mailed check      0  
2  Electronic check      1  
3       Credit card      0  
4  Electronic check      1  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   CustomerID      5 non-null      int64  
 1   Tenure          5 non-null      int64  
 2   MonthlyCharges  5 non-null      float64
 3   TotalCharges    5 non-null      float64
 4   Contract        5 non-nu

## Preprocess the data
Handle missing values and simplify the dataset
Identify and handle any missing values in the dataset. This could involve filling missing values, dropping rows or columns, or using imputation techniques. Removing unnecessary columns, such as CustomerID, helps keep the dataset clean. In this smaller dataset, removing these columns is essential to accurately train a model using only the important values. 



In [2]:
data = data.drop(columns=['CustomerID']) #Simplify the dataset
data = data.dropna()  # Simple example of dropping missing values
print(data.head())

   Tenure  MonthlyCharges  TotalCharges        Contract     PaymentMethod  \
0       5            70.0         350.0  Month-to-month  Electronic check   
1      10            85.5         850.5        Two year      Mailed check   
2       3            55.3         165.9        One year  Electronic check   
3       8            90.0         720.0  Month-to-month       Credit card   
4       2            65.2         130.4        One year  Electronic check   

   Churn  
0      1  
1      0  
2      1  
3      0  
4      1  


In [10]:
# Encode categorical variables
# Convert categorical variables into numerical format using techniques like one-hot encoding.

data = pd.get_dummies(data, drop_first=True)

# Split the dataset
X = data.drop('Churn', axis=1)
y = data['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a machine learning model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Test accuracy: {accuracy}')

# Simplify model by limiting its maximum depth
pruned_model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10, max_features='sqrt') 

pruned_model.fit(X_train, y_train) 
pruned_predictions = pruned_model.predict(X_test) 
pruned_accuracy = accuracy_score(y_test, pruned_predictions) 
print(f'Pruned Test accuracy: {pruned_accuracy}')

# save the model for future use
joblib.dump(model, 'churn_model.pkl')

Test accuracy: 1.0
Pruned Test accuracy: 1.0


['churn_model.pkl']