# Reducing Churn

After discussing the churn problem at TelCo extensively and intensively, you've finally defined an adequate target variable for churn and gathered relevant data to predict it. Moreover, the marketing department has come up with an amazing retention offer: the offer is guaranteed to convince customers to extend their contract for an extra year after receiving it. Unfortunately, the offer is expensive; it costs $200. 

You've been authorized to give the retention offer to up to 25% of the customers whose contract is expiring. It is your job to use data from previous contract non-renewals to build a predictive model and make a recommendation of whom to target with the offers. The historical data includes:

- Gender: Whether the customer is a male or a female
- SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
- Partner: Whether the customer has a partner or not (Yes, No)
- Dependents: Whether the customer has dependents or not (Yes, No)
- Tenure: Number of months the customer has stayed with the company
- PhoneService: Whether the customer has a phone service or not (Yes, No)
- MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
- InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
- OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
- OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
- DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
- TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
- StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
- StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
- Contract: The contract term of the customer (Month-to-month, One year, Two year)
- PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
- PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- MonthlyCharges: The amount charged to the customer monthly
- Churn: Whether the customer churned or not shortly after contract expiration (Yes or No)

To open this notebook in Colab, click 
<a href="https://colab.research.google.com/github/powenfang/Data-Science-for-Business-2021Fall-Elkan/blob/main/Homeworks/HW3-Fall2021.ipynb" target="_parent"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /> </a>
Then save the notebook to your personal Google Drive and run the following cell.

In [None]:
!git clone https://github.com/powenfang/Data-Science-for-Business-2021Fall-Elkan
%cd Data-Science-for-Business-2021Fall-Elkan/Homeworks/

Cloning into 'Data-Science-for-Business-2021Fall-Elkan'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (53/53), done.[K
remote: Total 61 (delta 13), reused 41 (delta 4), pack-reused 0[K
Unpacking objects: 100% (61/61), done.
/Users/celkan/Data-Science-for-Business-2021Fall-Elkan/Homeworks/Data-Science-for-Business-2021Fall-Elkan/Homeworks


Load the churn data. The code below will also transform your categorical variables into dummy variables. No points for this. This is just meant to help you get started.

In [None]:
import numpy as np
import pandas as pd

# If necessary, change the path below so that it points to your file.
data_path = "./data/data-hw3.csv" 

df = pd.read_csv(data_path)
df = pd.get_dummies(df, drop_first=True)

__1. Split the data into 80% training data and 20% test data (more precisely called evaluation data).__

In [None]:
from sklearn.model_selection import train_test_split
# Your code goes here


__2. Train the best model that you can for each of the following three model types:__
- __A decision tree classifier. Try different values for the parameter min_samples_leaf.__
- __Logistic regression. Try different values for the regularization parameter _C_.__
- __A nearest-neighbor model. Try different numbers _k_ of neighbors.__

__Find the value of the complexity parameter that optimizes each model for F1 score measured using cross-validation with 5 folds. 
Use only the training data; do not use the test data now.__

__If a model type has a "predict_proba" method, then use that and also find a threshold for these predictions that optimizes F1 score.
Note that the best threshold may be different for different values of the complexity parameter.__

__For each model type, 
report the best complexity parameter value (and threshold if applicable) and the corresponding cross-validated F1 score. 
Your code should show the process you went through to compare different parameter values for each model type.
Pick one model to move forward with. Why did you select this one?__

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

target = "Churn_Yes"
predictors = df.columns[df.columns != target]
example_model = DecisionTreeClassifier()
# Remember to use only the training data here

In [None]:
# Your code goes here

__3. Use your test data to display a probability calibration plot for each of the three best models you found in the previous question.
Would you consider changing the selected model after looking at these plots? Explain why yes or why no.__

In [None]:
import matplotlib.pyplot as plt

# Remember to use the TRAINING data here: 
example_model.fit(df[predictors], df[target])
# And to use the TEST data here:
probs = example_model.predict_proba(df[predictors])[:, 1]

In [None]:
# Your code goes here 

__4. What is the potential benefit of stopping a customer from leaving? 
Use unit cost and unit revenue information to compute and then print a 2x2 benefit matrix. 
HINT: Look at the description of the data and of the retention offer.__

In [None]:
# Your code goes here

__5. Split your training data into two sets, one with 90% of the data (the "sub-training" set) and another with 10% of the data (the validation set). Train the model you selected on the "sub-training" set, apply it to the validation set, and plot a profit curve by ranking customers according to their probability of churning. Make a recommendation of which customers to target with the retention incentive according to this profit curve.
HINT: Utilize the following functions, filling in your code to calculate profits.__

In [None]:
def build_cumulative_curve(model, X_validation, y_validation, scale=100):
    # Get the probability of Y_test records being = 1
    Y_test_probability_1 = model.predict_proba(X_validation)[:, 1]

    # Sort these probabilities and the true value in descending order of probability
    order = np.argsort(Y_test_probability_1)[::-1]    
    
    Y_test_probability_1_sorted = Y_test_probability_1[order]
    Y_test_sorted = np.array(y_validation)[order]

    # Build the cumulative response curve
    x_cumulative = np.arange(len(Y_test_probability_1_sorted)) + 1
    y_cumulative = np.cumsum(Y_test_sorted)

    # Rescale
    x_cumulative = np.array(x_cumulative)/float(x_cumulative.max()) * scale
    y_cumulative = np.array(y_cumulative)/float(y_cumulative.max()) * scale
    
    return x_cumulative, y_cumulative

def plot_profit_curve(model, model_name, sub_X_train, sub_y_train, X_validation, y_validation):
    ##############################
    # input
    #   model: a sklearn model object like DecisionTreeClassifier, LogisticRegression, etc.
    #   model_name: string, the name of the model to serve as the label in the plot. e.g. "Logistic Regression"
    #   sub_X_train: a dataframe, the training set with features only
    #   sub_y_train: a dataframe, the training set with target variable only
    #   X_validation: a dataframe, the test set with features only
    #   y_validation: a dataframe, the test set with target variable only
    #
    # output
    #   Profit curve: a matplot.lib plot
    ##############################

    total_obs = len(sub_y_test)
    total_pos = sub_y_test.sum()

    model.fit(sub_X_train, sub_y_train)

    x_cumulative, y_cumulative = build_cumulative_curve(model, X_validation, y_validation, scale=1)

    #########################
    # Fill in your code here to calculate "profits"
    #########################

    plt.plot(x_cumulative*100, profits, label = model_name)
    # Plot other details
    plt.xlabel("Percentage of users targeted")
    plt.ylabel("Profit")
    plt.title("Profit curve")
    plt.legend()

In [None]:
# Your code goes here

__6. Can you think of a better process than ranking according to the probability of churning? If so, explain it, and plot a profit curve according to this new ranking. Compare the results of the new ranking with the results you got in the previous question. Are the results any better? Are the selected customers different?__

In [None]:
# Your code goes here

__7. Now that you have chosen a process, evaluate the potential benefit of this solution. Use the entire training data to train your preferred model, and then use the model on the held-out test data to decide which customers to target with the retention offer. How much profit do you estimate that your process would have achieved?  HINT: Read the case description at the beginning carefully.__

In [None]:
# Your code goes here

__8. Discuss whether the analysis and process above have distinguished correctly between revenue and profit. Between fixed costs and variable costs?__