# <font color= Purple> 0. Problem statement: Telecom Churn Group study
    
                                                                                   Team- 1. Varun Shenoy
                                                                                         2. Binay Yadab

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business
goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, we should analyze customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn, and identify the main indicators of churn.

### Objective   
The goal is to build a machine learning model that is able to predict churning customers based on the features provided for their usage.

- Identify customers at high risk of churn by building a predicitve ML model
- To Identify important churn predictors
- Improve the overall accuracy of the model, using different models and explain the business objectives
- Recommend different strategies to cointain the churn based on observations from models.

**The Data**

Training data in a CSV file along with metadata is provided. The data has 172 columns highlighting the customer behavior, usage, payment, and other patterns that might be relevant. The target variable is "churn_probability".

Steps followed in solving the Telecom churn case study

<b>Step-0 Understanding problem</b> 

<b>Step-1 Data Understanding (EDA) and visualization:</b> Impute missing value, null value treatment, Univariate and Bivariate analysis, visulaizing the data with appropriate plots

<b>Step-2 Data Preperation and modeling:</b> Class imbalance, dummy variables, and scaling, train test split, build different models like, LR, decision Tree classifier, random forest etc

<b>Step-3 Model Development and Evaluation:</b> identify the important features and best model.

<b>Step-4 Prediction:</b> Predictions on unseen test data  

## <font color= Purple> Step-1 Data Understanding (EDA) and visualization:
## <font color= Blue> 1.1 Loading dependencies & datasets

In [None]:
#Data Structures
import pandas as pd
import numpy as np
import re
import os

# To hide warnings
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',10000)

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
%matplotlib inline


# ML libreries of Sklearn and stats model

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import RocCurveDisplay
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from imblearn.over_sampling import SMOTE

In [None]:
df_telco = pd.read_csv('train.csv')
print("Telecom training data :",len(df_telco))
df_telco.info(verbose=True)

In [None]:
df_telco.head(10)

## <font color= Blue> 1.2 Understanding the Data (EDA)

In [None]:
# data description
df_telco.describe(percentiles=[0.25, 0.50, 0.75, 0.95])

In [None]:
# to check for row repetation
df_telco.duplicated().sum()

In [None]:
# capturing rows eith single entries
j=0
single_entry_list=[]
for i in df_telco.columns:
    if df_telco[i].nunique() <= 10:
        j+=1
        print('\n', df_telco[i].value_counts(),sep="")
        if df_telco[i].nunique()==1:
            single_entry_list.append(i)
print ("Total columns with less than 11 entries:",j)
print (single_entry_list)

In [None]:
# to understand meaing of terms
df_data_dict = pd.read_csv("data_dictionary.csv")
df_data_dict.style #to view all text

In [None]:
# % Missing value in all columns
round(df_telco.isnull().sum()/len(df_telco),4)*100

## <font color= Orange> Observations
1. training data has 172 columns and 69999 rows
2. all data is the numeric formats except date representations which is in MM/DD/YYYY format
3. There are outliers present in many columns as seen from describe but they need to be analysed further
4. There are no duplicate rows observed in the dataset
5. There are 24 columns with less than 11 entries with many having single entry in them.
6. Many null values are present in columns A pattern is seen here there are either missing values  under 6% or above 70% in different columns. Further study needed to impute or remove them.
7. There is a ID column which is not needed here

## <font color= Blue> 1.3 Missing value treatment (EDA)

Dropping values with missing values greater than 60% as they cannot be imputed and would leading to skewing of data if imputed

In [None]:
missing_val_60_list=df_telco.columns[100*(df_telco.isnull().sum()/len(df_telco)) > 60]
print("Number of columns with more than 60% missing values = ",len(missing_val_60_list))
print(missing_val_60_list)

In [None]:
# columns from the data set for dropping
drop_list=list(missing_val_60_list)+single_entry_list
drop_list

In [None]:
# dropping with id
df_telco_v1 = df_telco.drop(drop_list,axis=1)
df_telco_v1 = df_telco_v1.drop('id',axis=1)
print("Telecom dataset before dropping columns:",df_telco.shape)
print("Telecom dataset after dropping columns:",df_telco_v1.shape)

In [None]:
df_telco_v1.info(verbose=True)

In [None]:
df_telco_v1.head()

Dates can be dropped as we have monthly reacharge and their amounts

In [None]:
# capturing date columns
date_drop=[]
for i in df_telco_v1.columns:
    if df_telco_v1[i].dtype=='object':
        date_drop.append(i)

df_telco_v1 = df_telco_v1.drop(date_drop,axis=1)
df_telco_v1.shape

## <font color= Blue> 1.4 Null value imputaion (EDA)

In [None]:
# getting columns with null values
cols_w_null = df_telco_v1.columns[100*(df_telco_v1.isnull().sum()/len(df_telco_v1)) > 0]
print("Total Columns with missing values in it = ",len(cols_w_null))
print(cols_w_null)

In [None]:
# understanding the distribution for columns with null
plt.figure(figsize=(35, 100))
for i in range (0,len(cols_w_null)):
    plt.subplot(17,5,i+1)
    grp= sns.distplot(df_telco_v1[cols_w_null[i]])
plt.show()

## <font color= Orange> Observations
1. we see that most of the variables have vast distribution but most of their values are zero or close within 100.
2. We must not impute them with their means as they are heavily skewed by outliers
3. Imputation will be done with median.

In [None]:
#imputing with median
median_imputation = SimpleImputer(strategy='median', missing_values=np.nan)
df_telco_v1[cols_w_null] = median_imputation.fit_transform(df_telco_v1[cols_w_null])

In [None]:
df_telco_v1.describe()

In [None]:
# % Missing value in all columns
round(df_telco_v1.isnull().sum()/len(df_telco_v1),4)*100

No missing values are observed in any columns

In [None]:
 df_telco_v1.head()

## <font color= Orange> Observations
1. After null imputation and dropping columns we can see that we are left with 3 months June, July, August in the dataset
2. We must create a customer profile from this data to identify high value customer, These customers can be a combination of high ARPU (avg revenue per user) and AON (age on network) 

In [None]:
df_data_dict.style

In [None]:
df_telco_v1['aon_months'] = df_telco_v1['aon'].apply(lambda x: round((x/365)*12,2))
df_telco_v1['aon_months'].describe()

In [None]:
df_telco_v1['mean_arpu'] = round((df_telco_v1['arpu_6']+df_telco_v1['arpu_7']+df_telco_v1['arpu_8'])/3,2)
df_telco_v1['mean_arpu'].describe()

## <font color= Orange> Observations
1. Age on network in months tells us the least usage was ~5 months an maximum of ~142 months on the network 
2. mean revenue per user for 3 months is ~280 Units with media ~200 units
3. If we combine the aon_months and mean_arpu we can get the avg_customer_Spend and all other columns used for this imputation can be dropped

In [None]:
df_telco_v1['avg_customer_Spend'] = df_telco_v1['aon_months']*df_telco_v1['mean_arpu']
df_telco_v1['avg_customer_Spend'].describe()

In [None]:
df_telco_v1.drop(columns = ['aon','arpu_6','arpu_7','arpu_8','mean_arpu','aon_months'],axis=1,inplace=True)
df_telco_v1.head()

In [None]:
df_telco_v1.shape

In [None]:
HVC = df_telco_v1['avg_customer_Spend'].quantile(0.65) #High value customers
LVC = df_telco_v1['avg_customer_Spend'].quantile(0.2) #low value customers
df_telco_v1['customer_value'] = df_telco_v1['avg_customer_Spend'].apply(
    lambda x: 'HVC' if x > HVC else ('LVC' if x < LVC else 'MVC')) # MVC- Medium value customers
df_telco_v1['customer_value'].value_counts()

## <font color= Orange> Observations
1. We now have a customer vaulue and who the telecom operator has to focus on for maximum retention

## <font color= Blue> 1.5 Outlier treatment and Visualization (EDA)

In [None]:
from pandas.api.types import is_numeric_dtype
for col in df_telco_v1.columns: 
    if is_numeric_dtype(df_telco_v1[col]):
        plt.title(col)
        sns.boxplot(x = df_telco_v1['customer_value'],y=df_telco_v1[col], data = df_telco_v1)
        print ("maximum value for",col, "is:",df_telco_v1[col].max(),
               "\nminimum value for",col, "is:", df_telco_v1[col].min(),
              "\nmedian value for",col, "is:", df_telco_v1[col].quantile(0.50))
        plt.show()

## <font color= Orange> Observations
1. Almost all columns have outliers. 
2. Outliers will not be removed from all columns e.x columns like SACHET, monthly_3g, spl etc have very low difference between max and median 
3. Many columns like std recharge, roaming cannot be considered in outlier analysis are they are a cause for revenue and dissatisfaction here may lead to more churn
3. Hence outliers are dropped only from columns which have a large mean median difference especially observed in total minutes of usage with combination of onnet and offnet usage. 
4. values above 99 (assumed) percentile are considered outliers

In [None]:
outlier_cols=['total_ic_mou_6','total_og_mou_6','onnet_mou_6','offnet_mou_6','total_rech_amt_6','total_ic_mou_7','total_og_mou_7','onnet_mou_7','offnet_mou_7','total_rech_amt_7','total_ic_mou_8','total_og_mou_8','onnet_mou_8','offnet_mou_8','total_rech_amt_8',]


In [None]:
df_telco_v1[outlier_cols].describe(percentiles=[0,0.25,0.5,0.75,0.99])

In [None]:
df_telco_v2=df_telco_v1.copy()
for col in outlier_cols:
    if is_numeric_dtype(df_telco_v2[col]):
        if df_telco_v2[col].max()/df_telco_v2[col].quantile(0.99)>6:
            df_telco_v2=df_telco_v2[df_telco_v2[col]<df_telco_v2[col].quantile(0.99)]

In [None]:
df_telco_v2.shape

## <font color= Orange> Observations
1. Among the identified outliers only the very high deviation cases i.e ratio between max and 99percentile is more than 6times only those entries are removed 

In [None]:
round(((69999-df_telco_v2.shape[0])/69999)*100,2)

In [None]:
df_telco_v2.describe()

## <font color= Orange> Observations
1. few key columns were identified and outliers seen in those columns were eliminated resulting trimming dataset by ~3% which is accecptable

In [None]:
sns.histplot(df_telco_v2['churn_probability'])
plt.title("Churn Probability")
plt.show()
df_telco_v2['churn_probability'].value_counts()

In [None]:
# of churn in the filtered dataset
round((df_telco_v2['churn_probability'].value_counts()[1]/df_telco_v2['churn_probability'].value_counts()[0])*100,2)

## <font color= Orange> Observations
1. A big class imablance is seen between churn (~12%) and active cases (~88%). 
2. correlation matirx is carried out to understand if there are any visible  patterns correlating with churn and spend

In [None]:
# correlation plot of whole dataset
plt.figure(figsize = (40, 40))
sns.heatmap(df_telco_v2.corr(), cmap="YlGnBu")
plt.show()

## <font color= Orange> Observations

1. since the grph is not cler the stdy is carried out in Month-wise order to see if there are any visible patterns

In [None]:
June_month =[]
July_month = []
August_month = []
others=[]
for i in df_telco_v2.columns:
    if '6' in i or 'jun' in i.lower():
        June_month.append(i)
    elif '7' in i or 'jul' in i.lower():
        July_month.append(i)
    elif '8' in i or 'aug' in i.lower():
        August_month.append(i)
    else:
        others.append(i)

print(June_month, len(June_month))
print(July_month, len(July_month))
print(August_month, len(August_month))
print(others)

Analysis Month wise

In [None]:
df_telco_v2.head()

In [None]:
for i in range(len(June_month)):
    plt.figure(figsize=(80, 60))
    plt.subplot(10,4,i+1)
    sns.scatterplot(x = df_telco_v2[June_month[i]], y = 'avg_customer_Spend', data = df_telco_v2, hue='customer_value',style='churn_probability',sizes='churn_probability')
    plt.show()

In [None]:
for i in range(len(July_month)):
    plt.figure(figsize=(80, 60))
    plt.subplot(10,4,i+1)
    sns.scatterplot(x = df_telco_v2[July_month[i]], y = 'avg_customer_Spend', data = df_telco_v2, hue='customer_value',style='churn_probability',sizes='churn_probability')
    plt.show()

In [None]:
for i in range(len(August_month)):
    plt.figure(figsize=(80, 60))
    plt.subplot(10,4,i+1)
    sns.scatterplot(x = df_telco_v2[August_month[i]], y = 'avg_customer_Spend', data = df_telco_v2, hue='customer_value',style='churn_probability',sizes='churn_probability')
    plt.show()

## <font color= Orange> Observations

1. Visually more churns ca be seen in high value customers in std_outgoing calls followed by roaming

In [None]:
def corr(month):
    plt.figure(figsize = (40, 40))
    sns.heatmap(df_telco_v2[month].corr(), annot=True, cmap="YlGnBu")
    return plt.show()

In [None]:
June=June_month+others
corr(June)

In [None]:
July=July_month+others
corr(July)

In [None]:
Aug=August_month+others
corr(Aug)

## <font color= Orange> Observations

1. We can see small patches where customer spend is positive and aslo the churn posibility is positive although very slightly these variable are important
2. Also when the spending is high the  churn probaility is negetive
3. from the scatter plot it is also evident that the sd_og is an importaint variable

In [None]:
plt.figure(figsize=(5,5))
sns.countplot(x="customer_value",hue="churn_probability", data=df_telco_v2)
plt.show()

##  <font color= Green> Key insights from EDA -
 - Most customers who are highvalue as per our analysis have a lower churn rate.
 - Targeting the middle MVC and HVC is benefetial to the company
 - Mostly the churn probabilty is negetively correlated with most variables which means predictor variables donot show a clear pattern with the target variable
 - There is a big class imbalance between the most of 88% of the data is active users
 - only few variables(like std_og, roam etc) gives a clear identification how it is related to revenue and churn
 - In this dataset June sees highest churn

## <font color= Purple> Step-2 Data Preperation and modeling:

In [None]:
df_telco_v2.info(verbose=True)

Handling class imbalance by resampling

In [None]:
df_telco_v2['customer_value'] = df_telco_v2['customer_value'].apply(lambda x:2 if x=='HVC' else (0 if x=='LVC' else 1))

In [None]:
#Test Train split
y = df_telco_v2.loc[:,'churn_probability']
x = df_telco_v2.drop(columns=['churn_probability'],axis=1)

In [None]:
sm = SMOTE(random_state=15)
X_rsamp, y_rsamp = sm.fit_resample(x, y)
print(X_rsamp.shape, y_rsamp.shape)

In [None]:
#df_telco_v3 = df_telco_v2
df_telco_v3 = pd.concat([X_rsamp,y_rsamp],axis=1)
df_telco_v3.head()

In [None]:
df_telco_v3['customer_value'] = df_telco_v3['customer_value'].apply(lambda x:'HVC' if x==2 else ('LCV' if x==0 else 'MCV'))
df_telco_v3['customer_value'].value_counts()

In [None]:
df_telco_v3.shape

In [None]:
plt.figure(figsize=(5,5))
sns.countplot(x="customer_value",hue="churn_probability", data=df_telco_v3)
plt.show()

In [None]:
# % of churn in the resampled dataset
round((df_telco_v3['churn_probability'].value_counts()[1]/df_telco_v3['churn_probability'].value_counts()[0])*100,2)

## <font color= Orange> Observations

1. Now the churn probailities are equally distributed and imbalance is eliminated

In [None]:
hot_encode = pd.get_dummies(df_telco_v3['customer_value'], drop_first=True, prefix='customer_value')
df_telco_v3.drop(columns=['customer_value'],axis=1,inplace=True)
df_telco_v3 = pd.concat([df_telco_v3,hot_encode],axis=1)

In [None]:
df_telco_v3.shape

In [None]:
df_train, df_test = train_test_split(df_telco_v3,train_size=0.80,test_size=0.20,random_state=100)

In [None]:
y_train = df_train.pop('churn_probability')
X_train = df_train
y_test = df_test.pop('churn_probability')
X_test = df_test

In [None]:
print(df_telco_v3.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
# Scaling the train & test datasets
scaler = StandardScaler()
var=X_train.columns
X_train[var] = scaler.fit_transform(X_train[var])
X_test[var] = scaler.transform(X_test[var])

In [None]:
X_train.describe()

In [None]:
X_test.describe()

## <font color= Orange> Observations

1. All values seems to be scaled in the same similarly

## <font color= Purple> Step-3 Model Development and Evaluation

### <font color= Blue> 3.1 Logistic regression is chosen as base model

In [None]:
model_logistic = LogisticRegression()
model_logistic.fit(X_train, y_train)
y_pred_log =  model_logistic.predict(X_test)

In [None]:
def get_confusion_matrix(y_test,y_pred):
    cm = confusion_matrix(y_test, y_pred)
    print('The confusion Matrix : \n',cm)
    
    TP = cm[1,1] # true positives 
    TN = cm[0,0] # true negatives
    FP = cm[0,1] # false positives
    FN = cm[1,0] # false negatives

    accuracy = metrics.accuracy_score(y_true=y_test,y_pred=y_pred)
    recall = TP/(FN+TP)
    specificity = TN/(TN+FP)
    precision = TP/(FP+TP)

    print("Accuracy = {:.2f}".format(accuracy))
    print("Sensitivity/Recall = {:.2f}".format(recall))
    print("Specificity = {:.2f}".format(specificity))
    print("Precision = {:.2f}".format(precision))


In [None]:
get_confusion_matrix(y_test,y_pred_log)

## <font color= Orange> Observations

1. The base model itself is got with very high accuracy. 
2. all other scores seems to be same at 87%
3. Just to get a fair idea VIF is applied and some columns are dropped just to see the influence on the result

In [None]:
## eliminating features by performing VIF

def VIF_calc(dataframe):
    vif = pd.DataFrame()
    vif['Features'] = dataframe.columns
    vif['VIF'] = [variance_inflation_factor(dataframe.values, i) for i in range(dataframe.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return vif

In [None]:
VIF_calc(X_train[var])

## <font color= Orange> Observations

1. VIF shows more than 25 columns with a high value
2. This would take many iterations to solve
3. Intutively 5 colums with high VIF are randomly dropped just to understand its influence
4. the columns which will be dropped are [std_og_mou_7,total_og_mou_7,loc_og_mou_7,loc_og_t2m_mou_7,std_og_t2m_mou_7]

In [None]:
var_1=var.drop(['std_og_mou_7','total_og_mou_7','loc_og_mou_7','loc_og_t2m_mou_7','std_og_t2m_mou_7'])

In [None]:
model_logistic.fit(X_train[var_1], y_train)
y_pred_log_2 =  model_logistic.predict(X_test[var_1])

In [None]:
get_confusion_matrix(y_test,y_pred_log_2)

## <font color= Orange> Observations

1. No change is observed this mean the data variables have high multicolinearity. PCA is needed

### <font color= Blue> 3.2 Decision Tree base model

In [None]:
dec_tree = DecisionTreeClassifier(max_depth=3,random_state=100)
dec_tree.fit(X_train[var], y_train)

In [None]:
from IPython.display import Image  
from six import StringIO  
from sklearn.tree import export_graphviz
import pydotplus, graphviz

def get_graph(classifier):
    dot_data = StringIO()  
    export_graphviz(classifier, out_file=dot_data, filled=True, rounded=True,
                    feature_names=X_train.columns, 
                    class_names=['Churn', "Active"])

    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    return Image(graph.create_png())

In [None]:
get_graph(dec_tree)

In [None]:
y_pred_dec_tree_base = dec_tree.predict(X_test)

In [None]:
get_confusion_matrix(y_test,y_pred_dec_tree_base)

## <font color= Orange> Observations

1. decision tree gives same accuray compared to linear regression
2. the recall has reduced by 4% and improvement in precision and specificity are seen
3. Dimentionality is further reduced by using PCA in the next steps and decision tree classifier is used

### <font color= Blue> 3.3 Principal Component analysis

In [None]:
pca = PCA(random_state=100)

In [None]:
pca.fit(X_train)

In [None]:
pca.components_.shape

In [None]:
# cumulative sum calculation for scree plot to understand variance v/s variables
cumsum_vars = np.cumsum(np.round(pca.explained_variance_ratio_, decimals=3)*100)

In [None]:
fig = plt.figure(figsize=[10,6])
plt.vlines(x=70, ymax=100, ymin=0, colors="g", linestyles="--")
plt.hlines(y=95, xmax=125, xmin=0, colors="r", linestyles="-.")
plt.plot(cumsum_vars)
plt.ylabel("Cumulative variance")
plt.xlabel("Variables")
plt.show()

## <font color= Orange> Observations

1. from the scree plot we can see that 70 variables can represent 95% variation in the data i.e roughly just above half the dataset size.
2. This wil be taken forward in the next step

In [None]:
pca_70_var = PCA(n_components=70,random_state=100)

In [None]:
X_train_pca_70 = pca_70_var.fit_transform(X_train)
X_test_pca_70 = pca_70_var.transform(X_test)

In [None]:
print(X_train_pca_70.shape)
print(X_test_pca_70.shape)
X_train_pca_70

## <font color= Orange> Observations

1. data set has been reduced to 70 variable array after pca
2. Decision tree and random forest are done on this data to calculate the matrices

### <font color= Blue> 3.4 Decision tree calssifier with PCA and Hyper tuning

In [None]:
dec_tree_cla = DecisionTreeClassifier(random_state=100)

In [None]:
# Create the parameter grid based on the results of random search 
params_dec_tree = {
    'max_depth': [ 3, 5, 10, 20],
    'min_samples_leaf': [5, 10, 20, 50, 100, 1000],
    'criterion': ["gini", "entropy"]
}

In [None]:
grid_search = GridSearchCV(estimator=dec_tree_cla, 
                           param_grid=params_dec_tree, 
                           cv=4, n_jobs=-1, verbose=1, scoring = "accuracy")

In [None]:
grid_search.fit(X_train_pca_70, y_train)

In [None]:
score_dec_tree = pd.DataFrame(grid_search.cv_results_)
score_dec_tree.head()

In [None]:
score_dec_tree.nlargest(5,"mean_test_score")

In [None]:
grid_search.best_estimator_

In [None]:
dec_tree_final = DecisionTreeClassifier(criterion='entropy', max_depth=20, min_samples_leaf=5, random_state=100)
dec_tree_final.fit(X_train_pca_70, y_train)

In [None]:
y_pred_dec_tree = dec_tree_final.predict(X_test_pca_70)

In [None]:
get_confusion_matrix(y_test,y_pred_dec_tree)

## <font color= Orange> Observations

1. The accracy has marginally reduced from 87% to 86%  
2. but the other matries have also marginally decreased on all counts except recall.
3. Hence is this case the base model decision tree performs slightly better on all counts

### <font color= Blue> 3.5 Random forest Calssifer with PCA and hyper tuning

In [None]:
params_rf = {
    'max_depth': [ 3, 5, 10, 20],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'n_estimators': [10, 25, 50, 100,200]
}
# Instantiate the grid search model
rf = RandomForestClassifier(random_state=100)
grid_search = GridSearchCV(estimator=rf, 
                           param_grid=params_rf, 
                           cv=4, n_jobs=-1, verbose=1, scoring = "recall")

grid_search.fit(X_train_pca_70, y_train)

In [None]:
rf_pca_final = grid_search.best_estimator_

In [None]:
rf_pca_final.fit(X_train_pca_70,y_train)

In [None]:
y_pred_rf = rf_pca_final.predict(X_test_pca_70)
get_confusion_matrix(y_test,y_pred_rf)

## <font color= Orange> Observations

1. Random forest classifier gives better results compared to decision tree on all counts

## <font color= Green> Key Insights Model Development and Evaluation  -
1. The base linear regression model was itself very good interms of prediction with 87% accuracy and 87% in recall, precision and specificity
2. Here it is to note that logistic regression dataset of telecom was resampled to eliminate class imbalance. without which the accuracy was good but other indeces performed poorly.
3. Hence the resampled dataset was used in all model development and hyper tuning exercises
4. Also differnt models like Logistic regreesion with VIF, Decision tree (base and tuned), Random Forest (tuned) were performed
5. Best results were observed with Random forest classifer with following matrices
    Accuracy = 0.92
    Sensitivity/Recall = 0.92
    Specificity = 0.91
    Precision = 0.91
this means The classifier detects all the churn cases as churn 92% of the times (Recall) and active cases as active 91% of the time (specificity). Precision also indicates that the cases which were dected churn are actually churn with 91% precision.
6. Random forest with hyper tuning of depth 20, sample leaf 5, n estimator 200 is used for final submission.

## <font color= Purple> FINAL PREDICTION (UNSEEN DATA)

In [None]:
#read test csv
df_telco_test = pd.read_csv("test.csv")

In [None]:
# capture customer id
Final_pred=pd.DataFrame()
Final_pred['id']=df_telco_test.id
Final_pred.head(20)

Applying all the Pre-processing on the test data

In [None]:
df_telco_test_v1 = df_telco_test.drop(drop_list,axis=1)
df_telco_test_v1 = df_telco_test_v1.drop('id',axis=1)

In [None]:
df_telco_test_v1 = df_telco_test_v1.drop(date_drop,axis=1)

In [None]:
#imputing with median
df_telco_test_v1[cols_w_null] = median_imputation.fit_transform(df_telco_test_v1[cols_w_null])

In [None]:
df_telco_test_v1['aon_months'] = df_telco_test_v1['aon'].apply(lambda x: round((x/365)*12,2))
df_telco_test_v1['mean_arpu'] = round((df_telco_test_v1['arpu_6']+df_telco_test_v1['arpu_7']+df_telco_test_v1['arpu_8'])/3,2)
df_telco_test_v1['avg_customer_Spend'] = df_telco_test_v1['aon_months']*df_telco_test_v1['mean_arpu']
df_telco_test_v1.drop(columns = ['aon','arpu_6','arpu_7','arpu_8','mean_arpu','aon_months'],axis=1,inplace=True)

In [None]:
df_telco_test_v1['customer_value'] = df_telco_test_v1['avg_customer_Spend'].apply(
    lambda x: 'HCV' if x > HVC else ('LCV' if x < LVC else 'MCV')) # MVC- Medium value customers
df_telco_test_v1['customer_value'].value_counts()

In [None]:
df_telco_test_v1.shape

In [None]:
hot_encode = pd.get_dummies(df_telco_test_v1['customer_value'], drop_first=True, prefix='customer_value')
df_telco_test_v1.drop(columns=['customer_value'],axis=1,inplace=True)
df_telco_test_v1 = pd.concat([df_telco_test_v1,hot_encode],axis=1)

In [None]:
df_telco_test_v1.info(verbose=True)

In [None]:
df_telco_test_v2=df_telco_test_v1.copy()

In [None]:
var=df_telco_test_v2.columns
df_telco_test_v2[var] = scaler.transform(df_telco_test_v2[var])

In [None]:
df_telco_test_v2_pca_70 = pca_70_var.transform(df_telco_test_v2)

In [None]:
Final_pred['Predictions_decision_tree'] = dec_tree_final.predict(df_telco_test_v2_pca_70)

In [None]:
Final_pred.to_csv("Unseen_Pred_test_Dec_Tree.csv",index=False)

In [None]:
Final_pred=Final_pred.drop('Predictions_decision_tree',axis=1)

In [None]:
Final_pred

In [None]:
Final_pred['Predictions_Random_Forest'] = rf_pca_final.predict(df_telco_test_v2_pca_70)

In [None]:
Final_pred.to_csv("Unseen_Pred_test_Random_forest.csv",index=False)

## <font color= Green> Suggestions to the company
 - Most customers who are highvalue as per our analysis have a lower churn rate.
 - Although highvalue + Middle value customers together have a churn rate of above 60% and Targeting them is benefetial to the company
 - better offers around roaming and outgoing standard calls will help reducing the churn in this segment
 - Among the given dataset company can target month of June for releasing new offers