# Customer Churn Dataset

<code>
- **customerID: Customer ID**

- **gender: Whether the customer is a male or a female**

- **SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)**

- **Partner: Whether the customer has a partner or not (Yes, No)**

- **Dependents: Whether the customer has dependents or not (Yes, No)**

- **tenure: Number of months the customer has stayed with the company**

- **PhoneService: Whether the customer has a phone service or not (Yes, No)**

- **MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)**

- **InternetService: Customer’s internet service provider (DSL, Fiber optic, No)**

- **OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)**

- **OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)**

- **DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)**

- **TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)**

- **StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)**

- **StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)**

- **Contract: The contract term of the customer (Month-to-month, One year, Two year)**

- **PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)**

- **PaymentMethodThe: customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))**

- **MonthlyCharges: The amount charged to the customer monthly**

- **TotalCharges: The total amount charged to the customer**

- **Churn: Whether the customer churned or not (Yes or No)**
</code>


In [None]:
!pip install mlxtend
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(rc={'figure.figsize': [12, 10]}, font_scale=1.2)

In [None]:
from pandas.plotting import scatter_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [None]:
data=pd.read_csv("dataset.csv")
data.head()

In [None]:
data.drop("customerID",axis=1,inplace=True)

In [None]:
data.info()

In [None]:
data.describe(include='all')

In [None]:
data["SeniorCitizen"].value_counts()

In [None]:
data["SeniorCitizen"].fillna(data["SeniorCitizen"].mode()[0],inplace=True)

In [None]:
data["tenure"].fillna(data["tenure"].median(),inplace=True)

In [None]:
data.isna().sum()

## EDA

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt  
plt.style.use('bmh')

In [None]:
sns.countplot(data["gender"],hue=data["Churn"]);

if gender is Male or Female doesn't make any different in Churn

In [None]:
sns.countplot(data["SeniorCitizen"],hue=data["Churn"]);

The proportion between the largest if he stayed or left

In [None]:
# internet service, churn 

sns.countplot(data["InternetService"],hue=data["Churn"]);

rely more on DSL for internet service
That is, the customer who uses optical fibers is left against the one who uses DSL

In [None]:
sns.countplot(data["Partner"],hue=data["Churn"]);


Most of the shows are for husbands
If the customer does not have a partner, he leaves the company

In [None]:
sns.countplot(data["PaperlessBilling"],hue=data["Churn"]);

Most of customer who have paperless billing leave company

In [None]:
sns.kdeplot(data["tenure"]);

most months that customer stay with company between 3:8 months


In [None]:
sns.kdeplot(data[data["Churn"]=='No']["tenure"],color='blue',label='Churn: No')
sns.kdeplot(data[data["Churn"]=='Yes']["tenure"],color='red',label='Churn: Yes')
plt.legend()
plt.show()

most months that customer leave with company between 3:8 months and
most months that customer stay with company between 60:70 months

In [None]:
sns.kdeplot(data[data["Churn"]=='No']["MonthlyCharges"],color='blue',label='Churn: No')
sns.kdeplot(data[data["Churn"]=='Yes']["MonthlyCharges"],color='red',label='Churn: Yes')
plt.legend()
plt.title("Monthly charges for each category")
plt.xlabel("Monthly charges")

customer who leave company spend between 70:90 and
while customer who stay spend 20:25


In [None]:
sns.kdeplot(data[data["Churn"]=='No']["TotalCharges"],color='blue',label='Churn: No')
sns.kdeplot(data[data["Churn"]=='Yes']["TotalCharges"],color='red',label='Churn: Yes')
plt.legend()
plt.show()

Customer who leave spend more money

In [None]:
for i in data.columns:
    if (data[i].dtype=='float') or (data[i].dtype=='int'):
        print(i)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
sc=StandardScaler()
data[["tenure","MonthlyCharges","TotalCharges"]]=sc.fit_transform(data[["tenure","MonthlyCharges","TotalCharges"]])

In [None]:
data.head()

In [None]:
from sklearn.preprocessing import  LabelEncoder

In [None]:
lb=LabelEncoder()
data["gender"]=lb.fit_transform(data["gender"])
data.head()

In [None]:
# l_objects=["gender","Partner",........]
# for i in l_objects:
#     lb=LabelEncoder()
#     data[i]=lb.fit_transform(data[i])

In [None]:
for i in data.columns:
    if data[i].dtype=='object':
        lb=LabelEncoder()
        data[i]=lb.fit_transform(data[i])

In [None]:
data.head()

In [None]:
# data["degree"]=data["degree"].apply(map({"A":5,"b":4,"c":3}))

In [None]:
data.corr()

In [None]:
plt.figure(figsize=(15,7))
sns.heatmap(data.corr(),annot=True)

In [None]:
data["Churn"].value_counts()

In [None]:
X=data.drop("Churn",axis=1)
y=data["Churn"]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30)

# sampling data

In [None]:
training_data = pd.concat([X_train, y_train], axis=1)
training_data.head()

In [None]:
from sklearn.utils import resample

In [None]:
not_Churn = training_data[training_data['Churn'] == 0]
Churn = training_data[training_data['Churn'] == 1]

# upsample minority
Churn_upsampled = resample(Churn,
                          replace=True, # sample with replacement
                          n_samples=len(not_Churn), # match number in majority class
                          random_state=27) # reproducible results

# combine majority and upsampled minority
upsampled = pd.concat([not_Churn, Churn_upsampled])

# check new class counts
upsampled['Churn'].value_counts()

In [None]:
X_train = upsampled.drop('Churn', axis=1)
y_train = upsampled['Churn']

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr=LogisticRegression()
lr.fit(X_train,y_train)
lr_pred=lr.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix,f1_score

In [None]:
f1_score(y_test,lr_pred)

In [None]:
lr.score(X_test,y_test)

In [None]:
from mlxtend.plotting import plot_confusion_matrix

In [None]:
plot_confusion_matrix(confusion_matrix(y_test,lr_pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
rf_pred=rf.predict(X_test)

In [None]:
plot_confusion_matrix(confusion_matrix(y_test,rf_pred))

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt=DecisionTreeClassifier()
dt.fit(X_train,y_train)
dt_pred=dt.predict(X_test)

In [None]:
plot_confusion_matrix(confusion_matrix(y_test,dt_pred))

In [None]:
import pickle 

In [None]:
import pickle 
save_model=open("saved_model.sav","wb")
pickle.dump(dt,save_model)

In [None]:
len(X_test.iloc[1,:].values)

In [None]:
# import pickle
# model=pickle.load(open("saved_model.pk", 'rb'))

In [None]:
import numpy as np

In [None]:
X=data[["tenure","MonthlyCharges","TotalCharges","gender"]]
y=data["Churn"]

In [None]:
lb=LabelEncoder()
X["gender"]=lb.fit_transform(X["gender"])

In [None]:
X.head()

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30)

In [None]:
rf=RandomForestClassifier()
rf.fit(X_train,y_train)

In [None]:
filename = 'finalized_model.sav'
pickle.dump(rf, open(filename, 'wb'))


In [None]:
import pickle
model=pickle.load(open("finalized_model.sav", 'rb'))

In [None]:
np.array([[1.        , 50       , 500       ,1500]])

In [None]:
model.predict(np.array([[1.        , 50       , 500       ,1500]]))

In [None]:
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

In [None]:
print (cv_results)