### Problem Statement
### The project aims to address the issue of customer churn within a telecommunications company. Customer churn refers to customers leaving or discontinuing their services with the company. Reducing customer churn is critical for the company's profitability and growth.

###  Provided is a dataset on customer churn for Telco Customer Services.Your job is to build a model for predicting customer churn by exploring both random forest and the decision tree approach.

In [1]:
# Importing the required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn .ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [2]:
# Reading the data
df=pd.read_csv("Telco-Customer-Churns.csv")
df

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [4]:
# getting the target variable from our dataset
target=df["Churn"].values

In [5]:
# dropping the churn column  from our general dataset
df=df.drop(["Churn"],axis=1)

In [6]:
# dropping the cutomer id column  from our general dataset
df=df.drop(["customerID"],axis=1)

In [7]:
# Getting a view of the unique values in each column
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Unique values in '{column}':")
    print(unique_values)
    print("\n")

Unique values in 'gender':
['Female' 'Male']


Unique values in 'SeniorCitizen':
[0 1]


Unique values in 'Partner':
['Yes' 'No']


Unique values in 'Dependents':
['No' 'Yes']


Unique values in 'tenure':
[ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26  0
 39]


Unique values in 'PhoneService':
['No' 'Yes']


Unique values in 'MultipleLines':
['No phone service' 'No' 'Yes']


Unique values in 'InternetService':
['DSL' 'Fiber optic' 'No']


Unique values in 'OnlineSecurity':
['No' 'Yes' 'No internet service']


Unique values in 'OnlineBackup':
['Yes' 'No' 'No internet service']


Unique values in 'DeviceProtection':
['No' 'Yes' 'No internet service']


Unique values in 'TechSupport':
['No' 'Yes' 'No internet service']


Unique values in 'StreamingTV':
['No' 'Yes' 'No internet service']


Unique values in 'StreamingMovies'

#### As we can see above some columns have more than two unique values it would be good to perform one-hot-encoding on them
#### and also to those with two unique values, to make them numerical
#### We will therefore perform one-hot encoding to all categorical varables




In [8]:
# converting Total Charges to float datatype
df["TotalCharges"] = df["TotalCharges"].replace(" ", "0").astype(float)

In [9]:
# selecting columns that are categorical
categorical_columns = df.select_dtypes(include=['object']).columns

# Apply one-hot encoding to all categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_columns)


In [10]:
# splitting our data into test and train datasets
X_train,X_test,y_train,y_test=train_test_split(df_encoded.values,target,test_size=0.2,random_state=2)

In [11]:
#  intializing the classifier
classifier=DecisionTreeClassifier()

In [12]:
# fitting the classifier to our train datasets
classifier.fit(X_train,y_train)

In [13]:
# predicting the values
y_pred=classifier.predict(X_test)

In [14]:
# Obtaining the accuracy of the model
accuracy=accuracy_score(y_pred,y_test)
print(accuracy)

0.7409510290986515


### We will now try the random forest model to see if we can get a better degree of accuracy

In [15]:
# initializing the classifier
classifier=RandomForestClassifier()

In [16]:
## fitting the classifier to our train datasets
classifier.fit(X_train,y_train)

In [17]:
# predicting the values
y_pred=classifier.predict(X_test)

In [18]:
# Obtaining the accuracy of the model
accuracy=accuracy_score(y_pred,y_test)
print(accuracy)

0.7955997161107168


In [19]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

          No       0.84      0.90      0.87      1061
         Yes       0.61      0.46      0.53       348

    accuracy                           0.80      1409
   macro avg       0.73      0.68      0.70      1409
weighted avg       0.78      0.80      0.79      1409



#### After using the Random Forest classifier, the results of our accuracy improved