# OverView
creating a model that has high prediction for customers that will stop working with Syria telecommunication company that is having a high revenue lose due to the customers that stop working with them.

# business problem
Syria Tel is losing revenue due to customers that are leaving

# Objective
### 
1. predict whether a customer would leave so Syria Tel can intervine early and reduce revenue loss.
2. creating a model that is atleast 92%
3. create a model that has few missed churns.
 ###

# importation of libraries

In [194]:
# imports

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.linear_model import LogisticRegression,LinearRegression,Ridge
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


# loading the dataset using pandas and cleaning the dataset.
the cell bellow we load the csv file using pandas.
then we use .head() to beable to see the 1st 5 rows of the dataset.

In [195]:
df=pd.read_csv("bigml_59c28831336c6604c800002a.csv")

In [196]:
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


### finding the .info()
in the cell bellow we see that there are a total of 3333 rows and 21 columns.
we are also able to identify that there are no missing values in this dataset since the non_null count of all the columns add up to 3333.we are also able to identify the data type of each column.

In [197]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

### Droping the non useful columns

In [198]:
df. drop (columns=["phone number"] , inplace=True)

converting churn to binary

In [199]:
df['churn'] = df['churn'].map({True: 1 , False: 0})

### encoding the categorical variables
using the get_dummies()

In [200]:
df=pd.get_dummies(df, columns=["state","international plan","voice mail plan"], drop_first=True)

In [201]:
df.head()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,...,state_TX,state_UT,state_VA,state_VT,state_WA,state_WI,state_WV,state_WY,international plan_yes,voice mail plan_yes
0,128,415,25,265.1,110,45.07,197.4,99,16.78,244.7,...,0,0,0,0,0,0,0,0,0,1
1,107,415,26,161.6,123,27.47,195.5,103,16.62,254.4,...,0,0,0,0,0,0,0,0,0,1
2,137,415,0,243.4,114,41.38,121.2,110,10.3,162.6,...,0,0,0,0,0,0,0,0,0,0
3,84,408,0,299.4,71,50.9,61.9,88,5.26,196.9,...,0,0,0,0,0,0,0,0,1,0
4,75,415,0,166.7,113,28.34,148.3,122,12.61,186.9,...,0,0,0,0,0,0,0,0,1,0


### identifying the features and the target.
spliting the data into X and y

In [202]:
X = df.drop(columns=["churn"],axis=1)
y = df["churn"]

### train_test of the split
the test is 20% while the train is 80% and a random_state of 42.stratify=y to ensure same propotion of each class that appears in y_train and y_test.

In [203]:
X_train ,X_test ,y_train ,y_test = train_test_split(X,y,test_size=0.2, random_state=42, stratify=y)

### scaling
to prevent data leakage and ensure model's evaluation is trustworthy.by using fit scaler only on training data and apply same scaling to test data

In [204]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### logistic regression.

In [205]:
model = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)

LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)

### making prediction using the model.
0 being customers who stay and 1 being customers who leave

In [206]:
y_pred = model.predict(X_test_scaled)
y_pred[:10]



array([1, 1, 1, 0, 1, 0, 1, 0, 0, 0], dtype=int64)

### evaluating the model.

In [207]:
print("classification Report:\n",classification_report(y_test, y_pred))
print("accuracy:",accuracy_score(y_test, y_pred))
print("confusion matrix:\n",confusion_matrix(y_test, y_pred))

classification Report:
               precision    recall  f1-score   support

           0       0.94      0.75      0.84       570
           1       0.33      0.70      0.45        97

    accuracy                           0.75       667
   macro avg       0.63      0.73      0.64       667
weighted avg       0.85      0.75      0.78       667

accuracy: 0.7466266866566716
confusion matrix:
 [[430 140]
 [ 29  68]]


# results
### class 0=stayed
1. precision 94% of the time the model is correct that the customer will stay.
2. recall the model correctly identifies 75% of loyal customers.
3. f1-score good overall performance. 
###
### class 1= left
1. precision 33% of the customers predicted to leave actually do.
2. recall the model predicts only 70% of the customers who leave
3. f1-score poor performance.
###

### in the confusion matrix
1. 68 customers who left were correctly identified 
2. 29 customers were predicted to stay while in reality they left.this means that there there was lost revenue that was not accounted for.
3. 140 customers were predicted to leave while in reality they stayed
4. 430 customers were predicted to stay while in reality they actualy stayed.
###

### propability prediction

In [208]:
y_pred = model.predict_proba(X_test_scaled)[:,1]
y_pred[:10]

array([0.65087501, 0.51124094, 0.52263669, 0.41977728, 0.81813187,
       0.30830243, 0.66074109, 0.37146537, 0.18929149, 0.38182787])

### ROC-AUC score

In [209]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)

0.798046663049376

### logistic regression coefficient

In [210]:
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_[0]})
coefficients = coefficients.sort_values(by='Coefficient', ascending=False)
coefficients

Unnamed: 0,Feature,Coefficient
15,customer service calls,0.851779
66,international plan_yes,0.743154
2,number vmail messages,0.668164
3,total day minutes,0.340258
5,total day charge,0.336771
...,...,...
61,state_VT,-0.096853
60,state_VA,-0.125560
13,total intl calls,-0.172732
26,state_HI,-0.186585


positive coefficient - increases churn propability
negative coefficience - decreases churn propability


# Decision tree model

### importation of an important model.

In [211]:
from sklearn.tree import DecisionTreeClassifier

In [212]:
X = df.drop(columns=["churn"],axis=1)
y = df["churn"]

In [213]:
X_train ,X_test ,y_train ,y_test = train_test_split(X,y,test_size=0.2, random_state=42, stratify=y)

### creating the model

In [214]:
df_model = DecisionTreeClassifier(criterion="gini",max_depth=5,class_weight='balanced',random_state=42)

### train(fit) the model

In [215]:
df_model.fit(X_train,y_train)#to capture non linear relationships

DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=42)

### making predictions

In [216]:
y_pred2=df_model.predict(X_test)
y_pred2[:10]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

### evaluating the model

In [217]:
print("classification Report:\n",classification_report(y_test, y_pred2))
print("accuracy:",accuracy_score(y_test, y_pred2))
print("confusion matrix:\n",confusion_matrix(y_test, y_pred2))

classification Report:
               precision    recall  f1-score   support

           0       0.95      0.94      0.94       570
           1       0.66      0.72      0.69        97

    accuracy                           0.91       667
   macro avg       0.81      0.83      0.82       667
weighted avg       0.91      0.91      0.91       667

accuracy: 0.9055472263868066
confusion matrix:
 [[534  36]
 [ 27  70]]


# Results
### class 0 = stayed
1. precision it predicts that a customer will stay 95% correct.
2. recall it correctly identifies 94% of loyal customers
3. f1-score excellent performance.
###
### class 1 = left
1. precision 66% of the customers predicted to leave actually left.
2. recall correctly identifies 72% of the customers who left
3. f1-score good balance between precision and recall.
###

### confusion matrix
1. the model predicts that 70 customers will leave whill in reality it is true.
2. the model predicts that 36 will stay will in reality they left.
3. the model predicts that 27 will leave while the actually stayed.
4. 530 were predicted to stay and in reality they stayed.

### the model flags most churners while keeping unnecessary retention low compaired to the logistic regression model.
### it is also more accurate compaired to the logistic model.

# Tuning the decision tree model.

In [218]:
from sklearn.model_selection import GridSearchCV

### defining the grid parameters

In [219]:
param_grid = {
    'max_depth': [3, 5,7,10, None],
    'min_samples_split': [2,5,10],
    'min_samples_leaf': [1,2,4],
    'criterion': ['gini','entropy']
}

### Running gridsearchCV

In [220]:
dt = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)


GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [3, 5, 7, 10, None],
                         'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10]},
             scoring='accuracy')

### The best model

In [221]:
best_model = grid_search.best_estimator_
print("Best parameters:", grid_search.best_params_)

Best parameters: {'criterion': 'entropy', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 10}


### Evaluating the tuned tree

In [222]:
y_pred_tuned = best_model.predict(X_test)
print("Tuned Decision Tree accuracy:", accuracy_score(y_test, y_pred_tuned))
print("classification Report:\n",classification_report(y_test, y_pred_tuned))
print("confusion matrix:\n",confusion_matrix(y_test, y_pred_tuned))

Tuned Decision Tree accuracy: 0.9220389805097451
classification Report:
               precision    recall  f1-score   support

           0       0.95      0.96      0.95       570
           1       0.76      0.67      0.71        97

    accuracy                           0.92       667
   macro avg       0.85      0.82      0.83       667
weighted avg       0.92      0.92      0.92       667

confusion matrix:
 [[550  20]
 [ 32  65]]


# Results
### class 0 = stayed
1. precision the model correctly predicts that a customer would stay 95% of the time.
2. recall the model correctly identifies loyal customers 96% of the time.
3. f1-score stable performance.
### 
### class 1 = left
1. precision 76% of the model prediction that customers would leave actually left.
2. recall the modell correctly identifies 67% of the customers thet left.
3. f1-score good balance.
### 


#### confusion matrix
1. 65 the model predicts that would leave and they actually leave
2. 32 model predicts would stay and in reality they leave.
3. 20 the model predicts would leave but they actually stay.
4. 550 are predicted to have stayed which is true.

In [223]:
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': best_model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)
feature_importances.head(10)

Unnamed: 0,Feature,Importance
3,total day minutes,0.264661
15,customer service calls,0.154859
66,international plan_yes,0.118831
8,total eve charge,0.097312
12,total intl minutes,0.092751
13,total intl calls,0.06442
2,number vmail messages,0.045838
6,total eve minutes,0.040561
9,total night minutes,0.031924
67,voice mail plan_yes,0.022424


### this tuned model achieves high predictive accuracy while maintaining strong churn detection , making it more effective for balancing revenue protection and customer retention cost.

# recomendation
###
1. the syria tel company should use the model with the highest accuracy which is the tuned decision tree.
2. they should use the model with the atleast 75% prediction of the churns.
3. they should use the model that has less missed churns.

# Conclusion
this tuned model achieves high predictive accuracy while maintaining strong churn detection , making it more effective for balancing revenue protection and customer retention cost.