# Ford used car analysis

Submitted by:
<br>Nabila Tajrin Bristy
<br>Dhaka, Bangladesh

#### Objectives
- Decision Tree Classification
- Cross-Validation
- Grid Search
- Confusion Matrix, Classification report, and ROC-AUC
- accuracy, precision, recall, f1 score

#### Tasks
1. Load the dataset and perform data preprocessing
2. Which approach works better for this dataset? A. One-hot Encoding or B. Label Encoding
3. Perform Data Transformation (StandardScaler or MinMaxScaler). Does Data Transformation improve model performance? Is it necessary to standardize or normalize data for tree-based machine learning models?
4. Perform Grid Search and Cross-Validation with Decision Tree Classifier
5. Show a tree diagram of the Decision Tree
6. Show the Confusion Matrix, Classification report, and ROC-AUC
7. Explain accuracy, precision, recall, f1 score


#### References
dataset: https://github.com/SKawsar/machine_learning_with_python/blob/main/Churn.csv <br>
Actual dataset source: https://learn.datacamp.com/courses/marketing-analytics-predicting-customer-churn-in-python

### Import required libraries and packages

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler

### Load the dataset and perform data preprocessing

In [5]:
# read the ford.csv file 
df = pd.read_csv('Churn.csv')

display(df.head(10))
print(df.shape)

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Churn,Intl_Plan,Vmail_Plan,...,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,State,Area_Code,Phone
0,128,25,265.1,197.4,244.7,10.0,1,no,no,yes,...,45.07,99,16.78,91,11.01,3,2.7,KS,415,382-4657
1,107,26,161.6,195.5,254.4,13.7,1,no,no,yes,...,27.47,103,16.62,103,11.45,3,3.7,OH,415,371-7191
2,137,0,243.4,121.2,162.6,12.2,0,no,no,no,...,41.38,110,10.3,104,7.32,5,3.29,NJ,415,358-1921
3,84,0,299.4,61.9,196.9,6.6,2,no,yes,no,...,50.9,88,5.26,89,8.86,7,1.78,OH,408,375-9999
4,75,0,166.7,148.3,186.9,10.1,3,no,yes,no,...,28.34,122,12.61,121,8.41,3,2.73,OK,415,330-6626
5,118,0,223.4,220.6,203.9,6.3,0,no,yes,no,...,37.98,101,18.75,118,9.18,6,1.7,AL,510,391-8027
6,121,24,218.2,348.5,212.6,7.5,3,no,no,yes,...,37.09,108,29.62,118,9.57,7,2.03,MA,510,355-9993
7,147,0,157.0,103.1,211.8,7.1,0,no,yes,no,...,26.69,94,8.76,96,9.53,6,1.92,MO,415,329-9001
8,117,0,184.5,351.6,215.8,8.7,1,no,no,no,...,31.37,80,29.89,90,9.71,4,2.35,LA,408,335-4719
9,141,37,258.6,222.0,326.4,11.2,0,no,yes,yes,...,43.96,111,18.87,97,14.69,5,3.02,WV,415,330-8173


(3333, 21)


#### Checking categorical variables

In [25]:
df['CustServ_Calls'].value_counts()

1    1181
2     759
0     697
3     429
4     166
5      66
6      22
7       9
9       2
8       2
Name: CustServ_Calls, dtype: int64

In [26]:
df['Churn'].value_counts()

no     2850
yes     483
Name: Churn, dtype: int64

In [27]:
df['Intl_Plan'].value_counts()

no     3010
yes     323
Name: Intl_Plan, dtype: int64

In [28]:
df['Vmail_Plan'].value_counts()

no     2411
yes     922
Name: Vmail_Plan, dtype: int64

In [29]:
df['Intl_Calls'].value_counts()

3     668
4     619
2     489
5     472
6     336
7     218
1     160
8     116
9     109
10     50
11     28
0      18
12     15
13     14
15      7
14      6
18      3
16      2
19      1
17      1
20      1
Name: Intl_Calls, dtype: int64

#### Checking for missing values and data types

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
Account_Length    3333 non-null int64
Vmail_Message     3333 non-null int64
Day_Mins          3333 non-null float64
Eve_Mins          3333 non-null float64
Night_Mins        3333 non-null float64
Intl_Mins         3333 non-null float64
CustServ_Calls    3333 non-null int64
Churn             3333 non-null object
Intl_Plan         3333 non-null object
Vmail_Plan        3333 non-null object
Day_Calls         3333 non-null int64
Day_Charge        3333 non-null float64
Eve_Calls         3333 non-null int64
Eve_Charge        3333 non-null float64
Night_Calls       3333 non-null int64
Night_Charge      3333 non-null float64
Intl_Calls        3333 non-null int64
Intl_Charge       3333 non-null float64
State             3333 non-null object
Area_Code         3333 non-null int64
Phone             3333 non-null object
dtypes: float64(8), int64(8), object(5)
memory usage: 546.9+ KB


#### Target variable: 'Churn'

In [31]:
print(df['Churn'].value_counts())

no     2850
yes     483
Name: Churn, dtype: int64


In [18]:
df = pd.read_csv("Churn.csv", na_values="?")
df = df.dropna()
#df.columns = ["sepal length", "sepal width", "petal length", "petal width", "class"]

display(df.sample(10))
print(df.shape)

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Churn,Intl_Plan,Vmail_Plan,...,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,State,Area_Code,Phone
1280,58,0,112.2,209.6,260.9,13.9,0,yes,no,no,...,19.07,108,17.82,78,11.74,1,3.75,NC,510,375-4107
3035,88,0,85.7,221.6,190.6,11.6,4,yes,no,no,...,14.57,70,18.84,75,8.58,3,3.13,ME,415,405-5513
206,122,0,243.8,83.9,179.8,13.7,2,no,no,no,...,41.45,72,7.13,84,8.09,8,3.7,IN,415,344-3388
293,96,37,172.7,120.1,216.1,10.3,5,yes,no,yes,...,29.36,116,10.21,86,9.72,5,2.78,CT,415,387-5860
856,104,0,183.6,120.7,215.1,12.7,1,no,no,no,...,31.21,98,10.26,112,9.68,2,3.43,WY,408,366-3917
2314,43,35,200.2,244.4,207.2,11.6,3,no,no,yes,...,34.03,88,20.77,97,9.32,4,3.13,VA,408,387-5411
2407,139,31,203.5,200.3,214.0,13.4,1,yes,yes,yes,...,34.6,72,17.03,112,9.63,6,3.62,TX,510,388-2240
1094,115,0,245.2,159.0,229.9,7.2,0,no,no,no,...,41.68,109,13.52,74,10.35,8,1.94,AK,415,333-3704
716,57,30,234.5,195.2,268.8,11.4,2,no,yes,yes,...,39.87,116,16.59,94,12.1,4,3.08,GA,408,410-3782
2705,101,0,253.2,237.9,154.3,9.7,4,no,no,no,...,43.04,114,20.22,85,6.94,7,2.62,HI,415,400-5511


(3333, 21)


#### Exploratory Data Analysis

In [24]:
df.describe()

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Day_Calls,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,Area_Code
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,8.09901,179.775098,200.980348,200.872037,10.237294,1.562856,100.435644,30.562307,100.114311,17.08354,100.107711,9.039325,4.479448,2.764581,437.182418
std,39.822106,13.688365,54.467389,50.713844,50.573847,2.79184,1.315491,20.069084,9.259435,19.922625,4.310668,19.568609,2.275873,2.461214,0.753773,42.37129
min,1.0,0.0,0.0,0.0,23.2,0.0,0.0,0.0,0.0,0.0,0.0,33.0,1.04,0.0,0.0,408.0
25%,74.0,0.0,143.7,166.6,167.0,8.5,1.0,87.0,24.43,87.0,14.16,87.0,7.52,3.0,2.3,408.0
50%,101.0,0.0,179.4,201.4,201.2,10.3,1.0,101.0,30.5,100.0,17.12,100.0,9.05,4.0,2.78,415.0
75%,127.0,20.0,216.4,235.3,235.3,12.1,2.0,114.0,36.79,114.0,20.0,113.0,10.59,6.0,3.27,510.0
max,243.0,51.0,350.8,363.7,395.0,20.0,9.0,165.0,59.64,170.0,30.91,175.0,17.77,20.0,5.4,510.0


#### Checking categorical variables

In [28]:
df['Churn'].value_counts()

no     2850
yes     483
Name: Churn, dtype: int64

In [26]:
df['Intl_Plan'].value_counts()

no     3010
yes     323
Name: Intl_Plan, dtype: int64

In [27]:
df['Vmail_Plan'].value_counts()

no     2411
yes     922
Name: Vmail_Plan, dtype: int64

#### Checking for missing values and data types

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3333 entries, 0 to 3332
Data columns (total 21 columns):
Account_Length    3333 non-null int64
Vmail_Message     3333 non-null int64
Day_Mins          3333 non-null float64
Eve_Mins          3333 non-null float64
Night_Mins        3333 non-null float64
Intl_Mins         3333 non-null float64
CustServ_Calls    3333 non-null int64
Churn             3333 non-null object
Intl_Plan         3333 non-null object
Vmail_Plan        3333 non-null object
Day_Calls         3333 non-null int64
Day_Charge        3333 non-null float64
Eve_Calls         3333 non-null int64
Eve_Charge        3333 non-null float64
Night_Calls       3333 non-null int64
Night_Charge      3333 non-null float64
Intl_Calls        3333 non-null int64
Intl_Charge       3333 non-null float64
State             3333 non-null object
Area_Code         3333 non-null int64
Phone             3333 non-null object
dtypes: float64(8), int64(8), object(5)
memory usage: 572.9+ KB


#### Target variable: 'Churn'

In [31]:
print(df['Churn'].value_counts())

no     2850
yes     483
Name: Churn, dtype: int64


In [32]:
display(df.describe())

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Day_Calls,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,Area_Code
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,8.09901,179.775098,200.980348,200.872037,10.237294,1.562856,100.435644,30.562307,100.114311,17.08354,100.107711,9.039325,4.479448,2.764581,437.182418
std,39.822106,13.688365,54.467389,50.713844,50.573847,2.79184,1.315491,20.069084,9.259435,19.922625,4.310668,19.568609,2.275873,2.461214,0.753773,42.37129
min,1.0,0.0,0.0,0.0,23.2,0.0,0.0,0.0,0.0,0.0,0.0,33.0,1.04,0.0,0.0,408.0
25%,74.0,0.0,143.7,166.6,167.0,8.5,1.0,87.0,24.43,87.0,14.16,87.0,7.52,3.0,2.3,408.0
50%,101.0,0.0,179.4,201.4,201.2,10.3,1.0,101.0,30.5,100.0,17.12,100.0,9.05,4.0,2.78,415.0
75%,127.0,20.0,216.4,235.3,235.3,12.1,2.0,114.0,36.79,114.0,20.0,113.0,10.59,6.0,3.27,510.0
max,243.0,51.0,350.8,363.7,395.0,20.0,9.0,165.0,59.64,170.0,30.91,175.0,17.77,20.0,5.4,510.0


#### Create feature set and target

In [33]:
X = df.drop('Churn', axis=1)
y = df[['Churn']]

print(X.shape, y.shape)

(3333, 20) (3333, 1)


#### One-hot encoding

In [34]:
X = pd.get_dummies(X, columns=['Day_Charge', 'Intl_Charge'], drop_first=True)

display(X.head())
print(X.shape)

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Intl_Plan,Vmail_Plan,Day_Calls,...,Intl_Charge_4.73,Intl_Charge_4.75,Intl_Charge_4.81,Intl_Charge_4.83,Intl_Charge_4.86,Intl_Charge_4.91,Intl_Charge_4.94,Intl_Charge_4.97,Intl_Charge_5.1,Intl_Charge_5.4
0,128,25,265.1,197.4,244.7,10.0,1,no,yes,110,...,0,0,0,0,0,0,0,0,0,0
1,107,26,161.6,195.5,254.4,13.7,1,no,yes,123,...,0,0,0,0,0,0,0,0,0,0
2,137,0,243.4,121.2,162.6,12.2,0,no,no,114,...,0,0,0,0,0,0,0,0,0,0
3,84,0,299.4,61.9,196.9,6.6,2,yes,no,71,...,0,0,0,0,0,0,0,0,0,0
4,75,0,166.7,148.3,186.9,10.1,3,yes,no,113,...,0,0,0,0,0,0,0,0,0,0


(3333, 1845)


#### Split the dataset intro train and test set

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state = 42, 
                                                    stratify=y)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2666, 1845) (667, 1845) (2666, 1) (667, 1)


#### Decision Tree classifier, no grid search

In [37]:
model_DT = DecisionTreeClassifier(random_state = 42)
model_DT = model_DT.fit(X_train, y_train)

ValueError: could not convert string to float: 'no'

#### Tree Diagram

In [38]:
plt.figure(figsize = (15, 15))
plot_tree(model_DT, 
          filled=True,
          rounded=True,
          class_names = ["No Churn", "Yes Churn"],
          feature_names = X.columns,
          max_depth=2, 
          fontsize=15)

plt.show()

NotFittedError: This DecisionTreeClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

<Figure size 1080x1080 with 0 Axes>

### ROC: Receiver Operating Characterisitcs and AUC: Area Under the Curve

In [40]:

fpr, tpr, thr = roc_curve(y_test['Churn'], 
                          y_test['probability'])
auc = np.round(roc_auc_score(y_test['Churn'], 
                             y_test['predicted_Churn']), 2)

plt.figure(figsize=(10, 8))
plt.plot(fpr, 
         tpr, 
         color='green', 
         lw=2, 
         label="Curve Area = " +str(auc))

plt.plot([0, 1], [0, 1], color='blue', lw=2, linestyle='--')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC curve')
plt.legend(loc="lower right")
plt.show()

KeyError: 'probability'

#### Data Transformation

In [51]:
# normalize the feature columns
scaler = MinMaxScaler()

for col in features.columns:
    features[col] = scaler.fit_transform(features[[col]])

display(features.sample(10))

AttributeError: 'list' object has no attribute 'columns'

In [52]:
features.describe()

AttributeError: 'list' object has no attribute 'describe'