# Ford used car analysis

Submitted by:
<br>Nabila Tajrin Bristy
<br>Dhaka, Bangladesh

#### Objectives
- Decision Tree Classification
- Cross-Validation
- Grid Search
- Confusion Matrix, Classification report, and ROC-AUC
- accuracy, precision, recall, f1 score

#### Tasks
1. Load the dataset and perform data preprocessing
2. Which approach works better for this dataset? A. One-hot Encoding or B. Label Encoding
3. Perform Data Transformation (StandardScaler or MinMaxScaler). Does Data Transformation improve model performance? Is it necessary to standardize or normalize data for tree-based machine learning models?
4. Perform Grid Search and Cross-Validation with Decision Tree Classifier
5. Show a tree diagram of the Decision Tree
6. Show the Confusion Matrix, Classification report, and ROC-AUC
7. Explain accuracy, precision, recall, f1 score


#### References
dataset: https://github.com/SKawsar/machine_learning_with_python/blob/main/Churn.csv <br>
Actual dataset source: https://learn.datacamp.com/courses/marketing-analytics-predicting-customer-churn-in-python

### Import required libraries and packages

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler

## Load the dataset and perform data preprocessing

In [15]:
df = pd.read_csv("Churn.csv", na_values="?")
df = df.dropna()
display(df.head())
print(df.shape)

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Churn,Intl_Plan,Vmail_Plan,...,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,State,Area_Code,Phone
0,128,25,265.1,197.4,244.7,10.0,1,no,no,yes,...,45.07,99,16.78,91,11.01,3,2.7,KS,415,382-4657
1,107,26,161.6,195.5,254.4,13.7,1,no,no,yes,...,27.47,103,16.62,103,11.45,3,3.7,OH,415,371-7191
2,137,0,243.4,121.2,162.6,12.2,0,no,no,no,...,41.38,110,10.3,104,7.32,5,3.29,NJ,415,358-1921
3,84,0,299.4,61.9,196.9,6.6,2,no,yes,no,...,50.9,88,5.26,89,8.86,7,1.78,OH,408,375-9999
4,75,0,166.7,148.3,186.9,10.1,3,no,yes,no,...,28.34,122,12.61,121,8.41,3,2.73,OK,415,330-6626


(3333, 21)


#### Exploratory Data Analysis

In [4]:
df.describe()

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Day_Calls,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,Area_Code
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,8.09901,179.775098,200.980348,200.872037,10.237294,1.562856,100.435644,30.562307,100.114311,17.08354,100.107711,9.039325,4.479448,2.764581,437.182418
std,39.822106,13.688365,54.467389,50.713844,50.573847,2.79184,1.315491,20.069084,9.259435,19.922625,4.310668,19.568609,2.275873,2.461214,0.753773,42.37129
min,1.0,0.0,0.0,0.0,23.2,0.0,0.0,0.0,0.0,0.0,0.0,33.0,1.04,0.0,0.0,408.0
25%,74.0,0.0,143.7,166.6,167.0,8.5,1.0,87.0,24.43,87.0,14.16,87.0,7.52,3.0,2.3,408.0
50%,101.0,0.0,179.4,201.4,201.2,10.3,1.0,101.0,30.5,100.0,17.12,100.0,9.05,4.0,2.78,415.0
75%,127.0,20.0,216.4,235.3,235.3,12.1,2.0,114.0,36.79,114.0,20.0,113.0,10.59,6.0,3.27,510.0
max,243.0,51.0,350.8,363.7,395.0,20.0,9.0,165.0,59.64,170.0,30.91,175.0,17.77,20.0,5.4,510.0


#### Checking categorical variables

In [5]:
df['Churn'].value_counts()

no     2850
yes     483
Name: Churn, dtype: int64

In [6]:
df['Intl_Plan'].value_counts()

no     3010
yes     323
Name: Intl_Plan, dtype: int64

In [7]:
df['Vmail_Plan'].value_counts()

no     2411
yes     922
Name: Vmail_Plan, dtype: int64

## Checking for missing values and data types

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
Account_Length    3333 non-null int64
Vmail_Message     3333 non-null int64
Day_Mins          3333 non-null float64
Eve_Mins          3333 non-null float64
Night_Mins        3333 non-null float64
Intl_Mins         3333 non-null float64
CustServ_Calls    3333 non-null int64
Churn             3333 non-null object
Intl_Plan         3333 non-null object
Vmail_Plan        3333 non-null object
Day_Calls         3333 non-null int64
Day_Charge        3333 non-null float64
Eve_Calls         3333 non-null int64
Eve_Charge        3333 non-null float64
Night_Calls       3333 non-null int64
Night_Charge      3333 non-null float64
Intl_Calls        3333 non-null int64
Intl_Charge       3333 non-null float64
State             3333 non-null object
Area_Code         3333 non-null int64
Phone             3333 non-null object
dtypes: float64(8), int64(8), object(5)
memory usage: 546.9+ KB


#### Target variable: 'Churn'

In [9]:
print(df['Churn'].value_counts())

no     2850
yes     483
Name: Churn, dtype: int64


In [10]:
df = pd.read_csv("Churn.csv", na_values="?")
df = df.dropna()
#df.columns = ["sepal length", "sepal width", "petal length", "petal width", "class"]

display(df.sample(10))
print(df.shape)

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Churn,Intl_Plan,Vmail_Plan,...,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,State,Area_Code,Phone
1114,108,15,165.1,267.0,250.7,10.9,1,no,no,yes,...,28.07,93,22.7,114,11.28,4,2.94,TN,408,352-1127
2807,52,0,217.0,152.3,134.3,11.8,2,no,no,no,...,36.89,83,12.95,109,6.04,4,3.19,AK,408,375-5562
2240,78,0,147.1,199.7,160.7,13.7,0,no,no,no,...,25.01,100,16.97,106,7.23,7,3.7,WY,415,399-6259
3313,127,0,102.8,143.7,191.4,10.0,1,no,no,no,...,17.48,95,12.21,97,8.61,5,2.7,ID,408,392-5090
1393,170,0,246.4,228.1,166.4,9.1,0,no,no,no,...,41.89,124,19.39,95,7.49,8,2.46,NC,415,366-4444
1416,27,0,177.6,296.8,192.9,7.6,3,no,no,no,...,30.19,92,25.23,106,8.68,3,2.05,NV,510,398-7414
127,61,27,187.5,146.6,225.7,6.4,4,yes,no,yes,...,31.88,103,12.46,129,10.16,6,1.73,MS,510,414-8718
2871,125,0,212.3,215.4,186.8,11.3,2,no,no,no,...,36.09,127,18.31,73,8.41,2,3.05,NC,408,412-7020
487,76,0,204.2,292.6,244.3,10.5,0,no,no,no,...,34.71,139,24.87,105,10.99,2,2.84,IN,415,363-3911
1668,98,0,171.7,174.8,189.6,7.8,1,no,no,no,...,29.19,87,14.86,130,8.53,6,2.11,NY,408,403-4917


(3333, 21)


In [11]:
display(df.describe())

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Day_Calls,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,Area_Code
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,8.09901,179.775098,200.980348,200.872037,10.237294,1.562856,100.435644,30.562307,100.114311,17.08354,100.107711,9.039325,4.479448,2.764581,437.182418
std,39.822106,13.688365,54.467389,50.713844,50.573847,2.79184,1.315491,20.069084,9.259435,19.922625,4.310668,19.568609,2.275873,2.461214,0.753773,42.37129
min,1.0,0.0,0.0,0.0,23.2,0.0,0.0,0.0,0.0,0.0,0.0,33.0,1.04,0.0,0.0,408.0
25%,74.0,0.0,143.7,166.6,167.0,8.5,1.0,87.0,24.43,87.0,14.16,87.0,7.52,3.0,2.3,408.0
50%,101.0,0.0,179.4,201.4,201.2,10.3,1.0,101.0,30.5,100.0,17.12,100.0,9.05,4.0,2.78,415.0
75%,127.0,20.0,216.4,235.3,235.3,12.1,2.0,114.0,36.79,114.0,20.0,113.0,10.59,6.0,3.27,510.0
max,243.0,51.0,350.8,363.7,395.0,20.0,9.0,165.0,59.64,170.0,30.91,175.0,17.77,20.0,5.4,510.0


#### Create feature set and target

In [12]:
X = df.drop('Churn', axis=1)
y = df[['Churn']]

print(X.shape, y.shape)

(3333, 20) (3333, 1)


## One-hot encoding

In [13]:
X = pd.get_dummies(X, columns=['Day_Charge', 'Intl_Charge'], drop_first=True)

display(X.head())
print(X.shape)

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Intl_Plan,Vmail_Plan,Day_Calls,...,Intl_Charge_4.73,Intl_Charge_4.75,Intl_Charge_4.81,Intl_Charge_4.83,Intl_Charge_4.86,Intl_Charge_4.91,Intl_Charge_4.94,Intl_Charge_4.97,Intl_Charge_5.1,Intl_Charge_5.4
0,128,25,265.1,197.4,244.7,10.0,1,no,yes,110,...,0,0,0,0,0,0,0,0,0,0
1,107,26,161.6,195.5,254.4,13.7,1,no,yes,123,...,0,0,0,0,0,0,0,0,0,0
2,137,0,243.4,121.2,162.6,12.2,0,no,no,114,...,0,0,0,0,0,0,0,0,0,0
3,84,0,299.4,61.9,196.9,6.6,2,yes,no,71,...,0,0,0,0,0,0,0,0,0,0
4,75,0,166.7,148.3,186.9,10.1,3,yes,no,113,...,0,0,0,0,0,0,0,0,0,0


(3333, 1845)


## Split the dataset intro train and test set

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2, 
                                                    random_state = 42, 
                                                    stratify=y)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

NameError: name 'train_test_split' is not defined

## Decision Tree classifier, no grid search

In [None]:
model_DT = DecisionTreeClassifier(random_state = 42)
model_DT = model_DT.fit(X_train, y_train)

## Tree Diagram

In [None]:
plt.figure(figsize = (15, 15))
plot_tree(model_DT, 
          filled=True,
          rounded=True,
          class_names = ["No Churn", "Yes Churn"],
          feature_names = X.columns,
          max_depth=2, 
          fontsize=15)

plt.show()

## ROC: Receiver Operating Characterisitcs and AUC: Area Under the Curve

In [16]:

fpr, tpr, thr = roc_curve(y_test['Churn'], 
                          y_test['probability'])
auc = np.round(roc_auc_score(y_test['Churn'], 
                             y_test['predicted_Churn']), 2)

plt.figure(figsize=(10, 8))
plt.plot(fpr, 
         tpr, 
         color='green', 
         lw=2, 
         label="Curve Area = " +str(auc))

plt.plot([0, 1], [0, 1], color='blue', lw=2, linestyle='--')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC curve')
plt.legend(loc="lower right")
plt.show()

NameError: name 'roc_curve' is not defined

## Data Transformation

In [23]:
# normalize the feature columns
scaler = MinMaxScaler()

for col in features.columns:
    features[col] = scaler.fit_transform(features[[col]])

display(features.sample(10))

NameError: name 'features' is not defined

In [18]:
features.describe()

NameError: name 'features' is not defined