# Classification models

### **The Dataset**

The dataset is a collection of used cars for sale in the US.

The goal is to analyze the data and develop a car classification model that will determine the price category of a used car depending on its characteristics.

NAVIGATION

[<- Data Analysis](retro_cars_analysis.ipynb)

-
-
-

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('data/df_encoded.csv', sep=',')
df.shape

(9997, 153)

### Splitting the dataset into training and test parts

In [3]:
x = df.drop(['price_cat_num', 'scaled_price'], axis=1)
y = df['price_cat_num']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [4]:
x_train.shape, x_test.shape

((6997, 151), (3000, 151))

-
-
-

### Decision Tree Classifier

In [24]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train, y_train)
dtc_predicted_test = dtc.predict(x_test)
dtc_accuracy = (accuracy_score(y_test, dtc_predicted_test)*100).round(2)
print(f'Accuracy of Decision Tree Classifier: {dtc_accuracy}%')

Accuracy of Decision Tree Classifier: 70.47%


-
-
-

### Random Forest Classifier

In [25]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
rfc_predicted_test = rfc.predict(x_test)
rfc_accuracy = (accuracy_score(y_test, rfc_predicted_test)*100).round(2)
print(f'Accuracy of Random Forest Classifier: {rfc_accuracy}%')

Accuracy of Random Forest Classifier: 77.7%


-

#### Hyperparameters optimization of Random Forest Classifier with Random Search

In [7]:
from sklearn.model_selection import RandomizedSearchCV

In [8]:
param_grid_rfc = {
    'max_features': ['sqrt', 'log2'],
    'min_samples_leaf': list(range(1, 5)),
    'min_samples_split': list(range(2, 5)),
    'random_state': list(range(30, 50)),
    'n_estimators': list(range(100, 401, 100))
}

In [9]:
randomized_search_rfc = RandomizedSearchCV(
    estimator=rfc,
    param_distributions=param_grid_rfc,
    n_iter=10,
    scoring='accuracy',
    verbose=1,
    n_jobs=-1
)

In [10]:
randomized_search_rfc.fit(x_train, y_train)
best_params_rfc = randomized_search_rfc.best_params_
best_params_rfc

Fitting 5 folds for each of 10 candidates, totalling 50 fits


{'random_state': 48,
 'n_estimators': 300,
 'min_samples_split': 3,
 'min_samples_leaf': 1,
 'max_features': 'sqrt'}

In [26]:
rfc_tuned = RandomForestClassifier(
    # random_state = best_params['random_state'],
    random_state = 40, 
    n_estimators = best_params_rfc['n_estimators'], 
    min_samples_split = best_params_rfc['min_samples_split'], 
    min_samples_leaf = best_params_rfc['min_samples_leaf'], 
    max_features = best_params_rfc['max_features']
    )

rfc_tuned.fit(x_train, y_train)
rfc_tuned_predicted_test = rfc_tuned.predict(x_test)
rfc_tuned_accuracy = (accuracy_score(y_test, rfc_tuned_predicted_test)*100).round(2)
print(f'Accuracy of Tuned Decision Tree Classifier: {rfc_tuned_accuracy}%')

Accuracy of Tuned Decision Tree Classifier: 78.23%


-
-
-

### Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression
logr = LogisticRegression()
logr.fit(x_train, y_train)
logr_predicted_test = logr.predict(x_test)
logr_accuracy = (accuracy_score(y_test, logr_predicted_test)*100).round(2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
print(f'Accuracy of Logistic Regression: {logr_accuracy}%')

Accuracy of Logistic Regression: 68.1%


In [14]:
logr.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

-

#### Hyperparameters optimization of Logistic Regression

In [18]:
param_grid_logr = {'C': list(range(1, 5)), 
              'penalty': ['l1', 'l2'],
              'max_iter': list(range(100, 501, 100)),
              'random_state': list(range(20, 60, 10)),
              }

In [19]:
randomized_search_logr = RandomizedSearchCV(
    estimator=logr,
    param_distributions=param_grid_logr,
    n_iter=10,
    scoring='accuracy',
    verbose=1,
    n_jobs=-1
)

In [20]:
randomized_search_logr.fit(x_train, y_train)
best_params_logr = randomized_search_logr.best_params_
best_params_logr

Fitting 5 folds for each of 10 candidates, totalling 50 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

{'random_state': 40, 'penalty': 'l2', 'max_iter': 400, 'C': 2}

In [29]:
logr_tuned = LogisticRegression(
    C = best_params_logr['C'],
    random_state = best_params_logr['random_state'],
    penalty = best_params_logr['penalty'],
    max_iter = best_params_logr['max_iter']
    )

logr_tuned.fit(x_train, y_train)
logr_tuned_predicted_test = logr_tuned.predict(x_test)
logr_tuned_accuracy = (accuracy_score(y_test, logr_tuned_predicted_test)*100).round(2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [30]:
print(f'Accuracy of Tuned Logistic Regression: {logr_tuned_accuracy}%')

Accuracy of Tuned Logistic Regression: 69.23%


-
-
-

### MLPClassifier

In [72]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42, max_iter=500) 
mlp.fit(x_train, y_train)
mlp_predicted_test = mlp.predict(x_test)
mlp_accuracy = (accuracy_score(y_test, mlp_predicted_test)*100).round(2)
print(f'Accuracy of MLPClassifier: {mlp_accuracy}%')

Accuracy of MLPClassifier: 69.03%


In [43]:
mlp.n_layers_

3

In [47]:
mlp.get_params()

{'activation': 'relu',
 'alpha': 0.0001,
 'batch_size': 'auto',
 'beta_1': 0.9,
 'beta_2': 0.999,
 'early_stopping': False,
 'epsilon': 1e-08,
 'hidden_layer_sizes': (100,),
 'learning_rate': 'constant',
 'learning_rate_init': 0.001,
 'max_fun': 15000,
 'max_iter': 500,
 'momentum': 0.9,
 'n_iter_no_change': 10,
 'nesterovs_momentum': True,
 'power_t': 0.5,
 'random_state': 42,
 'shuffle': True,
 'solver': 'adam',
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': False,
 'warm_start': False}

-
-
-

### Voting Ensemble

In [76]:
pred_df = pd.DataFrame({'Decision Tree Classifier': dtc_predicted_test,
                        'Random Forest Classifier': rfc_predicted_test,
                        'Tuned Random Forest Classifier': rfc_tuned_predicted_test,
                        'Logistic Regression': logr_predicted_test,
                        'Tuned Logistic Regression': logr_tuned_predicted_test,
                        'MLP Classifier': mlp_predicted_test})

pred_df.head()

Unnamed: 0,Decision Tree Classifier,Random Forest Classifier,Tuned Random Forest Classifier,Logistic Regression,Tuned Logistic Regression,MLP Classifier
0,1,2,2,2,2,2
1,3,2,2,2,2,1
2,1,1,1,1,1,1
3,2,2,2,3,3,3
4,3,3,3,3,3,3


In [77]:
pred_df.shape

(3000, 6)

In [78]:
pred_df['voting_result'] = [pred_df.iloc[x, :].mode()[0] for x in range(len(pred_df))]
pred_df.head()

Unnamed: 0,Decision Tree Classifier,Random Forest Classifier,Tuned Random Forest Classifier,Logistic Regression,Tuned Logistic Regression,MLP Classifier,voting_result
0,1,2,2,2,2,2,2
1,3,2,2,2,2,1,2
2,1,1,1,1,1,1,1
3,2,2,2,3,3,3,2
4,3,3,3,3,3,3,3


In [81]:
accuracy_voting = (accuracy_score(y_test, pred_df['voting_result'])*100).round(2)
print(f'Accuracy of Voting Ensemble on test data: {accuracy_voting}%')

Accuracy of Voting Ensemble on test data: 74.37%


-
-
-
-
-

In [82]:
models_dict = {dtc_accuracy : 'Decision Tree Classifier', 
           rfc_accuracy : 'Random Forest Classifier', 
           rfc_tuned_accuracy : 'Tuned Random Forest Classifier', 
           logr_accuracy : 'Logistic Regression', 
           logr_tuned_accuracy : 'Tuned Logistic Regression',
           mlp_accuracy : 'MLP Classifier',
           accuracy_voting : 'Voting Ensemble'}
models_dict

{70.47: 'Decision Tree Classifier',
 77.7: 'Random Forest Classifier',
 78.23: 'Tuned Random Forest Classifier',
 68.1: 'Logistic Regression',
 69.23: 'Tuned Logistic Regression',
 69.03: 'MLP Classifier',
 74.37: 'Voting Ensemble'}

In [83]:
print(f'The best model is {models_dict[max(models_dict)]}, accuracy = {max(models_dict)}')

The best model is Tuned Random Forest Classifier, accuracy = 78.23


-
-
-
-
-

NAVIGATION

[<- Data Analysis](retro_cars_analysis.ipynb)