In [7]:
import pandas as pd

df = pd.read_csv('data/car-sales.csv').drop(columns=['Unnamed: 0'], axis=1)
df.head()

Unnamed: 0,price,sold,models_age,km_per_year
0,30941.02,1,18,35085.22134
1,40557.96,1,20,12622.05362
2,89627.5,0,12,11440.79806
3,95276.14,0,3,43167.32682
4,117384.68,1,4,12770.1129


In [40]:
import numpy as np
from sklearn.model_selection import train_test_split

# Separating into labels and features
y = df['sold']
x = df[['price', 'models_age', 'km_per_year']]

# Separating the training and testing sets
SEED = 158020
np.random.seed(SEED)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25, stratify=y)

print('We will train with {} elements and train with {} elements.'.format(len(train_x), len(test_y)))

We will train with 7500 elements and train with 2500 elements.


First let's create a dummy classifier to compare to our model

In [41]:
from sklearn.dummy import DummyClassifier

dummy_stratified = DummyClassifier()
dummy_stratified.fit(train_x, train_y)
accuracy = dummy_stratified.score(test_x, test_y) * 100

print("The dummy's accuracy is {:.1f}%".format(accuracy))


The dummy's accuracy is 58.0%


Now creating out Random Forest model

In [43]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

SEED = 158020
np.random.seed(SEED)

model = DecisionTreeClassifier(max_depth=2)
model.fit(train_x, train_y)
predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions) * 100

print("The model's accuracy is {:.2f}%".format(accuracy))

The model's accuracy is 71.92%


However, what happens if we change our seed?

In [44]:
SEED = 5
np.random.seed(SEED)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25, stratify=y)

print('We will train with {} elements and train with {} elements.'.format(len(train_x), len(test_y)))

model = DecisionTreeClassifier(max_depth=2)
model.fit(train_x, train_y)
predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions) * 100

print("The model's accuracy is {:.2f}%".format(accuracy))

We will train with 7500 elements and train with 2500 elements.
The model's accuracy is 76.84%


Look how it varied drastically! If, for example, our baseline for a good model was 75%, simply changing the seed would mke the cut! We can't make important decisions based on randomness, so we must try to minimize its effects.