In [32]:
import pandas as pd

df = pd.read_csv('data/car-sales.csv').drop(columns=['Unnamed: 0'], axis=1)
df.head()

Unnamed: 0,price,sold,models_age,km_per_year
0,30941.02,1,18,35085.22134
1,40557.96,1,20,12622.05362
2,89627.5,0,12,11440.79806
3,95276.14,0,3,43167.32682
4,117384.68,1,4,12770.1129


In [33]:
import numpy as np
from sklearn.model_selection import train_test_split

# Separating into labels and features
y = df['sold']
x = df[['price', 'models_age', 'km_per_year']]

# Separating the training and testing sets
SEED = 158020
np.random.seed(SEED)
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25, stratify=y)

print('We will train with {} elements and train with {} elements.'.format(len(train_x), len(test_x)))

We will train with 7500 elements and train with 2500 elements.


First let's create a dummy classifier to compare to our model

In [34]:
from sklearn.dummy import DummyClassifier

dummy_stratified = DummyClassifier()
dummy_stratified.fit(train_x, train_y)
accuracy = dummy_stratified.score(test_x, test_y) * 100

print("The dummy's accuracy is {:.1f}%".format(accuracy))


The dummy's accuracy is 58.0%


Now creating our model

TO understand how the Decision Tree Classifier works, here's a very good video explaining it: https://www.youtube.com/watch?v=ZVR2Way4nwQ

In [35]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

SEED = 158020
np.random.seed(SEED)

model = DecisionTreeClassifier(max_depth=2)
model.fit(train_x, train_y)
predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions) * 100

print("The model's accuracy is {:.2f}%".format(accuracy))

The model's accuracy is 71.92%


However, what happens if we change our seed?

In [36]:
SEED = 5 # Changing our seed to another number
np.random.seed(SEED)

# Spliting our data into training and testing sets
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25, stratify=y)

print('We will train with {} elements and train with {} elements.'.format(len(train_x), len(test_y)))

# Creating our model (Tree Classifier) and training it
model = DecisionTreeClassifier(max_depth=2)
model.fit(train_x, train_y)

# Making predictions with our model
predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions) * 100

print("The model's accuracy is {:.2f}%".format(accuracy))

We will train with 7500 elements and train with 2500 elements.
The model's accuracy is 76.84%


Look how it varied drastically! If, for example, our baseline for a good model was 75%, simply changing the seed would mke the cut! We can't make important decisions based on randomness, so we must try to minimize its effects.

## Cross Validation

A good approach is to split our data into training and testing sets various times and in different places. The more we do this, the more we minimize the effects of randomness.

In [37]:
from sklearn.model_selection import cross_validate

SEED = 158020 # Changing our seed to another number
np.random.seed(SEED)

model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, x, y, cv=3) # cv=3 means we're separating our dataset into 3 subsets and in each iteration using one of them as the test set
results['test_score']

array([0.75704859, 0.7629763 , 0.75337534])

In [38]:
mean = results['test_score'].mean()
std = results['test_score'].std()

print("Accuracy with cross validation (3) -> [%.2f, %.2f]" % ((mean - 2*std) * 100, (mean + 2*std) * 100))

Accuracy with cross validation (3) -> [74.99, 76.57]


Note that if we change the seed value, our accuracy will remain consistent!

What about the value 3 we chose for spliting our data? What happens if we choose another number?

In [39]:
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, x, y, cv=10)

mean = results['test_score'].mean()
std = results['test_score'].std()

print("Accuracy with cross validation (10) -> [%.2f, %.2f]" % ((mean - 2*std) * 100, (mean + 2*std) * 100))

Accuracy with cross validation (10) -> [74.24, 77.32]


Some scientific papers show that choosing a cv between 5 and 10 often results in good accuracy, so we'll be choosing 5 

In [40]:
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, x, y, cv=5)

mean = results['test_score'].mean()
std = results['test_score'].std()

print("Accuracy with cross validation (5) -> [%.2f, %.2f]" % ((mean - 2*std) * 100, (mean + 2*std) * 100))

Accuracy with cross validation (5) -> [75.21, 76.35]


However, *cross_validate()* is deterministic, and once again it is advantageous to test with random spliting in order to have a better grasp of our model's true accuracy

## Randomness in Cross Validation

In [41]:
# First of all let's create a function to print our results
def print_results(results):
    mean = results['test_score'].mean()
    std = results['test_score'].std()

    print("Mean accuracy -> %.2f" % (mean * 100))
    print("Accuracy Interval -> [%.2f, %.2f]" % ((mean - 2*std) * 100, (mean + 2*std) * 100))

In [42]:
from sklearn.model_selection import KFold

SEED = 301
np.random.seed(SEED)

# KFold() provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds. Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
cv = KFold(n_splits=10, shuffle=True) # "shuffle" shuffles the data before splitting into batches
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, x, y, cv = cv)
print_results(results)

Mean accuracy -> 75.76
Accuracy Interval -> [73.26, 78.26]


We're still exposed to a big problem: what if our data is split with a class disbalance? That is, if, by chance, there's a higher proportion of a given class in the train set than in the test set, our model could wrongfully learn that the given class is more frequent than it truly is.

When we were using the *train_test_split()* method, we solved this problem by stratifying the data. However, there's no "stratify" parameter in *cross_validate()*. What do we do then?

Let's simulate a case where we're unlucky enough to have a disbalanced split:

## Simulating an unlucky disbalanced split

In [43]:
bad_df = df.sort_values("sold", ascending=True) # First 0 (not sold), then 1 (sold)
bad_df

Unnamed: 0,price,sold,models_age,km_per_year
4999,74023.29,0,12,24812.80412
5322,84843.49,0,13,23095.63834
5319,83100.27,0,19,36240.72746
5316,87932.13,0,16,32249.56426
5315,77937.01,0,15,28414.50704
...,...,...,...,...
5491,71910.43,1,9,25778.40812
1873,30456.53,1,6,15468.97608
1874,69342.41,1,11,16909.33538
5499,70520.39,1,16,19622.68262


In [44]:
bad_x = bad_df[['price', 'models_age', 'km_per_year']]
bad_y = bad_df[['sold']]

#### Without Shuffle

In [45]:
from sklearn.model_selection import KFold

SEED = 301
np.random.seed(SEED)

cv = KFold(n_splits=10)
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, bad_x, bad_y, cv=cv)
print_results(results)

Mean accuracy -> 57.84
Accuracy Interval -> [34.29, 81.39]


#### With Shuffle

In [46]:
from sklearn.model_selection import KFold

SEED = 301
np.random.seed(SEED)

cv = KFold(n_splits=10, shuffle=True)
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, bad_x, bad_y, cv=cv)
print_results(results)

Mean accuracy -> 75.78
Accuracy Interval -> [72.30, 79.26]


Look how much better it is! Our data was arranged with the labels in ascending order, so of course any split would result in vary different subsets. With shuffle, we greatly minimize this issue.

Yet, shuffling is very different from stratifying: the former helps randomizing our data, but it doesn't guarantee a stratified split.

## Stratified KFold

It's that simple: instead of using KFold, we can use StratifiedKFold! 

In [47]:
from sklearn.model_selection import StratifiedKFold

SEED = 301
np.random.seed(SEED)

cv = StratifiedKFold(n_splits=10, shuffle=True)
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, bad_x, bad_y, cv=cv)
print_results(results)

Mean accuracy -> 75.78
Accuracy Interval -> [73.55, 78.01]


## Grouping data before spliting into sets

Imagine a dataset containing data of patients in a hospital, and that in the dataset there could be multiple lines belonging to one patient. In this case, it's important that we group these data so the test set only sees new patients (instead of splitting the same patient into both sets).

Since we're dealing with cars, and not patients, a better example would be a car's model - it's best that our algorithm is tested on new car models. However, there's no *model* column in our dataset. That should be no problem: let's create one! We're going through this trouble because it is important to know how to group them in case we encounter a real life scenario where this is needed.

This new *model* column should be strongly correlated to the model's age - that is, two cars with roughly the same age might be the same model, while two with very different ages won't. We could just make the model be the same number as its age, but that would mean that all cars with the same age are the same model. That's no fun. So we can just add or subtract a random number to the model's age to get a new number that will represent the model to give it some extra variety and randomness.

In [48]:
np.random.seed(SEED)

df['model'] = df['models_age'] + np.random.randint(-2, 3, size=df.shape[0]) # Adding a number from -2 to 2 to the model's age
df.head()

Unnamed: 0,price,sold,models_age,km_per_year,model
0,30941.02,1,18,35085.22134,16
1,40557.96,1,20,12622.05362,22
2,89627.5,0,12,11440.79806,12
3,95276.14,0,3,43167.32682,4
4,117384.68,1,4,12770.1129,3


In [49]:
df['model'].unique()

array([16, 22, 12,  4,  3, 11, 18, 17, 13,  0, 15, 10,  9, 14,  1,  5, 19,
       21,  8,  7, 20,  6,  2, -1], dtype=int64)

Let's get rid of the negative numbers

In [50]:
df['model'] = df['model'] + abs(df['model'].min()) + 1 # We make the lowest number become 1, and from that all others are moved up
df['model'].unique() 

array([18, 24, 14,  6,  5, 13, 20, 19, 15,  2, 17, 12, 11, 16,  3,  7, 21,
       23, 10,  9, 22,  8,  4,  1], dtype=int64)

In [51]:
df.head()

Unnamed: 0,price,sold,models_age,km_per_year,model
0,30941.02,1,18,35085.22134,18
1,40557.96,1,20,12622.05362,24
2,89627.5,0,12,11440.79806,14
3,95276.14,0,3,43167.32682,6
4,117384.68,1,4,12770.1129,5


Now that our dataset includes the cars' models, using *shuffle* will not do us good, for it won't group the cars by their model. For that, we should use *GroupKFold*.

## Testing cross validation with GroupKFold

In [52]:
from sklearn.model_selection import GroupKFold

SEED = 301
np.random.seed(SEED)

cv = GroupKFold(n_splits=10)
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, bad_x, bad_y, cv=cv, groups=df['model'])
print_results(results)

Mean accuracy -> 75.78
Accuracy Interval -> [73.67, 77.90]


# Cross validation with SVC (and StandardScaler)

### Without *cross_validate*

In [53]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

SEED = 301
np.random.seed(SEED)

# Scaling the data
scaler = StandardScaler()
scaler.fit(train_x)
scaled_train_x = scaler.transform(train_x)
scaled_test_x = scaler.transform(test_x)

# Creating and training the model
model = SVC()
model.fit(scaled_train_x, train_y)
predictions = model.predict(scaled_test_x)

accuracy = accuracy_score(test_y, predictions) * 100
print("The model's accuracy was %.2f%%" % accuracy)

The model's accuracy was 77.48%


### With *cross_validate*

In [56]:
SEED = 301
np.random.seed(SEED)

scaler = StandardScaler()
scaler.fit(bad_x)
scaled_bad_x = scaler.transform(bad_x)

cv = GroupKFold(n_splits=10)
model = SVC()
results = cross_validate(model, scaled_bad_x, bad_y, cv=cv, groups=df['model'])

print_results(results)

  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)


Mean accuracy -> 76.70
Accuracy Interval -> [74.30, 79.10]


However, that is wrong. The reason is simple: since we're using *cross_validate*, we should be fitting the scaler once per fold, not once overall. For that, we're going to be using a pipeline.

In [57]:
from sklearn.pipeline import Pipeline

SEED = 301
np.random.seed(SEED)

scaler = StandardScaler()
model = SVC()

pipeline = Pipeline([('Transformer', scaler), ('Estimator', model)])

cv = GroupKFold(n_splits=10)
results = cross_validate(pipeline, bad_x, bad_y, cv=cv, groups=df['model'])
print_results(results)

  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)


Mean accuracy -> 76.68
Accuracy Interval -> [74.28, 79.08]
