<h1>Predicting car sales based on their mileages, model years and prices<h1>

In [25]:
import pandas as pd

data = pd.read_csv('data/car-prices.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,mileage_per_year,model_year,price,sold
0,0,21801,2000,30941.02,yes
1,1,7843,1998,40557.96,yes
2,2,7109,2006,89627.5,no
3,3,26823,2015,95276.14,no
4,4,7935,2014,117384.68,yes


In [26]:
# Converting yes = 1 and no = 0
change = {
    'yes': 1,
    'no' : 0
}

data['sold'] = data.sold.map(change)
data.head()

Unnamed: 0.1,Unnamed: 0,mileage_per_year,model_year,price,sold
0,0,21801,2000,30941.02,1
1,1,7843,1998,40557.96,1
2,2,7109,2006,89627.5,0
3,3,26823,2015,95276.14,0
4,4,7935,2014,117384.68,1


Now we're going to use the model year to get the car's age. This is important because the difference between years (like 1998 and 2000) can be as tiny as 0.1%, so it's best for us to use the car's age instead because the magnitude of the difference will be much larger, hence much easier for our model to make accurate predictions

In [27]:
from datetime import datetime

current_year = datetime.today().year
data['models_age'] = current_year - data.model_year
data.head()

Unnamed: 0.1,Unnamed: 0,mileage_per_year,model_year,price,sold,models_age
0,0,21801,2000,30941.02,1,22
1,1,7843,1998,40557.96,1,24
2,2,7109,2006,89627.5,0,16
3,3,26823,2015,95276.14,0,7
4,4,7935,2014,117384.68,1,8


In [28]:
# I prefer using kilometers
data['km_per_year'] = data.mileage_per_year * 1.60934
data.head()

Unnamed: 0.1,Unnamed: 0,mileage_per_year,model_year,price,sold,models_age,km_per_year
0,0,21801,2000,30941.02,1,22,35085.22134
1,1,7843,1998,40557.96,1,24,12622.05362
2,2,7109,2006,89627.5,0,16,11440.79806
3,3,26823,2015,95276.14,0,7,43167.32682
4,4,7935,2014,117384.68,1,8,12770.1129


In [29]:
# Now let's get rid of the unwanted columns
data = data.drop(columns = ["Unnamed: 0", "mileage_per_year", "model_year"], axis = 1)
data.head()

Unnamed: 0,price,sold,models_age,km_per_year
0,30941.02,1,22,35085.22134
1,40557.96,1,24,12622.05362
2,89627.5,0,16,11440.79806
3,95276.14,0,7,43167.32682
4,117384.68,1,8,12770.1129


In [30]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

x = data[['price', 'models_age', 'km_per_year']]
y = data['sold']

SEED = 10
np.random.seed(SEED)

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.25, stratify = y)

print("We will train with %d elements and test with %d elements" % (len(train_x), len(test_x)))

model = LinearSVC()
model.fit(train_x, train_y)
predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions) * 100
print("The model's accuracy was %.2f%%" % accuracy)



We will train with 7500 elements and test with 2500 elements
The model's accuracy was 58.00%




<h3>Dummy Classifiers<h3>

To test the accuracy of our prediction algorithm, we must compare it to dummy classifiers