# <font color=purple> Machine Learning

<font color=purple> **The objective of this project is twofold: to delve into the study of machine learning libraries while advancing further in the exploration of Jupyter Notebook and Python. The project encompasses a many dataset and offers a range of exercises involving those datasets.** <font>

Several practices will be employed in this study:

1- Below certain commands, there will be a summary of their meanings.

2- All text will be written in English.

3- The data has been extracted from exercises on the Alura platform.

4- Each dataset will have a summary of its meaning.

# About

The main dataset is a fictional database where each row represents a car for sale on an online store. The cars are from various different owners, and some of these cars have been sold while others have not. The "sold" column represents whether the cars have been sold or not: "yes" for sold cars and "no" for unsold ones. We have three features for each of the cars: firstly, "mileage per year" represents how many miles the car has traveled per year; "model year" represents the year of each model (which is different from the year of manufacture); and finally, we have the "price," i.e., the selling price of each car.

The objective of this project its to validate the models from a past project, that you can find on my github at : "ML/ML-SKLearn".

In [1]:
# Imports

import pandas as pd
import numpy as np
from datetime import datetime

# SKLearn imports

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

# Datasets

url_1 = 'https://gist.githubusercontent.com/guilhermesilveira/4d1d4a16ccbf6ea4e0a64a38a24ec884/raw/afd05cb0c796d18f3f5a6537053ded308ba94bf7/car-prices.csv'

# Readers

data_1 = pd.read_csv(url_1)
data_1

Unnamed: 0.1,Unnamed: 0,mileage_per_year,model_year,price,sold
0,0,21801,2000,30941.02,yes
1,1,7843,1998,40557.96,yes
2,2,7109,2006,89627.50,no
3,3,26823,2015,95276.14,no
4,4,7935,2014,117384.68,yes
...,...,...,...,...,...
9995,9995,15572,2006,97112.86,no
9996,9996,13246,2002,107424.63,yes
9997,9997,13018,2014,93856.99,no
9998,9998,10464,2011,51250.57,yes


In [2]:
# Changing sold column

replace = {
    'yes': 1,
    'no':0
}

data_1['car_age'] = datetime.today().year - data_1.model_year

# Creating a new column called "km_per_year", since we are in brazil we use the KM notification not miles

data_1['km_per_year'] = data_1.mileage_per_year * 1.60934 # Miles to KM

data_1['sold'] = data_1.sold.map(replace)

# Drop unnecessary columns

data_1 = data_1.drop(columns = ['Unnamed: 0', 'mileage_per_year', 'model_year'], axis = 1)

data_1.head()

Unnamed: 0,price,sold,car_age,km_per_year
0,30941.02,1,24,35085.22134
1,40557.96,1,26,12622.05362
2,89627.5,0,18,11440.79806
3,95276.14,0,9,43167.32682
4,117384.68,1,10,12770.1129


In [3]:
# Splitting test and train

x = data_1[['km_per_year','car_age','price']]
y = data_1['sold']

SEED = 158020
np.random.seed(SEED)

train_x, test_x, train_y, test_y = train_test_split(x,y, test_size = 0.25, stratify=y)
print(f'Training with: {len(train_x)} elements and Testing with: {len(test_x)} elements')

Training with: 7500 elements and Testing with: 2500 elements


In [4]:
# Creating dummy classifier

dummy_stratified = DummyClassifier(strategy='stratified')
dummy_stratified.fit(train_x, train_y)
accuracy = dummy_stratified.score(test_x, test_y) * 100

print(f'Accuracy: {accuracy} %')

Accuracy: 50.96000000000001 %


In [5]:
# Creating Decision Tree model

model = DecisionTreeClassifier(max_depth=2)
model.fit(train_x, train_y)
prediction = model.predict(test_x)

accuracy = accuracy_score(test_y, prediction) * 100
print(f'Accuracy: {accuracy} %')

Accuracy: 71.92 %


In [6]:
# Creating cross validate (to avoid holdout)

model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, x, y, cv=5, return_train_score=False)
mean = results['test_score'].mean()
std = results['test_score'].std()

print(f'Accuracy Cross Validation 5: {((mean - 2 * std) * 100 , (mean + 2 * std) * 100)}')

Accuracy Cross Validation 5: (75.20868572571656, 76.35131427428342)


In [7]:
# Creating function to print

def print_result(results):
    mean = results['test_score'].mean()
    std = results['test_score'].std()
    print(f'Mean: {mean * 100}')
    print(f'Accuracy Range: {((mean - 2 * std) * 100 , (mean + 2 * std) * 100)}')
    
# Creating kfold to random the results

cv = KFold(n_splits=10, shuffle=True)
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, x, y, cv=cv, return_train_score=False)

print_result(results)

Mean: 75.79
Accuracy Range: (72.84592119670685, 78.73407880329314)


In [8]:
# Using stratified kfold, in a unbalance example

data_unlucky = data_1.sort_values("sold", ascending=True)
x_un =data_unlucky[['km_per_year','car_age','price']]
y_un = data_unlucky['sold']

cv = StratifiedKFold(n_splits=10, shuffle=True)
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, x_un, y_un, cv=cv, return_train_score=False)

print_result(results)

Mean: 75.77999999999999
Accuracy Range: (73.72126252280675, 77.83873747719323)


In [9]:
# Generating random car model data for clustering simulation when using our estimator

SEED = 301
np.random.seed(SEED)

data_1['model'] = data_1.car_age + np.random.randint(-2,3, size=10000)
data_1.head()

Unnamed: 0,price,sold,car_age,km_per_year,model
0,30941.02,1,24,35085.22134,22
1,40557.96,1,26,12622.05362,28
2,89627.5,0,18,11440.79806,18
3,95276.14,0,9,43167.32682,10
4,117384.68,1,10,12770.1129,9


In [10]:
# Applying groupkfold

cv = GroupKFold(n_splits=10)
model = DecisionTreeClassifier(max_depth=2)
results = cross_validate(model, x_un, y_un, cv=cv, groups=data_1.model ,return_train_score=False)

print_result(results)

Mean: 75.78421883757397
Accuracy Range: (73.67184753288377, 77.89659014226416)


In [11]:
# Cross validation with StandardScaler

scaler = StandardScaler()
scaler.fit(train_x)
train_x_scaler = scaler.transform(train_x)
test_x_scaler = scaler.transform(test_x)

model = SVC()
model.fit(train_x_scaler, train_y)
prediction = model.predict(test_x_scaler)

accuracy = accuracy_score(test_y, prediction) * 100
print(f'Accuracy: {accuracy} %')

Accuracy: 74.4 %


In [13]:
# Scaling x_un

scaler = StandardScaler()
scaler.fit(x_un)
x_un_scaler = scaler.transform(x_un)

In [16]:
# Using pipeline

SEED = 301
np.random.seed(SEED)

scaler = StandardScaler()
model = SVC()
pipeline = Pipeline([('scaler', scaler),('model', model)])

cv = GroupKFold(n_splits=10)
model = SVC()
results = cross_validate(pipeline, x_un, y_un, cv=cv, groups=data_1.model ,return_train_score=False)

print_result(results)

Mean: 76.67947835217086
Accuracy Range: (74.27639884416052, 79.0825578601812)
