# Introduction to scikit-learn

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.

What we're going to cover


0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problem
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together!

In [1]:
# Let's listify the contents
what_were_covering = [
    "0. An end-to-end Scikit-Learn workflow",
    "1. Getting the data ready",
    "2. Choose the right estimator/algorithm for our problems",
    "3. Fit the model/algorithm and use it to make predictions on our data",
    "4. Evaluating a model",
    "5. Improve a model",
    "6. Save and load a trained model",
    "7. Putting it all together!"]

## 0. An end-to-end Scikit-Learn workflow

In [64]:
# 1. Get the data ready
import pandas as pd
import numpy as np

heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [65]:
# Create X (feautres matrix)
X =  heart_disease.drop("target", axis=1)

# Create y (labels)
y = heart_disease["target"]

In [66]:
# 2. Choose the right model and hyperparameters
# This is a classification problem because we want to determine if X = heart disease

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 90)

# We'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 90,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [67]:
# Fit the model to the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [68]:
clf.fit(X_train, y_train);
X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
173,58,1,2,132,224,0,0,173,0,3.2,2,2,3
156,47,1,2,130,253,0,1,179,0,0.0,2,0,2
285,46,1,0,140,311,0,1,120,1,1.8,1,2,3
110,64,0,0,180,325,0,1,154,1,0.0,2,0,2
45,52,1,1,120,325,0,1,172,0,0.2,2,0,2


In [69]:
# Make a prediction
y_label = clf.predict(np.array([0, 2, 3, 4]))



ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [70]:
y_preds = clf.predict(X_test)

In [71]:
y_preds

array([0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1], dtype=int64)

In [72]:
y_test.head()

79     1
120    1
61     1
215    0
192    0
Name: target, dtype: int64

In [73]:
# 4. Evaluate the model on the training data and test data
clf.score(X_train, y_train)

1.0

In [74]:
clf.score(X_test, y_test)

0.8360655737704918

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

In [None]:
confusion_matrix(y_test, y_preds)

In [None]:
accuracy_score(y_test, y_preds)

In [None]:
# 5. Improve a model
# Try different amount of n_estimators

np.random.seed(10)
for i in range(10, 100, 10):
    print("Trying model with {} estimators".format(i))
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print("Model accuracy on test set: {} %".format(clf.score(X_test, y_test)))
    print()

In [None]:
# 6. Save a model and load it
import pickle

pickle.dump(clf, open("random-forest-model1.pkl", "wb"))

In [None]:
loaded_model = pickle.load(open("Random-forest-model1.pkl", "rb"))
loaded_model.score(X_test, y_test)

# Retry again

In [None]:
heart_data = pd.read_csv("heart-disease.csv")
heart_data.head()

In [None]:
X = heart_data.drop("target", axis=1)
X

In [None]:
y = heart_data["target"]
y

In [None]:
clf = RandomForestClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
clf.fit(X_train, y_train);

In [None]:
y_pred = clf.predict(X_test)
y_pred

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

# Retry again

In [None]:
heart_data = pd.read_csv("heart-disease.csv")
X = heart_data.drop("target", axis=1)
y = heart_data["target"]

clf = RandomForestClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

clf.score(X_test, y_test)

In [None]:
what_were_covering

# 1. Getting your data ready

Three main things we have to do:
    1. Split the data into features and labels (Usually "X" and "y")
    2. Filling (also called imputing) or disregarding missing values
    3. Converting non-numerical values into numerical values (also called feature encoding)

In [75]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [76]:
X = heart_disease.drop("target", axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [77]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [78]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [79]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape # Important, make sure shapes match.

((242, 13), (61, 13), (242,), (61,))

# 1.1 Make sure its all numerical

In [80]:
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [81]:
len(car_sales)

1000

In [82]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [83]:
# Split the data into x and y
X = car_sales.drop("Price", axis=1)
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [84]:
y = car_sales["Price"]
y.head()

0    15323
1    19943
2    28343
3    13434
4    14043
Name: Price, dtype: int64

In [85]:
# SPlit into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 4), (200, 4), (800,), (200,))

In [86]:
# Build machine learning model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train) # FIt on training data
model.score(X_test, y_test) # Evaluate on test data

ValueError: could not convert string to float: 'Toyota'

In [87]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder="passthrough")
transformed_x = transformer.fit_transform(X)
transformed_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [88]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [89]:
pd.DataFrame(transformed_x).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


In [90]:
# dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
# dummies.head()

In [91]:
# Lets try refit the model
np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformed_x,
                                                   y,
                                                   test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 13), (200, 13), (800,), (200,))

In [92]:
model.fit(X_train, y_train);

In [93]:
model.score(X_test, y_test)

0.3235867221569877

# 1.2 What if there were missing values?
1. Fill them with some value (also known as imputation)
2. Remove the samples with missing data altogether

In [94]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [95]:
# Import car sales missing data
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.head()
missing_data.dtypes

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

In [96]:
missing_data.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [97]:
# Create X and y
X = missing_data.drop("Price", axis=1)
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431.0,4.0
1,BMW,Blue,192714.0,5.0
2,Honda,White,84714.0,4.0
3,Toyota,White,154365.0,4.0
4,Nissan,Blue,181577.0,3.0


In [98]:
y = missing_data["Price"]
y.head()

0    15323.0
1    19943.0
2    28343.0
3    13434.0
4    14043.0
Name: Price, dtype: float64

In [99]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")

transformed_x = transformer.fit_transform(X)

In [100]:
X.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
dtype: int64

#### Option 1: Fill missing  data with pandas

In [101]:
# Fill the "Make" column
missing_data["Make"].fillna("missing", inplace=True)

# Fill the "Color" column
missing_data["Colour"].fillna("missing", inplace=True)

# Fill missing "Odometer (KM)" with mean of Odometer
missing_data["Odometer (KM)"].fillna(missing_data["Odometer (KM)"].mean(), inplace=True)

# Fill the "Doors" column with the average of doors
missing_data["Doors"].value_counts()
missing_data["Doors"].fillna(4, inplace=True)

# CHeck our dataframe again
missing_data.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [102]:
# Remove rows with missing price value
missing_data.dropna(inplace=True)
missing_data.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [103]:
len(missing_data)

950

In [104]:
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

In [105]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [1]:
import pandas as pd
import numpy as np

In [3]:
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

In [7]:
heart_disease.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [10]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [14]:
clf.fit(X_train, y_train);

In [15]:
clf.score(X_train, y_train)

1.0

In [16]:
clf.score(X_test, y_test)

0.8360655737704918

# Try again making data numerical and running ML

In [11]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Get the data ready
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales.head(), len(car_sales), car_sales.dtypes

(     Make Colour  Odometer (KM)  Doors  Price
 0   Honda  White          35431      4  15323
 1     BMW   Blue         192714      5  19943
 2   Honda  White          84714      4  28343
 3  Toyota  White         154365      4  13434
 4  Nissan   Blue         181577      3  14043,
 1000,
 Make             object
 Colour           object
 Odometer (KM)     int64
 Doors             int64
 Price             int64
 dtype: object)

In [12]:
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

In [13]:
feature_data = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 feature_data)],
                               remainder="passthrough")

transformed_X = transformer.fit_transform(X)
X_pd = pd.DataFrame(data=transformed_X)
X_pd.head()
# model = RandomForestRegressor()
# transformed_X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


In [35]:
X_train, X_test, y_train, y_test = train_test_split(X_pd, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((800, 13), (200, 13), (800,), (200,))

In [37]:
model.fit(X_train, y_train)
model.score(X_train, y_train)

0.9014751043312839

In [38]:
model.score(X_test, y_test)

0.21967647766614984

# Restart

In [2]:
what_were_covering

['0. An end-to-end Scikit-Learn workflow',
 '1. Getting the data ready',
 '2. Choose the right estimator/algorithm for our problems',
 '3. Fit the model/algorithm and use it to make predictions on our data',
 '4. Evaluating a model',
 '5. Improve a model',
 '6. Save and load a trained model',
 '7. Putting it all together!']

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Getting the data ready
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.head(), len(missing_data), missing_data.dtypes

(     Make Colour  Odometer (KM)  Doors    Price
 0   Honda  White        35431.0    4.0  15323.0
 1     BMW   Blue       192714.0    5.0  19943.0
 2   Honda  White        84714.0    4.0  28343.0
 3  Toyota  White       154365.0    4.0  13434.0
 4  Nissan   Blue       181577.0    3.0  14043.0,
 1000,
 Make              object
 Colour            object
 Odometer (KM)    float64
 Doors            float64
 Price            float64
 dtype: object)

In [3]:
missing_data.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [4]:
# FIll missing data in "Make", "Color", "Odometer", "Doors" with "missing" and drop missing data in price
# Make string data numerical

missing_data["Make"].fillna("missing", inplace=True)
missing_data["Colour"].fillna("missing", inplace=True)
missing_data["Odometer (KM)"].fillna(missing_data["Odometer (KM)"].mean(), inplace=True)
missing_data["Doors"].fillna(4, inplace=True)

missing_data.dropna(inplace=True)

In [5]:
missing_data.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [6]:
# Now that all missing data has been sorted. Sort the data into X and y
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

In [7]:
X.head(), y.head()

(     Make Colour  Odometer (KM)  Doors
 0   Honda  White        35431.0    4.0
 1     BMW   Blue       192714.0    5.0
 2   Honda  White        84714.0    4.0
 3  Toyota  White       154365.0    4.0
 4  Nissan   Blue       181577.0    3.0,
 0    15323.0
 1    19943.0
 2    28343.0
 3    13434.0
 4    14043.0
 Name: Price, dtype: float64)

In [8]:
# Turn the data into numerical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

feature_data = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 feature_data)],
                               remainder="passthrough",
                               sparse_threshold=0)

transformed_X = transformer.fit_transform(X)
X_pd = pd.DataFrame(transformed_X)
X_pd.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


In [48]:
# Fit the model. We are trying to get an estimate of price
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=60)
X_train, X_test, y_train, y_test = train_test_split(X_pd, y, test_size=0.2)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((760, 15), (190, 15), (760,), (190,))

In [49]:
# Train the data
model.fit(X_train, y_train);

In [50]:
# Test the trained data
model.score(X_train, y_train)

0.8759014850757179

In [51]:
# Test the test data
model.score(X_test, y_test)

0.05824340430011321

In [52]:
for i in range(10, 100, 25):
    model = RandomForestRegressor(n_estimators=i)
    X_train, X_test, y_train, y_test = train_test_split(X_pd, y, test_size=0.2)
    model.fit(X_train, y_train);
    model.score(X_train, y_train)
    print("Testing {} estimators: Result = {}".format(i, model.score(X_test, y_test)))

Testing 10 estimators: Result = 0.19502700616183477
Testing 35 estimators: Result = 0.16641690130045628
Testing 60 estimators: Result = 0.12887545623799446
Testing 85 estimators: Result = 0.2519028247610945


# Option 2. FIll missing values with scikit-learn

In [54]:
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [55]:
# Drop the rows with no labeks
missing_data.dropna(subset=["Price"], inplace=True)
missing_data.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [56]:
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

In [66]:
# Fill missing values with scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with missing and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
cat_feature = ["Make", "Colour"]
door_feature = ["Doors"]
num_feature = ["Odometer (KM)"]

# Create an imputer (Something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_feature),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_feature)
])

# Transform the data
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [68]:
X_pd = pd.DataFrame(filled_X,
                    columns=["Make", "Colour", "Doors", "Odometer (KM)"])
X_pd.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0


In [69]:
X_pd.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [75]:
# Convert catagorical value to numerical
from sklearn.preprocessing import OneHotEncoder


cat_data = ["Make", "Colour", "Doors"]
hot_one = OneHotEncoder()
transformer = ColumnTransformer([("hot_one", hot_one, cat_data)],
                                remainder="passthrough",
                               sparse_threshold=0)

transform_X = transformer.fit_transform(X_pd)
transform_X

array([[0.0, 1.0, 0.0, ..., 1.0, 0.0, 35431.0],
       [1.0, 0.0, 0.0, ..., 0.0, 1.0, 192714.0],
       [0.0, 1.0, 0.0, ..., 1.0, 0.0, 84714.0],
       ...,
       [0.0, 0.0, 1.0, ..., 1.0, 0.0, 66604.0],
       [0.0, 1.0, 0.0, ..., 1.0, 0.0, 215883.0],
       [0.0, 0.0, 0.0, ..., 1.0, 0.0, 248360.0]], dtype=object)

In [79]:
# Now that our data is numbers and filled (No missing values). Fit a model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transform_X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((760, 15), (190, 15), (760,), (190,))

In [80]:
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.20991493163258768

# Retry

In [2]:
what_were_covering

['0. An end-to-end Scikit-Learn workflow',
 '1. Getting the data ready',
 '2. Choose the right estimator/algorithm for our problems',
 '3. Fit the model/algorithm and use it to make predictions on our data',
 '4. Evaluating a model',
 '5. Improve a model',
 '6. Save and load a trained model',
 '7. Putting it all together!']

In [3]:
# Getting the data ready
# Import the data
# Check for missing data
# Make categorical data numerical
# Choose the right model
# fit the model
# Make predictions

In [4]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

In [5]:
# 1. import the data
missing_data = pd.read_csv("car-sales-extended-missing-data.csv")
missing_data.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [6]:
# 2. check for missing data
missing_data.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [10]:
# Split the missing data
X = missing_data.drop("Price", axis=1)
y = missing_data["Price"]

In [14]:
# Fix missing data in y
y.dropna(inplace=True)
y.isna().sum()

0

In [None]:
# Fix the missing data in X
cat_imputer = SimpleImputer(strategy="constant", missing_values="missing")
mean_imputer = SimpleImputer(strategy="mean")
door_imputer = SimpleImputer(strategy="constant", missing_values=4)

cat_feature = ["Make", "Colour"]
mean_feature = ["Odometer (KM)"]
door_feature = ["Doors"]

imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_feature),
    ("mean_imputer", mean_imputer, mean_feature),
    ("door_imputer", door_imputer, door_feature)
], remainder="passthrough", sparse_threshold=0)

filled_X = imputer.fit_transform(X)