# Regresion: Car Prices Predictions

In this Notebook I'll be doing a regression type of problem, meaining predicting price of cars by their provided features: **Make** Brand of the car; **Colour**; **Odometer** cars mileage in kilometers; **Doors**; **Price**. Problem with provided set is that  **Make** and **Colour** aren't in numerical form and since Regression needs data to be in numerical values, I'll have to transform these parts of data into numerical values using `OneHoteEnconder` and `ColumnTransformer` provided from SKLearn library. 

Also **Doors** column is column containing a numerical value but this numerical value is value that should also be tranformed since it's relevance with car is categorical.

## Part 1.1 : Checking and setting the data

In this part of notebook I'll be setting and checking the car sales data.

In [1]:
# Importing needed libraries
import pandas as pd
import numpy as np

# Getting the data
car_sales = pd.read_csv("car-sales-extended.csv")

In [2]:
# Displaying the dataset
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


In [3]:
# Checking Doors column of provided data
car_sales["Doors"].value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

**note**: In the cell above I'm checking how many cars of provided data have 4 doors, 5 doors and 3 doors. 

In [4]:
# Checking datatypes of every column. 
# This helps to identify which columns should be 
# transformed with OneHotEncoder and ColumnTransformer
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [5]:
# Splitting the Price column from the rest of data
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

### Trying to train the model without transforming the data
In this part of notebook I'll try to fit the original untransformed data to the model. This will output `ValueError` adressing that regression model can't make prediction on `object` types of data.

In [6]:
# Splitting the data to train/test sets. 
# Train data will be 80% of provided data and 20% will be test data. 
# Split will be provided through test_size=0.2 parameter used in train_test_split
from sklearn.model_selection import train_test_split

# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [7]:
#### This won't work
# Importing the selected model and fitting the data
from sklearn.ensemble import RandomForestRegressor

# Initilazing the regression model
model = RandomForestRegressor()

# Training the model on training data
model.fit(X_train, y_train)

# Evaluating the model on testing data i.e. ground truth
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Honda'

## Part 1.2 : Transforming the data using OneHotEncoder and ColumnTransformer

Because of provided `ValueError: could not convert string to float: 'Honda'` I'll have to transform **Make** and **Colour** column into numerical values. I'll achieve this with `OneHotEncoder` and `ColumnTransformer` which will transform `object` type of data into numerical binary code which will be enough for a regression model to make predictions.

In [8]:
# Importing libraries and transforming selected columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")

tranformed_X = transformer.fit_transform(X)
tranformed_X


array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [9]:
# Setting transformed data into DataFrame
eks = pd.DataFrame(tranformed_X)

### Values before tranformation 

In [10]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


### Values after transformation

In [11]:
eks.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


In [12]:
dummies = pd.get_dummies(car_sales[["Make","Colour","Doors"]])

In [13]:
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


# Part 2 : Fitting the transformed data (training the model)

In [14]:
# Seting random seed for reproductible results
np.random.seed(42)

# Spliting training and testing data
X_train, X_test, y_train, y_test = train_test_split(tranformed_X, y, test_size=0.2)

# Fitting the data to the model
model.fit(X_train, y_train)

RandomForestRegressor()

In [15]:
# Evaluating the data
model.score(X_test, y_test)

0.3235867221569877

# Linear Regression with missing values

In this part of the notebook I'll be using the same car sales dataset but with some missing values.

In [16]:
# Getting the car sales data with missing values
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")

In [17]:
# Checking how many of missing values are in every column of provided data
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [18]:
# Splitting the data
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [19]:
### Transforming categorical data

# Importing needed libraries
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Defining parts of the data that need to be transformed
categorical_features = ["Make","Colour","Doors"]

# Initializing OneHotEncoder 
one_hot = OneHotEncoder()

# Initializng transformer with needed data
transformer = ColumnTransformer([("one_hot",
                               one_hot,
                               categorical_features)],
                               remainder="passthrough")

# Fitting transformed data into my dataset X
transformed_X = transformer.fit_transform(X)


In [20]:
# Displaying encoded data
tranformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

### Filling missing data with Pandas
In this part, I'll fill the missing rows of provided data. `Make` and `Colour` columns will be filled with `missing` statement. `Odometer (KM)` column will be filled with mean value of all values provided in `Odometer (KM)` column. `Doors` column will be filled with number `4` since most cars in provided data are with 4 doors.

In [21]:
# Filling Make column with missing statement on every missing row of this column
car_sales_missing["Make"].fillna("missing", inplace=True)

# Filling Colour column with missing statement on every missing row of this column
car_sales_missing["Colour"].fillna("missing", inplace=True)

# Filling Odometer column with mean value of Odometer column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)

# Filling Doors column with value 4
car_sales_missing["Doors"].fillna(4, inplace=True)

In [22]:
# Checking if all provided columns have been filled
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [23]:
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,missing,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


**Now there are only 50 rows with missing price, since I have a dataset containing 1000 rows, I'll delete 50 rows with missing `Price` column.**

In [24]:
# Deleting rows with missing Price
car_sales_missing.dropna(inplace=True)

In [25]:
# Last check if there are any other rows with missing values
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [26]:
# Splitting the Price column of updated dataset
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [28]:
### Transforming categorical data

# Importing needed libraries
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Defining parts of the data that need to be transformed
categorical_features = ["Make","Colour","Doors"]

# Initializing OneHotEncoder 
one_hot = OneHotEncoder()

# Initializng transformer with needed data
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")

# Fitting transformed data into my dataset X
transformed_X = transformer.fit_transform(car_sales_missing)
tranformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

**note**: I had to do ,,column transformer process,, again because of the changes made with filling missing rows.

### Filling missing values with Scikit-Learn

This is an example of filling mising values with scikit-learn library

In [34]:
# Getting dataset with missing values
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")

# Checking which columns are with missing values
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [35]:
# Deleting rows which don't have value in Price column
car_sales_missing.dropna(subset=["Price"], inplace=True)

# Checking if rows were deleted
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [36]:
# Spliting Price column from the rest of the dataset
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

#### Filling missing data with SimpleImputer and ColumnTransformer from Scikit-Learn

In [29]:
# Importing needed libraries
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

### Defining what values have to be filled in certain rows of selected columns
# Defining what value will be filled in categorical object type columns
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")

# Defining what number has to be filled in door column
door_imputer = SimpleImputer(strategy="constant", fill_value=4)

# Defining what value has to be filled in Odometer column
num_imputer = SimpleImputer(strategy="mean")

# Dividing columns for later use to ColumnTransformer
cat_features = ["Make","Colour"]
door_feature = ["Doors"]
num_features = ["Odometer (KM)"]

# Putting it all together
imputer = ColumnTransformer([
    ("cat_features", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, num_features)])

# Fitting the data into my dataset
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [30]:
# Replacing missing rows with filled rows in our dataset
car_sales_filled = pd.DataFrame(filled_X, columns=["Make","Colour","Doors","Odometer (KM)"])

car_sales_filled.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0


In [31]:
# Checking if there are any missing values left
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [32]:
### Transforming object types and Doors column into numerical values
# Importing libraries
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Using OneHotEncoder and ColumnTransformer to get numerical values
categorical_features = ["Make","Colour","Doors"]

# Initializing OneHotEncoder
one_hot = OneHotEncoder()

# Initializing ColumnTransformer and providing needed values
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")

# Fitting the data
tranformed_X = transformer.fit_transform(car_sales_filled)
tranformed_X


<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [33]:
# Setting the random seed for reproductible results
np.random.seed(42)

# Importing machine learning model RandomForestRegressor 
# with train_test_split for splitting the data
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Spliting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tranformed_X, y, test_size=0.2)

# Initializing the model
model = RandomForestRegressor()

# Training the model on training data
model.fit(X_train, y_train)

# Evaluating the performance of the model on the testing data
model.score(X_test, y_test)

0.22011714008302485

# Making predictions on california housing dataset

In this part of the exercise, I'll be doing a price prediction based on califoria housing data from scikit-learn library

## Part 1 : Data preparation

In [34]:
# Getting the dataset from sklearni library
from sklearn.datasets import fetch_california_housing

# Initializing the dataset
housing = fetch_california_housing()

# Checking the dataset
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [35]:
# Creating a DataFrame
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])

# Showing the dataset
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [36]:
# Creating target values from the dataset into housing_df dataset
housing_df["target"] = housing["target"]

In [37]:
# Setting the random seed
np.random.seed(42)

# Importing Ridge linear regression model
from sklearn.linear_model import Ridge

# Spliting DataFrame into measured data (X) and  results (y)
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initilazing the model
model = Ridge()

# Training the model on training set
model.fit(X_train, y_train)

# Evaluating the models results
model.score(X_test, y_test)

0.5758549611440126

## Choosing a different model

Since the initial score of Ridge model were very low, I decided to try RandomForestRegressor

In [38]:
# Importing the model
from sklearn.ensemble import RandomForestRegressor

# Setting the random seed
np.random.seed(42)

# Splitting the data into measured values and results 
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Initiliazing the model
model = RandomForestRegressor()

# Trainin the model on training data
model.fit(X_train, y_train)

# Testing models score
model.score(X_test, y_test)

0.8066196804802649

### Regression model evaluation metrics

In [61]:
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [39]:
# making predictions with X_test set
y_preds = model.predict(X_test)

In [40]:
# Importing measuring functions from SKLearn
from sklearn.metrics import  mean_absolute_error, mean_squared_error

m_a_e = mean_absolute_error(y_test, y_preds)

m_s_e = mean_squared_error(y_test, y_preds)


print(f"Mean absolute error of Random  forest Reggresor: {m_a_e}")
print(f"Mean squared error of Random  forest Reggresor: {m_s_e}")

Mean absolute error of Random  forest Reggresor: 0.3265721842781009
Mean squared error of Random  forest Reggresor: 0.2534073069137548


In [41]:
# Displaying differences between actual values and predicted values
df = pd.DataFrame(data={"actual values": y_test,
                       "predicted values": y_preds})

df["differences"] = df["predicted values"] - df["actual values"]
df.head(10)

Unnamed: 0,actual values,predicted values,differences
20046,0.477,0.49384,0.01684
3024,0.458,0.75494,0.29694
15663,5.00001,4.928596,-0.071414
20484,2.186,2.54029,0.35429
9814,2.78,2.33176,-0.44824
13311,1.587,1.65497,0.06797
7113,1.982,2.34323,0.36123
7668,1.575,1.66182,0.08682
18246,3.4,2.47489,-0.92511
5723,4.466,4.834478,0.368478


In [42]:
# Saving the model
import pickle 

pickle.dump(model, open("trained_reg_model.pkl","wb"))