## Overview

Mr. K is a man that knows no bounds to achieve his goals. One of these goals has been to upload an immensely famous mobile game to the App Store. After working for months on end on his new, original game "Clash of Royales", Mr. K was shocked to learn that Apple had rejected his game due to supposed "unoriginality".

Enraged, Mr. K went to the Apple Campus and attempted to bribe an Apple employee to accept his game. However, the employee stated he would only accept the bribe if Mr. K gave him 1,000 cars, each with their exact price tags, that he could resell. So, Mr. K, not understanding sarcasm, obtained his 1,000 cars through various illegal means, but didn't know the prices for any of them.

So, he has now decided to go to the place with the world's brightest minds, Tinovation, to get the prices for his cars. This is where you come in. Give Mr. K an accurate estimation of the prices for each car, and get his mobile game to the App Store!


#### Description

You have been provided a dataset in the form of a csv file of over 4,000 cars and their prices, along with various features of each car. This will be your training set, where you will create a Machine Learning model to predict "Price" from the rest of the features.

You will use your model on the testing set, which has all the same features except for the "Price" column. Collect the predictions your model makes for each of the cars here, and submit it to the competition.

#### Evaluation
Submissions are evaluated through the "Mean Absolute Error" metric, a common metric for Regression Machine Learning problems. Learn more here: https://deepchecks.com/glossary/mean-absolute-error/

#### Submission File
See the sample submission file for what yours should look like. It should look the same as the sample, but hopefully with more accurate values for the "Price" column. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from joblib import load, dump
from sklearn.metrics import mean_absolute_error

In [3]:
df = pd.read_csv("../data/Help Mr K The car thief/car_prices_train.csv")
df.head()

Unnamed: 0,Brand,Model,Model Year,Mileage,Fuel Type,Engine,Liter,Transmission,Speed,Exterior Color,Interior Color,Accident,Clean Title,Price
0,Ford,Utility Police Interceptor Base,2013,51000,E85 Flex Fuel,300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capa...,3.7,6-Speed A/T,6.0,Black,Black,At least 1 accident or damage reported,Yes,10300
1,Hyundai,Palisade SEL,2021,34742,Gasoline,3.8L V6 24V GDI DOHC,3.8,8-Speed Automatic,8.0,Moonlight Cloud,Gray,At least 1 accident or damage reported,Yes,38005
2,Lexus,RX 350 RX 350,2022,22372,Gasoline,3.5 Liter DOHC,3.5,Automatic,,Blue,Black,None reported,,54598
3,INFINITI,Q50 Hybrid Sport,2015,88900,Hybrid,354.0HP 3.5L V6 Cylinder Engine Gas/Electric H...,3.5,7-Speed A/T,7.0,Black,Black,None reported,Yes,15500
4,Audi,Q3 45 S line Premium Plus,2021,9835,Gasoline,2.0L I4 16V GDI DOHC Turbo,2.0,8-Speed Automatic,8.0,Glacier White Metallic,Black,None reported,,34999


In [4]:
df.isna().sum()

Brand                0
Model                0
Model Year           0
Mileage              0
Fuel Type          170
Engine               0
Liter              238
Transmission         0
Speed             1833
Exterior Color       0
Interior Color       0
Accident           113
Clean Title        596
Price                0
dtype: int64

In [5]:
df.dtypes

Brand              object
Model              object
Model Year          int64
Mileage             int64
Fuel Type          object
Engine             object
Liter             float64
Transmission       object
Speed             float64
Exterior Color     object
Interior Color     object
Accident           object
Clean Title        object
Price               int64
dtype: object

In [66]:
# Clean the data. Fix missing data and convert to numerical data

cat_imputer = make_pipeline((SimpleImputer(strategy="constant", fill_value="missing")), OneHotEncoder(handle_unknown="ignore"))
numerical_imputer = make_pipeline((SimpleImputer(strategy="mean")))

preprocessing = make_column_transformer((cat_imputer, ["Brand", "Model", "Fuel Type",
                                                      "Engine", "Transmission", "Exterior Color",
                                                      "Interior Color"]), 
                                       (numerical_imputer, ["Speed", "Liter"]))

model = make_pipeline((preprocessing), (RandomForestRegressor()))

In [6]:
X = df.drop("Price", axis=1)
y = df["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [67]:
model.fit(X_train, y_train)

In [8]:
model.score(X_train, y_train)

0.8781681338211043

In [9]:
model.score(X_test, y_test)

0.8615278773051774

In [10]:
y_preds = model.predict(X_test)

In [11]:
mae = mean_absolute_error(y_test, y_preds)
print(f"The mean absolute error is: {mae:.2f}")

The mean absolute error is: 6946.85


In [58]:
# dump(model, "Mr-K.joblib")

['Mr-K.joblib']

In [7]:
model = load("Mr-K.joblib")

In [12]:
# Now put it against the test data
test_df = pd.read_csv("../data/Help Mr K The car thief/car_prices_test.csv")
y_preds = model.predict(test_df)

In [13]:
test_df["Price"] = pd.Series(y_preds)
test_df.head()

Unnamed: 0,ID,Brand,Model,Model Year,Mileage,Fuel Type,Engine,Liter,Transmission,Speed,Exterior Color,Interior Color,Accident,Clean Title,Price
0,0,Mercedes-Benz,C-Class C 300 4MATIC,2013,15959,Gasoline,335.0HP 4.7L 8 Cylinder Engine Gasoline Fuel,4.7,7-Speed A/T,7.0,Red,Black,None reported,Yes,23833.05
1,1,INFINITI,QX60 Base,2018,54394,Gasoline,265.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,3.5,Transmission w/Dual Shift Mode,,Gray,Black,None reported,Yes,27816.33
2,2,Chevrolet,Camaro 2SS,2020,37464,Gasoline,455.0HP 6.2L 8 Cylinder Engine Gasoline Fuel,6.2,Transmission w/Dual Shift Mode,,Blue,Black,None reported,Yes,42153.31
3,3,RAM,2500 Laramie,2016,145185,Diesel,350.0HP 6.7L Straight 6 Cylinder Engine Diesel...,6.7,6-Speed A/T,6.0,Red,Black,None reported,Yes,49279.78
4,4,Honda,Pilot EX-L,2018,43509,Gasoline,3.5L V6 24V GDI SOHC,3.5,CVT Transmission,,Gray,Gray,At least 1 accident or damage reported,Yes,17575.47


In [14]:
submission_result = pd.DataFrame(data={"ID": test_df["ID"],
                                      "Price": test_df["Price"]})

In [15]:
submission_result.head()

Unnamed: 0,ID,Price
0,0,23833.05
1,1,27816.33
2,2,42153.31
3,3,49279.78
4,4,17575.47


In [16]:
submission_result.to_csv("my_submission.csv", index=False)