# Download and look into the data

### Project description

Rusty Bargain used car sales service is developing an app to attract new
customers. In that app, you can quickly find out the market value of your car.


From the historical data that includes: technical specifications, trim versions, and
prices. We need to build the model to determine the following values:

- the quality of the prediction
- the speed of the prediction
- the time required for training

### Import needed libraries

In [None]:
# Data tools
import pandas as pd
import numpy as np
#from pandas_profiling import ProfileReport

# others
import time

# Graphics and display
from IPython.core.interactiveshell import InteractiveShell
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
%matplotlib inline

# Ml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
!pip install lightgbm 
import lightgbm as lgb
!pip install catboost
import catboost as cb

# Statistics
from scipy import stats

print('Project libraries has been successfully been imported!')

1.0.3  Set environment variables

In [None]:
# set to display all output not only print() or last output
InteractiveShell.ast_node_interactivity = "all"      

1.1  Open the file and look into the data.

In [None]:
try:
    car_data = pd.read_csv('car_data.csv')
except:
    car_data = pd.read_csv('/datasets/car_data.csv')
    
print('Data has been read correctly!')

###  Data describe

In [None]:
# Checking 0's
def zero_check(df):
    for i in df:
        print(i,len(df[df[i]==0]))
        
# function to determine if columns in file have null values        
def get_percent_of_na(df, num):
    count = 0
    df = df.copy()
    s = (df.isna().sum() / df.shape[0])
    for column, percent in zip(s.index, s.values):
        num_of_nulls = df[column].isna().sum()
        if num_of_nulls == 0:
            continue
        else:
            count += 1
        print('{} has {} nulls, which is {:.{}%} percent of Nulls'.format(column, num_of_nulls, percent, num))
    if count != 0:
        print("\033[1m" + 'There are {} columns with NA.'.format(count) + "\033[0m")
    else:
        print()
        print("\033[1m" + 'There are no columns with NA.' + "\033[0m")       
        
# function to display general information about the dataset
def general_info(df):
    print("\033[1m" + "\033[0m")
    display(pd.concat([df.dtypes, df.count(),df.isna().sum(),df.isna().sum()*100/len(df)], keys=['type','count','na','na%'],
                      axis=1))
    print()
    print("\033[1m" + 'Head:')  
    display(df.head())
    print()
    print("\033[1m" + 'Tail:')
    display(df.tail())
    print()
    print("\033[1m" + 'Info:')
    print()
    display(df.info())
    print()
    print("\033[1m" + 'Describe:')
    print()
    display(df.describe())
    print()
    print("\033[1m" + 'Describe include: all :')
    print()
    display(df.describe(include='all'))
    print()
    print("\033[1m" + 'nulls in the columns:')
    print()
    display(get_percent_of_na(df, 4))  # check this out
    print()
    print("\033[1m" + 'Zeros in the columns:') 
    print()
    display(zero_check(df))
    print()
    print("\033[1m" + 'Shape:', df.shape)
    print()
    print()
    print('Duplicated:',"\033[1m" + 'We have {} duplicated rows\n'.format(df.duplicated().sum()) + "\033[0m")
    print()
    print("\033[1m" + 'Dtypes:')  
    display(df.dtypes)
    print()

In [None]:
#print our data
print('information about the dataset:')
general_info(car_data)

Notes:

- We have missing values in some fields
    - VehicleType
    - Gearbox
    - Model
    - FuelType
    - NotRepaired
- Some fields contain categorial data, some numerical, some dates and some booleans
- The price mean is much higher then it's median. Also have large varience.
-  With mileage it's the opposite. Much higher median then mean.
- We see zeroes in Price, Power, RegistrationMonth and NumberOfPictures 
- There are 354,369 observatioins from them 262 are duplicate

In [None]:
## changer fields name to lower case with _ separator

In [None]:
car_data.columns = [x.lower() for x in car_data.columns]

In [None]:
car_data.columns = ['date_crawled', 'price', 'vehicle_type', 'registration_year', 'gearbox',
       'power', 'model', 'mileage', 'registration_month', 'fuel_type', 'brand',
       'not_repaired', 'date_created', 'number_of_pictures', 'postal_code',
       'last_seen']

## Columns info

Features
- DateCrawled — date profile was downloaded from the database
- VehicleType — vehicle body type
- RegistrationYear — vehicle registration year
- Gearbox — gearbox type
- Power — power (hp)
- Model — vehicle model
- Mileage — mileage (measured in km due to dataset's regional specifics)
- RegistrationMonth — vehicle registration month
- FuelType — fuel type
- Brand — vehicle brand
- NotRepaired — vehicle repaired or not
- DateCreated — date of profile creation
- NumberOfPictures — number of vehicle pictures
- PostalCode — postal code of profile owner (user)
- LastSeen — date of the last activity of the user


Target
- Price — price (Euro)

## EDA with data cleaning

In [None]:
car_data_original = car_data.copy(deep=True)

### Univariate EDA
Let's look at the distributions for each numeric variable in the dataset:

In [None]:
car_data.hist(edgecolor='black', linewidth=1.2, figsize=(15,10));

The price looks distributed ok.

Registration year, Power and Number of pictures looks like there are some outliers and most of the data seats in very distinct area. I will explore this. 

On Mileage I see an abnormal behavior. The mileage per cars is pretty  low and stable and there are a lot of cars with high mileage of ~135000-150000 but not a single car with more then that. No slow decreasing that I would anticipate. I will explore this. 





#### registration_year

In [None]:
px.box(car_data, x='registration_year').show()

This is the registration year. The result before the central region are before cars were invented and some are just to long ago and the ones after is in the future. Both irrelevant and shall be dismissed. 

In [None]:
# construct function to remove extreem outliers
def fences_of_column(df, column_name):
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1
    lower_fence = Q1 - 1.5*IQR
    upper_fence = Q3 + 1.5*IQR
    return lower_fence, upper_fence

In [None]:
lower_fence, upper_fence = fences_of_column(car_data, 'registration_year')
print('The lower fence is {} and the upper fece is {}'.format(lower_fence, upper_fence))

In [None]:
car_data = car_data.loc[
    (car_data['registration_year'] > lower_fence) & (car_data['registration_year'] < upper_fence), :]

In [None]:
px.histogram(car_data, x='registration_year').show()

ok

#### power

In [None]:
px.box(car_data, x='power').show()

In [None]:
lower_fence, upper_fence = fences_of_column(car_data, 'power')
print('The lower fence is {} and the upper fece is {}'.format(lower_fence, upper_fence))

In [None]:
car_data = car_data.loc[
    (car_data['power'] > lower_fence) & (car_data['power'] < upper_fence), :]

In [None]:
px.histogram(car_data, x='power').show()

Cars with zero power are not possible. We will replace that with the mean

In [None]:
car_data['power'] = car_data['power'].replace(to_replace=0, value=car_data['power'].mean())

In [None]:
px.histogram(car_data, x='power').show()

The spikes are probably because power  is rated in round values usually

#### number_of_pictures

In [None]:
car_data['number_of_pictures'].value_counts()

All pictures number are 0. This is usesles

In [None]:
car_data.drop(['number_of_pictures'], axis=1, inplace=True)

In [None]:
#### mileage

In [None]:
px.histogram(car_data, x='mileage').show()

I can see two things here

1. values are given in a round values
2. Since the mileage always increase and stop increase in 150K with very large observation number compare to the rest then my guess is that user can't fill up value above that so it contains all what is 150K and above

#### Non numerics data

In [None]:
# create numeric and non numeric data
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

numerics_car_data = car_data.select_dtypes(include=numerics)
non_numeric_car_data = car_data.select_dtypes(exclude=numerics)

In [None]:
# print value_counts for the categorical features
missing_categorical = car_data[non_numeric_car_data.columns.to_list()].isna().sum() 
list_of_missing_categorical = missing_categorical[missing_categorical > 0].index.to_list()
car_data[list_of_missing_categorical]
for column_name in list_of_missing_categorical:
    print(column_name)
    print(car_data[column_name].value_counts(dropna=False))
    print()

In [None]:
car_data[list_of_missing_categorical].isnull().mean() * 100

In [None]:
car_data.isnull().mean() * 100

All the missings are in the categorical fields.

Let's remove rows with more then one missing and see how it will change the missing status


In [None]:
car_data =  car_data.dropna(thresh=(len(car_data.columns) - 1))
car_data.isnull().mean() * 100

I will impute with the most common the missing places

In [None]:
# define an imputer to treat the categorical missing values
imputer_categorical = SimpleImputer(strategy='most_frequent')

In [None]:
# create function to auto impute
def impute_values_in_column(df:pd.DataFrame, column_name, my_imputer) -> pd.DataFrame:
    df[column_name] = my_imputer.fit_transform(
    df[column_name].values.reshape(-1, 1))
    return df  

In [None]:
for column_name in list_of_missing_categorical:
    car_data = impute_values_in_column(car_data, column_name, imputer_categorical)

In [None]:
car_data.isnull().mean() * 100

No missing values

### Bivariate EDA
Exploring the relations of features couple

In [None]:
px.scatter_matrix(numerics_car_data).show()

Can't see anything particular. I will use corr matrix

In [None]:
numerics_car_data.corr()

There are few fields with above medium (> |0.3|) correlation with price. This is a good sighn for making predictions on this set. At the same time there are no other fields that explains very good other fields. This is good because we don't want our features to correlate too much. 

### Treating dates

In [None]:
car_data

### Duplicates

In [None]:
car_data

In [None]:
print(f'There are {car_data[car_data.duplicated()].shape[0]} duplicates in the data')

I will remove them

In [None]:
car_data.drop_duplicates(inplace=True)

In [None]:
print(f'There are {car_data[car_data.duplicated()].shape[0]} duplicates in the data')

ok

## Data encoding

We need to encode our categorical values

In [None]:
car_data[non_numeric_car_data.columns.to_list()]

### One hot encoding

I will do OHE for - not_repaired 

In [None]:
# create dummies for not_repaired
data_ohe = pd.get_dummies(car_data['not_repaired'], drop_first=True)

In [None]:
data_ohe.columns = ['not_repaired_yes']

In [None]:
# join the data with the ohe data of the car_data
car_data = car_data.drop('not_repaired', axis=1).join(data_ohe)

In [None]:
car_data.sample(3)

ok

### Use OrdinalEncoder  to Encode Multiple Columns All at Once

In [None]:
non_numeric_car_data.columns.to_list()

In [None]:
columns_for_Ordinal_encoder = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand']

In [None]:
encoder = OrdinalEncoder()
car_data[columns_for_Ordinal_encoder] = encoder.fit_transform(
    car_data[columns_for_Ordinal_encoder])

In [None]:
car_data.sample(3)

ok

## Features selection

I will choose all features from dataset except the dates wich looks irrelevant in context.

In [None]:
car_data = car_data.drop(['date_crawled', 'date_created', 'last_seen'], axis=1)

In [None]:
car_data

# Train different models with various hyperparameters

## Split the data

In [None]:
# create sample data for fast test of models
car_data_sample = car_data.sample(n=car_data.shape[0] // 100, random_state=12345)

In [None]:
X = car_data.drop(['price'], axis=1)
y = car_data['price']
X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.25, random_state=12345
)
print(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

In [None]:
X_sample = car_data_sample.drop(['price'], axis=1)
y_sample = car_data_sample['price']
X_train_sample, X_valid_sample, y_train_sample, y_valid_sample = train_test_split(
        X_sample, y_sample, test_size=0.25, random_state=12345
)
print(X_train_sample.shape, X_valid_sample.shape, y_train_sample.shape, y_valid_sample.shape)

In [None]:
# create df to store results
results_df = pd.DataFrame(columns=['model_type', 'time_sec', 'RMSE'])

### Create RMSE function for scoring

In [None]:
def calc_RMSE(y_true,y_pred):
    RMSE_temp = (mean_squared_error(y_true,y_pred))**0.5
    return RMSE_temp

In [None]:
RMSE_score = make_scorer(calc_RMSE, greater_is_better=False)

## models creation and training

### Linear regression

In [None]:
lr = LinearRegression()

In [None]:
t = time.time() # start time
lr.fit(X_train, y_train)
elapsed = (time.time() - t) # elapsed time in seconds
print(f'It took {elapsed} seconds for the model to train')

In [None]:
prediction = lr.predict(X_valid)
RMSE = calc_RMSE(y_valid, prediction)
print(f'The RMSE is: {RMSE} dollars')

In [None]:
def append_to_score_df(model_name, time_in_sec, RMSE):
    row_val = [model_name, time_in_sec, round(RMSE, 2)]
    results_df.loc[len(results_df)] = row_val
    print(results_df)

In [None]:
append_to_score_df(model_name='lr', time_in_sec=elapsed, RMSE=RMSE)

### Random forest Regression

In [None]:
# create function for grid search
def gs_evaluate(gs, X_train_gs, X_valid_gs, y_train_gs, y_valid_gs, X_gs, y_gs):

    t = time.time() # start time
    gs.fit(X_train_gs, y_train_gs)
    elapsed = (time.time() - t) # elapsed time in seconds
    best_score = -gs.best_score_
    score_train = -gs.score(X_train_gs, y_train_gs)
    score_valid = -gs.score(X_valid_gs, y_valid_gs)
    score_full = -gs.score(X_gs, y_gs)
    best_params = gs.best_params_
    hour = int(elapsed // 3600)
    minutes = int((elapsed/3600 - hour)*60)
    seconds = round(((elapsed/3600 - hour)*60 - minutes)*60)
    
    
    print('The best score in the cross validation is {:0.3f}, \n\
    the best score on all training set is {:0.3f}, \n\
    the best score on the valid set is {:0.3f} \n\
    the best score on the full set is {:0.3f} \n\
    and the parameters obtained from the GridSearchCV are: \n{}'.format(
    best_score, score_train, score_valid, score_full, best_params
    ))
    print()
    print('It took {} hours, {} minutes and {} seconds for the model to train'.format(
        hour, minutes, seconds
    ))
    
    return elapsed, score_valid

In [None]:
# create model
rf = RandomForestRegressor(random_state=12345)

In [None]:
param_grid = {
    'max_depth': [2, 6, 8],
    'n_estimators': [2, 3, 6]
}

In [None]:
# create grid search
gs = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring=RMSE_score, verbose=0)

In [None]:
elapsed_1, best_score_1 = gs_evaluate(
    gs, X_train_sample, X_valid_sample, y_train_sample, y_valid_sample, X_sample, y_sample
)

In [None]:
elapsed_2, best_score_2 = gs_evaluate(
    gs, X_train, X_valid, y_train, y_valid, X, y
)

In [None]:
append_to_score_df(model_name='rf', time_in_sec=elapsed_2, RMSE=best_score_2)

### LightGBM Regression

Here I will test different number of iterations in the cat boost light gbm model

In [None]:
# laoding data
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)


In [None]:
def lgbm_iterations_check(num_of_iterations):
    # defining parameters 
    params = {
        'task': 'train', 
        'boosting': 'gbdt',
        'objective': 'regression',
        'num_leaves': 10,
        'learnnig_rage': 0.05,
        'metric': {'l2','l1'},
        'verbose': -1, 
        'num_iterations': num_of_iterations
    }
    
    # fitting the model
    t = time.time() # start time
    lgbm = lgb.train(params,
                     train_set=lgb_train,
                     valid_sets=lgb_eval,
                     early_stopping_rounds=30)
    elapsed_lgbm = (time.time() - t) # elapsed time in seconds
    print(f'It took {elapsed_lgbm} seconds for the model to train')
    
    # prediction
    y_pred = lgbm.predict(X_valid)
    
    # accuracy check
    mse = mean_squared_error(y_valid, y_pred)
    rmse = mse**(0.5)
    print("MSE: %.2f" % mse)
    print("RMSE: %.2f" % rmse) 

#     MSE: 7.66
#     RMSE: 2.77  
    
    return round(rmse), round(elapsed_lgbm, 4)
    

In [None]:
rmse_list = []
elapsed_lgbm_list = []
x = []
for i in range(100, 1000, 100):
    rmse_1, elapsed_lgbm = lgbm_iterations_check(i)
    rmse_list.append(rmse_1)
    elapsed_lgbm_list.append(elapsed_lgbm)
    x.append(i)
    print(rmse_1, elapsed_lgbm, i)

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(x, elapsed_lgbm_list, c='b')
ax2.plot(x, rmse_list, c='r')

ax1.set_xlabel('Num of iterations')
ax1.set_ylabel('Time in seconds', color='b')
ax2.set_ylabel('RMSE', color='r')

    

I will choose 500 iterations because after that no significant change in RMSE but training time keep rise steadily. 

In [None]:
rmse_lgbm, elapsed_lgbm = lgbm_iterations_check(500)

In [None]:
append_to_score_df(model_name='lgbm', time_in_sec=elapsed_lgbm, RMSE=rmse_lgbm)

### CatBoost regression

Here I will test different depth in the cat boost regressor model

In [None]:
train_dataset = cb.Pool(X_train, y_train) 
test_dataset = cb.Pool(X_valid, y_valid)

In [None]:
cbr = cb.CatBoostRegressor(loss_function='RMSE', depth=10)
t = time.time() # start time
cbr.fit(X_train, y_train)
elapsed_cbr = (time.time() - t) # elapsed time in seconds
pred = cbr.predict(X_valid)
rmse = (np.sqrt(mean_squared_error(y_valid, pred)))

print('The RMSE is: {}. It took {} seconds to train the model'.format(rmse, elapsed_cbr))

In [None]:
append_to_score_df(model_name='cbr', time_in_sec=elapsed_cbr, RMSE=rmse)

# Analyze the speed and quality of the models

In [None]:
print('The df with RMSE score and learning time')
print()
results_df

We see that the best two algoritm in terms of RMSE are light GBM regressor and cat Boost regressor. cat Boost regressor made a better score but when comparing the training time we see that it took about 20 times longer in the cat Boost regressor. Given larger datasets despite the better accuracy of the cat Boost regressor model we might better choose the light GBM regressor