# Objective
Predict resale prices of BMW cars. This could for instance be used by someone who wants to sell their car, to get an idea about how much it is worth, similar to how Kelley Blue Book works.

# Thinking about the problem
From the readme of the dataset available here <https://github.com/datacamp/careerhub-data/tree/master/BMW%20Used%20Car%20Sales>, one can see that the dataset contains information about price, transmission, mileage, fuel type, road tax, miles per gallon (mpg), and engine size. Upon inspection of the dataset (see below), it turned out to additionally contain the car model and year (I'm assuming this means production year). First I want to describe my initial expectations for the relationships between these quantities, and formulate different levels of complexity for including the data.

The five quantities model, year, transmission, fuel type, and engine size collectively describe the car configuration at the time of initial purchase. The quantity mileage describes how much the car has been used, and therefore worn since that point. The quantities miles per gallon and road tax should be given based on the new car configuration quantities.

I suspect that the price will strongly depend on the mileage and age of the car, and a first simple model could therefore just consider these two variables.

In [None]:
# This requires the file draw_diagrams.py to be in the same directory as this notebook
import draw_diagrams
draw_diagrams.data_model1()

An improvement on this would be to include the new car configuration variables. From these in addition to price, mpg and road tax could be inferred.

In [None]:
draw_diagrams.data_model2()

Finally the last two variables, mpg and road tax, can be included. These could affect the resale price of the car, since they would probably influence how much a buyer is willing to pay, but I suspect this connection will be less strong than the connection between the other variables and price.

In [None]:
draw_diagrams.data_model3()

Before any of this though, first I want to take a closer at the data.


# Loading and inspecting data
First I load and inspect the data. I downloaded the data from [here](https://raw.githubusercontent.com/datacamp/careerhub-data/master/BMW%20Used%20Car%20Sales/bmw.csv) and saved it in the `datasets/bmw.csv` file.

In [None]:
import numpy as np
import pandas as pd

In [None]:
bmw = pd.read_csv("datasets/bmw.csv")
bmw.head()

In [None]:
bmw.info()

In [None]:
bmw.model.unique()

In [None]:
bmw.transmission.unique()

In [None]:
bmw.fuelType.unique()

In [None]:
bmw.describe()

In [None]:
for col in bmw:
    print(col, len(bmw[col].unique()))

# Data exploration
<a id = "data-exploration"></a>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

Let's look at how the price depends on all the continous variables using a pair plot

In [None]:
sns.pairplot(
    bmw,  # hue='transmission',
    x_vars=["price", "year", "mileage", "tax", "mpg", "engineSize"],
    y_vars=["price"],
)

There appears to be a definite relationship between price and both mileage and year. The relationship looks like at might be expopnential, so let's we look at the logarithm of the price

In [None]:
bmw_log = bmw.copy()
bmw_log['log price'] = np.log10(bmw_log['price'])
bmw_log = bmw_log.drop('price', axis='columns')

sns.pairplot(bmw_log, #hue='transmission', 
             x_vars=['log price', 'year', 'mileage',  'tax', 'mpg', 'engineSize'],
             y_vars=['log price']) #, hue='transmission')

These plots reveal that there appears to be a linear relationship between the logarithm of the price, and both year and mileage. There is no obvious relationship between the price and the remaining variables, whether we consider logarithm or not. Going forward in the analysis, we will be using the logarithm of the price as the target variable.

# Data cleaning
## Categorical variables
Let us take a closer look at the categorical columns. First we print the number of values in each category

In [None]:
categorical_columns = ["model", "fuelType", "transmission"]


def print_categorical_counts(df, columns):
    for col in columns:
        display(df.groupby(col)[col].count())


print_categorical_counts(bmw_log, categorical_columns)

There are a number of categories with very few records. For instance, the `fuelType` `Electric` has only three. With such a small amount of observations for this category, and no obvious relationship with other entries in this category as one naturally has for numeric columns, I wouldn't expect it to be possible to make reliable predictions for the selling price for this category. I therefore choose to drop any category with less than 10 records. 

In [None]:
def drop_almost_empty_categories(df, col, nmin=10):
    df = df.copy()  # To avoid modyfiyng the input dataframe
    category_count = df.groupby(col)[col].count()
    for category_name, count in category_count.iteritems():
        if count < nmin:
            df = df[df[col] != category_name]
    return df


bmw_cat = bmw_log.copy()
for col in categorical_columns:
    bmw_cat = drop_almost_empty_categories(bmw_cat, col)
bmw_cat[categorical_columns] = bmw_cat[categorical_columns].astype('category')
# print_categorical_counts(bmw_cat, categorical_columns)

In [None]:
new_car_config_cols = ['model', 'transmission', 'fuelType', 'engineSize']
new_car_cols = new_car_config_cols + ['year']

In [None]:
sns.pairplot(
    bmw.sort_values("engineSize"),  # hue='transmission',
    x_vars=new_car_cols,
    y_vars=new_car_cols,
)

In [None]:
bmw_cat[bmw_cat.engineSize==0].head()

In [None]:
with pd.option_context("display.max_rows", None):
    new_car_grouped = bmw.groupby(new_car_cols)[["tax", "mpg", "price"]]
    display(new_car_grouped.nunique())
    # display(bmw.groupby(new_car_config_cols)['tax'].nunique())

In [None]:
choices = (
    (bmw.model == " 1 Series")
    & (bmw.transmission == "Automatic")
    & (bmw.fuelType == "Diesel")
    & (bmw.engineSize == 2.0)
    & (bmw.year == 2016)
)
bmw[choices][["mileage", "tax", "mpg", "price"]].sort_values("mileage")

## A bit more data cleaning

From the plots we can see that `mpg` has a group of values near 400, far from the nearest values which are less than 200. Let's see how many different values  are present there

In [None]:
bmw_cat[bmw_cat["mpg"]>400]["mpg"].unique()

All the values of `mpg` in the group near 400 have the same value. This looks very suspicious. I suspect this is data is wrong, and since it could seriously skew a model since it has such high values, I should eliminate these values (either impute with e.g. average, or drop the records all together).

Let's also check the remaining two continous variables

In [None]:
# display(sorted(bmw_dropped["engineSize"].unique()))
display(bmw_cat.groupby("engineSize")["engineSize"].count())
bmw_cat.groupby("tax")["tax"].count()

They both contain zeros, which seems weird for both tax and engine size. The skewing effect is probably less then for the `mpg` outliers, since zero is closer to other values of tax and engine size, but I should still either impute or drop these records.

In [None]:
to_be_dropped = (bmw_cat.mpg > 400) | (bmw_cat.engineSize == 0) | (bmw_cat.tax == 0)
bmw_cleaned = bmw_cat[~to_be_dropped]
# bmw_cleaned = bmw_log
# bmw_cleaned.info()

# Regression models
Here I train models on the data.

## Linear regression models
For the first model, I only want to consider the dependency of price on build year and mileage. From the plots in the [data exploration](#data-exploration) section we see that the logarithm of the price appears to depend linearly on year and mileage.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler


def fit_and_test_linear_model(
    df_selected, model=None, dependent="log price", features="all", 
    print_coeffs=True,
    **kwargs
):
    if model is not None:
        linreg = model
    else:
        linreg = LinearRegression()
    linreg = fit_and_test_model(
        df_selected, linreg, dependent=dependent, features=features, **kwargs
    )
    # print(features)
    if print_coeffs:
        if features == "all":
            features = every_column_name_but(df_selected, dependent)
        std = df_selected[features].std()
        print_linear_coeffs(features, linreg, std)
    return linreg


def print_linear_coeffs(features, linreg, std):
    coeffs = pd.DataFrame(
        {
            "observable": features,
            "coef": linreg.coef_,
            "10^coef": np.power(10, linreg.coef_),
        }
    )
    coeffs["std"] = std.values
    coeffs['coef*std'] = coeffs['std'] * coeffs['coef']
    coeffs = coeffs.sort_values("coef*std", key=np.abs, ascending=False)
    display(coeffs.set_index("observable"))


def normalize_df(
    df,
    columns=["year"],
    # columns=['year', 'mileage', 'tax', 'mpg', 'engineSize']
):
    columns = [col for col in columns if col in df.columns]
    columns = [col for col in df.columns if col == "log price"]
    df = df.copy()
    if columns:
        # print(df)
        df_normed = pd.DataFrame(
            data=StandardScaler().fit_transform(df[columns].values),
            index=df.index,
            columns=columns,
        )
        for col in columns:
            df[col] = df_normed[col]
        # print(df)
    return df


def every_column_name_but(df, dependent):
    features = [col for col in df.columns if col != dependent]
    return features


def split_dependent(df, features, dependent="log price"):
    if features == "all":
        features = every_column_name_but(df, dependent)
    else:
        features = [col for col in features if col in df.columns and col != dependent]
    return df[features], df[dependent]


def fit_and_test_model(
    df, model, dependent="log price", features="all", plot_test=False
):

    # df_selected = normalize_df(df_selected)
    df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

    X_train, y_train = split_dependent(df_train, features, dependent=dependent)
    X_test, y_test = split_dependent(df_test, features, dependent=dependent)

    model.fit(X_train, y_train)

    y_predict_train = model.predict(X_train)
    y_predict = model.predict(X_test)

    print(
        "Mean squared error, test: {:.2g}, train: {:.2g}".format(
            mean_squared_error(y_predict, y_test),
            mean_squared_error(y_predict_train, y_train),
        )
    )
    print(
        "R^2 coefficient, test: {:.2f}, train: {:.2f}".format(
            r2_score(y_predict, y_test), r2_score(y_predict_train, y_train)
        )
    )
    if plot_test:
        sns.scatterplot(x=y_test, y=y_predict, alpha=0.5)
        plt.show()
    return model

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate

cat_transformer_tuple = (
    OneHotEncoder(),
    make_column_selector(dtype_include="category"),
)
ohe = make_column_transformer(cat_transformer_tuple, remainder="passthrough")

linreg = Pipeline((("one_hot", ohe), ("regressor", LinearRegression())))
fit_and_test_model(
    bmw_cleaned, linreg, dependent="log price", features="all", plot_test=True
)
#linreg.named_steps["one_hot"].get_feature_names()

In [None]:
X, y = split_dependent(bmw_cleaned, features='all')
cross_validate(linreg, X, y, return_train_score=True)

In [None]:
def mean_and_std(array):
    return np.mean(array), np.std(array), np.min(array)
new_car_cols2 = ['model', 'year', 'engineSize', 'transmission', 'fuelType'] + ["mpg", "tax"]
linreg = Pipeline((("one_hot", ohe), ("regressor", LinearRegression())))
for i in range(0,len(new_car_cols2)+1):
    new_car_cols_sub = new_car_cols2[:i]
    cols = ["log price", "mileage"] + new_car_cols_sub
    print(cols)
    bmw_selected = bmw_cleaned[cols]
    X, y = split_dependent(bmw_cleaned[cols], features='all')
    scores = cross_validate(linreg, X, y, return_train_score=True)
    print("train", *mean_and_std(scores["train_score"]))
    print("test", *mean_and_std(scores["test_score"]))
    print()

In [None]:
ohe.fit_transform(bmw_cleaned)
features = ohe.get_feature_names()
for i, cat_col in enumerate(categorical_columns):
    features = [
        feature.replace(f"onehotencoder__x{i}", cat_col) for feature in features
    ]
features

In [None]:
ohe = OneHotEncoder()
ohe.fit_transform(bmw_cleaned[categorical_columns])
ohe.get_feature_names(categorical_columns)

In [None]:
bmw_selected = bmw_cleaned[["log price", "mileage"]]
fit_and_test_linear_model(bmw_selected)

In [None]:
bmw_selected = bmw_cleaned[["log price", "year"]]
fit_and_test_linear_model(bmw_selected)

In [None]:
bmw_selected = bmw_cleaned[["log price", "year", "mileage"]]
fit_and_test_linear_model(bmw_selected, plot_test=True)

In [None]:
bmw_selected = bmw_cleaned[["log price", "year", "mileage", "tax", "mpg"]]
fit_and_test_linear_model(bmw_selected, plot_test=True)

In [None]:
bmw_selected = bmw_cleaned[["log price", "mileage"] + new_car_cols]
bmw_selected = pd.get_dummies(bmw_selected, drop_first=True)
fit_and_test_linear_model(bmw_selected, plot_test=True)

In [None]:
for i in range(0, len(new_car_cols) + 1):
    new_car_cols_sub = new_car_cols[:i]
    print(new_car_cols_sub)
    bmw_selected = bmw_cleaned[["log price", "mileage"] + new_car_cols_sub]
    bmw_selected = pd.get_dummies(bmw_selected, drop_first=True)
    fit_and_test_linear_model(bmw_selected, plot_test=False, print_coeffs=False)

In [None]:
bmw_selected = bmw_cleaned[["log price", "mileage", "model", "year"]]
bmw_selected = pd.get_dummies(bmw_selected, drop_first=True)
fit_and_test_linear_model(bmw_selected, plot_test=False, print_coeffs=False)

In [None]:
bmw_selected = pd.get_dummies(bmw_cleaned, drop_first=True)
fit_and_test_linear_model(bmw_selected)

In [None]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
bmw_selected = bmw_cleaned[["log price", "mileage"] + new_car_cols]
bmw_selected = pd.get_dummies(bmw_selected, drop_first=True)
linreg = fit_and_test_linear_model(bmw_selected, model=lasso)
#print_linear_coeffs(features, linreg, std)

## Tree models

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
bmw_selected = bmw_cleaned[["log price", "mileage"] + new_car_cols]
bmw_selected = pd.get_dummies(bmw_selected, drop_first=True)
fit_and_test_model(bmw_selected, gbr)

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
bmw_selected = bmw_cleaned[["log price", "mileage"] + new_car_cols]
bmw_selected = pd.get_dummies(bmw_selected, drop_first=True)
fit_and_test_model(bmw_selected, rfr, plot_test=True)