# Data Science Project: car price prediction

The following project consists in one scenario in which you will have to analyze and train a model for one data set. The data set contains information about year, price, transmission, mileage, fuel type and engine size of used cars. The idea is training a model to predict which will be the price of a used car in the market.

Steps:

1.	Upload the csv file to the workspace and load it into a data frame.
2.	Look for null values and outliers. Remove, keep or impute them and explain why.
3.	Show the main statistics (mean, standard deviation…) of the numerical columns of the data set. Are any of the variables skewed?
4.	Train a model for the prediction of the price. Explain why I chose the model that I have trained.
5.	Test the model and obtain some performance metrics from it. Does it have a good performance? Why?
6.	Would you say that you have enough information to predict the price of an Electric Vehicle of the same class? Why?

## 0. Imports
These are the needed imports for the notebook:

In [2]:
# Imports
import pandas as pd
import typing
from typing import List
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn import tree
from xgboost import XGBRegressor

ModuleNotFoundError: No module named 'xgboost'

## 1.	Upload the csv file to the workspace and load it into a data frame

Read the data from a local folder:

In [None]:
df: pd.DataFrame = pd.read_csv('cclass.csv')
df

Check the types of data just in case the dataframe has automatically given a type for a column that might not be OK:

In [None]:
df.dtypes

The data types are correct!

## 2.	Look for null values and outliers. Remove, keep or impute them and explain why

### 2.1. Null values
As Machine Learning models cannot handle null values, they need to be handled before training any model. Hence, the dataset needs to be preprocessed. 

In general, there are 4 main strategies to handle null values:
  1. *Drop Columns with Missing Values*: in case there is a null value or more in a column, the entire column is removed from the dataset
  2. *Drop Rows with Missing Values*: in case there is a null value or more in a row, the entire row is removed from the dataset
  3. *Imputation*: using a class such as SimpleImputer, predict the null values so that the dataset shape (rows and columns) is the same and the null values are predicted considering the rest of the dataset.
  4. *Extended Imputation*: the same as the previous Imputation, but a new column is added to state that the measuremnt or row has been modified. When doing this, the model will consider also the column for the training process

First, I will check which columns have null values:

In [None]:
# Get names of columns with null values
columns_with_null_values = [col for col in df.columns
                     if df[col].isnull().any()]
print('These columns have null values:', columns_with_null_values)

I would like to display the null values before performing any further analysis:

In [None]:
null_data = df[df.isnull().any(axis=1)]
print(null_data)

#### 2.1.0 Setting up the scenario for the benchmark

In this steps, I will set up a scenario for making a comparison of the 4 mentioned strategies for handling null values.

This reference scenario will have the next characteristics:
- **Random Forest Model**: for this regression problem it is good enough for making a benchmark for the null value handling strategies.
- **Mean Absolute Error (MAE)**: metric widely used for regression problems. I am using this one since it shows on average how of are the predictions. Hence, the smallest the MAE, the better result.

I will also perform some changes in the dataset so that it can be introduced in the model.

First, I will start removing the **model** column since it is the same for all the rows so it does not provide any value to the model to train:

In [None]:
print("Unique values for the model:", df['model'].unique())
df_without_model_column = df.drop('model', axis=1)
df_without_model_column.head()

As in the models numerical data is only allowed, I will have to transform the categorical columns **transmission** and **fuelType**. 

I will use One-Hot encoding since there is no order in these two variables:

In [None]:
# Get list of categorical variables
s = (df_without_model_column.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:", object_cols)

In [None]:
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(df_without_model_column[object_cols]))

# One-hot encoding removed index; put it back
OH_cols.index = df_without_model_column.index

# Remove categorical columns (will replace with one-hot encoding)
numerical_df: pd.DataFrame = df_without_model_column.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
encoded_df: pd.DataFrame = pd.concat([numerical_df, OH_cols], axis=1)
encoded_df.head()

As the dataset is ready, now I will create a function to process the train-test split and obtain the MAE. For the train-test split 80% of the data will be used for training and 20% for testing, which is a regular practice in machine learning:

In [None]:
# Function for comparing different approaches
def score_dataset(df:pd.DataFrame, target_name:str, train_size:float=0.8, test_size:float=0.2, n_estimators:int=10)->float:
    target_column = target_name
    target = df[target_column]
    predictors = df.drop([target_column], axis=1)
  
    X_train, X_valid, y_train, y_valid = train_test_split(predictors,
                                                          target,
                                                          train_size=train_size,
                                                          test_size=test_size,
                                                          random_state=0)
    model = RandomForestRegressor(n_estimators=n_estimators, random_state=0)
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
  
    return mean_absolute_error(y_valid, predictions)

#### 2.1.1 Drop Columns with Missing Values

This is the code for dropping columns that contain missing values:

In [None]:
# Get names of columns with missing values
cols_with_missing:List = [col for col in encoded_df.columns
                     if encoded_df[col].isnull().any()]

# Drop columns in training and validation data
drop_columns_df:pd.DataFrame = encoded_df.drop(cols_with_missing, axis=1)

Let's get the score for this approach:

In [None]:
MAE_drop_columns = score_dataset(df=drop_columns_df, target_name='price')
print("MAE when dropping columns that contain missing values:", MAE_drop_columns)

#### 2.1.2 Drop Rows with Missing Values

This is the code for dropping rows that contain missing values:

In [None]:
# Drop the rows with missing values
drop_rows_df:pd.DataFrame = encoded_df.dropna(axis=0) 

# Get the score
MAE_drop_rows = score_dataset(df=drop_rows_df, target_name='price')
print("MAE when dropping rows that contain missing values:", MAE_drop_rows)

#### 2.1.3 Imputation

Impute the missing values using the most frequent value per column as the columns that have missing values data are categorical:

In [None]:
# Imputation
my_imputer = SimpleImputer(strategy='most_frequent')
imputed_df = pd.DataFrame(my_imputer.fit_transform(encoded_df))


# Imputation removed column names; put them back
imputed_df.columns = encoded_df.columns

Get the score:

In [None]:
MAE_imputation = score_dataset(df=imputed_df, target_name='price')
print("MAE when imputing missing values:", MAE_imputation)

#### 2.1.4 Extended Imputation

It is similar to the previous imputation, but extra columns will be added to state if the measurement (row) has been imputed. In this way, the model will have more details:

In [None]:
# Make copy to avoid changing original data (when imputing)
encoded_df_plus = encoded_df.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    encoded_df_plus[col + '_was_missing'] = encoded_df_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer(strategy='most_frequent')
imputed_df_plus = pd.DataFrame(my_imputer.fit_transform(encoded_df_plus))

# Imputation removed column names; put them back
imputed_df_plus.columns = encoded_df_plus.columns
imputed_df_plus.head()

In [None]:
MAE_extended_imputation = score_dataset(df=imputed_df_plus, target_name='price')
print("MAE when performing extended imputation for missing values:", MAE_extended_imputation)

#### 2.1.5 Summary for missing values

Let's print again the MAEs for each case:

In [None]:
print("MAE when dropping columns that contain missing values:", MAE_drop_columns)
print("MAE when dropping rows that contain missing values:", MAE_drop_rows)
print("MAE when imputing missing values:", MAE_imputation)
print("MAE when performing extended imputation for missing values:", MAE_extended_imputation)

The best strategy for handling missing values for this particular dataset is the **Imputation** since it obtains the smallest MAE.

### 2.2. Outliers

Outliers will be only considered for numerical variables: **year**, **price**, **mileage**, and **engineSize**.

To understand better what to do with outliers, their distributions will be obtained first. The dataframe to be used will be the one from before the missing values section that had removed the **model** column.

I will perform the analysis by variable:

#### 2.2.1 year column

Let's make a historgram for the **year** column:

In [None]:
df_without_model_column.year.hist(bins=30)

Despite the distribution is left skewed, it looks like there are no outliers in this category. Just to see if the years are OK, I will check the minium and maximums:

In [None]:
print("Min year:", df_without_model_column.year.min())
print("Max year:", df_without_model_column.year.max())

The numbers seem reasonable as they make reference to years of the C-class model.

#### 2.2.2 price column

Again, I will perform a histogram to have a look to the data distribution for the **price** column:

In [None]:
df_without_model_column.price.hist(bins=100)

In the case of the **price**, it looks like the distribution is right-skewed. Therefore, for outlier detection the Inter-Quartile Range (IQR) proximity rule can be used. Before that, let's make a box-plot for more details:

In [None]:
sns.boxplot(y=df_without_model_column.price)

From the box-plot it is possible to see that there are outliers on the upper side but not in the lower side. Nevertheless, the next function, which is based in the Inter Quartile Range, will be helpfull for finding both types of outliers:

In [None]:
def lower_upper_iqr(df , column):
    global lower,upper
    q25, q75 = np.nanquantile(df[column], 0.25), np.nanquantile(df[column], 0.75)
    # calculate the IQR
    iqr = q75 - q25
    # calculate the outlier cutoff
    cut_off = iqr * 1.5
    # calculate the lower and upper bound value
    lower_iqr_limit = q25 - cut_off
    upper_iqr_limit = q75 + cut_off
    
    return lower_iqr_limit, upper_iqr_limit

With the function defined, I will obtain the Inter Quartile limits:

In [None]:
lower_iqr_limit, upper_iqr_limit = lower_upper_iqr(df_without_model_column, 'price')

For better understanding, the next plot will show the outliers in red color:

In [None]:
plt.figure(figsize = (10,6))
sns.distplot(df_without_model_column.price, kde=False)
plt.axvspan(xmin = lower_iqr_limit,xmax= df_without_model_column.price.min(),alpha=0.2, color='red')
plt.axvspan(xmin = upper_iqr_limit,xmax= df_without_model_column.price.max(),alpha=0.2, color='red')

##### 2.2.2.0 Setting up the scenario for the benchmark of outliers

As it happened with the missing values, for the outliers there are two main strategies that can be performed with outliers:
1. *Drop the outlier rows*: remove the rows that contain outliers
2. *Impute the outliers*: replace the outlier value with the mean of the distribution

For the evaluation of these two scenarios the *score_dataset* function will be reused. Besides, two other functions will be made from the previous section of missing values:

- *One-Hot-Encoding*: to convert the categorical variables into numerical
- *Imputation*: as imputation was getting the lowest MAE in the missing value section, this strategy will be used

Finally, a function that performs those two steps:

In [None]:
def one_hot_encoding(df: pd.DataFrame) -> pd.DataFrame:
    # Get list of categorical variables
    s = (df.dtypes == 'object')
    object_cols = list(s[s].index)

    # Apply one-hot encoder to each column with categorical data
    OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_cols = pd.DataFrame(OH_encoder.fit_transform(df[object_cols]))

    # One-hot encoding removed index; put it back
    OH_cols.index = df.index

    # Remove categorical columns (will replace with one-hot encoding)
    numerical_df: pd.DataFrame = df.drop(object_cols, axis=1)

    # Add one-hot encoded columns to numerical features
    one_hot_encoded_df: pd.DataFrame = pd.concat([numerical_df, OH_cols], axis=1)
    
    return one_hot_encoded_df

In [None]:
def imputation(df: pd.DataFrame) -> pd.DataFrame:
    # Imputation
    my_imputer = SimpleImputer(strategy='most_frequent')
    imputed_df = pd.DataFrame(my_imputer.fit_transform(df))
    # Imputation removed column names; put them back
    imputed_df.columns = df.columns
  
    return imputed_df

In [None]:
def score_dataset_outliers(df:pd.DataFrame, target_name:str, train_size:float=0.8, test_size:float=0.2, n_estimators:int=10)->float:
    one_hot_encoding_df = one_hot_encoding(df)
    imputed_df = imputation(one_hot_encoding_df)
    mae = score_dataset(df=imputed_df, target_name=target_name, train_size=train_size, test_size=test_size, n_estimators=n_estimators)
    
    return mae  

##### 2.2.2.1 Drop the outlier rows

Once the outliers have been identified, their rows will be removed from the dataset in this approach:

In [None]:
# Create a copy of the data frame
drop_price_outliers_df: pd.DataFrame = df_without_model_column.copy()
# Remove the outliers
drop_price_outliers_df = drop_price_outliers_df[(drop_price_outliers_df['price'] > lower_iqr_limit) & (drop_price_outliers_df['price'] < upper_iqr_limit)]

print("Shape of the original dataframe:", df_without_model_column.shape)
print("Shape of the dataframe after dropping price outliers:", drop_price_outliers_df.shape)

Get the MAE score:

In [None]:
MAE_drop_price_outliers = score_dataset_outliers(df=drop_price_outliers_df, target_name='price')
print("MAE when dropping outliers for the price column:", MAE_drop_price_outliers)

##### 2.2.2.2 Impute the outliers

In this strategy the identified outliers will be imputed using the median value of the distribution:

In [None]:
# Create a copy of the data frame
impute_price_outliers_df: pd.DataFrame = df_without_model_column.copy()
#Imputation
median = df_without_model_column['price'].median()
impute_price_outliers_df['price'] = np.where(df_without_model_column['price'] > upper_iqr_limit, median, df_without_model_column['price'])

Get the MAE score:

In [None]:
MAE_impute_price_outliers = score_dataset_outliers(df=impute_price_outliers_df, target_name='price')
print("MAE when imputing outliers for the price column:", MAE_impute_price_outliers)

##### 2.2.2.3 Summary

This is the summary for both outlier strategies for the **price** column:

In [None]:
print("MAE when dropping outliers for the price column:", MAE_drop_price_outliers)
print("MAE when imputing outliers for the price column:", MAE_impute_price_outliers)

In this case dropping the rows with outliers in **price** is a better solution. This is its distribution after dropping the outliers:

In [None]:
drop_price_outliers_df.price.hist(bins=100)

#### 2.2.3 mileage column

Again, I will perform a histogram to have a look to the data distribution for the **mileage** column:

In [None]:
drop_price_outliers_df.mileage.hist(bins=100)

In the case of the **mileage**, it looks like the distribution is right-skewed. Therefore, for outlier detection the Inter-Quartile Range (IQR) proximity rule can be used. Before that, let's make a box-plot for more details:

In [None]:
sns.boxplot(y=drop_price_outliers_df.mileage)

From the box-plot it is possible to see that there are outliers on the upper side but not in the lower side. With the Inter Quartile Range function we will obtain the outliers:

In [None]:
lower_iqr_limit, upper_iqr_limit = lower_upper_iqr(drop_price_outliers_df, 'mileage')

For better understanding, the next plot will show the outliers in red color:

In [None]:
plt.figure(figsize = (10,6))
sns.distplot(drop_price_outliers_df.mileage, kde=False)
plt.axvspan(xmin = lower_iqr_limit,xmax= drop_price_outliers_df.mileage.min(),alpha=0.2, color='red')
plt.axvspan(xmin = upper_iqr_limit,xmax= drop_price_outliers_df.mileage.max(),alpha=0.2, color='red')

As before, the same two strategies will be checked with the MAE to know what is the best option to follow (dropping the rows that containt outliers or imput the outliers).

##### 2.2.3.1 Drop the outlier rows

In [None]:
# Create a copy of the data frame
drop_mileage_outliers_df: pd.DataFrame = drop_price_outliers_df.copy()
# Remove the outliers
drop_mileage_outliers_df = drop_mileage_outliers_df[(drop_mileage_outliers_df['mileage'] > lower_iqr_limit) & (drop_mileage_outliers_df['mileage'] < upper_iqr_limit)]

print("Shape of the original dataframe:", df_without_model_column.shape)
print("Shape of the dataframe after dropping mileage outliers:", drop_mileage_outliers_df.shape)

In [None]:
MAE_drop_mileage_outliers = score_dataset_outliers(df=drop_mileage_outliers_df, target_name='price')
print("MAE when dropping outliers for the mileage column:", MAE_drop_mileage_outliers)

##### 2.2.3.2 Impute the outliers

In [None]:
# Create a copy of the data frame
impute_mileage_outliers_df: pd.DataFrame = drop_price_outliers_df.copy()
#Imputation
median = drop_price_outliers_df['mileage'].median()
impute_mileage_outliers_df['mileage'] = np.where(drop_price_outliers_df['mileage'] > upper_iqr_limit, median, drop_price_outliers_df['mileage'])

In [None]:
MAE_impute_mileage_outliers = score_dataset_outliers(df=impute_mileage_outliers_df, target_name='price')
print("MAE when imputing outliers for the mileage column:", MAE_impute_mileage_outliers)

##### 2.2.3.3 Summary

These are the MAE for both strategies for mileage outliers:

In [None]:
print("MAE when dropping outliers for the mileage column:", MAE_drop_mileage_outliers)
print("MAE when imputing outliers for the mileage column:", MAE_impute_mileage_outliers)

In this case imputing the rows with outliers in **mileage** is a better solution. This is the its distribution after imputing the outliers:

In [None]:
impute_mileage_outliers_df.mileage.hist(bins=100)

#### 2.2.4 engineSize column

Again, I will perform a histogram to have a look to the data distribution for the **engineSize** column:

In [None]:
impute_mileage_outliers_df.engineSize.hist(bins=100)

In this case the **engineSize** variable is more limited on its range:

In [None]:
max_engine_size = impute_mileage_outliers_df.engineSize.max()
min_engine_size = impute_mileage_outliers_df.engineSize.min()
print("Max engine size:", max_engine_size)
print("Min engine size:", min_engine_size)

Having an engien size of 0.0 looks strange. Let's check the dataframe that has that condition:

In [None]:
impute_mileage_outliers_df[impute_mileage_outliers_df['engineSize'] == 0]

This is an error, since it is a Diesel engine and 0.0 engineSize is not possible. Before doing anything else, the unique values for engineSize will be also checked:

In [None]:
unique_engine_size = impute_mileage_outliers_df['engineSize'].unique()
print("Sorted unique engine sizes:", sorted(set(unique_engine_size)))

To verify that the **engineSize** is correct, more documentation is needed in order to see if those engine sizes match with the c-class model. Hence, I will only remove the 0.0 enginesize measurement:

In [None]:
drop_engine_size_outliers_df = impute_mileage_outliers_df[impute_mileage_outliers_df['engineSize'] != 0]

## 3.	Main statistics

For this section I will be using the original dataset. These are the main statistics per column:

In [None]:
df.describe()

To know if the variables are skewed, it is neccesary to plot a histogram per numerical variable.

### 3.1	year

The year variable is left-skewed as it can be seen on the next plot:

In [None]:
df.year.hist(bins=50)

### 3.2	price

The price variable is right-skewed as it can be seen on the next plot:

In [None]:
df.price.hist(bins=50)

### 3.3 mileage

The mileage variable is right-skewed as it can be seen on the next plot:

In [None]:
df.mileage.hist(bins=50)

### 3.4	engineSize

It can be said that the engineSize variable is slightly right-skewed despite it has a different distribution as the engineSize has some specific values (it is not a real continouos variable):

In [None]:
df.engineSize.hist(bins=50)

## 4.	Train and test a model

In this section I will perform section 4 and 5 since I think that training and testing go together in a Data Science project. Thanks to the testing, it is possible to check how the training went. Hence, having them together makes more sense to me.

The dataset that will be used is the last one from the outliers section:

In [None]:
preprocessed_df = drop_engine_size_outliers_df
preprocessed_df.head()

I will show the shape of the original dataset and the preprocessed, to see how it has changed. Remember that the **model** column has been removed and that some rows also were removed when handling outliers:

In [None]:
print("Original dataset shape:", df.shape)
print("Preprocessed dataset shape:", preprocessed_df.shape)

As mentioned before during the missing value section, **imputation** will be performed as this strategy obtained the lowest Mean Absolute Error (MAE):

In [None]:
imputed_df = imputation(df=preprocessed_df)
imputed_df.head()

I will check again the data types of the dataset:

In [None]:
imputed_df.dtypes

The types need to be adjusted:

In [None]:
imputed_correct_types_df = imputed_df.astype({'year': 'int', 'price': 'float', 'mileage': 'float', 'engineSize': 'float'}, copy=False)
imputed_correct_types_df.dtypes

Now it is possible to perform one hot encoding:

In [None]:
encoded_df = one_hot_encoding(imputed_correct_types_df)
encoded_df.head()

The dataset is ready for starting the training-test phase. 

Before, I have used train-test split. However, better results can be obtained using cross-validation. The computation effort is greater with cross-validation, but as this dataset is not so big, it makes sense to use cross-validation.

In the following cells I will check if the cross-validation is better than the train-test split:

In [None]:
# Function for comparing different approaches
def cross_validation_mae(df:pd.DataFrame, target_name:str, model)->float:
    target_column = target_name
    target = df[target_column]
    predictors = df.drop([target_column], axis=1)

    scores = cross_val_score(model, predictors, target, scoring='neg_mean_absolute_error')
    
    return -1 * scores.mean()

Let's obtain the MAE for a Random Forest model that uses cross-validation:

In [None]:
random_forest = RandomForestRegressor(n_estimators=10, random_state=0)
MAE_cross_validation = cross_validation_mae(df=encoded_df, target_name='price', model=random_forest)
print("MAE for cross validation with Random Forest:", MAE_cross_validation)

To compare, let's try the same Random Forest model but with train-test split:

In [None]:
MAE_train_test_split = score_dataset(df=encoded_df, target_name='price')
print("MAE for train-test split with Random Forest:", MAE_train_test_split)

**Train-test split** is better according to the obtained MAE since it is smaller, so I will use train-test split from now on.

Now, I will make some experiments with different models that usually are good for regression problems like this one. These will be the models I will be using:
- Lasso
- Decision Tree
- Random Forest
- XGBoost

I will redesign the score_dataset function so that different models can be easily added:

In [None]:
def score_dataset_by_model(df:pd.DataFrame, target_name:str, model, train_size:float=0.8, test_size:float=0.2)->float:
    target_column = target_name
    target = df[target_column]
    predictors = df.drop([target_column], axis=1)
  
    X_train, X_valid, y_train, y_valid = train_test_split(predictors,
                                                          target,
                                                          train_size=train_size,
                                                          test_size=test_size,
                                                          random_state=0)
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
  
    return mean_absolute_error(y_valid, predictions)

### 4.1 Lasso model

In [None]:
lasso = linear_model.Lasso(alpha=0.1)
lasso_mae = score_dataset_by_model(df=encoded_df, target_name='price', model=lasso)
print("MAE with Lasso:", lasso_mae)

### 4.2 Decision Tree model

In [None]:
decision_tree = tree.DecisionTreeRegressor()
decision_tree_mae = score_dataset_by_model(df=encoded_df, target_name='price', model=decision_tree)
print("MAE with Decision Tree:", decision_tree_mae)

### 4.3 Random Forest model

In [None]:
random_forest = RandomForestRegressor()
random_forest_mae = score_dataset_by_model(df=encoded_df, target_name='price', model=random_forest)
print("MAE with Decision Tree:", random_forest_mae)

### 4.4 XGBoost model

In [None]:
xgboost_model = XGBRegressor()
xgboost_model_mae = score_dataset_by_model(df=encoded_df, target_name='price', model=xgboost_model)
print("MAE with XGBoost:", xgboost_model_mae)


### 4.5 Model summary

These are the MAEs for different models:

In [None]:
print("MAE with Lasso:", lasso_mae)
print("MAE with Decision Tree:", decision_tree_mae)
print("MAE with Decision Tree:", random_forest_mae)
print("MAE with XGBoost:", xgboost_model_mae)

As the lowest Mean Absolute Error (MAE) is obtained by the **XGBoost model**, this one will be used to obtain more accurate results.

## 6. Would you say that you have enough information to predict the price of an Electric Vehicle of the same class? Why?

As the dataset does not contain any Electric Vehicle model, it will be difficult to make a price prediction due to different factors compared with the regular car:
- Materials
- Production processes
- Maintenance
- Durability
- Type of clients
- Others

It is true that in the dataset there is a **fuelType** that is *Other* :

In [None]:
df.fuelType.unique()

That *Other* might be considered as electric, but there are still other technologies such as gas and hydrogen engines that might pollute the results. Therefore, I think that it is not safe to make a price prediction of an Electric Vehicle according to this dataset. 