___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

# #Determines

**Auto Scout** data which using for this project, scraped from the on-line car trading company(https://www.autoscout24.com)in 2019, contains many features of 9 different car models. In this project, you will use the data set which is already preprocessed and prepared for algorithms .

The aim of this project to understand of machine learning algorithms. Therefore, you will not need any EDA process as you will be working on the edited data.

---

In this Senario, you will estimate the prices of cars using regression algorithms.

While starting you should import the necessary modules and load the data given as pkl file. Also you'll need to do a few pre-processing before moving to modelling. After that you will implement ***Linear Regression, Ridge Regression, Lasso Regression,and Elastic-Net algorithms respectively*** (After completion of Unsupervised Learning section, you can also add bagging and boosting algorithms such as ***Random Forest and XG Boost*** this notebook to develop the project. You can measure the success of your models with regression evaluation metrics as well as with cross validation method.

For the better results, you should try to increase the success of your models by performing hyperparameter tuning. Determine feature importances for the model. You can set your model with the most important features for resource saving. You should try to apply this especially in Random Forest and XG Boost algorithms. Unlike the others, you will perform hyperparameter tuning for Random Forest and XG Boost using the ***GridSearchCV*** method. 

Finally You can compare the performances of algorithms, work more on the algorithm have the most successful prediction rate.






# #Tasks

#### 1. Import Modules, Load Data and Data Review
#### 2. Data Pre-Processing
#### 3. Implement Linear Regression 
#### 4. Implement Ridge Regression
#### 5. Implement Lasso Regression 
#### 6. Implement Elastic-Net
#### 7. Visually Compare Models Performance In a Graph

## 1. Import Modules, Load Data and Data Review

In [None]:
import pandas as pd      
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

from scipy.stats import skew

from sklearn.model_selection import cross_validate

plt.rcParams["figure.figsize"] = (10,6)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [None]:
#!pip install -U scikit-learn

In [None]:
import sklearn
sklearn.__version__

In [None]:
df = pd.read_csv("final_scout_not_dummy2.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

## Feature Engineering

In [None]:
df_object = df.select_dtypes(include="object").head()
df_object

# select_dtypes(include="object") to only include the columns of object dtype.

In [None]:
for col in df_object:
    print(f"{col:<30}:", df[col].nunique())

# <30 here sets the character length before the column (:) so that all of the outputs are aligned.

### Converting the Extras feature from object to numeric

In [None]:
for i in df.Extras:
    print(i)

# We want to preprocess this feature, as from our domain knowledge, the number of extras a car has plays an important role in its price.
# Keep in mind that this variable does not consist of atomic values.

In [None]:
for i in df.Extras:
    print(len(i.split(",")))

# print out the number of extras in each record.

In [None]:
df.Extras.apply(lambda x: len(x.split(',')))

# We can also do it like this. Remember that whenever you find yourself using a loop over a dataframe, you're probably making a mistake as
# there will almost always a better way of doing it in pandas than looping the DF.

In [None]:
df["Extras"] = df.Extras.apply(lambda x: len(x.split(',')))

# Replacing Extras with the number of extras each record has.

In [None]:
df.head()  # confirm the change.

In [None]:
df.Extras.nunique()  # Got to 10 unique values from 659. Huge reduction in cardinality.

In [None]:
df.Extras.unique()

In [None]:
for i in df.select_dtypes("object"):
    print(f"{i:<30}:", df[i].unique())
    
# We can see that the first 8 features should be encoded with onehotencoder, whereas the last 3 ones with ordinalencoder.
# OneHotEncoder:
# It encodes nominal categorical variables as dummy variables. We use this in case of nominal variables to avoid hinting our model that
# there's an underlying logical order to our values when in reality that's not the case.

# OrdinalEncoder:
# If there's a logical order to your values (good, bad, worse) (warm, hot, hottest), we will use
# OrdinalEncoder to encode our variables numerically such that we preserve their logical order.
# See statistical variable types for more info.

In [None]:
df.make_model.value_counts()

In [None]:
#!conda install "matplotlib>=3.4"

In [None]:
ax = df.make_model.value_counts().plot(kind ="bar")

ax.bar_label(ax.containers[0]);

# Axis object has a bar_label method to annotate the bar heights.
# Make sure that your matplotlib version is >= 3.4

In [None]:
df2 = df.copy()

In [None]:
sns.histplot(df.price, bins=50, kde=True);

# We are checking the distribution of our target to see if we have any outliers, since linear
# models are highly susceptible to outliers. In the context of ML, outliers don't necessarily
# indicate an underlying problem in the data gathering process. It means that we don't have
# enough records that represent a certain area in the target which is called "underrepresented".
# And the model will most likely not learn enough about these underrepresented areas.

# In the histogram below, we can see that above 40k EUR, we dont have enough observation
# points to represent that longer right tail.

# Train one model without the outliers and train one with the outliers to see which one works better.

# It also might be a better idea to group the observations by their car make to check if they have outliers within their price ranges.

In [None]:
skew(df.price)

In [None]:
df_numeric = df.select_dtypes(include="number")
df_numeric

# select_dtypes(include="number") to get only the numeric features.

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df_numeric.corr(), annot=True, vmin=-1, vmax=1, cmap='coolwarm');

## multicollinearity control

In [None]:
df_numeric.corr()[(df_numeric.corr()>= 0.9) & (df_numeric.corr() < 1)].any().any()

# Check if any 2 features have a correlation between 0.9 and 1 which indicates high positive correlation.

In [None]:
df_numeric.corr()[(df_numeric.corr()<= -0.9) & (df_numeric.corr() > -1)].any().any()

# Check if any 2 features have a correlation between -0.9 and -1 which indicates high negative correlation.

In [None]:
df_numeric.corr()[(abs(df_numeric.corr())>= 0.9) & (abs(df_numeric.corr()) < 1)].any().any()

# Combine them both into a one-liner.

## Outliers in Price Column

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
plt.figure(figsize=(10,6))

plt.subplot(211)
sns.boxplot(df.price)

plt.subplot(212)
sns.stripplot(df.price);

# Can check both boxplot and stripplot to see if we have any outliers and where they start from. If you have too many outliers (too long of a tail on either side), you can play
# with the whisker distance to accommodate some of the so-called outliers if you don't want to lose too many records.

In [None]:
plt.figure(figsize=(16,10))
plt.subplot(211)
sns.boxplot(x="make_model", y="price", data=df, whis=1.5)

plt.subplot(212)
sns.boxplot(x="make_model", y="price", data=df, whis=1.5)
sns.stripplot(x="make_model", y="price", data=df);

# We can set individual whisker values per group, as they have different outlier situations. It's up to you.

In [None]:
df.make_model.unique()

# Unique valies in make_model

In [None]:
df[df["make_model"] == "Audi A1"]["price"]

# The price values for Audi A1 records.

In [None]:
total_outliers = []

for model in df.make_model.unique():
    
    car_prices = df[df["make_model"] == model]["price"]
    
    Q1 = car_prices.quantile(0.25)
    Q3 = car_prices.quantile(0.75)
    IQR = Q3-Q1
    lower_lim = Q1 - 1.5 * IQR
    upper_lim = Q3 + 1.5 * IQR
    
    count_of_outliers = (car_prices[(car_prices < lower_lim) | (car_prices > upper_lim)]).count()
    
    total_outliers.append(count_of_outliers)
    
    print(f" The count of outlier for {model:<15} : {count_of_outliers:<5}, \
          The rate of outliers : {(count_of_outliers/len(df[df['make_model']== model])).round(3)}")
print()    
print("Total_outliers : ", sum(total_outliers), "The rate of total outliers :", (sum(total_outliers)/len(df)).round(3))


# Getting potential outliers per make_model based on a 1.5 whisker range.

## 2. Data Pre-Processing

As you know, the data set must be processed before proceeding to the implementation of the model. As the last step before model fitting, you need to split the data set into train and test. Then, you should train the model with the training data and evaluate the performance of the model on the test data. You can use the train and test data you have created for all algorithms.

You must also drop your target variable, the column you are trying to predict.

You can use many [performance metrics for regression](https://medium.com/analytics-vidhya/evaluation-metrics-for-regression-problems-343c4923d922) to measure the performance of the regression model you train. You can define a function to view different metric results together.

You can also use the [cross validation](https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85) method to measure the estimator performance. Cross validation uses different data samples from your test set and calculates the accuracy score for each data sample. You can calculate the final performance of your estimator by averaging these scores.

### Train | Test Split

In [None]:
X = df.drop(columns="price")
y = df.price

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

### OneHotEncoder

#### Example

In [None]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown="ignore", sparse=False)


# OneHotEncoder will encode your nominal categorical variables as binary dummy variables which is fundamentally important to avoid
# hinting the model that there is an underlying logical order to the nominal categories. We don't use get_dummies() function
# for this purpose because pandas functions are not well suited for scikit-learn methodology (creating different parts from the
# dataset such as train/test), since pandas was not made with machine learning in mind. It's a data analysis library.

# As always, we will fit the encoder on the train set, and transform both the train and test sets using what the encoder
# will have learnt from the train set. handle_unknown="ignore" is essential here, in case of a situation where
# your test set has a categorical value that was not seen by the encoder in the train set. It will handle
# this expception by "ignoring" it. The default value for this parameter is "error" which will raise an error.

In [None]:
train = {"train":['good','bad','worst','good', 'good', 'bad', 'bed']}
test = {"test": ['bad','worst','good', 'good', 'bad', "bed", "resume", "car"]}
train = pd.DataFrame(train)
test = pd.DataFrame(test)
train

In [None]:
test

In [None]:
train.value_counts()

In [None]:
test.value_counts()

In [None]:
enc.fit_transform(train[["train"]])

# The fitting is done ONLY on the train set. As always.

In [None]:
enc.transform(test[["test"]])


# And then we transform the test set with the unique values learnt from the train set.
# Notice here that the test set had 2 values that were not seen in the train set, yet the encoder
# handles this exception gracefully by "ignoring" those values (putting all 0 to those observation points).

In [None]:
enc.get_feature_names_out(["train"])

# OneHotEncoder will encode the nominal categorical variables as dummy variables, which will
# add the same new dummy/binary variables as the number of unique values in each categorical variable being encoded.
# And it will drop the original categorical variable afterwards. The newly added dummy feature names will follow
# the convention of <the name of the old categorical feature>_<name of the categorical value>

In [None]:
pd.DataFrame(enc.fit_transform(train[["train"]]), columns = enc.get_feature_names_out(["train"]))

In [None]:
pd.DataFrame(enc.transform(test[["test"]]), columns = enc.get_feature_names_out(["train"]))

### OrdinalEncoder

#### Example

In [None]:
train2 = {"train":['good','bad','worst','good', 'good', 'bad']}
test2 = {"test": ['bad','worst','good', 'good', 'bad']}
train2 = pd.DataFrame(train2)
test2 = pd.DataFrame(test2)
train2

# If there's an underlying logical order to a categorical variable (which is called ordinal variable), we need to handle it
# accordingly by using OrdinalEncoder in order to avoid losing the logical order.

# There's an exception to this approach: In tree based models, we will encode all categorical variables with onehotencoder
# regardless of the difference between nominal and ordinal variables. We will talk about that more in the future.

In [None]:
test2

In [None]:
from sklearn.preprocessing import OrdinalEncoder

categories = ['worst','bad','good']

enc_2 = OrdinalEncoder(categories=[categories])

# If there's a logical order to the categorical values, we will use ordinalencoder. One important note here is that
# by default, ordinalencoder will order the categorical values in their alphabetical order. So, by default,
# it will encode bad:0, good:1, worst:2 which is not the correct logical order they should be.
# This is why we are specifying the categories variable in the exact logical order that they should be.

In [None]:
enc_2.fit_transform(train2[["train"]])

In [None]:
enc_2.transform(test2[["test"]])

In [None]:
enc_2.get_feature_names_out(["train"])

# OrdinalEncoder doesn't change the feature name. It just encodes the values of the feature.

### OneHotEncoder  and OrdinalEncoder for X_train

#### OneHotEncoder

In [None]:
for i in df.select_dtypes("object"):
    print(f"{i:<30}:", df[i].unique())

# First 8 features should be encoded with onehotencoder as they dont have logical orders.
# The last 3 features should be encoded with ordinalencoder as they have a logical order.

In [None]:
cat = X_train.select_dtypes("object").columns
cat

# Names of the features that need to be encoded.

In [None]:
cat_onehot = ['make_model', 'body_type', 'Type', 'Fuel', 'Paint_Type','Upholstery_type', 'Gearing_Type', 'Drive_chain']
cat_ordinal = ['Comfort_Convenience_Package', 'Entertainment_Media_Package', 'Safety_Security_Package']

# Creating 2 separate lists. One for variables that will be onehotencoded, one for variables that will be ordinalencoded.

In [None]:
X_train[cat_onehot].head()

# X_train features that will be onehotencoded.

In [None]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown="ignore", sparse=False)

enc.fit_transform(X_train[cat_onehot])

# Did the onehotencoder transformation to the relevant features of X_train

In [None]:
enc.get_feature_names_out(cat_onehot)

# These are the new feature names that we got after doing the OneHotEncoder

In [None]:
X_train_onehot = pd.DataFrame(enc.fit_transform(X_train[cat_onehot]), index=X_train.index, 
                           columns=enc.get_feature_names_out(cat_onehot))
X_train_onehot

# Transformers in scikit-learn ALWAYS return np.ndarray objects. In order for us to use them as pandas objects, we have to
# turn them into dataframes. We can get the new names from the encoder itself directly by saying enc.get_feature_names_out(cat_onehot).
# Also, when we turn them into dataframes, we will lose the index information. We retrieve the original index information from the old X_train.
# This is very important to keep track of our records.

#### OrdinalEncoder

In [None]:
cat_ordinal = ['Comfort_Convenience_Package', 'Entertainment_Media_Package', 'Safety_Security_Package']

# Features to OrdinalEncode

In [None]:
for i in cat_ordinal:
    print(f"{i:<27}:", df[i].unique())

# Check their unique values

In [None]:
from sklearn.preprocessing import OrdinalEncoder

cat_for_comfort = ['Standard', 'Premium', 'Premium Plus']
cat_for_ent = ['Standard', 'Plus']
cat_for_safety = ['Safety Standard Package', 'Safety Premium Package', 'Safety Premium Plus Package']

enc2 = OrdinalEncoder(categories=[cat_for_comfort, cat_for_ent, cat_for_safety])

# Manually arrange the unique category names in the correct logical order.

# Also, make sure that the order of the "categories" variable values are the same as the order
# of the columns that you want to OrdinalEncode like below:

In [None]:
X_train[cat_ordinal]

# Getting the features to OrdinalEncode

In [None]:
enc2.fit_transform(X_train[cat_ordinal])

In [None]:
enc2.get_feature_names_out(cat_ordinal)

# Feature names are the same, as OrdinalEncoder doesn't change the feature names.

In [None]:
X_train_ordinal = pd.DataFrame(enc2.fit_transform(X_train[cat_ordinal]), index = X_train.index, 
                           columns = enc2.get_feature_names_out(cat_ordinal))

X_train_ordinal

# turn it back into a dataframe.

### Joining All Features of X_train

In [None]:
X_train_numeric = X_train.select_dtypes("number")
X_train_numeric.head()

# We are getting the numerical features of our X_train which we did not encode in any way.
# We will combine all of them back into a one big dataframe (numeric_df + ordinal_df + onehot_df)

# Since we retained the original index values of our data points after we did OrdinalEncoder and OneHotEncoder, their indexes
# are the same now with the indexes of X_train_numeric, which makes it very easy to join them back together.

In [None]:
X_train_new = X_train_numeric.join([X_train_onehot, X_train_ordinal])
X_train_new

# Joining back them together on the same index numbers.

### OneHotEncoder  and OrdinalEncoder for X_test

#### OneHotEncoder

In [None]:
X_test_onehot = pd.DataFrame(enc.transform(X_test[cat_onehot]), index = X_test.index, 
                             columns = enc.get_feature_names_out(cat_onehot))
X_test_onehot

# We will do the same process on the test set as well. One important difference, though, is that
# we ONLY TRANSFORM the test set. NOT FIT it.

#### OrdinalEncoder

In [None]:
X_test_ordinal = pd.DataFrame(enc2.transform(X_test[cat_ordinal]), index = X_test.index, 
                           columns = enc2.get_feature_names_out(cat_ordinal))

X_test_ordinal

### Joining All Features of X_test

In [None]:
X_test_numeric = X_test.select_dtypes("number")
X_test_numeric.head()

In [None]:
X_test_new = X_test_numeric.join([X_test_onehot, X_test_ordinal])
X_test_new

### Converting Object Features into Numerical Features Using Make Column Transformer

In [None]:
cat_onehot = ['make_model', 'body_type', 'Type', 'Fuel', 'Paint_Type','Upholstery_type', 'Gearing_Type', 'Drive_chain']
cat_ordinal = ['Comfort_Convenience_Package', 'Entertainment_Media_Package', 'Safety_Security_Package']
    
cat_for_comfort = ['Standard', 'Premium', 'Premium Plus']
cat_for_ent = ['Standard', 'Plus']
cat_for_safety = ['Safety Standard Package', 'Safety Premium Package', 'Safety Premium Plus Package']

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

column_trans = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), cat_onehot), 
                                       (OrdinalEncoder(categories= [cat_for_comfort, cat_for_ent, cat_for_safety]), cat_ordinal),
                                       remainder='passthrough') # MinMaxScaler()


# As you can see, this process is actually extremely long and cumbersome. There's a saviour, though. 
# ColumnTransformer from scikit-learn was created exactly for this purpose. To automate all this long and tedious process.
# Because doing this process manually is very error prone. By automating it, things get much more reliable and consistent 
# and error proof.

# It will automate the processing of features in the order that is specified when we instantiate the calss.

# remainder="passthrough" allows us to passthrough the remainder of the columns that we are not doing anything with.
# The default value for remainder is "drop" which will drop all of the remainder columns.
# You can also do some other stuff with this parameter as well such as remainder=MinMaxScaler() which will
# scale the remaining columns. It in fact accepts any transformer that is implemented in scikit-learn.

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
X_train.shape, X_test.shape

In [None]:
column_trans.fit_transform(X_train)

In [None]:
X_train_trans = column_trans.fit_transform(X_train)
X_test_trans = column_trans.transform(X_test)

In [None]:
X_train_trans.shape, X_test_trans.shape

In [None]:
X_train_trans

In [None]:
column_trans.get_feature_names_out()

In [None]:
features = column_trans.get_feature_names_out()

In [None]:
X_train= pd.DataFrame(X_train_trans, columns=features, index=X_train.index)
X_train.head()

In [None]:
X_test= pd.DataFrame(X_test_trans, columns=features, index=X_test.index)
X_test.head()

In [None]:
X_train.join(y_train).corr()

In [None]:
corr_by_price = X_train.join(y_train).corr()["price"].sort_values()[:-1]
corr_by_price

# We want to check the correlation of all of my independent features with my target feature. This is why
# we are joining the target with the independent features here. We obviously know the correlation of 
# the feature with itself is 1 so we are excluding it with [:-1].

In [None]:
plt.figure(figsize = (10,14))
sns.barplot(y = corr_by_price.index, x = corr_by_price)
plt.xticks(rotation=90)
plt.tight_layout();

# Visualise the correlations to make it more readable.

### Scaling

In [None]:
scaler = MinMaxScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Implement Linear Regression

 - Import the module
 - Fit the model 
 - Predict the test set
 - Determine feature coefficients
 - Evaluate model performance (use performance metrics for regression and cross_val_score)
 - Compare different evaluation metrics
 
*Note: You can use the [dir()](https://www.geeksforgeeks.org/python-dir-function/) function to see the methods you need.*

In [None]:
def train_val(model, X_train, y_train, X_test, y_test):
    
    y_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    
    scores = {"train": {"R2" : r2_score(y_train, y_train_pred),
    "mae" : mean_absolute_error(y_train, y_train_pred),
    "mse" : mean_squared_error(y_train, y_train_pred),                          
    "rmse" : np.sqrt(mean_squared_error(y_train, y_train_pred))},
    
    "test": {"R2" : r2_score(y_test, y_pred),
    "mae" : mean_absolute_error(y_test, y_pred),
    "mse" : mean_squared_error(y_test, y_pred),
    "rmse" : np.sqrt(mean_squared_error(y_test, y_pred))}}
    
    return pd.DataFrame(scores)

# We will use this function to compare train and test metrics.

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train_scaled, y_train)

# Instantiate the linear model into an object and train it on the train set.

In [None]:
pd.options.display.float_format = '{:.3f}'.format

# This way, pandas will show us 3 floating points.

In [None]:
train_val(lm, X_train_scaled, y_train, X_test_scaled, y_test)

### Adjusted R2 Score

In [None]:
def adj_r2(y_test, y_pred, X):
    r2 = r2_score(y_test, y_pred)
    n = X.shape[0]   # number of observations
    p = X.shape[1]   # number of independent variables 
    adj_r2 = 1 - (1-r2)*(n-1)/(n-p-1)
    return adj_r2

#There are too many featuras in our data but not enough rows, this is a false improvement in our model's R2_score.
#Especially in such data or if too many new features (such as dummies feature) have been added to our data.
#We need to detect real R2_score with adjusted R2_score.


# Adjusted R2_score checks the trade off between row and feature count and returns us a score. If numbers
#There will be serious decreases in adjusted_R2_score if there is a large imbalance between Above for Adjusted R2 Score
#We define the function.

In [None]:
y_pred = lm.predict(X_test_scaled)

In [None]:
adj_r2(y_test, y_pred, X)


### Cross Validate

In [None]:
model = LinearRegression()
scores = cross_validate(model, X_train_scaled, y_train, scoring=['r2', 
            'neg_mean_absolute_error','neg_mean_squared_error','neg_root_mean_squared_error'], cv =10,
             return_train_score=True)

# As we learned in our previous lessons, we get the overfiting control through cross validaition.
# We do this by comparing train and validation scores.

In [None]:
pd.DataFrame(scores)

In [None]:
pd.DataFrame(scores).iloc[:, 2:].mean()

# train ve validaiton scores close. So no Overfiting

In [None]:
train_val(lm, X_train_scaled, y_train, X_test_scaled, y_test)

In [None]:
2622/df.price.mean()

# models average error  %14.5

### Prediction Error with Outliers

In [None]:
from yellowbrick.regressor import PredictionError
from yellowbrick.features import RadViz

visualizer = RadViz(size=(720, 3000))
model = LinearRegression()
visualizer = PredictionError(model)
visualizer.fit(X_train_scaled, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test_scaled, y_test)  # Evaluate the model on the test data
visualizer.show();

# With the prediciton error image, we can see how well the predictions made by our model are. When we look at it, we can see that the cars priced at 40 thousand EURO and above are pulling our best fit line down.
# If I have determined from the data on the basis of cars or models of 40 thousand EURO and above that I have seen spoil my scores.
# I can get better scores when I remove outlier priced cars from my data and retrain the model from this data.

# In this notebook, we will continue by removing the outlier values from our data. but not 40 thousand EURO cars

### Residual Plot with Outliers

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = RadViz(size=(1000, 720))
model = LinearRegression()
visualizer = ResidualsPlot(model)

visualizer.fit(X_train_scaled, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test_scaled, y_test)  # Evaluate the model on the test data
visualizer.show();       

### Dropping outliers that worsen my predictions from the dataset

In [None]:
for model in df2.make_model.unique():
    
    car_prices = df2[df2["make_model"]== model]["price"]
    
    Q1 = car_prices.quantile(0.25)
    Q3 = car_prices.quantile(0.75)
    
    IQR = Q3-Q1
    
    lower_lim = Q1-1.5*IQR
    upper_lim = Q3+1.5*IQR

    drop_index = df2[df2["make_model"]== model][(car_prices < lower_lim) | (car_prices > upper_lim)].index
    df2.drop(index = drop_index, inplace=True)
    df2.reset_index(drop=True, inplace=True)
df2

# Here we extract outlier observations from our data. First, determine the lower and upper limits and stay outside these limits.
# We determine the indexes of the remaining car prices and drop these indexes from our data.
# We use reset_index to ignore the indexes we drop and make the index order properly.

In [None]:
15493 + 416

In [None]:
df2.shape

In [None]:
df3 = df2.copy()

# df3 is new dataset cleaned from outliers. I keep it maybe i can use it laters

In [None]:
X = df2.drop(columns = "price")
y = df2.price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)


In [None]:
X_train= pd.DataFrame(column_trans.fit_transform(X_train), columns=features, index=X_train.index)
X_test= pd.DataFrame(column_trans.transform(X_test), columns=features, index=X_test.index)

In [None]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
lm2 = LinearRegression()
lm2.fit(X_train_scaled, y_train)

### Prediction Error without Outliers

In [None]:
visualizer = RadViz(size=(720, 3000))
model = LinearRegression()
visualizer = PredictionError(model)
visualizer.fit(X_train_scaled, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test_scaled, y_test)  # Evaluate the model on the test data
visualizer.show();

# After removing the outlier values, we see that the angle between the best fit line and the identity line narrows even more.

### Residual Plot without Outliers

In [None]:
from yellowbrick.regressor import ResidualsPlot

visualizer = RadViz(size=(1000, 720))
model = LinearRegression()
visualizer = ResidualsPlot(model)

visualizer.fit(X_train_scaled, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test_scaled, y_test)  # Evaluate the model on the test data
visualizer.show(); 

In [None]:
train_val(lm2, X_train_scaled, y_train, X_test_scaled, y_test)

# After the outliers are out, we can see that the results are getting better.

In [None]:
2256/df2.price.mean()

# without outliers avreage prediction error decreased from %14.55 to %12,83

In [None]:
2622/df.price.mean()

In [None]:
model = LinearRegression() #normalize=True
scores = cross_validate(model, X_train_scaled, y_train,
                        scoring=['r2', 'neg_mean_absolute_error','neg_mean_squared_error','neg_root_mean_squared_error'], 
                        cv=10, return_train_score=True)

#overfitting check with new dataset

In [None]:
scores = pd.DataFrame(scores, index = range(1, 11))
scores.iloc[:,2:]

In [None]:
scores = pd.DataFrame(scores, index = range(1, 11))
scores.iloc[:,2:].mean()

# train and validation skoces close so no overfitting

In [None]:
train_val(lm2, X_train_scaled, y_train, X_test_scaled, y_test)

# Since the test scores and the validation scores we got from the CV are close to each other, 
# We can say that the scores we got from the test (hold out) set are consistent scores.

In [None]:
y_pred = lm2.predict(X_test_scaled)

lm_R2 = r2_score(y_test, y_pred)
lm_mae = mean_absolute_error(y_test, y_pred)
lm_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# To compare the scores we get from the linear model, we assign the scores to the variables.

In [None]:
lm2.coef_  # oThe coefficients of the features with onehot encoder applied are very high. Dummy variable trap

# https://geoffruddock.com/one-hot-encoding-plus-linear-regression-equals-multi-collinearity/

In [None]:
pd.DataFrame(lm2.coef_, index = X_train.columns, columns=["Coef"])

## Pipeline

In [None]:
df2

In [None]:
X = df2.drop(columns = ["price"])
y = df2.price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

# After dropping outlier observations, we divide the remaining data into X and y again and divide it into train and test sets.

In [None]:
X_train.head()

In [None]:
cat_onehot = ['make_model', 'body_type', 'Type', 'Fuel', 'Paint_Type','Upholstery_type', 'Gearing_Type', 'Drive_chain']
cat_ordinal = ['Comfort_Convenience_Package', 'Entertainment_Media_Package', 'Safety_Security_Package']
    
cat_for_comfort = ['Standard', 'Premium', 'Premium Plus']
cat_for_ent = ['Standard', 'Plus']
cat_for_safety = ['Safety Standard Package', 'Safety Premium Package', 'Safety Premium Plus Package']

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

enc_onehot = OneHotEncoder(handle_unknown="ignore", sparse=False)
enc_ordinal = OrdinalEncoder(categories= [cat_for_comfort, cat_for_ent, cat_for_safety])

column_trans = make_column_transformer((enc_onehot, cat_onehot), 
                                       (enc_ordinal, cat_ordinal),
                                       remainder='passthrough') # MinMaxScaler()

#The make_column_transformer function automates transformations to featurs. 
# It transforms the featurs sequentially according to the order we will give into this function.

# (OneHotEncoder(handle_unknown="ignore", sparse=False), cat_onehot) 
# handle_unknown = "ignore" parameter transforms all the featurs in the cat_onehot list, 
# converting all categorical data that pass in the test set and not in the train set to 0.

# (OrdinalEncoder(categories=categories), cat_ordinal) 
#After the onehotencoder conversion to the relevant featurs, all the featurs in the cat_ordinal list 
# perform ordinal encoder conversion according to the hierarchical order of the unique categorical observations
# in the categories list.

# remainder='passthrough' means leave all other unconverted features as they are.

In [None]:
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Ridge", Ridge())]

ridge_pipe = Pipeline(steps=operations)

ridge_pipe.fit(X_train, y_train)

# pipe_model.fit(X,y) makes
# Since the first operation in pipe_model is column_trans;
# 1. Onehotencoder conversion is done to the features (cat_onehot) in the X data that will be transformed into onehotencoder.
# 2. The features (cat_ordinal) in the X data to which the ordinalencoder transformation will be applied are converted to ordinalencoder.
# 3. Except for the converted featurs, no action is taken on the remaining featurs, and they are left as they are.
# The second action in pipe_model is MinMaxScaler();
# 4. After the transformation, minmax scale is applied to the new numeric X that we obtained. 
# Since there are dummies consisting of 0 and 1 in our data, minmaxscaler was applied so that these observations 
# remain 0 and 1 again.
# The third action in pipe_model is Lasso();
# The training is completed by giving y Lasso to the model along with the transformed and scaled X.

## Implement Ridge Regression

- Import the modul 
- Do not forget to scale the data or use Normalize parameter as True 
- Fit the model 
- Predict the test set 
- Evaluate model performance (use performance metrics for regression) 
- Tune alpha hiperparameter by using [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html) and determine the optimal alpha value.
- Fit the model and predict again with the new alpha value. 

## Ridge

In [None]:
from sklearn.linear_model import Ridge

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Ridge", Ridge())]

ridge_model = Pipeline(steps=operations)

ridge_model.fit(X_train, y_train)

In [None]:
train_val(ridge_model, X_train, y_train, X_test, y_test)

## Cross Validation

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Ridge", Ridge())]
pipe_model = Pipeline(steps=operations)

scores = cross_validate(pipe_model, X_train, y_train,
                        scoring=['r2', 'neg_mean_absolute_error','neg_mean_squared_error','neg_root_mean_squared_error'], 
                        cv=10, return_train_score=True)

In [None]:
scores = pd.DataFrame(scores, index = range(1, 11))
scores.iloc[:,2:].mean()

## Finding best alpha for Ridge

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
alpha_space = np.linspace(0.01, 100, 100)
alpha_space

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Ridge", Ridge())]
pipe_model = Pipeline(steps=operations)

param_grid = {'Ridge__alpha':alpha_space}  # Parameter names should be used with the model name defined in the pipeline.

ridge_grid_model = GridSearchCV(estimator=pipe_model,
                          param_grid=param_grid,
                          scoring='neg_root_mean_squared_error',
                          cv=10,
                          n_jobs = -1,
                          return_train_score=True)

In [None]:
pipe_model.get_params()  #To see the parameters of the model defined with the pipeline

In [None]:
ridge_grid_model.fit(X_train, y_train)

In [None]:
ridge_grid_model.best_estimator_


In [None]:
ridge_grid_model.best_params_

In [None]:
pd.DataFrame(ridge_grid_model.cv_results_)

In [None]:
ridge_grid_model.best_index_

In [None]:
pd.DataFrame(ridge_grid_model.cv_results_).loc[1, ["mean_test_score", "mean_train_score"]]

In [None]:
ridge_grid_model.best_score_

In [None]:
train_val(ridge_grid_model, X_train, y_train, X_test, y_test)

In [None]:
y_pred = ridge_grid_model.predict(X_test)
rm_R2 = r2_score(y_test, y_pred)
rm_mae = mean_absolute_error(y_test, y_pred)
rm_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Ridge", Ridge(alpha=1.02))]

ridge_model = Pipeline(steps=operations)

ridge_model.fit(X_train, y_train)

In [None]:
ridge_model["Ridge"].coef_  # In order to get the coefficients, the model name you know through the model 
                            # created with the pipeline should be used as a key.

In [None]:
ridge_model["OneHot_Ordinal_Encoder"].get_feature_names_out()

In [None]:
pd.DataFrame(ridge_model["Ridge"].coef_, index = ridge_model["OneHot_Ordinal_Encoder"].get_feature_names_out(), columns=["Coef"]).sort_values("Coef")

## 5. Implement Lasso Regression

- Import the modul 
- Do not forget to scale the data or use Normalize parameter as True(If needed)
- Fit the model 
- Predict the test set 
- Evaluate model performance (use performance metrics for regression) 
- Tune alpha hyperparameter by using [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) and determine the optimal alpha value.
- Fit the model and predict again with the new alpha value.
- Compare different evaluation metrics

*Note: To understand the importance of the alpha hyperparameter, you can observe the effects of different alpha values on feature coefficants.*

In [None]:
from sklearn.linear_model import Lasso

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Lasso", Lasso())]

lasso_model = Pipeline(steps=operations)

lasso_model.fit(X_train, y_train)

In [None]:
train_val(lasso_model, X_train, y_train, X_test, y_test)

## Cross Validation

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Lasso", Lasso())]

model = Pipeline(steps=operations)
scores = cross_validate(model, X_train, y_train,
                        scoring=['r2', 'neg_mean_absolute_error','neg_mean_squared_error','neg_root_mean_squared_error'],
                        cv=10, return_train_score=True)

In [None]:
scores = pd.DataFrame(scores, index = range(1, 11))
scores.iloc[:,2:].mean()

## Finding best alpha for Lasso

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Lasso", Lasso())]

model = Pipeline(steps=operations)

param_grid = {'Lasso__alpha':alpha_space}  # Parameter names should be used with the model name defined in the pipeline.

lasso_grid_model = GridSearchCV(estimator=model,
                          param_grid=param_grid,
                          scoring='neg_root_mean_squared_error',
                          cv=10,
                          n_jobs = -1,
                          return_train_score=True)

In [None]:
lasso_grid_model.fit(X_train, y_train)

In [None]:
lasso_grid_model.best_params_

In [None]:
pd.DataFrame(lasso_grid_model.cv_results_)

In [None]:
lasso_grid_model.best_index_

In [None]:
pd.DataFrame(lasso_grid_model.cv_results_).loc[1, ["mean_test_score", "mean_train_score"]]

In [None]:
lasso_grid_model.best_score_

In [None]:
train_val(lasso_grid_model, X_train, y_train, X_test, y_test)

In [None]:
y_pred = lasso_grid_model.predict(X_test)
lasm_R2 = r2_score(y_test, y_pred)
lasm_mae = mean_absolute_error(y_test, y_pred)
lasm_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Lasso", Lasso(alpha=1.02))]

lasso_model = Pipeline(steps=operations)

lasso_model.fit(X_train, y_train)

In [None]:
pd.DataFrame(lasso_model["Lasso"].coef_, index = lasso_model["OneHot_Ordinal_Encoder"].get_feature_names_out(), columns=["Coef"]).sort_values("Coef")

## 6. Implement Elastic-Net

- Import the modul 
- Do not forget to scale the data or use Normalize parameter as True(If needed)
- Fit the model 
- Predict the test set 
- Evaluate model performance (use performance metrics for regression) 
- Tune alpha hyperparameter by using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and determine the optimal alpha value.
- Fit the model and predict again with the new alpha value.
- Compare different evaluation metrics

In [None]:
from sklearn.linear_model import ElasticNet

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("ElasticNet", ElasticNet())]

elastic_model = Pipeline(steps=operations)

elastic_model.fit(X_train, y_train)

In [None]:
train_val(elastic_model, X_train, y_train, X_test, y_test)

## Cross Validation

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("ElasticNet", ElasticNet())]

model = Pipeline(steps=operations)

scores = cross_validate(model, X_train, y_train,
                        scoring=['r2', 'neg_mean_absolute_error','neg_mean_squared_error','neg_root_mean_squared_error'], 
                        cv=10, return_train_score=True)

In [None]:
scores = pd.DataFrame(scores, index = range(1, 11))
scores.iloc[:,2:].mean()

## Finding best alpha and l1_ratio for ElasticNet

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("ElasticNet", ElasticNet())]

model = Pipeline(steps=operations)

param_grid = {'ElasticNet__alpha':[1.02, 2,  3, 4, 5, 7, 10, 11],
              'ElasticNet__l1_ratio':[.5, .7, .9, .95, .99, 1]}

elastic_grid_model = GridSearchCV(estimator=model,
                          param_grid=param_grid,
                          scoring='neg_root_mean_squared_error',
                          cv=10,
                          n_jobs = -1,
                          return_train_score=True)

In [None]:
elastic_grid_model.fit(X_train, y_train)

In [None]:
elastic_grid_model.best_params_

In [None]:
elastic_grid_model.best_index_

In [None]:
pd.DataFrame(elastic_grid_model.cv_results_).loc[5, ["mean_test_score", "mean_train_score"]]

In [None]:
elastic_grid_model.best_score_

In [None]:
train_val(elastic_grid_model, X_train, y_train, X_test, y_test)

In [None]:
y_pred = elastic_grid_model.predict(X_test)
em_R2 = r2_score(y_test, y_pred)
em_mae = mean_absolute_error(y_test, y_pred)
em_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

## Feature İmportance

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Lasso", Lasso(alpha=1.02))]
model = Pipeline(steps=operations)
model.fit(X_train, y_train)

In [None]:
df_feat_imp = pd.DataFrame(model["Lasso"].coef_, index = model["OneHot_Ordinal_Encoder"].get_feature_names_out(), columns=["Coef"]).sort_values("Coef")

In [None]:
plt.figure(figsize=(10,14))
sns.barplot(data= df_feat_imp, x=df_feat_imp.Coef, y=df_feat_imp.index);

In [None]:
# Cannot view feature importance with yellowbrick when using pipeline.

from yellowbrick.model_selection import FeatureImportances
from yellowbrick.features import RadViz

X_train_trans= column_trans.fit_transform(X_train)
X_train_scaled= scaler.fit_transform(X_train_trans)
model = Lasso(alpha=1.02)

viz = FeatureImportances(model, labels=column_trans.get_feature_names_out())
visualizer = RadViz(size=(720, 3000))
viz.fit(X_train_scaled, y_train)
viz.show();

# We do not forget that we need to use the lasso model, as we will do feature selection.

In [None]:
df_new = df2[["make_model", "hp_kW", "km","age", "Gearing_Type", "Gears", "Type", 'Safety_Security_Package', "price"]]

# We choose the top 7 features that have the most impact on the prediction. Here, a question may come up as why 
# the make_model feature was chosen. When the above image is examined, the make_model feature is among the features
# that have the most impact on estimation.Since we saw that it has unique 
# categorical observations (Audi A3, AudiA1, Renault Espace etc.), we chose the make_model feature.

# Although the 'Safety_Security_Package' featura doesn't contribute much to the estimation, 
# it is chosen to show how ordinalencoder conversions automated.

In [None]:
df_new

In [None]:
X = df_new.drop(columns = ["price"])
y = df_new.price

# According to our new 7-featured dataset, we determine our X and y and reconstruct the model. 
# And we repeat the operations we did above.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

In [None]:
X_train.head()

In [None]:
cat_onehot = ['make_model', 'Type', 'Gearing_Type']
cat_ordinal = ['Safety_Security_Package']

Safety_Security_Package = ['Safety Standard Package', 'Safety Premium Package', 'Safety Premium Plus Package']
    
categories = [Safety_Security_Package]

column_trans = make_column_transformer((OneHotEncoder(handle_unknown="ignore", sparse=False), cat_onehot), 
                                       (OrdinalEncoder(categories=categories), cat_ordinal),
                                       remainder='passthrough') #MinMaxScaler()

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Lasso", Lasso(alpha=1.02))]
lasso_final_model = Pipeline(steps=operations)

lasso_final_model.fit(X_train, y_train)
train_val(lasso_final_model, X_train, y_train, X_test, y_test)

## Cross Validate

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Lasso", Lasso(alpha=1.02))]
model = Pipeline(steps=operations)

scores = cross_validate(model, X_train, y_train,
                        scoring=['r2', 'neg_mean_absolute_error','neg_mean_squared_error','neg_root_mean_squared_error'],
                        cv=10, return_train_score=True)

In [None]:
scores = pd.DataFrame(scores, index = range(1, 11))
scores.iloc[:,2:].mean()

In [None]:
2303/df_new.price.mean()

In [None]:
y_pred = lasso_final_model.predict(X_test)
fm_R2 = r2_score(y_test, y_pred)
fm_mae = mean_absolute_error(y_test, y_pred)
fm_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

## 7. Visually Compare Models Performance In a Graph

In [None]:
scores = {"linear_m": {"r2_score": lm_R2 , 
 "mae": lm_mae, 
 "rmse": lm_rmse},

 "ridge_m": {"r2_score": rm_R2, 
 "mae": rm_mae,
 "rmse": rm_rmse},
    
 "lasso_m": {"r2_score": lasm_R2, 
 "mae": lasm_mae, 
 "rmse": lasm_rmse},

 "elastic_m": {"r2_score": em_R2, 
 "mae": em_mae, 
 "rmse": em_rmse},
         
 "final_m": {"r2_score": fm_R2, 
 "mae": fm_mae , 
 "rmse": fm_rmse}}
scores = pd.DataFrame(scores).T
scores

# We assign the metrics we obtained from all models to the scores variable in jason format. Later to see model names
# In the index and metrics in the features we take the transpose of the df. 

In [None]:
compare = scores.sort_values(by="r2_score", ascending=False)
compare
#sns.barplot(x = compare[j] , y= compare.index)

# We reorder the compare df by r2_scores from largest to smallest.

In [None]:
# metrics = scores.columns

for i, j in enumerate(scores):
    plt.figure(i)
    if j == "r2_score":
        ascending = False # if our metric is r2_score the barplot will be sorted from largest to smallest
    else:
        ascending = True # if our metric is mae or rmse then the barplot will be sorted from smallest to largest
    compare = scores.sort_values(by=j, ascending=ascending) # Reordering compare df by relevant metric
    ax = sns.barplot(x = compare[j] , y= compare.index) # Metric scores for compare[j] from compare df are drawn 
                                                       # sequentially and visualized on the barplot.
                                                    # y=compare.index will write the model names on the y-axis of our image.
    ax.bar_label(ax.containers[0], fmt="%.4f"); # The annotate is arranged as 4 digits from "."

## Prediction new observation

In [None]:
X = df_new.drop(columns = ["price"])
y = df_new.price

# After trying all models and deciding on the model with the most optimal score, we separate the data 
# we use for this model as X and y. Note that we do not train and test split in the final stage.

In [None]:
X.head()

In [None]:
operations = [("OneHot_Ordinal_Encoder", column_trans), ("scaler", MinMaxScaler()), ("Lasso", Lasso(alpha=1.02))]
final_model = Pipeline(steps=operations)

# We set up the model with the best hyper parameter we found above.

In [None]:
final_model.fit(X, y)

In [None]:
my_dict = {
    "make_model": 'Audi A3',
    "hp_kW": 66,
    "km": 17000,
    "age": 2,
    "Gearing_Type": "Automatic",
    "Gears": 7,
    "Type":"Used",
    'Safety_Security_Package':'Safety Premium Package'
}

In [None]:
X.head()

# The feature order of the observation we will predict should be the same as the feature order of the data we train.

In [None]:
new_obs = pd.DataFrame([my_dict])
new_obs

# We saw that the feature rankings are the same

In [None]:
final_model.predict(new_obs)

In [None]:
# when feature order is different

my_dict = {
    "make_model": 'Audi A3',
    "km": 17000,
    "hp_kW": 66,
    "age": 2,
    "Gearing_Type": "Automatic",
    "Gears": 7,
    "Type":"Used",
    'Safety_Security_Package':'Safety Premium Package'
}

new_obs = pd.DataFrame([my_dict])
new_obs

In [None]:
final_model.predict(new_obs)

# The feature order of new_obs and X is different. make_column_transformer detects this difference and changes the feature order of new_obs.
# makes it same for the feature order of the X_train data

In [None]:
# What does pipe_model.predict(new_obs) do in order?
# Since the first operation in pipe_model is column_trans;
# 1. Onehotencoder conversion will be applied to new_obs data (cat_onehot) using X data
# 2. Ordinalencoder conversion will be applied to new_obs data (cat_ordinal) using X data 
# 3. Remainder data in new_obs data left as they are
# The second action in pipe_model is MinMaxScaler();
# 4. The minmax scale is applied to the new numeric new_obs we get after the transformation, 
#    according to the min and max information of the X data. 
# The third action in pipe_model is Lasso();
# 5. The transformed and scaled new_obs data using X's metrics is predicted by the Lasso model.

In [None]:
# Important: make_column_transformer function assigns the categorical featurs to the beginning of the df
# and the numeric featurs to the end of the df in accordance with the transformation order 

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___