## Business Problem Statement Understanding: 

### Problem Statement

Build a Car Price prediction using Machine Learning.

UsedCar.com, a well-known business, is an online marketplace that enables customers to sell or buy a vehicle throughout Germany. When they receive a request to sell an automobile, one of their sales representatives goes to the client's location to gather all the information, including Brand Name, Model Name, year, KMS Driven, and Fuel Type. Once you return to the office, the backend staff will analyze the information that the salesperson supplied and forecast the car's current pricing. The salesperson then visits the client once more to discuss the price. 

They find the process to be extremely time-consuming, so they resolve to make it intelligent, automated, and clever. They therefore decide to use a system that will forecast the client's pricing. As a result, the client will be pleased with the special response and be able to sell or buy a car through UsedCar.com, increasing the company's share price by 35% after making the procedure quick and creative.

They therefore made the decision to create a website that will display the current pricing of the automobile based on user inputs such as Brand Name, Model Name, year, KMS Driven, and Fuel type. Here, they plan to employ a machine learning algorithm that will automatically forecast the price of the car based on its characteristics.

For that, our domain experts and data analysts met with the client at Quikr's headquarters to better grasp their problem and expectations. So they asked the client directly for the necessary information.

    1) name        >> Name of the car (Including Model Number & version)	
    
    2) company	 >> Car Brand
    
    3) year	     >> Year of the model
    
    4) Price 	   >> Present price of the Car 	
    
    5) kms_driven  >> Kilometer driven by the Car	
    
    6) fuel_type   >> Whether Car runs on Petrol/Diesel/LPG

In order to detect patterns in the dataset using the aforementioned features and attributes supplied by the client, the client wishes to train a machine learning pipeline. Once the user provides new information, our ML model will then automatically predict the price of the car.

As we are aware, Price exhibits consistent behavior. It is referred to as "Regression." Therefore, the price prediction for cars falls within the "Regression" linked kind. So, in order to forecast the cost of the car, we are utilizing the "Regression Algorithm."

### Import all libraries

In [1]:
import os


import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

 
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

Here, we have loaded 3 types of Libraries

a) Inbuilt Library


b) Third-Party Library


c) Skitlearn Library

## Data Collection

According to client inputs, our domain experts and data analysts team created the dataset in.CSV format. Additionally, we will train our machine learning (ML) algorithm using the same dataset.

In [2]:
df= pd.read_csv('quikr_car.csv')

Load the dataset files with csv extension from the dataset folder.

In [3]:
df

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Santro Xing XO eRLX Euro III,Hyundai,2007,80000,"45,000 kms",Petrol
1,Mahindra Jeep CL550 MDI,Mahindra,2006,425000,40 kms,Diesel
2,Maruti Suzuki Alto 800 Vxi,Maruti,2018,Ask For Price,"22,000 kms",Petrol
3,Hyundai Grand i10 Magna 1.2 Kappa VTVT,Hyundai,2014,325000,"28,000 kms",Petrol
4,Ford EcoSport Titanium 1.5L TDCi,Ford,2014,575000,"36,000 kms",Diesel
...,...,...,...,...,...,...
887,Ta,Tara,zest,310000,,
888,Tata Zest XM Diesel,Tata,2018,260000,"27,000 kms",Diesel
889,Mahindra Quanto C8,Mahindra,2013,390000,"40,000 kms",Diesel
890,Honda Amaze 1.2 E i VTEC,Honda,2014,180000,Petrol,


In [4]:
print('No. of Rows: ',df.shape[0])
print('No. of Columns: ',df.shape[1])

No. of Rows:  892
No. of Columns:  6


### Spliting the dataset into Training dataset and Testing Dataset

In [5]:
df_train, df_test = train_test_split(df,test_size=0.2,random_state=20)

print('df_train size: ',df_train.shape)
print('df_test size: ',df_test.shape)

df_train size:  (713, 6)
df_test size:  (179, 6)


To prevent the problem of data leakage, we divided the data at the very beginning of the task. When the information that we use to train a machine learning algorithm is present in the training data, this is referred to as data leakage in machine learning. As a result, we divide the testing data from the training data (used to train the model) (To predict the data outcome).

## Data Exploration

Here, We have performed the data exploration task on the df_train (Training) dataset.

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 713 entries, 812 to 355
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        713 non-null    object
 1   company     713 non-null    object
 2   year        713 non-null    object
 3   Price       713 non-null    object
 4   kms_driven  667 non-null    object
 5   fuel_type   665 non-null    object
dtypes: object(6)
memory usage: 39.0+ KB


df_train.info() >> This gives us information about the attributes, data type of the attribute, information about the null values of the attributes, memory size of the class which is 39.0+ KB.

We have observed that all the attributes are having 'Object' dtype.

Here, We will try to find out the Explicit and Implicit Missing values in our Dataset.

In [7]:
df_train['name'].value_counts()

Honda City                                11
Honda Amaze                               10
Maruti Suzuki Dzire                        9
Maruti Suzuki Alto 800 Lxi                 7
Mahindra Jeep CL550 MDI                    6
                                          ..
Maruti Suzuki Esteem LXi BS III            1
Tata Indica V2 Xeta e GLE                  1
all paper updated tata indica v2 and u     1
selling car Ta                             1
Hyundai Verna                              1
Name: name, Length: 448, dtype: int64

df_train['name'].value_counts() >> Here, we are getting value count of each samples present in the ['name'] attributes.

In [8]:
df_train['name'].unique()

array(['TATA INDI', 'Mahindra Bolero DI',
       'Mitsubishi Pajero Sport Limited Edition',
       'Maruti Suzuki Ertiga ZXI Plus', 'Sale tata',
       'Ford Fiesta SXi 1.6 ABS', 'Nissan Terrano XL D Plus',
       'Hyundai i20 Magna', 'Tata Nano', 'Maruti Suzuki Wagon R',
       'Hyundai Santro Xing', 'Maruti Suzuki Swift VXi 1.2 ABS BS IV',
       'Maruti Suzuki Swift Vdi BSIII', 'Hyundai i20 Sportz 1.2',
       'Hyundai Santro AE GLS Audio', 'Datsun Redi GO', 'Toyota Corolla',
       'Maruti Suzuki Zen Estilo LXI Green CNG',
       'Maruti Suzuki Dzire LDI', 'Hyundai i20 Active 1.4L SX O',
       'Maruti Suzuki Zen LXi BSII', 'Mahindra Jeep CL550 MDI',
       'Toyota Corolla Altis', 'Hyundai i20 Active 1.2 SX',
       'Maruti Suzuki Alto 800 Lxi', 'Ta', 'Tata Indigo CS GLS',
       'Tata Indigo eCS LX TDI BS III',
       'Maruti Suzuki Swift Select Variant',
       'Tata Sumo Victa EX 10 by 7 Str BSIII', 'Honda City 1.5 V MT',
       'Hyundai Grand i10 Magna 1.2 Kappa VTVT',
       '

df_train['name'].unique() >> helps us to give the unique values present in the ['name'] attributes

In [9]:
df_train['year'].value_counts()

2015    91
2013    75
2014    74
2016    61
2012    56
2011    48
2017    48
2009    47
2010    34
2018    26
2006    18
2019    16
2008    15
2007    14
2003    12
2005    10
2004    10
2000     6
2002     4
2001     4
o...     3
sale     3
...      3
car      2
Zest     2
digo     1
go .     1
2 bs     1
cent     1
SELL     1
/-Rs     1
k...     1
SALE     1
r 15     1
sell     1
d Ex     1
d...     1
cab      1
zire     1
EV2      1
tion     1
r...     1
emi      1
no.      1
odel     1
e...     1
n...     1
Sumo     1
o c4     1
able     1
t xe     1
D...     1
, Ac     1
zest     1
ture     1
arry     1
Name: year, dtype: int64

In [10]:
df_train['year'].unique()

array(['EV2', '2017', '2015', '2016', 'ture', '2009', '2011', '2013',
       '2003', '2014', '2012', '2006', 'zest', '2008', '2010', '2018',
       ', Ac', '2005', 'Zest', 'D...', '2007', '2000', 't xe', '2002',
       'able', '2004', '2019', 'sale', 'tion', 'o c4', 'o...', 'car',
       'Sumo', 'n...', 'e...', '2001', 'odel', 'no.', 'emi', '...',
       'r...', 'zire', 'go .', '2 bs', 'cent', '/-Rs', 'k...', 'digo',
       'd Ex', 'r 15', 'SALE', 'cab', 'd...', 'sell', 'SELL', 'arry'],
      dtype=object)

Its seems like possitive correlationship between year and Price. As the Year of Purchase is become latest then price of the car increases.

In [11]:
df_train['Price'].value_counts()

Ask For Price    31
2,50,000         15
3,50,000         12
4,50,000         11
3,00,000         10
                 ..
5,44,999          1
7,25,000          1
23,90,000         1
2,39,999          1
71,000            1
Name: Price, Length: 242, dtype: int64

In [12]:
df_train['Price'].unique()

array(['1,10,000', '1,80,000', '14,75,000', '6,35,000', '1,00,000',
       '2,50,000', '4,99,999', '2,40,000', '60,000', '1,59,000',
       '1,20,000', '3,65,000', '2,44,999', '1,60,000', 'Ask For Price',
       '4,88,000', '5,35,000', '99,999', '4,25,000', '3,00,000',
       '5,00,000', '2,80,000', '3,10,000', '2,70,000', '3,20,000',
       '1,75,000', '2,85,000', '5,49,000', '3,80,000', '2,20,000',
       '2,89,999', '10,00,000', '90,000', '5,99,000', '8,30,000',
       '2,84,999', '2,00,000', '4,89,999', '3,85,000', '2,51,111',
       '4,50,000', '8,60,000', '75,000', '9,44,999', '28,00,000',
       '1,35,000', '4,00,000', '8,55,000', '3,49,999', '2,65,000',
       '2,49,999', '6,00,000', '1,78,000', '4,80,000', '4,30,000',
       '9,50,000', '3,45,000', '2,45,000', '3,90,000', '6,90,000',
       '14,00,000', '7,30,000', '5,84,999', '5,01,000', '95,000',
       '3,44,999', '5,99,999', '3,51,000', '3,40,000', '3,71,500',
       '32,000', '31,00,000', '15,25,000', '2,90,000', '2,74,99

In [13]:
df_train['kms_driven'].value_counts()

45,000 kms      24
50,000 kms      20
20,000 kms      19
35,000 kms      18
60,000 kms      18
                ..
1,29,000 kms     1
30,600 kms       1
77,000 kms       1
30,201 kms       1
24,800 kms       1
Name: kms_driven, Length: 231, dtype: int64

In [14]:
df_train['kms_driven'].unique()

array([nan, '23,452 kms', '47,000 kms', '29,000 kms', '56,400 kms',
       '60,000 kms', '42,000 kms', '6,800 kms', '27,000 kms',
       '50,000 kms', '23,000 kms', '15,487 kms', '55,000 kms',
       '70,000 kms', '22,000 kms', '40,000 kms', '16,000 kms',
       '80,000 kms', '37,000 kms', '53,000 kms', '40 kms', '1,32,000 kms',
       '18,000 kms', '6,200 kms', '1,75,430 kms', '58,000 kms',
       '65,000 kms', '39,000 kms', '30,874 kms', '4,500 kms',
       '20,000 kms', '30,000 kms', '24,530 kms', '46,000 kms',
       '28,000 kms', '59,000 kms', '10,750 kms', '35,522 kms',
       '2,500 kms', '95,000 kms', '45,000 kms', '8,500 kms', '11,000 kms',
       '1,04,000 kms', '68,000 kms', '59,910 kms', '36,000 kms',
       '51,000 kms', '73,000 kms', '25,000 kms', '4,000 kms',
       '48,000 kms', '75,000 kms', '31,000 kms', '35,000 kms',
       '1,95,000 kms', '588 kms', '7,500 kms', '38,000 kms', '41,000 kms',
       '44,005 kms', '18,500 kms', '12,516 kms', '1,20,000 kms',
       '1,16

As we see, if the car is less driven in terms of kilometer range, then price of the car is more.
'kms_drive' and 'Car Price' both are inversly proportional to each other.

In [15]:
df_train['fuel_type'].value_counts()

Petrol    344
Diesel    320
LPG         1
Name: fuel_type, dtype: int64

In [16]:
df_train['fuel_type'].unique()

array([nan, 'Diesel', 'Petrol', 'LPG'], dtype=object)

Below is the data insights and obesrvation on the training data that I need to work on.

**Quality Summary >>**
    
    Name, Company year, Price, >> has non-null values >> data type is Object
    kms_driven, fuel_type      >> has some null values >> data type is Object
    
**1) name >>**

    a) Name Attribute values are inconsistent (mix of Categorical &  Numerical)
    b) We will select only first 3 words of the name    

**2) Year >>**

    a) Year attribute has lots of Non Year Values/ garbage values
    b) We need to convert 'Object' data type into 'Integer' data type.

**3) Price >>**

    a) We need to remove 'Ask For Price' from the price attributes.
    b) We need to remove commas from this.   
    c) We need to convert 'Object' data type into 'Integer' data type.
    
**4) kms_driven >>**
    
    a) We need to convert 'Object' data type into 'Integer' data type. 
    b) We need to remove commas from this.
    c) We need to remove 'kms' from the integer values. e.g. '2,875 kms'
    
**5) fuel_type >>**

    a) We need to take care of NaN values.

## Data Pre-processing

As we know that, Machine Learning Model doesn't understand the Raw data, Hence we are doing the pre-processing on the data to feed the data to ML algorithm.

**1) name >>**

    a) Name Attribute values are inconsistent (mix of Categorical &  Numerical)
    b) We will select only first 3 words of the name    

In [17]:
df_train['name']

812                                  TATA INDI
29                          Mahindra Bolero DI
49     Mitsubishi Pajero Sport Limited Edition
105              Maruti Suzuki Ertiga ZXI Plus
616                                  Sale tata
                        ...                   
218        Force Motors Force One LX ABS 7 STR
223                               Toyota Etios
271            Renault Duster 85 PS RxE Diesel
474                Hyundai Xcent Base 1.1 CRDi
355                              Hyundai Verna
Name: name, Length: 713, dtype: object

First I get the access the "name" attribute.

In [18]:
df_train['name'].str.split()

812                                     [TATA, INDI]
29                            [Mahindra, Bolero, DI]
49     [Mitsubishi, Pajero, Sport, Limited, Edition]
105              [Maruti, Suzuki, Ertiga, ZXI, Plus]
616                                     [Sale, tata]
                           ...                      
218     [Force, Motors, Force, One, LX, ABS, 7, STR]
223                                  [Toyota, Etios]
271           [Renault, Duster, 85, PS, RxE, Diesel]
474                [Hyundai, Xcent, Base, 1.1, CRDi]
355                                 [Hyundai, Verna]
Name: name, Length: 713, dtype: object

Here, I split the data by space.

In [19]:
df_train['name'].str.split().str.slice(0,3)

812                   [TATA, INDI]
29          [Mahindra, Bolero, DI]
49     [Mitsubishi, Pajero, Sport]
105       [Maruti, Suzuki, Ertiga]
616                   [Sale, tata]
                  ...             
218         [Force, Motors, Force]
223                [Toyota, Etios]
271          [Renault, Duster, 85]
474         [Hyundai, Xcent, Base]
355               [Hyundai, Verna]
Name: name, Length: 713, dtype: object

I perform the slicing operation from 0th Index (Start_index) till 3rd Index(Stop Index) i.e (0,1,2)

In [20]:
df_train['name'].str.split().str.slice(0,3).str.join(' ')

812                  TATA INDI
29          Mahindra Bolero DI
49     Mitsubishi Pajero Sport
105       Maruti Suzuki Ertiga
616                  Sale tata
                ...           
218         Force Motors Force
223               Toyota Etios
271          Renault Duster 85
474         Hyundai Xcent Base
355              Hyundai Verna
Name: name, Length: 713, dtype: object

Here, I join sliced data. Here, I select the first three letter of the 'name' attributes.

In [21]:
df_train['name']=df_train['name'].str.split().str.slice(0,3).str.join(' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['name']=df_train['name'].str.split().str.slice(0,3).str.join(' ')


In [22]:
df_train['name']

812                  TATA INDI
29          Mahindra Bolero DI
49     Mitsubishi Pajero Sport
105       Maruti Suzuki Ertiga
616                  Sale tata
                ...           
218         Force Motors Force
223               Toyota Etios
271          Renault Duster 85
474         Hyundai Xcent Base
355              Hyundai Verna
Name: name, Length: 713, dtype: object

In [23]:
df_test['name']=df_test['name'].str.split().str.slice(0,3).str.join(' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['name']=df_test['name'].str.split().str.slice(0,3).str.join(' ')


In [24]:
df_test['name']

347          Hyundai Grand i10
674         Maruti Suzuki Alto
791          Chevrolet Beat LS
837             Datsun Go Plus
56        Mahindra Scorpio S10
                ...           
694    Mitsubishi Pajero Sport
428         Maruti Suzuki Alto
431            Tata Manza ELAN
563        Mahindra XUV500 W10
484         Mahindra XUV500 W6
Name: name, Length: 179, dtype: object

**2) Year >>**

    a) Year attribute has lots of Non Year Values/ garbage values
    b) We need to convert 'Object' data type into 'Integer' data type.

In [25]:
df_train=df_train[df_train['year'].str.isnumeric()]
df_test=df_test[df_test['year'].str.isnumeric()]

Here, I'm selecting the numeric data which is having string datatype.

In [26]:
df_train['year']=df_train['year'].astype(int)
df_test['year']=df_test['year'].astype(int)

Now, I have converted the selected data from object to integer data type using astype(int) 

In [27]:
print('df_train size: ',df_train.shape)
print('df_test size: ',df_test.shape)

df_train size:  (669, 6)
df_test size:  (173, 6)


In [28]:
print(df_train['year'].unique())
print(df_train['year'].dtype)

[2017 2015 2016 2009 2011 2013 2003 2014 2012 2006 2008 2010 2018 2005
 2007 2000 2002 2004 2019 2001]
int64


In [29]:
print(df_test['year'].unique())
print(df_train['year'].dtype)

[2017 2016 2014 2015 2019 2011 2012 2001 2002 2013 2010 2007 2018 2009
 1995 2006 2005 2004 2003 2000 2008]
int64


In [30]:
df_train=df_train[df_train['Price'] != 'Ask For Price']
df_test=df_test[df_test['Price'] != 'Ask For Price']

Here, I have selected the data which is not equal to 'Ask For Price', means, i select all the data except 'Ask For Price' this. 

In [31]:
df_train['Price']=df_train['Price'].str.replace(',','').astype(int)
df_test['Price']=df_test['Price'].str.replace(',','').astype(int)

I deleted the commas from the price and used astype to change the selected data's object data type to an integer data type (int)

In [32]:
print('df_train size: ',df_train.shape)
print('df_test size: ',df_test.shape)

df_train size:  (649, 6)
df_test size:  (170, 6)


In [33]:
print(df_train['Price'].unique())
print(df_train['Price'].dtype)

[ 180000 1475000  635000  250000  499999  240000   60000  159000  120000
  365000  100000  244999  160000  488000  535000   99999  425000  300000
  500000  280000  270000  320000  175000  285000  549000  380000  220000
  289999 1000000   90000  599000  830000  284999  200000  489999  385000
  251111  450000  860000   75000  944999 2800000  135000  400000  855000
  310000  349999  265000  110000  249999  600000  178000  480000  430000
  950000  345000  245000  390000  690000 1400000  730000  584999  501000
   95000  344999  599999  351000  340000   32000 3100000 1525000  290000
  274999  275000  540000  155000  401919   35999 1891111  475000  195000
  299000  145000  395000  324999  325000  525000  235000 1510000  375000
  498000   90001   70000  799999  230000  389700  699000  105000 1499000
  650000  225000  125000  610000  315000   80000   57000  260000  569999
  900000  140000  189500  524999  550000   99000  984999  800000  190000
  299999  649999  379000  278000  350000  239999  1

In [34]:
print(df_test['Price'].unique())
print(df_test['Price'].dtype)

[ 524999  270000  189000  285000  395000  200000  568500  140000  150000
  405000  699999  349999  600000  284999  335000  899000   40000  225000
  400000  170000  900000  180000  195000 1891111  250000  449999  320000
  365000   99999  750000   69999  690000  130000  160000   80000 1299000
  565000   85000  425000  749999  235000  549900  299999  610000  265000
   35000  550000  650000  501000  360000  290000  950000  700000  125000
  230000  375000  350000  560000  110000  430000  175000 1075000   95000
  530000  520000  599999  830000  240000 1025000   70000  190000  209000
  182000  199000  399000  280000  865000  199999  244999 1130000  105000
  415000  189700 1000000  215000  689999  795000  399999  100000  330000
  260000   32000  370000  149000   90000 1350000  549999  274999  372000
   60000  770000  499999  699000  220000   51999  580000  380000 1200000
  310000  165000  115000  135000  401000  590000  390000   30000  675000
  185000 1499000  123000 1074999  470000  500000  4

**4) kms_driven >>**
    
    a) We need to convert 'Object' data type into 'Integer' data type. 
    b) We need to remove commas from this.
    c) We need to remove 'kms' from the integer values. e.g. '2,875 kms'
    d) We need to take care of NaN values.

In [35]:
df_train['kms_driven']=df_train['kms_driven'].str.split().str.get(0).str.replace(',','')
df_train=df_train[df_train['kms_driven'].str.isnumeric()]
df_train['kms_driven']=df_train['kms_driven'].astype(int)

In this case, I separated the data into the first value and the second value after removing the commas from the kms drive. and by selecting only the first index (0th index), I mean (0). (2875).
Additionally, a few of the numbers were NaNs, so I used df train['kms driven'].
To pick all numerical values with object datatypes, use the function str.isnumeric().
Then, I used astype to transform the chosen data from object to integer data type (int)

In [36]:
df_test['kms_driven']=df_test['kms_driven'].str.split().str.get(0).str.replace(',','')
df_test=df_test[df_test['kms_driven'].str.isnumeric()]
df_test['kms_driven']=df_test['kms_driven'].astype(int)

In [37]:
print(df_train['kms_driven'].unique())
print(df_train['kms_driven'].dtype)

[ 23452  47000  29000  56400  60000  42000   6800  27000  50000  23000
  15487  55000  70000  22000  40000  80000  37000  53000     40 132000
  18000   6200 175430  58000  65000  39000  30874   4500  20000  30000
  24530  46000  28000  59000  10750  35522   2500  95000  45000   8500
  11000  16000 104000  68000  59910  36000  51000  73000  25000   4000
  48000  75000  31000  35000 195000    588  38000  41000  44005  18500
  12516 120000 116000 170000  32000  57923  13000  34000 102563  33000
  48508   7000  15000 100000   9000  39522 160000   1800  41800     65
     73  44000  56000  56758  90000  97200  60105  99000  54500  12500
  32700  82000  48006  22134  12000 175400  16934  39700      0  97000
 166000  63000  45863  85000  43000  37458  43200   5000  54870  43222
  49000  13349  55800  10000   2100  19000 129000  30600  80200  77000
  30201  56450  74000  91200  45933   9300   6000  58559  72000  11400
    122  64000  54000  21000  14000  26000   1600   7400   9800  35550
   300

In [38]:
print(df_test['kms_driven'].unique())
print(df_test['kms_driven'].dtype)

[  6821  38000  31000  13900  35000 130000      0  65000  62000  28000
  52000  44000  60500  55000  53000  40000  35500  15000  97200  36000
  13500   2450  50000  48660  20000  45000 147000  51000   1000  66000
  49000  37000  15487  46000  33000  52800  71200  75000  12000 150000
  37518  70000   3528   3350  34000  43000   2875  62500  23000  68000
  57000  39000  21000  24530  30000  41000  47000  25000  90000  29685
  60000  48006  72000   6020  32000  48247   8000  63000 100000  48008
  11523  25500   9000  13349  85455  14000  58000  29000  88000  26500
   2000   3000  48000   3200     60   7500  13000  27000  97000  22000
  60123   5000 111111  49800]
int64


**5) fuel_type >>**

    a) We need to take care of NaN values.

In [39]:
df_train=df_train[~df_train['fuel_type'].isna()] 

Since there was NaN value present in the df_train['fuel_type'], hence I select those values whose are not same as NaN values by using '~' this.

In [40]:
df_train[df_train['fuel_type'].isna()]

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type


In [41]:
df_test[df_test['fuel_type'].isna()]

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type


In [42]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 647 entries, 29 to 355
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        647 non-null    object
 1   company     647 non-null    object
 2   year        647 non-null    int64 
 3   Price       647 non-null    int64 
 4   kms_driven  647 non-null    int64 
 5   fuel_type   647 non-null    object
dtypes: int64(3), object(3)
memory usage: 35.4+ KB


In [43]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 169 entries, 347 to 484
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        169 non-null    object
 1   company     169 non-null    object
 2   year        169 non-null    int64 
 3   Price       169 non-null    int64 
 4   kms_driven  169 non-null    int64 
 5   fuel_type   169 non-null    object
dtypes: int64(3), object(3)
memory usage: 9.2+ KB


In [44]:
df_train.describe()

Unnamed: 0,year,Price,kms_driven
count,647.0,647.0,647.0
mean,2012.406491,409324.3,47444.238022
std,3.974394,502770.3,35986.043894
min,2000.0,30000.0,0.0
25%,2010.0,175000.0,27000.0
50%,2013.0,299000.0,41800.0
75%,2015.0,482500.0,59000.0
max,2019.0,8500003.0,400000.0


Here What I observed statistic values of df_train[Price], 

Mean value >> 4,09,324


minimum price value >> 30,000


25% Price values are below than >> 1,75,000


50% Price values are below than >> 2,99,000


75% Price values are below than >> 4,82,000


Maximum Price value >> 85,00,000


It seems to be outlier for such a data. 

In [45]:
df_test.describe()

Unnamed: 0,year,Price,kms_driven
count,169.0,169.0,169.0
mean,2012.591716,420880.2,41801.254438
std,4.119371,351143.8,26486.924239
min,1995.0,30000.0,0.0
25%,2011.0,180000.0,25500.0
50%,2013.0,335000.0,40000.0
75%,2015.0,550000.0,53000.0
max,2019.0,1891111.0,150000.0


In the test data, We could see max() value above than 85 lakh, which seems to be outlier, We will remove this Outlier from the dataset by taking all the values except this one

In [46]:
df_train.loc[(df_train['Price']>=8.500003e+06)]

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
562,Mahindra XUV500 W6,Mahindra,2014,8500003,45000,Diesel


In [47]:
df_train=df_train.loc[(df_train['Price']<8.500003e+06)]

In [48]:
print('df_train size: ',df_train.shape)
print('df_test size: ',df_test.shape)

df_train size:  (646, 6)
df_test size:  (169, 6)


### Checking all the integer attributes and categorical attribute

In [49]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 646 entries, 29 to 355
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        646 non-null    object
 1   company     646 non-null    object
 2   year        646 non-null    int64 
 3   Price       646 non-null    int64 
 4   kms_driven  646 non-null    int64 
 5   fuel_type   646 non-null    object
dtypes: int64(3), object(3)
memory usage: 35.3+ KB


#### we have successfully converted "Year, Price, kms_driven" attributes into Integer

#### Reset Index Column for Training and Testind Data

In [50]:
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

Here I reset the index number for both the dataset and saved them inplace.

In [51]:
df_train

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Mahindra Bolero DI,Mahindra,2017,180000,23452,Diesel
1,Mitsubishi Pajero Sport,Mitsubishi,2015,1475000,47000,Diesel
2,Maruti Suzuki Ertiga,Maruti,2016,635000,29000,Petrol
3,Ford Fiesta SXi,Ford,2009,250000,56400,Petrol
4,Nissan Terrano XL,Nissan,2015,499999,60000,Diesel
...,...,...,...,...,...,...
641,Force Motors Force,Force,2015,580000,3200,Diesel
642,Toyota Etios,Toyota,2011,275000,36000,Diesel
643,Renault Duster 85,Renault,2013,489999,27000,Diesel
644,Hyundai Xcent Base,Hyundai,2016,300000,140000,Diesel


In [52]:
df_test

Unnamed: 0,name,company,year,Price,kms_driven,fuel_type
0,Hyundai Grand i10,Hyundai,2017,524999,6821,Petrol
1,Maruti Suzuki Alto,Maruti,2016,270000,38000,Petrol
2,Chevrolet Beat LS,Chevrolet,2014,189000,31000,Diesel
3,Datsun Go Plus,Datsun,2016,285000,13900,Petrol
4,Mahindra Scorpio S10,Mahindra,2015,395000,35000,Diesel
...,...,...,...,...,...,...
164,Mitsubishi Pajero Sport,Mitsubishi,2015,1725000,37000,Diesel
165,Maruti Suzuki Alto,Maruti,2015,230000,5000,Petrol
166,Tata Manza ELAN,Tata,2010,155555,111111,Petrol
167,Mahindra XUV500 W10,Mahindra,2018,1299000,40000,Diesel


##  Feature Engineering 

### Seperating Features and Target Lable

In [53]:
X_train = df_train.drop(['Price'],axis=1)
y_train = df_train['Price']

X_test =df_test.drop(['Price'],axis=1)
y_test = df_test['Price']

X_train = df_train.drop(['Price'],axis=1) >> this will drop the 'price' attribute and pass all the Independent attributes to the X_train

y_train = df_train['Price'] >> will assign only dependent attribute ('price' variable) to y_train

In [54]:
print('x_train shape: ',X_train.shape)
print('y_train shape: ',y_train.shape)

print('x_test shape: ',X_test.shape)
print('y_test shape: ',y_test.shape)

x_train shape:  (646, 5)
y_train shape:  (646,)
x_test shape:  (169, 5)
y_test shape:  (169,)


### Encoding Categorical Attributes

In [55]:
enc= OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train)

X_train= enc.transform(X_train)

X_test=enc.transform(X_test)

Here, I've used the OneHotEncode Class to build a single Object (enc). I then only fit my "enc" using the data from X train, not the entire dataset or X test. Here, "enc" Object used a training dataset to learn how to encode categorical properties.


I then use the enc.transform(data) method to transform my X train and X test.

Now that we have checked, my X train's size has changed from (646, 5) to (646, 492)

additionally, it updated for X test from (169, 5) to (169, 492).


If handle unknown='error' had been used, If one categorical characteristic in our X test in our X test has a new value that was not present in the X train. If an unknown category is present during the transform, OneHotEncoder will then raise an error. 

As is well known, only numerical features can be used by machine learning algorithms. It is unable to comprehend categorical features. "Nominal Categorical Features" are shown here.

We so employed the OneHotEncoder method. It utilizes categorical variables to operate.
Each binary unique value will be added by OneHotEncoder to your categorical characteristics.

Depending on how many distinct values there are in the original categorical attribute, it will produce new attributes.

Here, 6 traits were changed into 492 attributes. Our dimensionality has been expanded here in accordance with our category variables.

In [56]:
print('x_train shape: ',X_train.shape)
print('x_test shape: ',X_test.shape)

print()

print('y_train shape: ',y_train.shape)
print('y_test shape: ',y_test.shape)

x_train shape:  (646, 492)
x_test shape:  (169, 492)

y_train shape:  (646,)
y_test shape:  (169,)


### Standardization

In [57]:
scaler=StandardScaler(with_mean=False)
scaler.fit(X_train)

X_train= scaler.transform(X_train)

X_test= scaler.transform(X_test)

In [58]:
print('x_train shape: ',X_train.shape)
print('x_test shape: ',X_test.shape)

print()

print('y_train shape: ',y_train.shape)
print('y_test shape: ',y_test.shape)

x_train shape:  (646, 492)
x_test shape:  (169, 492)

y_train shape:  (646,)
y_test shape:  (169,)


Here, I've used the StandardScaler Class to build a single Object (a scaler). I then only fit my "scaler" using the data from X train, not the entire dataset or X test. With the use of the training dataset, the "scaler" object in this case learned how to standardize the categorical properties.

Then, I use the scaler.transform(data) method to transform my X train and X test.
The range of my X train and X test has now changed from 0-n to 0-1, as shown if we inspect it.


The "Standardization" procedure must be used; otherwise, my accuracy will suffer. So I made the decision to standardize the data here.

Normalization >> Normalization of your features means will bring our attributes data in range of 0 to 1.
age 44 to 80 >> normalization >> 0.3 to 1.0


Standardization (Z_score normalization) >> It is a Transformed version of your feature is normal distribution with "mean == 0" & "standard deviation == 1".
Standardization = (each feature - mean of feature) / (standard deviations)


Normalization vs Standardization >> standardization is a better default option to use.

Here, I want to use supervised learning algorithm (Linear Regression Algorithm), Here I had some outliers, and most of the features are Normal distributions Hence, We used the standardization here.

Here, we did Data Prepreocessing and Feature engineering on both Training dataset and Testing dataset. 

If we would have done the Data Prepreocessing and Feature engineering only on the "Training Dataset" and not on the "Testing Dataset" then we will confuse our ML Algorithm. 

e.g. If we do the "Standardization" on Training Dataset & not on the Testing Dataset then one of the attribute from Training dataset will have range from 0 to 1 & one of the attribute from Testing dataset will have range from 0 to 100. 
It will confuse our model and our Accuracy will be less.

Hence we did the Data Prepreocessing and Feature engineering on both Training dataset and Testing dataset.

We Impute missing values on both the dataset, we did the "Standardization" on both the dataset.

### Dimensionality Reduction

In [59]:
pca= PCA(n_components=30)
pca.fit(X_train.toarray())

X_train= pca.transform(X_train.toarray())

X_test= pca.transform(X_test.toarray())

In [60]:
print('x_train shape: ',X_train.shape)
print('x_test shape: ',X_test.shape)

print()

print('y_train shape: ',y_train.shape)
print('y_test shape: ',y_test.shape)

x_train shape:  (646, 30)
x_test shape:  (169, 30)

y_train shape:  (646,)
y_test shape:  (169,)


I've used the PCA Class to construct a single object here, called pca. As we are aware, the PCA algorithm is used to transform datapoints from one 492nd dimensional space to another 492nd dimensional space characteristics. The "n components=30" command was used after this modification to specify that I wanted to maintain the "30 best Principle Component with Highest Variance" features for my dataset.

I then fitted my "PCA" using only the data from the X train and not the entire dataset or X test. With the use of the training dataset, the "PCA" object learned how to "Dimensionality Reduction" categorical attributes.

Then, I use the scaler.transform(data) method to transform my X train and X test.
In this case, it reduced 492 attributes to 30 attributes. In this case, we reduced the dimensionality without losing any important data.

Since Scikit-Learn uses compressed technique to store a such high dimensional data, I also use the "X train.toarray()" methode to convert dataframe to Numpy Array in this case. After doing the OnHotEncoding, We will attain Higher Dimensionality of our Attributes. However, PCA class cannot handle such a large amount of data; it issues an error and instructs us to transform the enormous Dimensional data into a regular array.

Diminishing Dimensions After OneHotEncoding, our dataset contained 492 features in this case.

Manifolding of feature selection projection.

It is known as the "Curse of Dimensionality" if we train our machine learning algorithm with such a high level of dimensionality because it would slow down processing time and lead to overfitting problems.



We implement the "Dimensional reduction" to get over this issue. Compression of the data is referred to as dimensionality reduction. For instance, when we convert a PNG file to a JPG file, but compress it so that we lose some information from the dataset, it is not a big deal.

Dimensionality reduction will hasten the processing of the ML algorithm. Additionally, as we reduced the dimensionality of the data, it aids with data visualization.
However, when we reduce the dimensions, we sometimes lose information, which could result in slightly worse performance with less accuracy.


So I used the PCA (Principle Component Analysis) algorithm to accomplish that. It basically entails projecting data points from one "Dth Dimensional Space" to another "Dth Dimensional Space" in order to reduce the dimensionality of the data.


The PCA algorithm will attempt to determine the principal components that have orthogonal dimensions and produce the "Highest Variance" of the data after projection from a higher dimensional dataset.

The "First Principle Component with Highest Variance," the "Second Principle Component with Highest Variance," and the "ith Principle Component with Highest Variance," which is orthogonal to the "First Principle Component," will all be found. This process will carry on until there are exactly as many dimensions as there were in original space. The subset of these Principle Components will be chosen after finding all of them (i.e. C1, C2, C3, C4). We have here decreased the dataset's dimension from Dth Space to 4th Space.

In order to maintain the maximum variance of the original dataset, PCA will locate the line or hyperplane in a lower-dimensional space.


The first axis or dimension in the lower-dimensional space that provides the greatest variance is found. This will be the first fundamental element we use (C1).

The algorithm then seeks to identify the axis or dimension that lies above the initial dimensions and provides us with the greatest variance. This will be the second main element we'll use (C2).

Now that the datapoints have been transformed, our model will use new features (C1, C2,..,Ci) to represent the datapoints in place of the original features (X1, X2,..,Xi).

In order to limit the number of dimensions in the data, we only selected a portion of these New dimensions, i.e. (C1,C2,..Ci).

## Model Training

### Algorithm Selection And Hyperparameter Tunning

### Model_1 >> Linear Regression

In [61]:
model_1 = LinearRegression(fit_intercept=True)
model_1.fit(X_train,y_train)

LinearRegression()

### Model_2 >>  Lasso Regression

In [62]:
model_2= Lasso()
model_2.fit(X_train,y_train)

Lasso()

### Model_3 >>  KNN Regression

In [63]:
model_3 = KNeighborsRegressor(n_neighbors=5,metric='manhattan')
model_3.fit(X_train,y_train)

KNeighborsRegressor(metric='manhattan')

### Model_4 >> Decision Tree Regressor

#### Without Hyperparameter Tunning

In [64]:
model_4= DecisionTreeRegressor(criterion='mse',max_depth=5)

In [65]:
model_4.fit(X_train,y_train)

DecisionTreeRegressor(max_depth=5)

#### With Hyperparameter Tunning

In [66]:
parameter_grids = {'criterion': ['mse', 'mae'],
                  'max_depth' : [2,4,8,10,12,None],
                   'max_features':[0.25,0.5,1.0],
                  'min_samples_split' : [0.25,0.5,1.0],
                  }

Here, We have defined the dictionary of Grid of Hyperparameter for Decision Tree Regressor.
Here, most important Hyperparameter for Decision Tree Regressor is,
criterion, max_depth, min_samples_split (minimun sample in one node to split it)

In [67]:
model_5 = GridSearchCV(DecisionTreeRegressor(),
                      param_grid=parameter_grids,
                       scoring='r2',
                      cv=5,
                      n_jobs=-1)

A GridSearchCV is this (Grid Search Cross Validation Model). The model (DecisionTreeRegressor()), parameter grids, scoring (the main metric I want to improve is the r2 score), CV (Cross Validation with 5-fold), and n jobs (the number of jobs >> As this method is time-consuming and we must train many models with each combination of these Hyperparameter, we are employing all CPU cores to conduct the parallel execution.

In [68]:
model_5.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(), n_jobs=-1,
             param_grid={'criterion': ['mse', 'mae'],
                         'max_depth': [2, 4, 8, 10, 12, None],
                         'max_features': [0.25, 0.5, 1.0],
                         'min_samples_split': [0.25, 0.5, 1.0]},
             scoring='r2')

## Model Assessment

In [69]:
y_test_predict_1= model_1.predict(X_test)
R2_score_1=r2_score(y_test, y_test_predict_1)
print('R2_score of Linear Regression Model: ',R2_score_1*100,'%')

R2_score of Linear Regression Model:  55.23458203757061 %


In [70]:
y_test_predict_2= model_2.predict(X_test)
R2_score_2=r2_score(y_test, y_test_predict_2)
print('R2_score of Lasso Regression  Model: ',R2_score_2*100,'%')

R2_score of Lasso Regression  Model:  55.2344234318771 %


In [71]:
y_test_predict_3=model_3.predict(X_test)
R2_score_3=r2_score(y_test, y_test_predict_3)
print('R2_score of KNN Regression  Model: ',R2_score_3*100,'%')

R2_score of KNN Regression  Model:  51.71066173911394 %


In [72]:
print("Decision Tree Regressor Without Hyperparameter Tunning\n")

y_test_predict_4= model_4.predict(X_test)
R2_score_4=r2_score(y_test, y_test_predict_4)
print('Decision Tree Regressor R2_Score: ',R2_score_4*100,'%')

Decision Tree Regressor Without Hyperparameter Tunning

Decision Tree Regressor R2_Score:  29.567111000110113 %


In [73]:
print("Decision Tree Regressor With Hyperparameter Tunning\n")

print("\nBest Hyperparameters Values:",model_5.best_params_)
print("\nDecision Tree Regressor R-Squared on Train: ",round((model_5.best_score_)*100,2), '%')

y_test_predict_5= model_5.predict(X_test)
R2_score_5=r2_score(y_test, y_test_predict_5)
print('\nDecision Tree Regressor R2_Score on Test: ',R2_score_5*100,'%')

Decision Tree Regressor With Hyperparameter Tunning


Best Hyperparameters Values: {'criterion': 'mae', 'max_depth': 10, 'max_features': 0.5, 'min_samples_split': 0.25}

Decision Tree Regressor R-Squared on Train:  45.96 %

Decision Tree Regressor R2_Score on Test:  25.524443408455145 %


# Conculusion

### Feature Importance Analysis 

Feature Importance Analysis >> Here we have found that only 2 feature were linearly correlated with the depeendant varible.

        Dependent Attribute >> Price 
        Independent Attribute >> year & kms_driven
        
   We all know that if an automobile was purchased recently, its selling price will be high. Therefore, the same positive link between year and price has been seen here. This relationship enables us to improve our pipeline's R2 score and obtain a lower error rate.


As is common knowledge, the price of an automobile increases with the amount of miles it has been driven. The cost of the car will decrease as it is driven more frequently. Therefore, the same link between "kms drive" and "Car Price" has been found here; they are both inversely proportional to one another. 

### Model Explaination

#### Linear Regression Model

Linear Regression is the "Supervised Machine Learning algorithm". Model finds the best fit linear line between the dependent and one or more independent variables. It has two types: Simple Linear Regression and Multiple Linear Regression.

Simple Linear Regression is where only one independent variable is present and the model has to find the linear relationship of it with the dependent variable. 

Equation for Linear Regression: y=b0+b1x

Equation for Muliple Linear Regression: y=b0+b1x1+b2x2+...+bdxd

We have got the highest R2_Score in Linear Regression Algorithm. Hence we are finalizing the same for the deployment.

#### Lasso Regression Model

LASSO (Least Absolute Shrinkage Selector Operator) also called as L1 Regularization technique to avoid equation become to complex.

#### KNN Regression

KNN algorithm used for both regression problems and classification problems. It uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set.
Distance between the new point and each training points can be calculated using below three techniques 

- Manhattan (for continuous)-
- Hamming distance (for categorical).
- Euclidian.


#### DecisionTreeRegression

Decision tree is the versatile ML model capable of performing both regression related & classification-related task and even work in case of tasks which has multiple outputs. Decision Tree is powerful algorithms, capable of fitting even complex datasets. It is also the fundamental components of Random Forests, which is one of the most powerful machine learning algorithms available today

### Bias-variance trade off


bias-variance trade off >> the ML Model has 3 types of errors namely, 
    Bias Error
    Variance Error
    Irreducible Error
    
Irreducible Error is not about the model, it is about the data. Our model is not accurate because our data is not accurate. Here, I checked the correlation of each feature attributes with the lable variable and it's not linear.

Here, We can't change the error by changing the model Hence It is called as "Irreducable Error".

Yes, but we try to optimize the error by doing data preprocessing.


Bias error and Variance Error are depend on the ML Algorithm that we select.
And depending on the complexity of our model we got High Bias & High Variance Error.


Bias >> Direction (Centre, Right, left, top, Bottom) of the quardinate
Variance >> distance between datapoints

Low bias >> data points are at center of the quardinate
High Bias >> data points are at any particular direction of the quardinate

Low Variance >> all datapoints are close to each other
High Variance >> all datapoints are away from each other.

We were aiming for the model which has Low Bias and Low variance.

Bias error happens >> due to simplicity of the model, if model is too simple then model can't lear the data perfectly. Hence, Model become too bias, Model didn't have enough power to learn the data.

If the Feature data is not linear data, and if we use LinearRegression() here, then we get high bias error. Model makes a wrong assumption about the data. LinearRegression will not fit completly. Model will assume that data is too simple and linear, accordingly it predict the price, but this is not the case. Our data is a complex data and its not a linear data, hence we are getting high bias error.


Variance >> If our model is too complex then we get the Variance error. If our model is too complex then it always tries to catch the each and every data points, and this model becomes 'train_data specific model' and gives High Accuracy on the training dataset. Here, Problem is, This model can't generalize the data points, Althoguh this model is so complex and learnt the training data set perfectly,But when new dataset (X_test) feed to this model, then Model cant predict accurately. 

So, we tried to build a righ model with right level of complexity which offers a "Low-Bias Error" & "Low-Variance Error" and Total error which sum of "Bias error" and "Variance error" should be minumum. This is called as "Bias-Variance Trade-Off".

### Overfitting and Underfitting Error


As we know that, "Bias & Variance" both term came from "Statistics", but in "Machine Learning" we have "Underfitting & Overfitting".

When our model is too simple then we called it as "High Bias Error" or "Underfitting" >> Model underfits the training data points. Model can't able to learn the data perfectly. 
Here we train our model using df_train and this model does not able perform well neither on df_train nor on df_test dataset. So this is called as "Underfitting" 

To deal this "Underfitting" issue we used the more complex models like "Lasso Regressor", "DecisionTreeRegressor" & "KNN Regressor". 
Here, we are tried to get more correlated features & informative features with the target lables. We made our model complex by tunning some "Hyperparameters"


When our model is too complex then we called it as "High Variance Error" or "Overerfitting" >> model overfits the training data points.

Here we train our model using df_train and this model performed well only on df_train but when we evaluate on df_test dataset it doesn't perforemd well. So this is called as "Overfitting" 

If the Overfitting problem had been there, then we would have used below technique to solve the issue:

    1) Simplify Model:- As our model is too complex like Deep Learning Model, then we will use simple model like (decision tree, svm etc.)
    
    2) If the dataset has some noises/outliers in the data, then we will remove those noises/outliers from it. Otherwise, Since our model is too complex then complex model will try to learn these noises/Outliers and treat it as a Patterns.
    
    3) If possible We will feed more training dataset to our model to solve the overfitting issue.
    
    4) We will do dimensionality reduction to avoid the Overfitting issue.

  ***5) We will do the "Regularization" approch to avoid the Overfitting issue *** We will put constrain on our Machine Learning model and will not allow our model to become to complex is called as "Regularization".
  
    As we know in DecisionTree algorithm, We have hyperparameter "max_depths",
if we do not set any particual integer value to this hyperparameter, then our tree become larger and larger and it will predict lable of every point accuretly. So, this complex model will overfit the data. 
    
    To avoid overfitting issue, we set a hyperparameter of the DecisionTree model ("max_depths") to regularize the tree. So, this becomes a "Regularization Hyperparameter" to put constrain on the Model.

### L1 and L2 Regularization

"L1 and L2 Regularization" >> The method is to add second term (C|w|) to the objective function to penalize the if it has lots of parameters.
Goal is to stop our model becoming too complex. We can do this by reduceing the parameters of the model.

As we know that, every ML algorithm has its own formula, then we define the objective function to find the best value of the parameter of the model. We can find best value of parameter by Minimizing or Maximizing the Objective function.

second term (C|w|) >>
    C   >> coefficient
    
    |w| >> L1 norm of parameter vector >> sigma over all absolute of weights 

C >> hyperparameter and "C" controls regularization of the model.

L1 Regularization: 
    MSE= ((summ(Yi - Yp)**2) + C|w|) / N
                      
if we put C|w|=0, then second term will be removed and we get our orignal objective function, 

As our goal is to put some constrain on our model and stop the model from becoming to complex. For that we have reduced some parameters in order to simplify the model.

So,Initially we have all the d+1 parameters,
To simplify our model we trained our model so that model will not use all the parameters of the objective function, it will set some of them as "Zero" or close to "Zero". This is how we simplified our model. So for that, by adding "C|w|" this term model get penalize if it gets many "non-zero" parameters.

if C|W| is "High" means model is using all the parameters of the objective function. 

if C|W| is "Low" means model is seting some of the parameters of the objective function equal to zero. And number of parameter of our model is reduced and now model is become simplifed.
L1 Regularization is also called as "Lasso Regression"

L2 Regularization: 
    MSE= ((summ(Yi - Yp)**2) + C||w||**2) / N
                      
C   >> coefficient
|w| >> L2 norm of parameter vector >> sigma of all weights of power of 2   L2 
Regularization is also called as "Ridge Regression" 

We should use Regularization technique to avoid "Overfitting" issues

### Strength Weaknesses of our Pipeline & Recommandation

Reasons to get less accuracy?

    1) Features are not Informative

    2) Less Feature Attributes
    
    3) Less data samples
    
    4) Less correlation between Independent and Dependant Variables
    
    2) Data is Noisy
    
    3) Data errors
    
    4) Wrong Model
    
    5) Hyperparameters are not well tunned
    
Here I have carefully analyzed all this thing and I did experiment on all those points, and able achieve the best R2_score for that.

Strength Weaknesses of Our Pipeline:- 

   Strength >>
        Here we have used all the techniques which generalize our model for all the data points.
        We permform operations like 
        
            "Encoding categorical Attributes using OneHotEncoder"
            
            "Standardization on the data so that they are in same range"
            
            "Dimensionality reduction Using PCA (Principle Component Analysis) so that we can have the features with Highest                Variance"
            
            "We trained our model with different algorithms like (Linear Regression, Lasso Regression, KNN Regression,                      DecisionTreeRegressor)"
            
            "We performed the Hyper-Parameter tunning to optimize the model"
            
            "We evaluate our model using Test dataset for all the algorithm"
            
            "We got R2_score for each model"
            


   Weakness >>
           Here It has been obeserved data, we got the error of "Irreducible Error". Our model is perfectly designed, pipeline            is perfectly fitted, but Our data has some issues. We tried to optimized the issues and successed in the some. But              below are the some observations that i would like to highlight here,
           
            "Features are not Informative"
            
            "Less Feature Attributes"
            
            "Less data samples"
            
            "Less correlation between Independent and Dependant Variables"
            
            "Data is Noisy"
            
            "Data errors"
            
            
Recommendation >> If our Domain Experty & Data Analyst team do the better disscussion with client and get the features/data which are relevent to get the better performance then we will get the better accuracy for this pipeline.

### References 

Below are the resources to learn more about the dataset and tools:

Dataset: https://drive.google.com/drive/folders/1fHtEcXuY-tCnbNvMNtM8t1TcE6hT7Sg7?usp=sharing

Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html

Matplotlib user guide: https://matplotlib.org/3.3.1/users/index.html

Seaborn user guide & tutorial: https://seaborn.pydata.org/tutorial.html