# Decision Trees
<li>Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression.</li>
<li>It is a non-parametric learning algorithm because it doesnot make any assumptions about the underlying data distribution or parameters.</li>
<li>The goal of decision trees is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.</li>
<li>Decision Tree has a hierarchical, tree like structure, which consists of a root node, branches, internal nodes and leaf nodes.</li>

![](images/decision_trees.png)

## How Decision Trees Work?
<li>The decision tree algorithm builds the tree in a recursive way, by selecting the best attribute to split the data at each node based on some criterion.</li>
<li>The criterion that can be used for splitting up a decision node is information gain or Gini impurity.</li>
<li>Information gain measures the reduction in entropy (i.e., uncertainty) of the class labels after a split.</li>
<li>Entropy is defined as a measure of randomness or disorder of a system.</li>
<li>Information gain and Entropy are inversely proportional to each other.</li>
<li>When entropy increases, information gain decreases and when entropy decreases, information gain increases.</li>
<li>Gini impurity measures the probability of misclassifying a random sample from the node.</li>
<li>The process continues until all the instances in a node belong to the same class or until a stopping criterion is met.</li>
<li>Stopping criterion could be maximum tree depth or minimum number of instances per leaf.</li>
<li>The resulting tree can be used to classify new instances by traversing from the root to a leaf node, following the path that satisfies the tests at each node.</li>

![](images/working_of_dtrees.png)

## Decision Tree Inducers (Types Of Decision Tree Algorithm)
<li>A decision tree inducer is an algorithm that is used to build a decision tree from a given dataset. Here are some commonly used decision tree inducers:</li>
<ol>
    <b><li>ID3</li></b>
    <b><li>C4.5</li></b>
    <b><li>CART</li></b>
</ol>

<b>1. ID3:</b>
<li>The full form of ID3 algorithm is Iterative Dichotomiser 3.</li>
<li>This is one of the earliest decision tree algorithms developed by Ross Quinlan.</li> 
<li>It uses the concept of entropy and information gain to select the best attribute for splitting the data at each node.</li>
<li>It cannot handle numeric featues and it can only be used for classification tasks only.</li>

<b>2. C4.5:</b>
<li>C4.5 is actually an abbreviation for "Classifier Version 4.5".</li>
<li>It is a decision tree algorithm that was developed by Ross Quinlan, and it is an extension of the earlier ID3 algorithm.</li>
<li>The C4.5 algorithm can handle both discrete and continuous data.</li>
<li>It uses <b>information gain ratio</b> as the splitting criterion.</li>
<li>It also includes a post-pruning step to reduce overfitting.</li>

<b>3. CART:</b>
<li>The full form of CART is Classification And Regression Trees.</li>
<li>This is a decision tree algorithm developed by Breiman, Friedman, Olshen, and Stone.</li>
<li>It can be used for both classification and regression tasks.</li>
<li>It uses the GIni impurity measure to select the best attribute for splitting the data.</li>

![](images/decision_tree_inducers.png)

## Entropy

<li>We use the concept of Entropy and Information Gain while splitting up a node in an ID3 algorithm.</li>
<li>Entropy is defined as a measure of randomness or disorder in the system.</li>
<li>The formula to calculate entropy is given by:</li>

![](images/Entropy_formula.png)

<li>Here, c is the number of class. So for binary classification problem, the entropy formula is given by:</li>

![](images/expanded_eqn_entropy.png)

<li>Here, p is the probablity that it belongs to positive class and q is the probability that it belongs to negative class.</li>
<li>Let's say you are predicting whether the employee will get a promotion or not.</li>
<li>If only 30% of employees in your total dataset has received promotion then your p=0.3 being your positive class and q=1-p=0.7 being your negative class.</li>

## Information Gain & Splitting Of Node In ID3 Algorithm
<li>One of the key steps in ID3 algorithm is to split a node into child nodes based on the attribute that maximizes the information gain.</li>

<li>Information gain is a measure of the reduction in entropy (impurity) of the dataset after splitting the data based on an attribute.</li>
<li>Entropy is a measure of the randomness or uncertainty in the dataset.</li>

**The formula for information gain is:**
<code>
Information Gain = Entropy(parent) - ∑ [Weighted Average] * Entropy(children)
</code>
**where**

<li>Entropy(parent) is the entropy of the parent node</li>
<li>Entropy(children) is the entropy of each child node</li>
<li>the Weighted Average is the proportion of the data that belongs to each child node.</li>


![](images/information_gain_id3.png)
<li>Firstly, we calculate the entropy of the parent node
<li>After calculating entropy, we calculate the information gain for each of the attributes.</li>
<li>The attribute that results in the highest information gain is selected as the splitting attribute for the node.</li> 
<li>The node is then split into child nodes based on the values of the selected attribute.</li>
<li>This process is repeated recursively until all leaf nodes are pure (contain only one class) or until some stopping criteria is met.</li>
<li>In this way, ID3 algorithm uses information gain to select the attribute to split a node and to construct a decision tree from the dataset.</li>



In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [11]:
gameplay_df = pd.DataFrame({
    "Outlook": ["Sunny", "Sunny", "Overcast", "Rain", "Rain",
               "Rain", "Overcast", "Sunny", "Sunny", "Rain",
               "Sunny", "Overcast", "Overcast", "Rain", 
               "Sunny", "Overcast", "Rain"],
    "Temperature": ["Hot", "Hot", "Hot", "Mild", "Cool",
                   "Cool", "Cool", "Mild", "Cool", "Mild",
                   "Mild", "Mild", "Hot", "Mild",
                   "Hot", "Mild", "Cool"],
    "Humidity": ["High", "High", "High", "High", "Normal",
                "Normal", "Normal", "High", "Normal", "Normal",
                "Normal", "High", "Normal", "High",
                "Normal", "High", "Normal"],
    "Wind": ["Weak", "Strong", "Weak", "Weak", "Weak",
            "Strong", "Strong", "Weak", "Weak", "Weak",
            "Strong", "Strong", "Weak", "Strong", 
            "Strong", "Weak", "Strong"],
    "Play" : ["No", "No", "Yes", "Yes", "Yes",
             "No", "Yes", "No", "Yes", "Yes",
             "Yes", "Yes", "Yes", "No",
             "Yes", "Yes", "No"]
})

In [12]:
gameplay_df.head()

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes


In [13]:
outlook_df = pd.get_dummies(gameplay_df["Outlook"])
outlook_df.drop('Sunny', axis=1, inplace=True)
outlook_df.head()

Unnamed: 0,Overcast,Rain
0,0,0
1,0,0
2,1,0
3,0,1
4,0,1


In [14]:
gameplay_df = pd.concat([gameplay_df, outlook_df], axis=1)
gameplay_df.drop('Outlook', axis=1, inplace=True)
gameplay_df.head()

Unnamed: 0,Temperature,Humidity,Wind,Play,Overcast,Rain
0,Hot,High,Weak,No,0,0
1,Hot,High,Strong,No,0,0
2,Hot,High,Weak,Yes,1,0
3,Mild,High,Weak,Yes,0,1
4,Cool,Normal,Weak,Yes,0,1


In [15]:
gameplay_df.replace({"Temperature": {"Hot":2 , "Mild": 1, "Cool": 0},
                    "Humidity": {"Normal": 0, "High": 1},
                    "Wind": {"Weak": 0, "Strong": 1}}, inplace=True)
gameplay_df.head()

Unnamed: 0,Temperature,Humidity,Wind,Play,Overcast,Rain
0,2,1,0,No,0,0
1,2,1,1,No,0,0
2,2,1,0,Yes,1,0
3,1,1,0,Yes,0,1
4,0,0,0,Yes,0,1


## Separating Data & Labels

In [16]:
data = gameplay_df.drop('Play', axis=1)
labels = gameplay_df['Play']

## Train Test Split

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(data, labels, 
                                                    test_size=0.2,
                                                    random_state=42, 
                                                    stratify = labels)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(13, 5)
(13,)
(4, 5)
(4,)


In [19]:
from sklearn.tree import DecisionTreeClassifier

In [20]:
dt = DecisionTreeClassifier()

In [21]:
dt.fit(X_train, y_train)

In [22]:
predictions = dt.predict(X_test)

In [23]:
predictions

array(['Yes', 'No', 'Yes', 'Yes'], dtype=object)

In [24]:
test_vs_pred_df = pd.DataFrame({"actual": y_test,
                               "pred": predictions})

test_vs_pred_df

Unnamed: 0,actual,pred
4,Yes,Yes
10,Yes,No
8,Yes,Yes
7,No,Yes


### Gini Index 
<li>Gini index is a measure of impurity or diversity used to select the best split in decision trees.</li>
<li>In the context of decision trees, it is used to measure the quality of a split when determining the feature that should be used to create child nodes.</li>
<li>The main goal of measuring impurity is to create child nodes that are as pure as possible in terms of the target variable.</li>
<li>The Gini index measures the probability of misclassifying a randomly chosen element from a dataset.</li>
<li>It ranges from 0 to 1, where 0 indicates a pure node and 1 indicates maximum impurity.</li>
<li>A pure node is a node where all elements belong to the same class.</li>
<li>An impure node is a node where elements are equally distributed across all classes.</li>

**The formula for calculating the Gini index for a leaf node is:**
<code>
Gini Index(Leaf) = 1 - ∑(p_i^2)
</code>

**where p_i is the proportion of samples that belong to class i in the node.**

<li>After calculating the gini index for a leaf node, weighted gini index for the node is calculated based on the formula.</li>

![](images/weighted_gini_index.png)

<li>When selecting a split in a decision tree, the feature that results in the lowest weighted Gini index (highest purity) is chosen.</li>
<li>The resulting split divides the dataset into two or more child nodes, which are then processed recursively to create the decision tree.</li>



In [49]:
import os

In [50]:
path_to_admission_data = os.path.join(os.path.dirname(os.getcwd()), 
                                      'csv_data', 'Admission_data.csv')
print(path_to_admission_data)

C:\Users\srval\OneDrive\Desktop\Python_For_Data_Science\August_Lumbini_Course\csv_data\Admission_data.csv


In [51]:
admission_df = pd.read_csv(path_to_admission_data)
print(admission_df.shape)
admission_df.head()

(500, 9)


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [52]:
admission_df.drop('Serial No.', axis=1, inplace = True)
admission_df.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337,118,4,4.5,4.5,9.65,1,0.92
1,324,107,4,4.0,4.5,8.87,1,0.76
2,316,104,3,3.0,3.5,8.0,1,0.72
3,322,110,3,3.5,2.5,8.67,1,0.8
4,314,103,2,2.0,3.0,8.21,0,0.65


In [53]:
admission_df.describe()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,316.472,107.192,3.114,3.374,3.484,8.57644,0.56,0.72174
std,11.295148,6.081868,1.143512,0.991004,0.92545,0.604813,0.496884,0.14114
min,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,317.0,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,325.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


In [54]:
admission_df['admission_chance'] = 0
admission_df.loc[admission_df['Chance of Admit '] > 0.5, 'admission_chance'] = 1

In [55]:
admission_df['admission_chance'].value_counts()

1    461
0     39
Name: admission_chance, dtype: int64

In [56]:
admission_df.drop('Chance of Admit ', axis=1, inplace=True)
admission_df.shape

(500, 8)

## Data Preparation

In [57]:
data = admission_df.drop('admission_chance', axis=1)
labels = admission_df["admission_chance"]

## Train Test SPlit

In [58]:
X_train, X_test, y_train, y_test = train_test_split(data, labels, 
                                                    test_size=0.2,
                                                   random_state = 42,
                                                   stratify=labels)


print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(400, 7)
(400,)
(100, 7)
(100,)


In [59]:
dt = DecisionTreeClassifier()

In [60]:
dt.fit(X_train, y_train)

In [61]:
predictions = dt.predict(X_test)

In [62]:
test_vs_pred_df = pd.DataFrame({"actual": y_test,
                               "pred": predictions})
test_vs_pred_df

Unnamed: 0,actual,pred
128,1,1
10,1,1
50,1,1
347,0,0
36,1,1
...,...,...
237,1,1
383,1,1
357,1,1
240,1,1


In [63]:
from sklearn.metrics import classification_report, confusion_matrix

In [64]:
print(confusion_matrix(y_test, predictions))

[[ 6  2]
 [ 5 87]]


In [65]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.55      0.75      0.63         8
           1       0.98      0.95      0.96        92

    accuracy                           0.93       100
   macro avg       0.76      0.85      0.80       100
weighted avg       0.94      0.93      0.93       100



### Decision Tree For Regression
<li>For classification, DT tries to split node by maximizing information gain incase of ID3 or minimizing gini index incase of CART.</li>
<li>But for regression, the goal is to reduce the variance of the target variable (i.e., the dependent variable).</li>
<li>Decision Trees works on the principle of variance reduction since the target variable is continuous.</li>
<li>This is typically done by minimizing the sum of squared differences between the target variable and the mean value of the samples in each resulting group.</li>

### How Splitting Of Node is Done in Decision Tree Regressor

<li>The decision tree regressor considers all possible splits for each predictor variable and selects the one that maximizes the variance reduction.</li>
<li>The process is repeated recursively for each resulting group until a stopping criterion is met.</li>
<li>Common stopping criteria include a minimum number of samples required to split a node, a maximum tree depth.</li>

In [66]:
path_to_car_df = os.path.join(os.path.dirname(os.getcwd()), 
                             'csv_data', 'car_details.csv')
print(path_to_car_df)

C:\Users\srval\OneDrive\Desktop\Python_For_Data_Science\August_Lumbini_Course\csv_data\car_details.csv


In [67]:
import pandas as pd

In [68]:
car_df = pd.read_csv(path_to_car_df)
print(car_df.shape)
car_df.head()

(4340, 8)


Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner


In [69]:
car_df['brand'] = car_df['name'].str.split(' ').str[0]
car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,brand
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner,Maruti
2,Hyundai Verna 1.6 SX,2012,600000,100000,Diesel,Individual,Manual,First Owner,Hyundai
3,Datsun RediGO T Option,2017,250000,46000,Petrol,Individual,Manual,First Owner,Datsun
4,Honda Amaze VX i-DTEC,2014,450000,141000,Diesel,Individual,Manual,Second Owner,Honda


In [70]:
car_df['year'].describe()

count    4340.000000
mean     2013.090783
std         4.215344
min      1992.000000
25%      2011.000000
50%      2014.000000
75%      2016.000000
max      2020.000000
Name: year, dtype: float64

In [72]:
car_df['car_age'] = car_df['year'].max() - car_df['year']

In [81]:
brand_age_sp_mean = car_df.groupby(['brand', 'car_age'], as_index=False)['selling_price'].mean()
brand_age_sp_mean.columns = ["brand", "car_age", "avg_sp_per_year_brand"]
brand_age_sp_mean.head()

Unnamed: 0,brand,car_age,avg_sp_per_year_brand
0,Ambassador,8,430000.0
1,Ambassador,15,120000.0
2,Ambassador,18,50000.0
3,Audi,0,4700000.0
4,Audi,1,3256000.0


In [83]:
merge_car_df = pd.merge(car_df, brand_age_sp_mean, 
                        on = ["brand", "car_age"],
                       how = "inner")

merge_car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,brand,car_age,avg_sp_per_year_brand
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652
1,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652
2,Maruti Alto LX BSIII,2007,140000,125000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652
3,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652
4,Maruti Wagon R LXI Minor,2007,135000,50000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652


In [88]:
car_frequency_df = car_df['name'].value_counts().to_frame().reset_index()
car_frequency_df.columns = ['name', 'car_count_frequency']
car_frequency_df.head()

Unnamed: 0,name,car_count_frequency
0,Maruti Swift Dzire VDI,69
1,Maruti Alto 800 LXI,59
2,Maruti Alto LXi,47
3,Maruti Alto LX,35
4,Hyundai EON Era Plus,35


In [90]:
final_car_df = pd.merge(merge_car_df, car_frequency_df, how = 'inner',
                       on = "name")
final_car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,brand,car_age,avg_sp_per_year_brand,car_count_frequency
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23
1,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23
2,Maruti 800 AC,2007,95000,100000,Petrol,Individual,Manual,Second Owner,Maruti,13,119608.695652,23
3,Maruti 800 AC,2007,105000,60000,Petrol,Individual,Manual,Second Owner,Maruti,13,119608.695652,23
4,Maruti 800 AC,2007,80000,120000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23


In [92]:
final_car_df['fuel'].value_counts()

Diesel      2153
Petrol      2123
CNG           40
LPG           23
Electric       1
Name: fuel, dtype: int64

In [95]:
final_car_df = final_car_df.loc[final_car_df['fuel']!="Electric"]
final_car_df.shape

(4338, 12)

In [96]:
final_car_df['fuel'].value_counts()

Diesel    2152
Petrol    2123
CNG         40
LPG         23
Name: fuel, dtype: int64

In [98]:
fuel_df = pd.get_dummies(final_car_df['fuel'])
fuel_df.drop('LPG', axis=1, inplace=True)
fuel_df.head()

Unnamed: 0,CNG,Diesel,Petrol
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1


In [100]:
final_car_df = pd.concat([final_car_df, fuel_df], axis=1)
final_car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,brand,car_age,avg_sp_per_year_brand,car_count_frequency,CNG,Diesel,Petrol
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23,0,0,1
1,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23,0,0,1
2,Maruti 800 AC,2007,95000,100000,Petrol,Individual,Manual,Second Owner,Maruti,13,119608.695652,23,0,0,1
3,Maruti 800 AC,2007,105000,60000,Petrol,Individual,Manual,Second Owner,Maruti,13,119608.695652,23,0,0,1
4,Maruti 800 AC,2007,80000,120000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23,0,0,1


In [101]:
final_car_df['seller_type'].value_counts()

Individual          3243
Dealer               993
Trustmark Dealer     102
Name: seller_type, dtype: int64

In [103]:
dealer_df = pd.get_dummies(final_car_df["seller_type"])
dealer_df.drop('Trustmark Dealer', axis=1, inplace=True)
dealer_df.head()

Unnamed: 0,Dealer,Individual
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [104]:
final_car_df = pd.concat([final_car_df, dealer_df], axis=1)
final_car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,brand,car_age,avg_sp_per_year_brand,car_count_frequency,CNG,Diesel,Petrol,Dealer,Individual
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23,0,0,1,0,1
1,Maruti 800 AC,2007,60000,70000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23,0,0,1,0,1
2,Maruti 800 AC,2007,95000,100000,Petrol,Individual,Manual,Second Owner,Maruti,13,119608.695652,23,0,0,1,0,1
3,Maruti 800 AC,2007,105000,60000,Petrol,Individual,Manual,Second Owner,Maruti,13,119608.695652,23,0,0,1,0,1
4,Maruti 800 AC,2007,80000,120000,Petrol,Individual,Manual,First Owner,Maruti,13,119608.695652,23,0,0,1,0,1


In [106]:
final_car_df['transmission'].value_counts()

Manual       3891
Automatic     447
Name: transmission, dtype: int64

In [108]:
final_car_df.replace({"transmission": {"Manual": 0, "Automatic": 1}}, inplace=True)
final_car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,brand,car_age,avg_sp_per_year_brand,car_count_frequency,CNG,Diesel,Petrol,Dealer,Individual
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,0,First Owner,Maruti,13,119608.695652,23,0,0,1,0,1
1,Maruti 800 AC,2007,60000,70000,Petrol,Individual,0,First Owner,Maruti,13,119608.695652,23,0,0,1,0,1
2,Maruti 800 AC,2007,95000,100000,Petrol,Individual,0,Second Owner,Maruti,13,119608.695652,23,0,0,1,0,1
3,Maruti 800 AC,2007,105000,60000,Petrol,Individual,0,Second Owner,Maruti,13,119608.695652,23,0,0,1,0,1
4,Maruti 800 AC,2007,80000,120000,Petrol,Individual,0,First Owner,Maruti,13,119608.695652,23,0,0,1,0,1


In [110]:
final_car_df['owner'].value_counts()

First Owner             2832
Second Owner            1104
Third Owner              304
Fourth & Above Owner      81
Test Drive Car            17
Name: owner, dtype: int64

In [114]:
final_car_df['owner'] = final_car_df['owner'].str.strip()

In [119]:
final_car_df = final_car_df.loc[final_car_df['owner'] != "Test Drive Car"]
final_car_df.shape

(4321, 17)

In [120]:
final_car_df['owner'].value_counts()

First Owner             2832
Second Owner            1104
Third Owner              304
Fourth & Above Owner      81
Name: owner, dtype: int64

In [124]:
import warnings
warnings.filterwarnings('ignore')

In [125]:
final_car_df['owner'].replace({"First Owner": 0, "Second Owner": 1, 
                              "Third Owner": 2, "Fourth & Above Owner": 3}, inplace=True)

final_car_df.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,brand,car_age,avg_sp_per_year_brand,car_count_frequency,CNG,Diesel,Petrol,Dealer,Individual
0,Maruti 800 AC,2007,60000,70000,Petrol,Individual,0,0,Maruti,13,119608.695652,23,0,0,1,0,1
1,Maruti 800 AC,2007,60000,70000,Petrol,Individual,0,0,Maruti,13,119608.695652,23,0,0,1,0,1
2,Maruti 800 AC,2007,95000,100000,Petrol,Individual,0,1,Maruti,13,119608.695652,23,0,0,1,0,1
3,Maruti 800 AC,2007,105000,60000,Petrol,Individual,0,1,Maruti,13,119608.695652,23,0,0,1,0,1
4,Maruti 800 AC,2007,80000,120000,Petrol,Individual,0,0,Maruti,13,119608.695652,23,0,0,1,0,1


In [126]:
final_car_df.drop(["year", "name", "fuel", "brand", "seller_type"], axis = 1, inplace=True)
final_car_df.shape

(4321, 12)

In [127]:
final_car_df.head()

Unnamed: 0,selling_price,km_driven,transmission,owner,car_age,avg_sp_per_year_brand,car_count_frequency,CNG,Diesel,Petrol,Dealer,Individual
0,60000,70000,0,0,13,119608.695652,23,0,0,1,0,1
1,60000,70000,0,0,13,119608.695652,23,0,0,1,0,1
2,95000,100000,0,1,13,119608.695652,23,0,0,1,0,1
3,105000,60000,0,1,13,119608.695652,23,0,0,1,0,1
4,80000,120000,0,0,13,119608.695652,23,0,0,1,0,1


## Separating into data and labels

In [141]:
data = final_car_df.drop('selling_price', axis=1)
labels = final_car_df["selling_price"]

In [142]:
data.shape

(4321, 11)

In [143]:
labels.shape

(4321,)

## Train Test Split

In [144]:
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, 
                                                   random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3456, 11)
(865, 11)
(3456,)
(865,)


## Create a Decision Tree Object

In [146]:
from sklearn.tree import DecisionTreeRegressor

In [147]:
dt = DecisionTreeRegressor()

In [148]:
dt.fit(X_train, y_train)

In [149]:
predictions = dt.predict(X_test)

In [150]:
test_vs_pred_df = pd.DataFrame({"actuals": y_test,
                               "pred": predictions})
test_vs_pred_df

Unnamed: 0,actuals,pred
1073,110000,225000.0
856,330000,330000.0
1222,475000,390000.0
3410,350000,400000.0
2252,75000,120000.0
...,...,...
2480,99000,300000.0
3915,325000,325000.0
63,180000,250000.0
3702,890000,600000.0


In [151]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [152]:
print("Mean absoulte error is ", mean_absolute_error(y_test, predictions))
print("Mean squared error is ", mean_squared_error(y_test, predictions))
print("Root Mean squared error is ", mean_squared_error(y_test, predictions, squared=False))
print("R score is ", r2_score(y_test, predictions))

Mean absoulte error is  113907.67090558766
Mean squared error is  47715377092.71073
Root Mean squared error is  218438.49727717577
R score is  0.8364803515928818
