<a href="https://colab.research.google.com/github/pratyush-3000/me/blob/master/Pratyush_Lahane_ML_Project_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The objective of this assignment is to acquaint oneself with the Decision Tree concept and gain practical experience in training both regression and classification DT models.

# Problem 01

In this problem you are going to work with "Life Expectancy Data.csv" dataset. <br>
Import the dataset and
1. Drop all the missing values from the dataset at the beginning of the project.
2. Drop the "Country" column from the dataset.
3. Split the dataset into train and test (consider 10% of data as the test).
4. Prepare the datasets using pipeline.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

df = pd.read_csv("Life Expectancy Data.csv")

x_cleaned = df.dropna()

x_cleaned = x_cleaned.drop(columns = ["Country"])

train_data , test_data = train_test_split(x_cleaned , test_size = 0.1 , random_state = 42)

x_train = train_data.drop(columns = ["Life expectancy"])
y_train = train_data["Life expectancy"]

x_test = test_data.drop(columns = ["Life expectancy"])
y_test = test_data["Life expectancy"]

cat_columns = ["Status"]

x_train_encoded = pd.get_dummies(x_train , columns = cat_columns)

x_test_encoded = pd.get_dummies(x_test , columns = cat_columns)

pipeline = Pipeline ([
    ("imputer", SimpleImputer(strategy = "median")),
    ("scaler" , StandardScaler())
])

x_train_transformed = pipeline.fit_transform(x_train_encoded)
x_test_transformed = pipeline.transform(x_test_encoded)

print(x_train_transformed)
print(x_test_transformed)



[[-0.94178494  2.06422426  0.27248116 ... -1.16393495 -0.47850059
   0.47850059]
 [ 0.21427632  0.15837984  0.80483722 ... -0.08837411 -0.47850059
   0.47850059]
 [-1.17299719  1.27268404 -0.23148258 ... -1.16393495 -0.47850059
   0.47850059]
 ...
 [-0.71057268 -1.1096215  -0.25277683 ...  1.32522014  2.08986157
  -2.08986157]
 [-0.47936043  0.81159264 -0.23858066 ... -0.0576438  -0.47850059
   0.47850059]
 [ 0.21427632 -0.02605672 -0.24567874 ...  1.01791704  2.08986157
  -2.08986157]]
[[-1.63542169  2.80197049 -0.04693248 ... -1.6248896  -0.47850059
   0.47850059]
 [ 0.44548857  0.32744668  0.11632338 ... -1.07174402 -0.47850059
   0.47850059]
 [ 0.21427632 -0.07216586 -0.2243845  ... -0.27275597 -0.47850059
   0.47850059]
 ...
 [ 1.37033758 -0.15669928 -0.23148258 ...  0.09600775 -0.47850059
   0.47850059]
 [-0.24814818  2.04116969  0.30087348 ... -1.25612588 -0.47850059
   0.47850059]
 [ 0.21427632 -0.54094211 -0.25277683 ...  0.34185023 -0.47850059
   0.47850059]]


**Model training and testing**

You need to train three models (all decision tree); follow the following instructions:
1. Train a decision tree model on the prepared train dataset with max_depth=2.
> *   The purpose of training this model is to get familiar with decision tree model visualization; so, **DO NOT** test this model at all.
> *   Use "graphviz" and "source" to visualize this model (plot the tree).

2. Train a decision tree model on the prepared train dataset with max_depth=20.
> *   Test the model on the train dataset and calculate RMSE.
> *   Test the model on the test dataset and calculate RMSE.


3. As you saw in the previous model, training a model with max_depth=20 will result in overfitting problem. To address the problem, train a new decision tree model and change max_depth manually until you solve the overfitting problem.
> *   Test the model on the train dataset and calculate RMSE.
> *   Test the model on the test dataset and calculate RMSE.

In [None]:
import graphviz
from sklearn.tree import DecisionTreeRegressor, export_graphviz
import numpy as np
from sklearn.metrics import mean_squared_error

tree_reg = DecisionTreeRegressor(max_depth = 2)
tree_reg.fit(x_train_transformed , y_train)

dot_data = export_graphviz(
    tree_reg,
    out_file = None,
    feature_names = x_train_encoded.columns)
graph = graphviz.Source(dot_data)
graph.render("Desicion Tree", format = "png", cleanup = True)

tree_reg_20 = DecisionTreeRegressor(max_depth = 20)
tree_reg_20.fit(x_train_transformed , y_train)

train_preds_depth_20 = tree_reg_20.predict(x_train_transformed)
test_preds_depth_20 = tree_reg_20.predict(x_test_transformed)
train_rsme_depth_20 = np.sqrt(mean_squared_error(y_train , train_preds_depth_20))
test_rsme_depth_20 = np.sqrt(mean_squared_error(y_test , test_preds_depth_20))

best_test_rsme = None
best_max_depth = None
best_train_rsme = None
best_test_rsme = None

for depth in range(1,21):
  model = DecisionTreeRegressor(max_depth = depth)
  model.fit(x_train_transformed , y_train)
  train_preds = model.predict(x_train_transformed)
  test_preds = model.predict(x_test_transformed)
  train_rmse = np.sqrt(mean_squared_error(y_train, train_preds))
  test_rmse = np.sqrt(mean_squared_error(y_test, test_preds))


if best_test_rsme is None or test_rsme < best_test_rsme:
   best_depth = depth
   best_train_rmse =train_rmse
   best_train_rsme = test_rmse

train_rsme_depth_20, test_rsme_depth_20, best_depth , best_train_rmse , best_test_rsme

(0.04925026821469577, 2.5314132238836295, 20, 0.04925026821469577, None)

# Problem 02

In this problem you are going to work with "water_potability.csv" dataset. <br>
Import the dataset and

1. Split the data into train and test datasets; consider 10% of data as the test set.
2. Prepare the datasets using pipeline.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

df = pd.read_csv("water_potability.csv")


train_data, test_data = train_test_split(df, test_size=0.1, random_state=42)

x_train = train_data.drop(columns=["Potability"])
y_train = train_data["Potability"]

x_test = test_data.drop(columns=["Potability"])
y_test = test_data["Potability"]

pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])

x_train_transformed = pipeline.fit_transform(x_train)
x_test_transformed = pipeline.transform(x_test)

x_train_transformed.shape, x_test_transformed.shape

((2948, 9), (328, 9))

**Model training and testing**

You need to train two models (all decision tree); follow the following instructions:
1. Train a decision tree model on the prepared train dataset with max_depth=2.
> *   The purpose of training this model is to get familiar with decision tree model visualization; so, **DO NOT** test this model at all.
> *   Use "graphviz" and "source" to visualize this model (plot the tree).

2. Train a decision tree model on the prepared train dataset with max_depth=20.
> *   Test the model on the train dataset and extract confusion matrix and calculate precision and recall scores.
> *   Test the model on the test dataset and extract confusion matrix and calculate precision and recall scores.

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import confusion_matrix, precision_score, recall_score
import graphviz

# Step 1: Train Decision Tree model with max_depth=2
model_depth_2 = DecisionTreeClassifier(max_depth=2, random_state=42)
model_depth_2.fit(x_train_transformed, y_train)


dot_data = export_graphviz(model_depth_2, out_file=None, feature_names=x_train.columns,
                           class_names=["Not Potable", "Potable"], filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("DecisionTree_max_depth_2", format="png", cleanup=True)

model_depth_20 = DecisionTreeClassifier(max_depth=20, random_state=42)
model_depth_20.fit(x_train_transformed, y_train)


train_preds_depth_20 = model_depth_20.predict(x_train_transformed)
train_conf_matrix = confusion_matrix(y_train, train_preds_depth_20)
train_precision = precision_score(y_train, train_preds_depth_20, average='binary', pos_label=1)
train_recall = recall_score(y_train, train_preds_depth_20, average='binary', pos_label=1)


test_preds_depth_20 = model_depth_20.predict(x_test_transformed)
test_conf_matrix = confusion_matrix(y_test, test_preds_depth_20)
test_precision = precision_score(y_test, test_preds_depth_20, average='binary', pos_label=1)
test_recall = recall_score(y_test, test_preds_depth_20, average='binary', pos_label=1)

train_conf_matrix, train_precision, train_recall, test_conf_matrix, test_precision, test_recall


(array([[1782,   12],
        [  77, 1077]]),
 0.9889807162534435,
 0.9332755632582322,
 array([[145,  59],
        [ 72,  52]]),
 0.46846846846846846,
 0.41935483870967744)