# Bagging

Vamos a utilizar el siguiente dataset: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

Las siguientes celdas lo descargan e importan:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
#!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1X0MT5YFVOduVogJ9cZKjCJ7fs-OdVuAT' -O weather.csv

In [3]:
df = pd.read_csv("weather.csv")

In [4]:
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


Vamos a descartar las columnas:

- Sunshine          
- Evaporation       
- Cloud3pm          
- Cloud9am  
- Location
- Date       
- WindGustDir
- WindDir9am 

In [5]:
df.drop(columns=["Sunshine", "Evaporation", "Cloud3pm", "Cloud9am", "Location", "Date", "WindGustDir", "WindDir9am"], inplace=True)

In [6]:
df.shape

(145460, 15)

Descartar todas las filas que tengan nulos:

In [7]:
df.dropna(axis=0, inplace=True)

In [8]:
df.shape

(119016, 15)

In [9]:
df

Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindGustSpeed,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,13.4,22.9,0.6,44.0,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,16.9,21.8,No,No
1,7.4,25.1,0.0,44.0,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,17.2,24.3,No,No
2,12.9,25.7,0.0,46.0,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,21.0,23.2,No,No
3,9.2,28.0,0.0,24.0,E,11.0,9.0,45.0,16.0,1017.6,1012.8,18.1,26.5,No,No
4,17.5,32.3,1.0,41.0,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,17.8,29.7,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,3.5,21.8,0.0,31.0,E,15.0,13.0,59.0,27.0,1024.7,1021.2,9.4,20.9,No,No
145455,2.8,23.4,0.0,31.0,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,10.1,22.4,No,No
145456,3.6,25.3,0.0,22.0,N,13.0,9.0,56.0,21.0,1023.5,1019.1,10.9,24.5,No,No
145457,5.4,26.9,0.0,37.0,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,12.5,26.1,No,No


Separamos en X e y. Nuestro objetivo es predecir si va a llover mañana.

In [10]:
X = df.drop(columns="RainTomorrow")
y = df["RainTomorrow"]

Aplicamos train test split. El set de test tendrá un 20% de los datos con un random state de 42 y stratify.

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Aplicar one hot encoder a las variables categóricas.

Recuerden hacer fit en train y luego en test solo transform.

In [12]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False).fit(X_train[["WindDir3pm", "RainToday"]])

# Train

encoded = ohe.transform(X_train[["WindDir3pm", "RainToday"]])
encoded_df = pd.DataFrame(columns=ohe.get_feature_names_out(), data= encoded, index=X_train.index)

X_train = pd.concat([X_train, encoded_df],  axis='columns')

# Test

encoded_test = ohe.transform(X_test[["WindDir3pm", "RainToday"]])
encoded_test_df = pd.DataFrame(columns=ohe.get_feature_names_out(), data= encoded_test, index=X_test.index)

X_test = pd.concat([X_test, encoded_test_df],  axis='columns')


In [13]:
X_train.columns

Index(['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm', 'RainToday',
       'WindDir3pm_E', 'WindDir3pm_ENE', 'WindDir3pm_ESE', 'WindDir3pm_N',
       'WindDir3pm_NE', 'WindDir3pm_NNE', 'WindDir3pm_NNW', 'WindDir3pm_NW',
       'WindDir3pm_S', 'WindDir3pm_SE', 'WindDir3pm_SSE', 'WindDir3pm_SSW',
       'WindDir3pm_SW', 'WindDir3pm_W', 'WindDir3pm_WNW', 'WindDir3pm_WSW',
       'RainToday_No', 'RainToday_Yes'],
      dtype='object')

Dropear las columnas originales:

In [14]:
X_train.drop(columns=["WindDir3pm", "RainToday"], inplace=True)
X_test.drop(columns=["WindDir3pm", "RainToday"], inplace=True)

Llevar el target a una variable numérica:

- 1 si va a llover mañana
- 0 si no va a llover mañana

In [15]:
raining_mapping_dict = {"No": 0, "Yes": 1}
y_train = y_train.replace(raining_mapping_dict)
y_test = y_test.replace(raining_mapping_dict)

Entrenar un arbol de decision con:
- max_depth=10
- random_state=0

y obtener el classification report para train y test.

In [16]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=10, random_state=0)

tree.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, random_state=0)

In [17]:
from sklearn.metrics import classification_report

print(classification_report(y_train, tree.predict(X_train)))

              precision    recall  f1-score   support

           0       0.89      0.96      0.92     74409
           1       0.79      0.56      0.65     20803

    accuracy                           0.87     95212
   macro avg       0.84      0.76      0.79     95212
weighted avg       0.86      0.87      0.86     95212



In [18]:
print(classification_report(y_test, tree.predict(X_test)))

              precision    recall  f1-score   support

           0       0.87      0.94      0.90     18603
           1       0.70      0.49      0.58      5201

    accuracy                           0.84     23804
   macro avg       0.78      0.72      0.74     23804
weighted avg       0.83      0.84      0.83     23804



Ahora, usar un Bagging Classifier armado con árboles de decisión: 
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

Los árboles que usemos en este clasificador, deberán tener la misma profundidad máxima (max_depth) que el árbol que entrenamos recién.

Además usar: 

- n_estimators = 200
- n_jobs = -1 -> ¿Qué pasa si sacamos esto?
- random_state = 0 -> En el arbol y en el BaggingClassifier

In [19]:
from sklearn.ensemble import BaggingClassifier

bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=10, random_state=0), n_estimators=200, n_jobs= -1, random_state=0)
bagging.fit(X_train, y_train)

y_train_pred = bagging.predict(X_train)
y_test_pred = bagging.predict(X_test)

bagging.score(X_train, y_train)

0.8812019493341176

In [20]:
print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.89      0.97      0.93     74409
           1       0.85      0.56      0.67     20803

    accuracy                           0.88     95212
   macro avg       0.87      0.76      0.80     95212
weighted avg       0.88      0.88      0.87     95212

              precision    recall  f1-score   support

           0       0.87      0.96      0.91     18603
           1       0.76      0.50      0.60      5201

    accuracy                           0.86     23804
   macro avg       0.82      0.73      0.76     23804
weighted avg       0.85      0.86      0.85     23804



Ahora, probar con un RandomForestClassifier. 

Los hiperparámetros serán igual que los anteriores:

- max_depth=10
- n_estimators=200
- n_jobs=-1
- random_state=0

In [21]:
from sklearn.ensemble import RandomForestClassifier
bagging_random = BaggingClassifier(base_estimator = RandomForestClassifier(max_depth=10, random_state=0), n_estimators=200, n_jobs=-1, random_state=0)

bagging_random.fit(X_train, y_train)

y_train_random_pred = bagging_random.predict(X_train)
y_test_random_pred = bagging_random.predict(X_test)



In [None]:
print(classification_report(y_train, y_train_random_pred))
print(classification_report(y_test, y_test_random_pred))

Ahora, en el random forest, imprimir la importancia de las features:

In [11]:
bagging_random.feature_importances_

Correr el siguiente código para obtener sus feature importances:

In [None]:
fi = pd.DataFrame(columns=["FEATURE", "IMPORTANCE"])
fi["FEATURE"] = X_train.columns
fi["IMPORTANCE"] = clf.feature_importances_
fi = fi.sort_values("IMPORTANCE", ascending=False)

In [None]:
plt.figure(figsize=(5, 15))
sns.barplot(y=fi.FEATURE, x=fi.IMPORTANCE)
plt.show()

Ahora, aplicar random forest en su proyecto del primer sprint.

Buscar los mejores hiperparámetros con grid search y finalmente imprimir el feature importance de las variables que utilizaron para entrenar.