https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

# Ejercicio Regression Tree
Son datos de temperaturas tomados de la ciudad de Seattle. El objetivo es intentar predecir lo mejor posible la máxima temperatura (columna `actual`) que alcanzaremos al día siguiente. Para ello contaremos con temperaturas de los dos días anteriores, históricos de máximas temperaturas y un amigo bastante chapas que te da su predicción del tiempo cada día.

**Resumen de datos**:
* year: 2016 for all data points
* month: number for month of the year
* day: number for day of the year
* week: day of the week as a character string
* temp_2: max temperature 2 days prior
* temp_1: max temperature 1 day prior
* average: historical average max temperature
* actual: max temperature measurement
* friend: your friend’s prediction, a random number between 20 below the average and 20 above the average


### 1. Importa el csv "temps.csv"

In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_csv("temps.csv")
df.head()

Unnamed: 0,year,month,day,week,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend
0,2016,1,1,Fri,45,45,45.6,45,43,50,44,29
1,2016,1,2,Sat,44,45,45.7,44,41,50,44,61
2,2016,1,3,Sun,45,44,45.8,41,43,46,47,56
3,2016,1,4,Mon,44,41,45.9,40,44,48,46,53
4,2016,1,5,Tues,41,40,46.0,44,46,46,46,41


### 2. Lidia con las variables categóricas del dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 348 entries, 0 to 347
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            348 non-null    int64  
 1   month           348 non-null    int64  
 2   day             348 non-null    int64  
 3   week            348 non-null    object 
 4   temp_2          348 non-null    int64  
 5   temp_1          348 non-null    int64  
 6   average         348 non-null    float64
 7   actual          348 non-null    int64  
 8   forecast_noaa   348 non-null    int64  
 9   forecast_acc    348 non-null    int64  
 10  forecast_under  348 non-null    int64  
 11  friend          348 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 32.8+ KB


In [4]:
# Descriptive statistics for each column
df.describe()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend
count,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0,348.0
mean,2016.0,6.477011,15.514368,62.652299,62.701149,59.760632,62.543103,57.238506,62.373563,59.772989,60.034483
std,0.0,3.49838,8.772982,12.165398,12.120542,10.527306,11.794146,10.605746,10.549381,10.705256,15.626179
min,2016.0,1.0,1.0,35.0,35.0,45.1,35.0,41.0,46.0,44.0,28.0
25%,2016.0,3.0,8.0,54.0,54.0,49.975,54.0,48.0,53.0,50.0,47.75
50%,2016.0,6.0,15.0,62.5,62.5,58.2,62.5,56.0,61.0,58.0,60.0
75%,2016.0,10.0,23.0,71.0,71.0,69.025,71.0,66.0,72.0,69.0,71.0
max,2016.0,12.0,31.0,117.0,117.0,77.4,92.0,77.0,82.0,79.0,95.0


### One-Hot Encoding

In [5]:
# usamos el metodo de pandas get_dummies para transformar las variables  categóricas 
df = pd.get_dummies(df)
df.head()

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2016,1,1,45,45,45.6,45,43,50,44,29,1,0,0,0,0,0,0
1,2016,1,2,44,45,45.7,44,41,50,44,61,0,0,1,0,0,0,0
2,2016,1,3,45,44,45.8,41,43,46,47,56,0,0,0,1,0,0,0
3,2016,1,4,44,41,45.9,40,44,48,46,53,0,1,0,0,0,0,0
4,2016,1,5,41,40,46.0,44,46,46,46,41,0,0,0,0,0,1,0


### 3. Divide en train y test

In [6]:
# Convierto en array mis variables
import numpy as np
# target
y = np.array(df['actual'])
# mis variables
X = df.drop('actual', axis = 1)
# Saving feature names for later use
X_list = list(df.columns)
# Convert to numpy array
X = np.array(df)

In [7]:
print(X_list)

['year', 'month', 'day', 'temp_2', 'temp_1', 'average', 'actual', 'forecast_noaa', 'forecast_acc', 'forecast_under', 'friend', 'week_Fri', 'week_Mon', 'week_Sat', 'week_Sun', 'week_Thurs', 'week_Tues', 'week_Wed']


### 4. Entrena el modelo
Utiliza un DecisionTreeRegressor

In [8]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [9]:
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (261, 18)
y_train: (261,)
X_test: (87, 18)
y_test: (87,)


### Baseline 

necesitamos establecer un Baseline que esperamos superar con nuestro modelo. Si nuestro modelo no puede mejorar la línea de base, entonces será un fracaso y deberíamos probar un modelo diferente o admitir que el aprendizaje automático no es adecuado para nuestro problema. 

In [10]:
# The baseline predictions are the historical averages
baseline_preds = X_test[:, X_list.index('average')]
# Baseline errors, and display average baseline error
baseline_errors = abs(baseline_preds - y_test)
print('Average baseline error: ', round(np.mean(baseline_errors), 2))

Average baseline error:  5.06


¡Ahora tenemos nuestro objetivo! Si no podemos superar un error promedio de 5 grados, entonces tenemos que repensar nuestro enfoque.

In [11]:
# Importamos nuestro modelo
from sklearn.ensemble import RandomForestRegressor
# Instanciamos 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Entrenamos nuestro modelo
rf.fit(X_train, y_train)

RandomForestRegressor(n_estimators=1000, random_state=42)

### 5. Calcula su MAE
Sin usar sklearn

In [12]:

##CALCULAMOS EL MAE USANDO NUMPY

predictions = rf.predict(X_test)
# Calculamos el error absoluto
errors = abs(predictions - y_test)
# MAE usando numpy
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')

Mean Absolute Error: 0.15 degrees.


Nuestra estimación promedio es de 0.15 grados. Eso es más que una mejora promedio de 4.90 grado sobre la línea de base. 

### 6. Calcula su MAPE
Sin usar sklearn

Para poner nuestras predicciones en perspectiva, podemos calcular una precisión utilizando el porcentaje medio de error restado del 100 %.

In [13]:
# Calculando (MAPE)
mape = 100 * (errors / y_test)
# Calculo Accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 99.77 %.


#### ¡Eso se ve bastante bien! Nuestro modelo ha aprendido a predecir la temperatura máxima para el día siguiente en Seattle con un 99.77% de precisión.

### 7. Representa el árbol de decision

![Big](big.png)

### 8. Modifica el max_depth a 3 y vuelve a entrenarlo

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [15]:
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (261, 18)
y_train: (261,)
X_test: (87, 18)
y_test: (87,)


In [16]:
# Instanciamos 1000 decision trees
rf_depth = RandomForestRegressor(n_estimators = 10, random_state = 42, max_depth= 3)
# Entrenamos nuestro modelo
rf_depth.fit(X_train, y_train)

RandomForestRegressor(max_depth=3, n_estimators=10, random_state=42)

### 9. Vuelve a calcular si MAE

In [17]:

##CALCULAMOS EL MAE USANDO NUMPY

predictions_depth = rf_depth.predict(X_test)
# Calculamos el error absoluto
errors_depth = abs(predictions_depth - y_test)
# MAE usando numpy
print('Mean Absolute Error:', round(np.mean(errors_depth), 2), 'degrees.')

Mean Absolute Error: 1.05 degrees.


### 10. Vuelve a representar su árbol

![Small](small.png)

### 10. Obten el `feature_importances` de cada variable en el último modelo

In [20]:
# Get numerical feature importances
importances = list(rf_depth.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(x, round(importance, 2)) for x, importance in zip(X_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

Variable: actual               Importance: 1.0
Variable: year                 Importance: 0.0
Variable: month                Importance: 0.0
Variable: day                  Importance: 0.0
Variable: temp_2               Importance: 0.0
Variable: temp_1               Importance: 0.0
Variable: average              Importance: 0.0
Variable: forecast_noaa        Importance: 0.0
Variable: forecast_acc         Importance: 0.0
Variable: forecast_under       Importance: 0.0
Variable: friend               Importance: 0.0
Variable: week_Fri             Importance: 0.0
Variable: week_Mon             Importance: 0.0
Variable: week_Sat             Importance: 0.0
Variable: week_Sun             Importance: 0.0
Variable: week_Thurs           Importance: 0.0
Variable: week_Tues            Importance: 0.0
Variable: week_Wed             Importance: 0.0


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]