# Problema de clasificacion
En este ejercicio se usara un dataset de la temperatura en australia, en el que el objetivo es predicir si llovera el dia de mañana. En este conjunto de datos tenemos varios tipos de datos; entre estos la temperatura, candtidad de lluvia, evaporacion, etc. En total conforman un conjunto de 23 columnas.


In [1]:
import os 
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
import random
from sklearn.metrics import accuracy_score

In [2]:
data = pd.read_csv('weatherAUS.csv')

In [3]:
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


## Preprocesamiento
En esta seccion, vamos a proceder a limpiar el dataset

In [4]:
data.columns

Index(['Date', 'Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation',
       'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')

In [5]:
data.shape

(145460, 23)

In [6]:
data.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,143975.0,144199.0,142199.0,82670.0,75625.0,135197.0,143693.0,142398.0,142806.0,140953.0,130395.0,130432.0,89572.0,86102.0,143693.0,141851.0
mean,12.194034,23.221348,2.360918,5.468232,7.611178,40.03523,14.043426,18.662657,68.880831,51.539116,1017.64994,1015.255889,4.447461,4.50993,16.990631,21.68339
std,6.398495,7.119049,8.47806,4.193704,3.785483,13.607062,8.915375,8.8098,19.029164,20.795902,7.10653,7.037414,2.887159,2.720357,6.488753,6.93665
min,-8.5,-4.8,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,980.5,977.1,0.0,0.0,-7.2,-5.4
25%,7.6,17.9,0.0,2.6,4.8,31.0,7.0,13.0,57.0,37.0,1012.9,1010.4,1.0,2.0,12.3,16.6
50%,12.0,22.6,0.0,4.8,8.4,39.0,13.0,19.0,70.0,52.0,1017.6,1015.2,5.0,5.0,16.7,21.1
75%,16.9,28.2,0.8,7.4,10.6,48.0,19.0,24.0,83.0,66.0,1022.4,1020.0,7.0,7.0,21.6,26.4
max,33.9,48.1,371.0,145.0,14.5,135.0,130.0,87.0,100.0,100.0,1041.0,1039.6,9.0,9.0,40.2,46.7


### Se verifican los datos nulos

In [7]:
data.nunique()

Date             3436
Location           49
MinTemp           389
MaxTemp           505
Rainfall          681
Evaporation       358
Sunshine          145
WindGustDir        16
WindGustSpeed      67
WindDir9am         16
WindDir3pm         16
WindSpeed9am       43
WindSpeed3pm       44
Humidity9am       101
Humidity3pm       101
Pressure9am       546
Pressure3pm       549
Cloud9am           10
Cloud3pm           10
Temp9am           441
Temp3pm           502
RainToday           2
RainTomorrow        2
dtype: int64

### Se borran los datos nulos

In [8]:
data = data.dropna()

In [9]:
data.shape

(56420, 23)

In [10]:
data.isnull().sum()

Date             0
Location         0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

### Se verifica que exista representatividad para cada tipo (llueve o no llueve) 

In [11]:
data.head()
data.RainTomorrow.value_counts()

No     43993
Yes    12427
Name: RainTomorrow, dtype: int64

In [12]:
data.shape

(56420, 23)

In [13]:
y = data.RainTomorrow
x = data.drop(['Date','RainToday', 'Location',  'WindGustDir', 'WindDir9am','RainTomorrow', 'WindDir3pm', 'Rainfall'], axis=1)

x.head()

Unnamed: 0,MinTemp,MaxTemp,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
6049,17.9,35.2,12.0,12.3,48.0,6.0,20.0,20.0,13.0,1006.3,1004.4,2.0,5.0,26.6,33.4
6050,18.4,28.9,14.8,13.0,37.0,19.0,19.0,30.0,8.0,1012.9,1012.1,1.0,1.0,20.3,27.0
6052,19.4,37.6,10.8,10.6,46.0,30.0,15.0,42.0,22.0,1012.3,1009.2,1.0,6.0,28.7,34.9
6053,21.9,38.4,11.4,12.2,31.0,6.0,6.0,37.0,22.0,1012.7,1009.1,1.0,5.0,29.1,35.6
6054,24.2,41.0,11.2,8.4,35.0,17.0,13.0,19.0,15.0,1010.7,1007.4,1.0,6.0,33.6,37.6


In [14]:
x.shape

(56420, 15)

# Reduccion de dimensionalidad con PCA

### Primero escalamos el dataset

In [15]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x)

StandardScaler()

In [16]:
scaled_data=scaler.transform(x)
print(scaled_data)

[[ 0.69120848  1.57529783  1.75769102 ...  0.25441126  1.27818584
   1.56362087]
 [ 0.76913098  0.67150378  2.51521564 ... -1.25660337  0.31897996
   0.62746694]
 [ 0.92497598  1.91960032  1.43303762 ...  0.63216492  1.59792114
   1.78303195]
 ...
 [ 1.12757448  1.23099533  0.02620618 ... -1.63435703  1.00412702
   1.37346461]
 [ 0.94056048  1.08753596  0.18853289 ... -1.25660337  1.00412702
   0.94926986]
 [ 1.04965198  1.07319002  0.02620618 ...  0.25441126  1.09547996
   1.21256315]]


### Luego aplicamos PCA

Buscamos que tenga la cantidad minima de features, con un 95% de varianza.

In [17]:
from sklearn.decomposition import PCA
#pca=PCA(n_components=2)

pca=PCA(.95)
principalComponents = pca.fit(scaled_data)
pca_x = pca.transform(scaled_data)

In [18]:
x = pca_x
x.shape

(56420, 9)

In [19]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(scaled_data, y, test_size= 0.2)

# Random Forest 

## Parameters
* n_estimators, en esta prueba tiene una variabilidad entre 10, 100 y 1000, lo que equivale a la cantidad de árboles que se usará en el modelo
* max_features, en esta prueba tiene una variabilidad de 2, 5 y 8, lo que equivale a la cantidad de parámetros utilizados


In [45]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
n_estimators = [10, 100, 1000]
max_features = [2,5,8]

## Tunning Hyperparameters

 Se ejecuta la combinación de cada parámetro en un estratified cross validation.

In [46]:
grid = dict(n_estimators=n_estimators,max_features=max_features)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(train_x, train_y)

## Best Hyperparameters

Imprimimos la combinación de los parámetros y sus resultados, obtenemos la mejor combinacion de estos para obtener el mejor resultado.

In [47]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.859610 using {'max_features': 5, 'n_estimators': 1000}
0.845002 (0.004831) with: {'max_features': 2, 'n_estimators': 10}
0.857372 (0.004636) with: {'max_features': 2, 'n_estimators': 100}
0.858797 (0.005323) with: {'max_features': 2, 'n_estimators': 1000}
0.846508 (0.004063) with: {'max_features': 5, 'n_estimators': 10}
0.857859 (0.004525) with: {'max_features': 5, 'n_estimators': 100}
0.859610 (0.004949) with: {'max_features': 5, 'n_estimators': 1000}
0.846944 (0.005147) with: {'max_features': 8, 'n_estimators': 10}
0.857593 (0.004484) with: {'max_features': 8, 'n_estimators': 100}
0.859299 (0.004967) with: {'max_features': 8, 'n_estimators': 1000}


## Creación del modelo con los mejores parámetros
En la anterior salida podemos ver todos los modelos probados tienen un desempeño similar, sin embargo podemos ver que el algoritmo que tuvo mejor desempeño tuvo como parámetros:
* max_features = 5
* n_estimators = 1000


In [48]:
rfc = RandomForestClassifier(random_state= 42, max_features=5, n_estimators= 1000)

## Training with hyperparameters
En esta parte se prueba el modelo con los mejores hiperparametros, con distintos datos. Esto con el propósito de evitar algún sesgo generado en el split. Razón por la que se usa un número aleatorio  para el random_state, de esta manera obtendremos distintas secciones de training y test.


In [49]:
total = 0
for i in range(5):
    train_x, test_x, train_y, test_y = train_test_split(x, y, test_size= 0.2, random_state= random.randint(0, 100))
    rfc.fit(train_x, train_y)
    pred= rfc.predict(test_x)
    total = total + accuracy_score(test_y,pred)
    print("Accuracy for Random Forest on CV data: ",accuracy_score(test_y,pred))

print("Average accuracy : ",total/5)

Accuracy for Random Forest on CV data:  0.84774902516838
Accuracy for Random Forest on CV data:  0.8458879829847572
Accuracy for Random Forest on CV data:  0.847837646224743
Accuracy for Random Forest on CV data:  0.8481921304501949
Accuracy for Random Forest on CV data:  0.8391527827011698
Average accuracy :  0.8457639135058489


# Support Vector Machine

In [20]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size= 0.2)

## Parameters
* kernel, en esta prueba tiene una variabilidad entre 'poly' y 'sigmoid', equivale a la  forma en que se separan los datos.
* C, en esta prueba tiene una variabilidad de 10, 1.0 y 0.1, lo que equivale a el castigo, en  este caso cuando el algoritmo deje por fuera un dato que debió ser incluido o viceversa.



In [21]:
from sklearn.svm import SVC
model = SVC()
kernel = ['poly', 'sigmoid']
C = [10, 1.0, 0.1]
gamma = ['scale']


## Tunning Hyperparameters

 Se ejecuta la combinación de cada parámetro en un estratified cross validation.

In [23]:
grid = dict(kernel=kernel,C=C,gamma=gamma)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(train_x, train_y)

## Best Hyperparameters

Imprimimos la combinación de los parámetros y sus resultados, obtenemos la mejor combinacion de estos para obtener el mejor resultado.

In [24]:
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.843695 using {'C': 10, 'gamma': 'scale', 'kernel': 'poly'}
0.843695 (0.006221) with: {'C': 10, 'gamma': 'scale', 'kernel': 'poly'}
0.764290 (0.012375) with: {'C': 10, 'gamma': 'scale', 'kernel': 'sigmoid'}
0.843008 (0.006212) with: {'C': 1.0, 'gamma': 'scale', 'kernel': 'poly'}
0.764644 (0.012395) with: {'C': 1.0, 'gamma': 'scale', 'kernel': 'sigmoid'}
0.840083 (0.005565) with: {'C': 0.1, 'gamma': 'scale', 'kernel': 'poly'}
0.767724 (0.013134) with: {'C': 0.1, 'gamma': 'scale', 'kernel': 'sigmoid'}


## Creación del modelo con los mejores parámetros
En la anterior salida podemos ver todos los modelos probados, podemos ver que el desempeño tiene un poco de variabilidad, el algoritmo que tuvo mejor desempeño tuvo como parámetros:
* kernel = poly
* C = 20
* gamma = scale


In [22]:
svm = SVC(kernel='poly',C=10,gamma='scale')

## Training with hyperparameters
En esta parte se prueba el modelo con los mejores hiperparametros, con distintos datos. Esto con el propósito de evitar algún sesgo generado en el split. Razón por la que se usa un número aleatorio  para el random_state, de esta manera obtendremos distintas secciones de training y test.


In [26]:
total = 0
for i in range(5):
    train_x, test_x, train_y, test_y = train_test_split(x, y, test_size= 0.2, random_state= random.randint(0, 100))
    svm.fit(train_x, train_y)
    pred= svm.predict(test_x)
    total = total + accuracy_score(test_y,pred)
    print("Accuracy for Support Vector Classification on CV data: ",accuracy_score(test_y,pred))

print("Average accuracy : ",total/5)

Accuracy for Support Vector Classification on CV data:  0.8445586671393123
Accuracy for Support Vector Classification on CV data:  0.8408365827720666
Accuracy for Support Vector Classification on CV data:  0.8429634881247784
Accuracy for Support Vector Classification on CV data:  0.8427862460120524
Accuracy for Support Vector Classification on CV data:  0.8417227933356966
Average accuracy :  0.8425735554767811


### Training with hyperparameters

Ahora esta vez se usará el método de sklearn cross_val_score, en el que se hacen lo mismo que el código anterior, pero de manera estratificada. Evitando que las muestras tengan algún sesgo.


In [23]:
from sklearn.model_selection import cross_val_score

results = (cross_val_score(svm, x, y, cv= 5, n_jobs=-1))

print (results) 

[0.83463311 0.83826657 0.83419    0.8489011  0.84181141]


In [24]:
average_result = 0
for res in( results):
    average_result += res
    
print("Promedio de los resultados con el metodo cross_val_score ", average_result / 5)

Promedio de los resultados con el metodo cross_val_score  0.8395604395604396
