<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold">

Classification of Weather Data <br><br>
using scikit-learn
<br><br>
</p>

In [3]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [4]:
data = pd.read_csv('./weather/daily_weather.csv')

In [5]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

In [6]:
del data['number']

In [10]:
before_cleaning = data.shape[0]
print(before_cleaning)

1095


In [11]:
data = data.dropna()

In [12]:
after_cleaning = data.shape[0]
print(after_cleaning)

1064


In [13]:
before_cleaning - after_cleaning

31

<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold">

Convert to a classification tasks <br><br>
<br><br>
</p>

In [14]:
# copiar o dataframe limpo para uma nova variavel para se trabalhar
clean_data = data.copy()
# Adicionar uma nova coluna com um filtro binario, 0 ou 1, indicando alto niveo de umidade.
# Aqui a gente faz um filtro que retornaria True ou False e multiplica por 1, que ele nos dará uma df binario
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99)*1
print(clean_data['high_humidity_label'])

0       1
1       0
2       0
3       0
4       1
5       1
6       0
7       1
8       0
9       1
10      1
11      1
12      1
13      1
14      0
15      0
17      0
18      1
19      0
20      0
21      1
22      0
23      1
24      0
25      1
26      1
27      1
28      1
29      1
30      1
       ..
1064    1
1065    1
1067    1
1068    1
1069    1
1070    1
1071    1
1072    0
1073    1
1074    1
1075    0
1076    0
1077    1
1078    0
1079    1
1080    0
1081    0
1082    1
1083    1
1084    1
1085    1
1086    1
1087    1
1088    1
1089    1
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_label, Length: 1064, dtype: int64


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Target is stored in 'y'.
<br><br></p>


### Agora, nosso alvo será armazenado em y

In [16]:
y = clean_data[['high_humidity_label']].copy()

In [17]:
clean_data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [19]:
y.head()
# Podemos ver que o Y nos mostra os binarios de umidade realtiva

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Use 9am Sensor Signals as Features to Predict Humidity at 3pm
<br><br></p>


In [20]:
# Vamos pegar as características da manhã
morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am','avg_wind_speed_9am',
        'max_wind_direction_9am','max_wind_speed_9am','rain_accumulation_9am',
        'rain_duration_9am']

In [21]:
# Vamos atribuir essas coisas a X
X = clean_data[morning_features].copy()

In [22]:
X.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am'],
      dtype='object')

In [23]:
y.columns

Index(['high_humidity_label'], dtype='object')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Perform Test and Train split

<br><br></p>



In [24]:
# Essa é a parte que nós vamos treinar nosso modelo. Teremos 4 dataframes aqui
# 1. X_train - será o nosso treino de X
# 2. X_test - será o nosso df de teste
# 3. y_train - será nosso treino de y
# 4. y_test - será nosso df de teste

# test_size é o tamanho do df que será para teste. Aqui, 33% será para teste
# random_state é para granularidade, verei mais sobre isso

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Fit on Train Set
<br><br></p>


In [25]:
humidity_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)
humidity_classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [26]:
type(humidity_classifier)

sklearn.tree.tree.DecisionTreeClassifier

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Predict on Test Set 

<br><br></p>


In [27]:
predictions = humidity_classifier.predict(X_test)

In [28]:
predictions[:10]

array([0, 0, 1, 1, 1, 1, 0, 0, 0, 1])

In [29]:
y_test['high_humidity_label'][:10].tolist()

[0, 0, 1, 1, 1, 1, 1, 0, 1, 1]

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Podemos ver que o modelo utilizado acertou 8/10 <br><br>
Para verificar a acurácia, pdoemos utilizar o seguinte método
<br><br></p>


In [30]:
accuracy_score(y_true = y_test, y_pred = predictions)

0.8153409090909091