# Author: Emmanuel Rodriguez

https://emmanueljrodriguez.com/

Date: 17 May 2022

Location: West Texas, USA

## Intro to Machine Learning: Classification of Weather Data using scikit-learn

## Scope: 

An introduction to the scikit-learn machine learning library by performing a decision tree based classification of weather data.

Code source: Ilkay Altinas, edX

### Import libraries:

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split # Function to split arrays or matrices into random train and test subsets.
from sklearn.tree import DecisionTreeClassifier # Class capable of performing multi-class classification on a dataset.

In [2]:
# Read weather data
# Create a Pandas DataFrame from a CSV file:

data = pd.read_csv(r'C:\Users\ejoaq\OneDrive\1 My_Notebook\3 Engineering\2 Data Science\Python for Data Science\Notebooks\Week-7-MachineLearning\weather\daily_weather.csv')
# Prefix with 'r' to produce raw string
# https://stackoverflow.com/questions/1347791/unicode-error-unicodeescape-codec-cant-decode-bytes-cannot-open-text-file

### Daily weather data description

The file **daily_weather.csv** is a comma-separated file that contains weather data. This data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

In [3]:
# Check labeled columns:

data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

### Data background

Each row in daily_weather.csv captures weather data for a separate day.

Sensor measurements from the weather station were captured at one-minute intervals. These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternoon. The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relative humidity.

Each row, or sample, consists of the following variables:

* **number:** unique number for each row
* **air_pressure_9am:** air pressure averaged over a period from 8:55am to 9:04am (*Unit: hectopascals*)
* **air_temp_9am:** air temperature averaged over a period from 8:55am to 9:04am (*Unit: degrees Fahrenheit*)
* **air_wind_direction_9am:** wind direction averaged over a period from 8:55am to 9:04am (*Unit: degrees, with 0 means coming from the North, and increasing clockwise*)
* **air_wind_speed_9am:** wind speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* **max_wind_direction_9am:** wind gust direction averaged over a period from 8:55am to 9:10am (*Unit: degrees, with 0 being North and increasing clockwise*)
* **max_wind_speed_9am:** wind gust speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* **rain_accumulation_9am:** amount of rain accumulated in the 24 hours prior to 9am (*Unit: millimeters*)
* **rain_duration_9am:** amount of time rain was recorded in the 24 hours prior to 9am (*Unit: seconds*)
* **relative_humidity_9am:** relative humidity averaged over a period from 8:55am to 9:04am (*Unit: percent*)
* **relative_humidity_3pm:** relative humidity averaged over a period from 2:55pm to 3:04pm (*Unit: percent *)

In [4]:
# View DataFrame
data

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.060000,74.822000,271.100000,2.080354,295.400000,2.863283,0.0,0.0,42.420000,36.160000
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.040000,60.638000,51.000000,17.067852,63.700000,22.100967,0.0,20.0,8.900000,14.460000
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.160000,44.294000,277.800000,1.856660,136.500000,2.863283,8.9,14730.0,92.410000,76.740000
...,...,...,...,...,...,...,...,...,...,...,...
1090,1090,918.900000,63.104000,192.900000,3.869906,207.300000,5.212070,0.0,0.0,26.020000,38.180000
1091,1091,918.710000,49.568000,241.600000,1.811921,227.400000,2.371156,0.0,0.0,90.350000,73.340000
1092,1092,916.600000,71.096000,189.300000,3.064608,200.800000,3.892276,0.0,0.0,45.590000,52.310000
1093,1093,912.600000,58.406000,172.700000,3.825167,189.100000,4.764682,0.0,0.0,64.840000,58.280000


In [7]:
# Search for NaN elements
data[data.isnull().any(axis=1)] # isnull() function to detect missing values, any() function to determine whether any
# element is true (NaN from the isnull() function) over the column axis

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
16,16,917.89,,169.2,2.192201,196.8,2.930391,0.0,0.0,48.99,51.19
111,111,915.29,58.82,182.6,15.613841,189.0,,0.0,0.0,21.5,29.69
177,177,915.9,,183.3,4.719943,189.9,5.346287,0.0,0.0,29.26,46.5
262,262,923.596607,58.380598,47.737753,10.636273,67.145843,13.671423,0.0,,17.990876,16.461685
277,277,920.48,62.6,194.4,2.751436,,3.869906,0.0,0.0,52.58,54.03
334,334,916.23,75.74,149.1,2.751436,187.5,4.183078,,1480.0,31.88,32.9
358,358,917.44,58.514,55.1,10.021491,,12.705819,0.0,0.0,13.88,25.93
361,361,920.444946,65.801845,49.823346,21.520177,61.886944,25.549112,,40.364018,12.278715,7.618649
381,381,918.48,66.542,90.9,3.467257,89.4,4.406772,,0.0,20.64,14.35
409,409,,67.853833,65.880616,4.328594,78.570923,5.216734,0.0,0.0,18.487385,20.356594


### Data pre-processing

In [8]:
# Delete the 'number' column
del data['number']

In [9]:
# Drop NaN 
before_rows = data.shape[0] # Get rows dimension before dropping
print(before_rows)

data = data.dropna()

after_rows = data.shape[0] # Get rows dimension after dropping
print(after_rows)

1095
1064


In [10]:
# Validate NaN elements have been dropped
data[data.isnull().any(axis=1)]

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm


In [11]:
# How many rows dropped?
before_rows - after_rows

31

## Convert to a Classification Task


A classifier will be used to predict humidity at 3 PM by looking at weather in the morning.

Binarize the relative_humidity_3pm column to either a 0 or 1.

In [12]:
clean_data = data.copy() # Create a new DataFrame containing only the clean data

# Add a new column to hold binary data based on the condition if RH @3pm > 24.99%, multiply the boolean by 1 to yield
# integer values (1 or 0) - this becomes the target variable.
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99)*1
print(clean_data['high_humidity_label'])

0       1
1       0
2       0
3       0
4       1
       ..
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_label, Length: 1064, dtype: int32


Using the parametric function analogy, y = f(x), a new DataFrame called 'y' is created that will represent the prediction variable.

In [19]:
# Target is stored in 'y' indicating a dependent variable.
y = clean_data[['high_humidity_label']] # Fancy indexing - used to pass an array of indices in place of single scalars, by
# using fancy indexing with only one index being passed, the return is a DataFrame (as opposed to a Series by NOT using
# fancy indexing)
print(type(y))
y

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1
...,...
1090,1
1091,1
1092,1
1093,1


### Sensor fusion

The 9AM sensor signals are used as features to predict humidity at 3 PM.

In [21]:
morning_features = clean_data.columns[0:-2] # Grab the column labels to be used as features (predictor variables) in the
# machine learning model
morning_features

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am'],
      dtype='object')

In [22]:
# Create new DataFrame to hold only morning features, using the same name index.
X = clean_data[morning_features] # Named 'X' to signifiy independent variables
X

Unnamed: 0,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am
0,918.060000,74.822000,271.100000,2.080354,295.400000,2.863283,0.0,0.0,42.420000
1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697
2,923.040000,60.638000,51.000000,17.067852,63.700000,22.100967,0.0,20.0,8.900000
3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102
4,921.160000,44.294000,277.800000,1.856660,136.500000,2.863283,8.9,14730.0,92.410000
...,...,...,...,...,...,...,...,...,...
1090,918.900000,63.104000,192.900000,3.869906,207.300000,5.212070,0.0,0.0,26.020000
1091,918.710000,49.568000,241.600000,1.811921,227.400000,2.371156,0.0,0.0,90.350000
1092,916.600000,71.096000,189.300000,3.064608,200.800000,3.892276,0.0,0.0,45.590000
1093,912.600000,58.406000,172.700000,3.825167,189.100000,4.764682,0.0,0.0,64.840000


In [31]:
# View columns labels
print(X.columns)
print(len(X.columns)) # # of features

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am'],
      dtype='object')
9


In [24]:
y.columns

Index(['high_humidity_label'], dtype='object')

## Perform test and train split

### Training Phase vs Testing Phase

In the **training phase**, the learning algorithm uses the training data to adjust the model's parameters to minimize errors. The output of this phase is a trained model.

<img src="ML_training_vs_testing.png" align="middle"/>

In the **testing phase**, the trained model is applied to (previously "unseen") test data. The model's performance is then evaluated. The classifier model should perform well, approximately equal, on both the training data and the test data.

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#Output arguments: Two sets of independent variables (X) and dependent variables (y), one used to train the model and
# the second used to test the model -- all DataFrames.
# Function's input arguments: Independent variables X, and dependent variable y - both are Pandas DataFrames.
# test_size is set to 33% of the input data.
# random_state - controls the shuffling applied to the data before applying the split. 42 is a popular int random seed.

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split

42 ref: https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#The_Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_is_42

In [29]:
print(type(X_train))
print(type(X_test))
print(type(y_train))
print(type(y_test))
print(X_train.head())
y_train.describe() # Generates descriptive statistics.

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
      air_pressure_9am  air_temp_9am  avg_wind_direction_9am  \
242         916.581504     75.224662              177.443701   
369         916.320000     63.032000               68.900000   
214         920.627327     78.669445               95.561092   
1003        918.660000     57.272000              286.300000   
607         915.600000     76.280000              188.400000   

      avg_wind_speed_9am  max_wind_direction_9am  max_wind_speed_9am  \
242             6.690026              185.467764            7.535282   
369             3.959384               79.300000            4.652835   
214             2.901136              124.866405            3.968541   
1003            3.937014              312.200000            5.480503   
607             1.856660              173.800000            2.371156   

      rain_accumulation_9am  r

Unnamed: 0,high_humidity_label
count,712.0
mean,0.5
std,0.500351
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


### Train the classifier

Sometimes referred to as "fitting" the model to the training data.

In [32]:
humidity_classifier = DecisionTreeClassifier(max_leaf_nodes=10, random_state=0) # Output is a decision tree classifier object.
# max_leaf_nodes is the stopping criteria for the tree induction, default is 'unlimited' which can potentially over-fit
# the tree to the training data
# random_state argument is used for splitting the nodes - it's the "random seed", 
# i.e., controols the randomness of the estimator.

humidity_classifier.fit(X_train, y_train) # Train with the fit method of the object, i.e., the classifier will tune itself.

DecisionTreeClassifier(max_leaf_nodes=10, random_state=0)

Refs: 

https://scikit-learn.org/stable/modules/tree.html#tree

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [33]:
type(humidity_classifier)

sklearn.tree._classes.DecisionTreeClassifier

### Predict on test data

In [34]:
predictions = humidity_classifier.predict(X_test)

In [35]:
predictions[:10] # Display the first 10 values in those predictions

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 0])

In [36]:
# Quick visual comparison to true values
y_test['high_humidity_label'][:10]

32      1
1084    1
423     1
767     1
818     1
583     1
594     1
606     1
87      0
421     1
Name: high_humidity_label, dtype: int32

### Measure accuracy of the classifier

In [37]:
accuracy_score(y_true = y_test, y_pred = predictions)

0.8636363636363636

Classifier accuracy is 86%.

Is this classifier's accuracy sufficient considering it's only taking in 9 AM measurements to predict 3 PM weather?