# Classification of Weather Data using scikit-learn

## Daily Weather Data Analysis
Creating a decision tree based classification of weather data using scikit-learn

In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import seaborn as sns
import matplotlib.pyplot as plt
import os

In [2]:
os.listdir('./weather')

['daily_weather.csv']

In [3]:
data = pd.read_csv('./weather/daily_weather.csv')

## Daily Weather Data Description
The file **daily_weather.csv** is a comma-separated file that contains weather data. This data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data from different seasons and weather conditions is captured.

Each row in daily_weather.csv captures weather data for a separate day. 

Sensor measurements from the weather station were captured at one-minute intervals. These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternoon. The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relative humidity.

Each row, or sample, consists of the following variables:

* **number**: unique number for each row
* **air_pressure_9am**: air pressure averaged over a period from 8:55am to 9:04am (Unit: hectopascals)
* **air_temp_9am**: air temperature averaged over a period from 8:55am to 9:04am (Unit: degrees Fahrenheit)
* **air_wind_direction_9am**: wind direction averaged over a period from 8:55am to 9:04am (Unit: degrees, with 0 means coming from the North, and increasing clockwise)
* **air_wind_speed_9am**: wind speed averaged over a period from 8:55am to 9:04am (Unit: miles per hour)
* **max_wind_direction_9am**: wind gust direction averaged over a period from 8:55am to 9:10am (Unit: degrees, with 0 being North and increasing clockwise)
* **max_wind_speed_9am**: wind gust speed averaged over a period from 8:55am to 9:04am (Unit: miles per hour)
* **rain_accumulation_9am**: amount of rain accumulated in the 24 hours prior to 9am (Unit: millimeters)
* **rain_duration_9am**: amount of time rain was recorded in the 24 hours prior to 9am (Unit: seconds)
* **relative_humidity_9am**: relative humidity averaged over a period from 8:55am to 9:04am (Unit: percent)
* **relative_humidity_3pm**: relative humidity averaged over a period from 2:55pm to 3:04pm (Unit: percent )

In [4]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


In [5]:
# check missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)

dataMissing = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
dataMissing.head(15)

Unnamed: 0,Total,Percent
rain_accumulation_9am,6,0.547945
air_temp_9am,5,0.456621
max_wind_speed_9am,4,0.365297
avg_wind_direction_9am,4,0.365297
rain_duration_9am,3,0.273973
max_wind_direction_9am,3,0.273973
avg_wind_speed_9am,3,0.273973
air_pressure_9am,3,0.273973
relative_humidity_3pm,0,0.0
relative_humidity_9am,0,0.0


## Cleaning Data

In [6]:
del data['number']

In [7]:
data.shape

(1095, 10)

In [8]:
data = data.dropna()

In [9]:
data.shape

(1064, 10)

Lost almost 3% of dataframe

## Converting to a Classification Task
Binarize the relative_humidity_3pm tp 0 or 1.

In [10]:
cleanData = data.copy()
cleanData['high_humidity_label'] = (cleanData.relative_humidity_3pm > 24.99)*1
print(cleanData.high_humidity_label)

0       1
1       0
2       0
3       0
4       1
5       1
6       0
7       1
8       0
9       1
10      1
11      1
12      1
13      1
14      0
15      0
17      0
18      1
19      0
20      0
21      1
22      0
23      1
24      0
25      1
26      1
27      1
28      1
29      1
30      1
       ..
1064    1
1065    1
1067    1
1068    1
1069    1
1070    1
1071    1
1072    0
1073    1
1074    1
1075    0
1076    0
1077    1
1078    0
1079    1
1080    0
1081    0
1082    1
1083    1
1084    1
1085    1
1086    1
1087    1
1088    1
1089    1
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_label, Length: 1064, dtype: int64


### Target is stored in 'y'

In [11]:
y = cleanData[['high_humidity_label']].copy()

In [12]:
cleanData.relative_humidity_3pm.head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [13]:
y.head()

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


## Using 9am sensor signals as features to predict humidity at 3pm

In [14]:
morningFeatures = ['air_pressure_9am','air_temp_9am',
                    'avg_wind_direction_9am','avg_wind_speed_9am',
                    'max_wind_direction_9am','max_wind_speed_9am',
                    'rain_accumulation_9am','rain_duration_9am'
                   ]

In [15]:
X = cleanData[morningFeatures].copy()

In [16]:
X.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am'],
      dtype='object')

In [17]:
y.columns

Index(['high_humidity_label'], dtype='object')

## Test and train split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.33, random_state = 0)

In [19]:
#print(type(X_train))
#print(type(X_test))
#print(type(y_train))
#print(type(y_test))
#X_train.head()
#y_train.head()
#y_train.describe()

## Fit on train set

In [20]:
classifier = DecisionTreeClassifier(max_leaf_nodes = 10, random_state = 42)
classifier.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')

In [21]:
type(classifier)

sklearn.tree.tree.DecisionTreeClassifier

## Predict on test set

In [22]:
predictions = classifier.predict(X_test)

In [23]:
predictions[:10]

array([1, 1, 1, 1, 0, 1, 1, 1, 1, 1])

In [24]:
y_test['high_humidity_label'][:10]

178     1
1013    1
704     1
533     1
882     0
712     1
254     1
1036    0
642     1
207     1
Name: high_humidity_label, dtype: int64

## Accuracy of the classifier

In [25]:
accuracy_score(y_true = y_test, y_pred = predictions)

0.8039772727272727