# Classification of Weather Data using DecisionTreeClassifier

### Daily Weather Data Analysis

In this notebook, we will use scikit-learn to perform a decision tree based classification of weather data.

#### Importing the Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

#### Creating a Pandas DataFrame from a CSV file

In [2]:
data = pd.read_csv('./weather/daily_weather.csv')

### Daily Weather Data Description
The file **daily_weather.csv** is a comma-separated file that contains weather data. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

In [3]:
data.columns

Index(['number', 'air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

Each row in **daily_weather.csv** captures weather data for a separate day.
Sensor measurements from the weather station were captured at one-minute intervals. These measurements were then processed to generate values to describe daily weather. Since this dataset was created to classify low-humidity days vs. non-low-humidity days (that is, days with normal or high humidity), the variables included are weather measurements in the morning, with one measurement, namely relatively humidity, in the afternoon. The idea is to use the morning weather values to predict whether the day will be low-humidity or not based on the afternoon measurement of relative humidity.

Each row, or sample, consists of the following variables:

* **number:** unique number for each row
* **air_pressure_9am:** air pressure averaged over a period from 8:55am to 9:04am (*Unit: hectopascals*)
* **air_temp_9am:** air temperature averaged over a period from 8:55am to 9:04am (*Unit: degrees Fahrenheit*)
* **air_wind_direction_9am:** wind direction averaged over a period from 8:55am to 9:04am (*Unit: degrees, with 0 means coming from the North, and increasing clockwise*)
* **air_wind_speed_9am:** wind speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* ** max_wind_direction_9am:** wind gust direction averaged over a period from 8:55am to 9:10am (*Unit: degrees, with 0 being North and increasing clockwise*)
* **max_wind_speed_9am:** wind gust speed averaged over a period from 8:55am to 9:04am (*Unit: miles per hour*)
* **rain_accumulation_9am:** amount of rain accumulated in the 24 hours prior to 9am (*Unit: millimeters*)
* **rain_duration_9am:** amount of time rain was recorded in the 24 hours prior to 9am (*Unit: seconds*)
* **relative_humidity_9am:** relative humidity averaged over a period from 8:55am to 9:04am (*Unit: percent*)
* **relative_humidity_3pm:** relative humidity averaged over a period from 2:55pm to 3:04pm (*Unit: percent *)


In [4]:
data.head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
0,0,918.06,74.822,271.1,2.080354,295.4,2.863283,0.0,0.0,42.42,36.16
1,1,917.347688,71.403843,101.935179,2.443009,140.471548,3.533324,0.0,0.0,24.328697,19.426597
2,2,923.04,60.638,51.0,17.067852,63.7,22.100967,0.0,20.0,8.9,14.46
3,3,920.502751,70.138895,198.832133,4.337363,211.203341,5.190045,0.0,0.0,12.189102,12.742547
4,4,921.16,44.294,277.8,1.85666,136.5,2.863283,8.9,14730.0,92.41,76.74


In [5]:
data[data.isnull().any(axis=1)].head()

Unnamed: 0,number,air_pressure_9am,air_temp_9am,avg_wind_direction_9am,avg_wind_speed_9am,max_wind_direction_9am,max_wind_speed_9am,rain_accumulation_9am,rain_duration_9am,relative_humidity_9am,relative_humidity_3pm
16,16,917.89,,169.2,2.192201,196.8,2.930391,0.0,0.0,48.99,51.19
111,111,915.29,58.82,182.6,15.613841,189.0,,0.0,0.0,21.5,29.69
177,177,915.9,,183.3,4.719943,189.9,5.346287,0.0,0.0,29.26,46.5
262,262,923.596607,58.380598,47.737753,10.636273,67.145843,13.671423,0.0,,17.990876,16.461685
277,277,920.48,62.6,194.4,2.751436,,3.869906,0.0,0.0,52.58,54.03


### Data Cleaning Steps

#### We will not need number for each row, so we can clean it.

In [6]:
del data['number']

#### Now let's drop null values using the pandas *dropna* function.

In [7]:
before_rows = data.shape[0]
print(before_rows)

1095


In [8]:
data = data.dropna()

In [9]:
after_rows = data.shape[0]
print(after_rows)

1064


In [10]:
#How many rows dropped due to cleaning?
before_rows - after_rows

31

### Convert to a Classification Task
Binarize the relative_humidity_3pm to 0 or 1.<br>


In [11]:
clean_data = data.copy()
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99)*1
print(clean_data['high_humidity_label'].head(10))

0    1
1    0
2    0
3    0
4    1
5    1
6    0
7    1
8    0
9    1
Name: high_humidity_label, dtype: int32


### Store the target in 'y'

In [12]:
y=clean_data[['high_humidity_label']].copy()
#y

In [13]:
clean_data['relative_humidity_3pm'].head()

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [14]:
y.head()

Unnamed: 0,high_humidity_label
0,1
1,0
2,0
3,0
4,1


### Use 9am Sensor Signals as Features to Predict Humidity at 3pm

In [15]:
morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am','avg_wind_speed_9am',
        'max_wind_direction_9am','max_wind_speed_9am','rain_accumulation_9am',
        'rain_duration_9am']

In [16]:
X = clean_data[morning_features].copy()

In [17]:
X.columns

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am'],
      dtype='object')

In [18]:
y.columns

Index(['high_humidity_label'], dtype='object')

### Perform Test and Train split

#### Training and Testing Phase:

In the **training phase**, the learning algorithm uses the training data to adjust the model’s parameters to minimize errors.  At the end of the training phase, you get the trained model.

In the **testing phase**, the trained model is applied to test data.  Test data is separate from the training data, and is previously unseen by the model.  The model is then evaluated on how it performs on the test data.  The goal in building a classifier model is to have the model perform well on training as well as test data.


### sklearn.model_selection.train_test_split:
Split arrays or matrices into random train and test subsets.

**Parameters**:
* *test_size*(float, int, or None (default is None)): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.25.
* *train_size*(float, int, or None (default is None)): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
* *random_state*(int or RandomState): Pseudo-random number generator state used for random sampling.

**Returns**: list containing train-test split of inputs.

Link: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

In [20]:
#print(type(X_train))
#print(type(X_test))
#print(type(y_train))
#print(type(y_test))
#print(X_train.head())
#print(y_train.describe())

## Use GridSearchCV to find the optimum parameter values:

In [21]:
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import ShuffleSplit



### sklearn.model_selection.ShuffleSplit:
*ShuffleSplit(n_splits=10, test_size=’default’, train_size=None, random_state=None)*

Random permutation cross-validator. Yields indices to split data into training and test sets.

Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

**Parameters for version 0.17**:
* *n*(int): Total number of elements in the dataset.
* *n_iter*(int (default 10)): Number of re-shuffling & splitting iterations.
* *test_size*(float (default 0.1), int, or None): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size.
* *train_size*(float, int, or None (default is None)): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
* *random_state*(int or RandomState): Pseudo-random number generator state used for random sampling.

**Parameters for version 0.19**:
* *n_splits*(int, default 10): Number of re-shuffling & splitting iterations.
* *test_size*(float, int, None, default=0.1): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. By default (the is parameter unspecified), the value is set to 0.1. The default will change in version 0.21. It will remain 0.1 only if train_size is unspecified, otherwise it will complement the specified train_size.
* *train_size*(float, int, or None, default=None): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
* *random_state*(int, RandomState instance or None, optional (default=None)): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

### sklearn.metrics.make_scorer:

*make_scorer(score_func, greater_is_better=True, needs_proba=False, needs_threshold=False, **kwargs)*

Make a scorer from a performance metric or loss function. This factory function wraps scoring functions for use in GridSearchCV and cross_val_score. It takes a score function, such as accuracy_score, mean_squared_error, adjusted_rand_index or average_precision and returns a callable that scores an estimator’s output.

**Parameters**:
* *score_func*(callable): Score function (or loss function) with signature score_func(y, y_pred, \*\*kwargs).
* *greater_is_better*(boolean, default=True): Whether score_func is a score function (default), meaning high is good, or a loss function, meaning low is good. In the latter case, the scorer object will sign-flip the outcome of the score_func.
* *needs_proba*(boolean, default=False): Whether score_func requires predict_proba to get probability estimates out of a classifier.
* *needs_threshold*(boolean, default=False): Whether score_func takes a continuous decision certainty. This only works for binary classification using estimators that have either a decision_function or predict_proba method. For example average_precision or the area under the roc curve can not be computed using discrete predictions alone.
* *\**kwargs*(additional arguments): Additional parameters to be passed to score_func.

**Returns**:	
* *scorer*(callable): Callable object that returns a scalar score; greater is better.

In [22]:
#Estimator to be used in GridSearchCV. All the fixed values of the parameters needs to be mentioned here.
dct=DecisionTreeClassifier(random_state=0)

#Cross-validation to be used in GridSearchCV
cv_sets = ShuffleSplit(X_train.shape[0], n_iter = 10, test_size = 0.20, random_state = 0) 

#Parameters to be searched in GridSearchCV
params = {'max_leaf_nodes': list(range(2,26)), 'max_depth':list(range(2,26))}

#Scoring function to be used in GridSearchCV.
#We can either use make_scorer like the following, or we can directly use scoring='accuracy' in GridSearchCV
def performance_metric(y_true, y_predict):
    score = accuracy_score(y_true, y_predict)
    return score
scoring_fnc = make_scorer(performance_metric)

In [23]:
print(cv_sets)

ShuffleSplit(712, n_iter=10, test_size=0.2, random_state=0)


### sklearn.model_selection.GridSearchCV

*GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=True)*

Exhaustive search over specified parameter values for an estimator. Important members are fit, predict. GridSearchCV implements a "fit" and a "score" method. It also implements "predict", "predict_proba", "decision_function", "transform" and "inverse_transform" if they are implemented in the estimator used. The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

**Parameters**:
* ***estimator***(estimator object): This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed.
* ***param_grid***(dict or list of dictionaries): Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
* ***scoring***(string, callable, list/tuple, dict or None, default: None): A single string (see [The scoring parameter: defining model evaluation rules](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)) or a callable (see [Defining your scoring strategy from metric functions](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring)) to evaluate the predictions on the test set. For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values. NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each. If None, the estimator’s default scorer (if available) is used.
* ***cv***(int, cross-validation generator or an iterable, optional): Determines the cross-validation splitting strategy. Possible inputs for cv are:
    * *None*, to use the default 3-fold cross validation.
    * *integer*, to specify the number of folds in a (Stratified)KFold.
    * *An object* to be used as a cross-validation generator.
    * *An iterable* yielding train, test splits.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

**Attributes**:
* *cv\_results\_*(dict of numpy (masked) ndarrays): A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.
* ***best\_estimator\_***(estimator or dict): Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.
* ***best\_score\_***(float): Mean cross-validated score of the best_estimator. For multi-metric evaluation, this is present only if refit is specified.
* ***best\_params\_***(dict): Parameter setting that gave the best results on the hold out data. For multi-metric evaluation, this is present only if refit is specified.
* *scorer\_*(function or a dict): Scorer function used on the held out data to choose the best parameters for the model. For multi-metric evaluation, this attribute holds the validated scoring dict which maps the scorer key to the scorer callable.
* *n\_splits\_*(int): The number of cross-validation splits (folds/iterations).

**Methods**:
* ***fit(X[, y, groups])***: Run fit with all sets of parameters.
* *get_params([deep])*: Get parameters for this estimator.
* *predict(X)*: Call predict on the estimator with the best found parameters.
* *predict_proba(X)*: Call predict_proba on the estimator with the best found parameters.
* *score(X[, y])*: Returns the score on the given data, if the estimator has been refit.
* *transform(X)*: Call transform on the estimator with the best found parameters.

In [24]:
grid = GridSearchCV(estimator=dct, param_grid=params, scoring='accuracy', cv=cv_sets)

In [25]:
grid = grid.fit(X_train, y_train)

In [26]:
grid.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
            max_features=None, max_leaf_nodes=21,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [27]:
grid.best_score_

0.76993006993007

In [28]:
grid.grid_scores_[:10]

[mean: 0.73427, std: 0.02866, params: {'max_depth': 2, 'max_leaf_nodes': 2},
 mean: 0.71119, std: 0.02168, params: {'max_depth': 2, 'max_leaf_nodes': 3},
 mean: 0.71958, std: 0.03196, params: {'max_depth': 2, 'max_leaf_nodes': 4},
 mean: 0.74126, std: 0.03620, params: {'max_depth': 2, 'max_leaf_nodes': 5},
 mean: 0.73986, std: 0.03061, params: {'max_depth': 2, 'max_leaf_nodes': 6},
 mean: 0.74755, std: 0.03531, params: {'max_depth': 2, 'max_leaf_nodes': 7},
 mean: 0.76713, std: 0.03607, params: {'max_depth': 2, 'max_leaf_nodes': 8},
 mean: 0.76713, std: 0.03607, params: {'max_depth': 2, 'max_leaf_nodes': 9},
 mean: 0.76713, std: 0.03607, params: {'max_depth': 2, 'max_leaf_nodes': 10},
 mean: 0.76713, std: 0.03607, params: {'max_depth': 2, 'max_leaf_nodes': 11}]

In [29]:
grid.best_params_

{'max_depth': 7, 'max_leaf_nodes': 21}

## Use KFold Cross Validation:

Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. It is used to assess the predictive performance of the models and and to judge how they perform on test data.

The motivation to use cross validation techniques is that when we fit a model, we are fitting it to a training dataset. Without cross validation we only have information on how does our model perform to our in-sample data. Ideally we would like to see how does the model perform when we have a new data in terms of accuracy of its predictions.

In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.

For classification problems, one typically uses stratified k-fold cross-validation, in which the folds are selected so that each fold contains roughly the same proportions of class labels.

### sklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)

*KFold(n_splits=3, shuffle=False, random_state=None)*

K-Folds cross-validator provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.

**Parameters**:	
* ***n_splits***(int, default=3): Number of folds. Must be at least 2.
* *shuffle*(boolean, optional): Whether to shuffle the data before splitting into batches.
* *random_state*(int, RandomState instance or None, optional, default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when shuffle == True.

**Methods**:
* *get_n_splits([X, y, groups])*: Returns the number of splitting iterations in the cross-validator
* ***split(X[, y, groups])***: Generate indices to split data into training and test set. Returns training and testing set indices for that split.

In [30]:
from sklearn.model_selection import KFold

In [31]:
kf=KFold(n_splits=3)

In [32]:
model_humidity_classifier = DecisionTreeClassifier(max_depth=7, max_leaf_nodes=21, random_state=0)
index_list = []

for i, (X_train_train, X_train_test) in enumerate(kf.split(X_train, y_train)):  #enumerate allows us to loop over something and have an automatic counter. 
    #X_train_train and X_train_test will contain the index number for the respective splits.
    #with .iloc, we will use the index values of X_train_train to fit to the model.
    model_humidity_classifier.fit(X_train.iloc[X_train_train],y_train.iloc[X_train_train])
    #then we will use the index values of X_train_test to find the score of the model
    x = model_humidity_classifier.score(X_train.iloc[X_train_test],y_train.iloc[X_train_test])
    #We will add the performance score of the model to a list
    index_list.append(x)

#We will convert the list to an array to find out the mean of the performance score
index_array = np.array(index_list)
index_array.mean()

0.76407356191421716

### Fit on Train Set

### sklearn.tree.DecisionTreeClassifier
#### Parameters:
* **criterion**(string, optional (default=”gini”)): The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
* **max_depth**(int or None, optional (default=None)): The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
* **min_samples_split**(int, float, optional (default=2)): The minimum number of samples required to split an internal node. If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
* **min_samples_leaf**(int, float, optional (default=1)): The minimum number of samples required to be at a leaf node. If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
* **max_leaf_nodes**(int or None, optional (default=None)): Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
* **class_weight**(dict, list of dicts, “balanced” or None, optional (default=None)): Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
* **random_state**(int, RandomState instance or None, optional (default=None)): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
* **min_impurity_split**(float, optional (default=1e-7)): Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

#### Methods:
* **fit(X, y[, sample_weight, check_input, ...])**: Build a decision tree classifier from the training set (X, y).
* **score(X, y[, sample_weight])**: Returns the mean accuracy on the given test data and labels
* **predict(X[, check_input])**: Predict class or regression value for X.

In [33]:
model_humidity_classifier = DecisionTreeClassifier(max_depth=7, max_leaf_nodes=21, random_state=0) #Used the .best_params_ value from GridSearchCV
model_humidity_classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=7,
            max_features=None, max_leaf_nodes=21,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [34]:
model_humidity_classifier.score(X_train, y_train)

0.8595505617977528

In [35]:
type(model_humidity_classifier)

sklearn.tree.tree.DecisionTreeClassifier

### Predict on Test Set

In [36]:
predictions = model_humidity_classifier.predict(X_test)

In [37]:
predictions[:10]

array([0, 0, 1, 1, 1, 1, 1, 0, 1, 1])

In [38]:
y_test['high_humidity_label'][:10]

456     0
845     0
693     1
259     1
723     1
224     1
300     1
442     0
585     1
1057    1
Name: high_humidity_label, dtype: int32

### Measure Accuracy of the Classifier

### sklearn.metrics.accuracy_score
* **y_true**: Ground truth (correct) labels.
* **y_pred**: Predicted labels, as returned by a classifier.
* **normalize**(default=True): If *False*, returns the number of correctly classified samples. If *True*, returns the fraction of correctly classified samples.
* **Returns**: If normalize == True, return the correctly classified samples (float), else it returns the number of correctly classified samples (int). The best performance is 1 with normalize == True and the number of samples with normalize == False.

In [39]:
accuracy_score(y_true = y_test, y_pred = predictions)

0.80965909090909094