### Weather Data Clustering using k-Means

In this notebook,  we will learn how to perform k-means clustering using scikit-learn in Python.

We will use cluster analysis to generate a big picture model of the weather at a local station using a minute-graunlarity data. In this dataset, we have
in the order of millions records. How do we create 12 clusters our of them.

NOTE: The dataset we will use is in a large CSV file called minute_weather.csv. Please download it into the weather directory in your Week-7-MachineLearning folder. The download link is: https://drive.google.com/open?id=0B8iiZ7pSaSFZb3ItQ1l4LWRMTjg



#### importing the necessory libraries

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
from itertools import cycle,islice
import matplotlib.pyplot as plt
#from pandas.tools.plotting import parallel_coordinates
%matplotlib inline

#### Creating a Pandas DataFrame from a CSV file

In [4]:
data = pd.read_csv('minute_weather.csv')


Minute Weather Data Description


The minute weather dataset comes from the same source as the daily weather dataset that we used in the decision tree based classifier notebook. The main difference between these two datasets is that the minute weather dataset contains raw sensor measurements captured at one-minute intervals. Daily weather dataset instead contained processed and well curated data. The data is in the file minute_weather.csv, which is a comma-separated file.
As with the daily weather data, this data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

Each row in minute_weather.csv contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:

rowID: unique number for each row (Unit: NA)
hpwren_timestamp: timestamp of measure (Unit: year-month-day hour:minute:second)
air_pressure: air pressure measured at the timestamp (Unit: hectopascals)
air_temp: air temperature measure at the timestamp (Unit: degrees Fahrenheit)
avg_wind_direction: wind direction averaged over the minute before the timestamp (Unit: degrees, with 0 means coming from the North, and increasing clockwise)
avg_wind_speed: wind speed averaged over the minute before the timestamp (Unit: meters per second)
max_wind_direction: highest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and increasing clockwise)
max_wind_speed: highest wind speed in the minute before the timestamp (Unit: meters per second)
min_wind_direction: smallest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and inceasing clockwise)
min_wind_speed: smallest wind speed in the minute before the timestamp (Unit: meters per second)
rain_accumulation: amount of accumulated rain measured at the timestamp (Unit: millimeters)
rain_duration: length of time rain has fallen as measured at the timestamp (Unit: seconds)
relative_humidity: relative humidity measured at the timestamp (Unit: percent)

In [6]:
data.shape

(1587257, 13)

In [7]:
data.head()

Unnamed: 0,rowID,hpwren_timestamp,air_pressure,air_temp,avg_wind_direction,avg_wind_speed,max_wind_direction,max_wind_speed,min_wind_direction,min_wind_speed,rain_accumulation,rain_duration,relative_humidity
0,0,2011-09-10 00:00:49,912.3,64.76,97.0,1.2,106.0,1.6,85.0,1.0,,,60.5
1,1,2011-09-10 00:01:49,912.3,63.86,161.0,0.8,215.0,1.5,43.0,0.2,0.0,0.0,39.9
2,2,2011-09-10 00:02:49,912.3,64.22,77.0,0.7,143.0,1.2,324.0,0.3,0.0,0.0,43.0
3,3,2011-09-10 00:03:49,912.3,64.4,89.0,1.2,112.0,1.6,12.0,0.7,0.0,0.0,49.5
4,4,2011-09-10 00:04:49,912.3,64.4,185.0,0.4,260.0,1.0,100.0,0.1,0.0,0.0,58.8


Data Sampling

Lots of rows, so let us sample down by taking every 10th row.

In [8]:
sampled_df=data[(data['rowID']%10)==0]

In [9]:
sampled_df.shape

(158726, 13)

#### Stastics

In [10]:
sampled_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rowID,158726.0,793625.0,458203.937509,0.0,396812.5,793625.0,1190437.5,1587250.0
air_pressure,158726.0,916.830161,3.051717,905.0,914.8,916.7,918.7,929.5
air_temp,158726.0,61.851589,11.833569,31.64,52.7,62.24,70.88,99.5
avg_wind_direction,158680.0,162.1561,95.278201,0.0,62.0,182.0,217.0,359.0
avg_wind_speed,158680.0,2.775215,2.057624,0.0,1.3,2.2,3.8,31.9
max_wind_direction,158680.0,163.462144,92.452139,0.0,68.0,187.0,223.0,359.0
max_wind_speed,158680.0,3.400558,2.418802,0.1,1.6,2.7,4.6,36.0
min_wind_direction,158680.0,166.774017,97.441109,0.0,76.0,180.0,212.0,359.0
min_wind_speed,158680.0,2.134664,1.742113,0.0,0.8,1.6,3.0,31.6
rain_accumulation,158725.0,0.000318,0.011236,0.0,0.0,0.0,0.0,3.12


In [11]:
sampled_df[sampled_df['rain_accumulation']==0].shape

(157812, 13)

In [12]:
sampled_df[sampled_df['rain_duration']==0].shape

(157237, 13)

#### Drop all the Rows with Empty rain_duration and rain_accumulation


In [13]:
del sampled_df['rain_accumulation']
del sampled_df['rain_duration']

In [14]:
rows_before=sampled_df.shape[0]
sampled_df=sampled_df.dropna()
rows_after=sampled_df.shape[0]

#### How many rows did we drop?

In [15]:
rows_before-rows_after

46

In [16]:
sampled_df.columns

Index(['rowID', 'hpwren_timestamp', 'air_pressure', 'air_temp',
       'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction',
       'max_wind_speed', 'min_wind_direction', 'min_wind_speed',
       'relative_humidity'],
      dtype='object')

#### Select Features of interest for Clustering

In [17]:
features = ['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction', 
        'max_wind_speed','relative_humidity']

In [18]:
select_df=sampled_df[features]

In [19]:
select_df.columns

Index(['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed',
       'max_wind_direction', 'max_wind_speed', 'relative_humidity'],
      dtype='object')

In [20]:
select_df

Unnamed: 0,air_pressure,air_temp,avg_wind_direction,avg_wind_speed,max_wind_direction,max_wind_speed,relative_humidity
0,912.3,64.76,97.0,1.2,106.0,1.6,60.5
10,912.3,62.24,144.0,1.2,167.0,1.8,38.5
20,912.2,63.32,100.0,2.0,122.0,2.5,58.3
30,912.2,62.60,91.0,2.0,103.0,2.4,57.9
40,912.2,64.04,81.0,2.6,88.0,2.9,57.4
50,912.1,63.68,102.0,1.2,119.0,1.5,51.4
60,912.0,64.04,83.0,0.7,101.0,0.9,51.4
70,911.9,64.22,82.0,2.0,97.0,2.4,62.2
80,911.9,61.70,67.0,3.3,70.0,3.5,71.5
90,911.9,61.34,67.0,3.6,75.0,4.2,72.5


#### Scale the Features using StandardScaler

In [21]:
X=StandardScaler().fit_transform(select_df)

In [22]:
X

array([[-1.48456281,  0.24544455, -0.68385323, ..., -0.62153592,
        -0.74440309,  0.49233835],
       [-1.48456281,  0.03247142, -0.19055941, ...,  0.03826701,
        -0.66171726, -0.34710804],
       [-1.51733167,  0.12374562, -0.65236639, ..., -0.44847286,
        -0.37231683,  0.40839371],
       ...,
       [-0.30488381,  1.15818654,  1.90856325, ...,  2.0393087 ,
        -0.70306017,  0.01538018],
       [-0.30488381,  1.12776181,  2.06599745, ..., -1.67073075,
        -0.74440309, -0.04948614],
       [-0.30488381,  1.09733708, -1.63895404, ..., -1.55174989,
        -0.62037434, -0.05711747]])

#### Use K-means Clustering

In [23]:
kmeans=KMeans(n_clusters=12)
model=kmeans.fit(X)
print("model\n",model)

model
 KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=12, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)


#### What are the centers of 12 clusters we formed?

In [24]:
centers=model.cluster_centers_
centers

array([[ 0.25241034, -0.99448862,  0.6598588 , -0.54736306,  0.8511741 ,
        -0.53003448,  1.15851425],
       [-0.1637492 ,  0.86368139, -1.31102399, -0.58982352, -1.1666706 ,
        -0.60515405, -0.64101285],
       [-1.18021116, -0.87592397,  0.44688481,  1.97679897,  0.5387836 ,
         1.93808868,  0.9140965 ],
       [ 1.3665042 , -0.08103762, -1.20720301, -0.04902986, -1.07617524,
        -0.02869652, -0.9778825 ],
       [-0.83974474, -1.19871999,  0.37520796,  0.35575034,  0.47370193,
         0.34365948,  1.3624654 ],
       [ 0.13087844,  0.84358567,  1.41108021, -0.63842998,  1.67510704,
        -0.58920468, -0.71419435],
       [ 0.2339313 ,  0.31909585,  1.88794143, -0.65198177, -1.55164369,
        -0.57681439, -0.28251551],
       [-0.69641987,  0.54217151,  0.17691218, -0.58410587,  0.34631293,
        -0.597491  , -0.11346421],
       [-0.211268  ,  0.63186071,  0.40850282,  0.73468993,  0.51663704,
         0.67266119, -0.1502734 ],
       [ 1.19056013, -0.2553

Plots


Let us first create some utility functions which will help us in plotting graphs:

In [25]:
# Function that creates a DataFrame with a column for Cluster Number

def pd_centers(featuresUsed, centers):
    colNames = list(featuresUsed)
    colNames.append('prediction')

	# Zip with a column called 'prediction' (index)
	Z = [np.append(A, index) for index, A in enumerate(centers)]

	# Convert to pandas data frame for plotting
	P = pd.DataFrame(Z, columns=colNames)
	P['prediction'] = P['prediction'].astype(int)
	return P

In [26]:
# Function that creates Parallel Plots

def parallel_plot(data):
	my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
	plt.figure(figsize=(15,8)).gca().axes.set_ylim([-3,+3])
	parallel_coordinates(data, 'prediction', color = my_colors, marker='o')

In [27]:
P = pd_centers(features, centers)
P

Unnamed: 0,air_pressure,air_temp,avg_wind_direction,avg_wind_speed,max_wind_direction,max_wind_speed,relative_humidity,prediction
0,0.25241,-0.994489,0.659859,-0.547363,0.851174,-0.530034,1.158514,0
1,-0.163749,0.863681,-1.311024,-0.589824,-1.166671,-0.605154,-0.641013,1
2,-1.180211,-0.875924,0.446885,1.976799,0.538784,1.938089,0.914096,2
3,1.366504,-0.081038,-1.207203,-0.04903,-1.076175,-0.028697,-0.977882,3
4,-0.839745,-1.19872,0.375208,0.35575,0.473702,0.343659,1.362465,4
5,0.130878,0.843586,1.41108,-0.63843,1.675107,-0.589205,-0.714194,5
6,0.233931,0.319096,1.887941,-0.651982,-1.551644,-0.576814,-0.282516,6
7,-0.69642,0.542172,0.176912,-0.584106,0.346313,-0.597491,-0.113464,7
8,-0.211268,0.631861,0.408503,0.73469,0.516637,0.672661,-0.150273,8
9,1.19056,-0.255377,-1.155048,2.124883,-1.053466,2.242024,-1.134211,9


### Dray Days