## Feature Selection Assigment


---

The focus of this task is to work with the preprocessed data set in the previous **Preprocessing assigment**.These topics are addressed in this notebook.


*   **Part 1**: PCA on the preprocessed data
*  **Part 2**: Inconsistency Rate on preprocessed data


Before we begin we must address some issues that is present the in the preprocessed data and resolve them. The preprocessed data has a lot of interpolated datas which will actually create a problem during the feature selection as well as in the later parts. So we will filter out the datas by the range of the datas in the main station.

time series range for station a: 2007-09-01 00:00:00 to 2017-12-31 23:00:00

time series range for station b: 2000-01-01 00:00:00 to 2017-12-31 23:00:00

time series range for station c: 2003-06-26 15:00:00 to 2017-12-31 23:00:00

time series range for station main: 2014-01-01 01:00:00 to 2017-12-31 23:00:00

We will take the date range **2014-01-01 01:00:00 to 2017-12-31 23:00:00** so that we don't have to deal with the interpolated datas.

In [162]:
import pandas as pd


In [163]:
df_prep = pd.read_csv('processed_data.csv',parse_dates=['time'],index_col='time')
df_prep.head()

Unnamed: 0_level_0,temp_c_a,status_a,rain_mm_a,temp_c_b,status_b,rain_mm_b,temp_c_c,status_c,rain_mm_c,level_cm,flow_m2_s
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2000-01-01 00:00:00,0.360573,4,-0.144319,-1.073449,4,-0.269803,1.271075,4,-0.144319,0.049463,0.089526
2000-01-01 01:00:00,0.360573,4,-0.144319,-1.073449,4,-0.125129,1.271075,4,-0.144319,0.049463,0.089526
2000-01-01 02:00:00,0.360573,4,-0.144319,-1.062132,4,0.019546,1.271075,4,-0.144319,0.049463,0.089526
2000-01-01 03:00:00,0.360573,4,-0.144319,-1.062132,4,-0.125129,1.271075,4,-0.144319,0.049463,0.089526
2000-01-01 04:00:00,0.360573,4,-0.144319,-1.028178,4,-0.269803,1.271075,4,-0.144319,0.049463,0.089526


In [164]:
df_prep_filtered = df_prep[(df_prep.index.year >= 2014) & (df_prep.index.year <= 2017)]

df_prep_filtered.head()

Unnamed: 0_level_0,temp_c_a,status_a,rain_mm_a,temp_c_b,status_b,rain_mm_b,temp_c_c,status_c,rain_mm_c,level_cm,flow_m2_s
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2014-01-01 00:00:00,-1.478435,3,-0.144319,-1.299806,4,-0.269803,-1.090793,3,-0.144319,0.049463,0.089526
2014-01-01 01:00:00,-1.425892,3,-0.144319,-1.446937,4,-0.269803,-1.115017,3,-0.144319,0.049463,0.089526
2014-01-01 02:00:00,-1.460921,3,-0.144319,-1.028178,3,-0.269803,-1.151353,3,-0.144319,0.049463,0.089526
2014-01-01 03:00:00,-1.548493,3,-0.144319,-0.892364,3,-0.269803,-1.224026,3,-0.144319,0.049463,0.089526
2014-01-01 04:00:00,-1.530978,3,-0.144319,-0.982907,3,-0.269803,-1.199802,3,-0.144319,0.049463,0.089526


## PCA

We have to conduct the PCA on the water level data. Since we want to predict the water level and the water flow using the model in the future here the `level_cm` and `flow_m2_s` are the dependant variables. we will exclude these columns from our PCA analysis.

In [165]:
feature_df = df_prep_filtered.copy(deep=True)
feature_df.drop(['level_cm','flow_m2_s'], axis=1,inplace=True)
feature_df.head()

Unnamed: 0_level_0,temp_c_a,status_a,rain_mm_a,temp_c_b,status_b,rain_mm_b,temp_c_c,status_c,rain_mm_c
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2014-01-01 00:00:00,-1.478435,3,-0.144319,-1.299806,4,-0.269803,-1.090793,3,-0.144319
2014-01-01 01:00:00,-1.425892,3,-0.144319,-1.446937,4,-0.269803,-1.115017,3,-0.144319
2014-01-01 02:00:00,-1.460921,3,-0.144319,-1.028178,3,-0.269803,-1.151353,3,-0.144319
2014-01-01 03:00:00,-1.548493,3,-0.144319,-0.892364,3,-0.269803,-1.224026,3,-0.144319
2014-01-01 04:00:00,-1.530978,3,-0.144319,-0.982907,3,-0.269803,-1.199802,3,-0.144319


Now that we have the filtered data within the desired the date range we will now first apply the PCA on the data.

In [166]:
from sklearn.decomposition import PCA


In [167]:
pca = PCA(n_components=5)
pc = pca.fit_transform(feature_df)
pc

array([[ 1.10643176, -0.88577592, -1.32582795,  0.98343112, -0.7494891 ],
       [ 1.14569799, -0.89814812, -1.34415303,  1.00506128, -0.77464971],
       [ 0.63006016, -0.77481738, -1.78513886,  0.77970121,  0.01301309],
       ...,
       [ 1.72597895, -0.43882251,  0.61399677, -1.08964425,  0.31363828],
       [ 1.51566038,  0.24214565,  0.21009369, -0.5634909 , -0.06163653],
       [ 2.197583  ,  1.7021362 ,  0.2650064 , -0.8069806 ,  0.14059193]])

In [168]:
pca.explained_variance_ratio_


array([0.38471052, 0.28512807, 0.12556059, 0.10235067, 0.0718178 ])

### Choice of Dimensionality reduction

From the variance ratio we see that the first 5 principle components can explain the 95% information about the feature space and by doing so we are reducing the data by 45%. We are taking 5 features from 9 available ones.

In [169]:
pca.components_

array([[-0.49399214,  0.4044924 ,  0.11526204, -0.38761832,  0.38290993,
         0.02548859, -0.33814134,  0.39233886,  0.11526204],
       [ 0.14953414,  0.04865534,  0.68618173,  0.11981471, -0.08229623,
         0.01734898,  0.10735765,  0.02503154,  0.68618173],
       [ 0.4410909 ,  0.28421595, -0.08252267,  0.2319699 ,  0.51161755,
         0.47850785,  0.3042927 ,  0.27187085, -0.08252267],
       [-0.2138098 , -0.40075035,  0.07404741, -0.19949932,  0.15457504,
         0.72257706, -0.1449682 , -0.41939441,  0.07404741],
       [-0.12096945,  0.29726692, -0.05797542,  0.14246803, -0.72053007,
         0.49662862, -0.08904437,  0.3102368 , -0.05797542]])

In [170]:
components = pd.DataFrame(pca.components_,columns=feature_df.columns,index = ['PC-1','PC-2','PC-3','PC-4','PC-5'])
components

Unnamed: 0,temp_c_a,status_a,rain_mm_a,temp_c_b,status_b,rain_mm_b,temp_c_c,status_c,rain_mm_c
PC-1,-0.493992,0.404492,0.115262,-0.387618,0.38291,0.025489,-0.338141,0.392339,0.115262
PC-2,0.149534,0.048655,0.686182,0.119815,-0.082296,0.017349,0.107358,0.025032,0.686182
PC-3,0.441091,0.284216,-0.082523,0.23197,0.511618,0.478508,0.304293,0.271871,-0.082523
PC-4,-0.21381,-0.40075,0.074047,-0.199499,0.154575,0.722577,-0.144968,-0.419394,0.074047
PC-5,-0.120969,0.297267,-0.057975,0.142468,-0.72053,0.496629,-0.089044,0.310237,-0.057975


Described in the above table are the impacts(covariances) of the features we obtained from the station A,B and C in the new freature space consisting of the 5 principal components.

For the PC-1 we the temp,status of station A,B and C are the most impactful whereas the rains are not.

For the PC-2 the rains in mm readings from the station A,B and C are more impactful. 

For the PC-3 the rain from Station B is impact full. Temprature and Status from all 3 stations in impactful.

For the PC-4 only the rain record of station B is impactful, rest is not. 

For the PC-5 Status from station A and C. Also rain from Station B is impactful.


## Inconsistancy Rate Analysis

The dataset has 3 columns which are actually ordinal. we have encoded them with neumerical values 0,1,2,3,4. these are the status columns for the station A,B and C.

In [171]:
df_ordinal = feature_df[['status_a','status_b','status_c']]
df_ordinal.head()

Unnamed: 0_level_0,status_a,status_b,status_c
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-01 00:00:00,3,4,3
2014-01-01 01:00:00,3,4,3
2014-01-01 02:00:00,3,3,3
2014-01-01 03:00:00,3,3,3
2014-01-01 04:00:00,3,3,3


`df_ordinal` has only the orinal values

In [172]:
df_ordinal['level_cm']=df_prep_filtered.level_cm.round()
df_ordinal['level_cm'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


array([ 0.,  1.,  2.,  3.,  4., -1., -2.,  5.,  6.,  7.,  8.,  9., 10.])

In [173]:
df_ordinal['flow_m2_s'] = df_prep_filtered.flow_m2_s.round()
df_ordinal['flow_m2_s'].unique()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


array([ 0.,  1.,  2.,  3., -1., -2.,  4.,  5.,  6.,  7.,  8.,  9., 10.,
       11., 12., 13., 14.])

as we can see from both the level and the flow has negative numbers which is not possible we will map them to zero.

In [174]:
df_ordinal['level_cm'].loc[df_ordinal['level_cm']<0] = 0
df_ordinal['level_cm'].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])

In [175]:
df_ordinal['flow_m2_s'].loc[df_ordinal['flow_m2_s']<0] = 0
df_ordinal['flow_m2_s'].unique()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12.,
       13., 14.])

In [176]:
df_ordinal.head()

Unnamed: 0_level_0,status_a,status_b,status_c,level_cm,flow_m2_s
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-01 00:00:00,3,4,3,0.0,0.0
2014-01-01 01:00:00,3,4,3,0.0,0.0
2014-01-01 02:00:00,3,3,3,0.0,0.0
2014-01-01 03:00:00,3,3,3,0.0,0.0
2014-01-01 04:00:00,3,3,3,0.0,0.0


1) Since we have these ordinal value columns as features and we  discritisize the level_cm and flow_m2_s columns so that the values can be used.

2) Next we created a pattern string by combining the status so we can have a unique pattern.

3) We calculated the IZ based on patterns in levels and then using that value we calculated the IR 


In [177]:
df_ordinal['pattern'] = df_ordinal['status_a'].astype('str') + df_ordinal['status_b'].astype('str') + df_ordinal['status_c'].astype('str')
df_ordinal.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,status_a,status_b,status_c,level_cm,flow_m2_s,pattern
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-01-01 00:00:00,3,4,3,0.0,0.0,343
2014-01-01 01:00:00,3,4,3,0.0,0.0,343
2014-01-01 02:00:00,3,3,3,0.0,0.0,333
2014-01-01 03:00:00,3,3,3,0.0,0.0,333
2014-01-01 04:00:00,3,3,3,0.0,0.0,333


In [178]:
patterns = df_ordinal.pattern.unique()
flow_class = df_ordinal.flow_m2_s.unique()
level_class = df_ordinal.level_cm.unique()
df_ordinal['pattern_count']=1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [188]:
df_ir_level = df_ordinal.groupby(['level_cm','pattern']).sum()
df_ir_level = df_ir_level[['pattern_count']]
df_ir_level.reset_index(inplace=True)

# df_ir_level.set_index(['level_cm'],inplace=True)

# df_ir_level = df_ir_level.T
df_ir_level.head()

Unnamed: 0,level_cm,pattern,pattern_count
0,0.0,111,159
1,0.0,112,30
2,0.0,114,2
3,0.0,121,128
4,0.0,122,51


In [180]:
df_ir_level[df_ir_level.pattern == '121']

Unnamed: 0,level_cm,pattern,pattern_count
3,0.0,121,128
95,1.0,121,3


In [181]:
df_ir_level_sum = df_ir_level.groupby(['pattern']).sum()
df_ir_level_sum.drop(['level_cm'],axis=1,inplace=True)
df_ir_level_sum.rename(columns={'pattern_count':'pattern_count_sum'},inplace=True)
# df_ir_level.reset_index(inplace=True)
df_ir_level_sum

Unnamed: 0_level_0,pattern_count_sum
pattern,Unnamed: 1_level_1
111,159
112,30
114,2
121,131
122,51
...,...
544,1234
545,2055
553,30
554,1123


In [182]:
df_ir_level_max = df_ir_level.groupby(['pattern']).max()
# df_ir_level_max.drop(['pattern'],axis=1,inplace=True)
df_ir_level_max.rename(columns={'pattern_count':'pattern_count_max'},inplace=True)
# df_ir_level.reset_index(inplace=True)
df_ir_level_max

Unnamed: 0_level_0,level_cm,pattern_count_max
pattern,Unnamed: 1_level_1,Unnamed: 2_level_1
111,0.0,159
112,0.0,30
114,0.0,2
121,1.0,128
122,0.0,51
...,...,...
544,10.0,961
545,9.0,1302
553,2.0,26
554,10.0,944


In [183]:
df_ir_level_combined = df_ir_level_sum.join(df_ir_level_max)
df_ir_level_combined.head()

Unnamed: 0_level_0,pattern_count_sum,level_cm,pattern_count_max
pattern,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
111,159,0.0,159
112,30,0.0,30
114,2,0.0,2
121,131,1.0,128
122,51,0.0,51


In [184]:
df_ir_level_combined['IZ'] = df_ir_level_combined.pattern_count_sum - df_ir_level_combined.pattern_count_max
df_ir_level_combined.head()

Unnamed: 0_level_0,pattern_count_sum,level_cm,pattern_count_max,IZ
pattern,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
111,159,0.0,159,0
112,30,0.0,30,0
114,2,0.0,2,0
121,131,1.0,128,3
122,51,0.0,51,0


In [187]:
IR = sum(df_ir_level_combined['IZ'])/ sum(df_ir_level_combined.pattern_count_sum)
IR

0.22456080310289755