# Learning Notes 6 - Pandas 4

## Split Train Test

We have two datasets.

* One has independent features, called (x).


* One has dependent variables, called (y).

To split it, we do:

* x Train – x Test / y Train – y Test

Then

* x Train and y Train become data for the machine learning, capable to create a model. 

* Once the model is created, input x Test and the output should be equal to y Test. 

* The more closely the model output is to y Test: the more accurate the model is.

In [5]:
# Example

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [6]:
list(y)

[0, 1, 2, 3, 4]

In [7]:
# Then split, lets take 33% for testing set (whats left for training).

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


In [8]:
X_train

array([[4, 5],
       [0, 1],
       [6, 7]])

In [9]:
X_test

array([[2, 3],
       [8, 9]])

In [10]:
y_train

[2, 0, 3]

In [11]:
y_test

[1, 4]

In [None]:
# Example

import numpy as np
import sklearn
 
X = df1['Households_log']
y = df1['RPI_log']  

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75,test_size=0.25)
print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test: ", X_test)
print ("y_test: ", y_test)

In [None]:
import numpy as np
import sklearn

X = df1.iloc[:,2:176] 
y = df1['RPI_log']

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.75,test_size=0.25)

## Label Encoding

Label encoding: transform strings into numbers

 Since machines can only process numbers, we need to convert an index into a numerical value

In [None]:
There is a built-in function for that: 
    

from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder() 

df1.loc[:,'Quarter']= label_encoder.fit_transform(df1.loc[:,'Quarter'].values)
df1.loc[:,'Quarter'].unique() 


# better to use loc[:,'col'] syntax
# better to use .values at the end


## Feature Selection

In [None]:
https://hub.packtpub.com/4-ways-implement-feature-selection-python-machine-learning/
https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b
    

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in.

How to select features and what are Benefits of performing feature selection before modelling your data?

* Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.

* Improves Accuracy: Less misleading data means modelling accuracy improves.

* Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train faster.

#### Methods

1. Univariate Selection

2. Feature Importance

3. Correlation Matrix with Heatmap

#### Univariate Selection

The scikit-learn library provides the SelectKBest class that can be used with a suite of different statistical tests to select a specific number of features.

Statistical tests can be used to select those features that have the strongest relationship with the output variable.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range

#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #print 10 best features

#### Feature Importance

You can get the feature importance of each feature of your dataset by using the feature importance property of the model.
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.
Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier

import matplotlib.pyplot as plt

model = ExtraTreesClassifier()
model.fit(X,y)

print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

#### Correlation Matrix with Heatmap

In [None]:
Correlation states how the features are related to each other or the target variable.
Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range
#get correlations of each features in dataset

corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))

#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
from scipy.stats.stats import pearsonr

X = df1.iloc[:,2:176]
y = df1['RPI_log']

features = list(X)
correlation = []
significance = []

for feature in features:
    correl = pearsonr(X[feature].values, y.values)
    correlation.append(correl[0])
    significance.append(correl[1])

df = pd.DataFrame()
df['feature'] = features
df['correlation'] = correlation
df['abs_correlation'] = np.abs(correlation)
df['significance'] = significance
df['significant'] = df['significance'] < 0.05 # Label those P<0.01

df.sort_values(by='abs_correlation', ascending=False, inplace=True)
df.head(30)

## Important: restack data into 1 column

In [3]:
import pandas as pd
import numpy as np

arrays = [['Amount', 'Amount', 'Amount', 'Amount', 'dwy', 'dwy', 'dwy', 'dwy', 'bmd', 'bmd', 'bmd', 'bmd'],
          ['EUR', 'GBP', 'JPY', 'USD', 'EUR', 'GBP', 'JPY', 'USD', 'EUR', 'GBP', 'JPY', 'USD']]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['Portfolio', 'Currency'])

data = [100, 200, 300, 400, -0.5, 0.5, 0, 0.8, 3.8, 3, 0, 3]

df = pd.DataFrame(data).T
df.columns = index
df.index = ['2016-05-13']
df

Portfolio,Amount,Amount,Amount,Amount,dwy,dwy,dwy,dwy,bmd,bmd,bmd,bmd
Currency,EUR,GBP,JPY,USD,EUR,GBP,JPY,USD,EUR,GBP,JPY,USD
2016-05-13,100.0,200.0,300.0,400.0,-0.5,0.5,0.0,0.8,3.8,3.0,0.0,3.0


In [4]:
df.stack('Currency')

Unnamed: 0_level_0,Portfolio,Amount,bmd,dwy
Unnamed: 0_level_1,Currency,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-05-13,EUR,100.0,3.8,-0.5
2016-05-13,GBP,200.0,3.0,0.5
2016-05-13,JPY,300.0,0.0,0.0
2016-05-13,USD,400.0,3.0,0.8


In [5]:
df.stack("Currency").to_records()

rec.array([('2016-05-13', 'EUR', 100., 3.8, -0.5),
           ('2016-05-13', 'GBP', 200., 3. ,  0.5),
           ('2016-05-13', 'JPY', 300., 0. ,  0. ),
           ('2016-05-13', 'USD', 400., 3. ,  0.8)],
          dtype=[('level_0', '<U10'), ('Currency', '<U3'), ('Amount', '<f8'), ('bmd', '<f8'), ('dwy', '<f8')])

In [6]:
df = pd.DataFrame(df.stack("Currency").to_records())
df

Unnamed: 0,level_0,Currency,Amount,bmd,dwy
0,2016-05-13,EUR,100.0,3.8,-0.5
1,2016-05-13,GBP,200.0,3.0,0.5
2,2016-05-13,JPY,300.0,0.0,0.0
3,2016-05-13,USD,400.0,3.0,0.8
