# Feature Engineering

### Features
Feature engineering is the process of selecting relevant features for the task.

![f1.jpg](attachment:f1.jpg)

# Feature Selection For Machine Learning¶

### Univariate Selection

In [None]:
# Feature Selection with Univariate Statistical Tests
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
dataframe

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [None]:
array = dataframe.values
array

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

In [None]:
X = array[:,0:8]
Y = array[:,8]

In [None]:
# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X,Y)

In [None]:
# summarize scores
set_printoptions(precision = 3)
fit.scores_

array([ 39.67 , 213.162,   3.257,   4.304,  13.281,  71.772,  23.871,
        46.141])

In [None]:
features = fit.transform(X)
# summarize selected features
features[0:5,:]

array([[  6. , 148. ,  33.6,  50. ],
       [  1. ,  85. ,  26.6,  31. ],
       [  8. , 183. ,  23.3,  32. ],
       [  1. ,  89. ,  28.1,  21. ],
       [  0. , 137. ,  43.1,  33. ]])

### Recursive Feature Elimination

In [None]:
# Feature Selection with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
# feature extraction
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]




### Principal Component Analysis

In [None]:
# Feature Extraction with PCA
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


### Feature Importance

In [None]:
# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
# load data
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
# feature extraction
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X, Y)
print(model.feature_importances_)

[0.108 0.242 0.099 0.082 0.073 0.141 0.116 0.139]


### References

http://scikit-learn.org/stable/modules/feature_selection.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.



# Evaluate Machine Learning Algorithms

Three distinct set of labeled data, 1. Training set, 2. Validation set and 3. Test set

![tt3.png](attachment:tt3.png)

### Split into Train and Test Sets

In [None]:
# Evaluate using a train and a test set
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result*100.0))

Accuracy: 75.591%


![training.png](attachment:training.png)

### k-fold Cross-Validation.

![cross1.JPG](attachment:cross1.JPG)

In [None]:
# Evaluate using Cross Validation
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 77.086% (5.091%)


# Performance Metrics

Evaluation metrics for regression and classification

![metrics.png](attachment:metrics.png)

## a. Classification Metrics

### Classification accuracy

![accuracy1.png](attachment:accuracy1.png)

In [None]:
# Cross Validation Classification Accuracy
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')
scoring = 'accuracy'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Accuracy: %.3f (%.3f)" % (results.mean(), results.std()))

Accuracy: 0.771 (0.051)


### Logistic loss

In [None]:
# Cross Validation Classification LogLoss
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')
scoring = 'neg_log_loss'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Logloss: %.3f (%.3f)" % (results.mean(), results.std()))

Logloss: -0.494 (0.042)


### Area Under ROC Curve

AUC is commonly used to compare the performance of various models while precision/recall/F-measure can help determine the appropriate threshold to use for prediction purposes.

![ROC_curve.png](attachment:ROC_curve.png)


classifier with the Red dashed line is guessing the label randomly. Closer the ROC curve gets to top-left part of the chart, better the classifier is. Area under the curves (AUC) is the area below these ROC curves. Therefore, in other words, AUC is a great indicator of how well a classifier functions.

In [None]:
# Cross Validation Classification ROC AUC
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LogisticRegression(solver='liblinear')
scoring = 'roc_auc'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))

AUC: 0.826 (0.050)


### Confusion Matrix

![confusion1.png](attachment:confusion1.png)

In [None]:
# Cross Validation Classification Confusion Matrix
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

[[141  21]
 [ 41  51]]


### Classification Report

In [None]:
# Cross Validation Classification Report
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
report = classification_report(Y_test, predicted)
print(report)

              precision    recall  f1-score   support

         0.0       0.77      0.87      0.82       162
         1.0       0.71      0.55      0.62        92

    accuracy                           0.76       254
   macro avg       0.74      0.71      0.72       254
weighted avg       0.75      0.76      0.75       254



## b. Regression Metrics

### Mean Absolute Error

![mae.png](attachment:mae.png)

In [None]:
# Cross Validation Regression MAE
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]

In [None]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()
scoring = 'neg_mean_absolute_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: %.3f (%.3f)" % (results.mean(), results.std()))

MAE: -3.387 (0.667)


### Mean Squared Error

![mse.png](attachment:mse.png)

In [None]:
# Cross Validation Regression MSE
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]

In [None]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)" % (results.mean(), results.std()))

MSE: -23.747 (11.143)


### R^2

![r2.JPG](attachment:r2.JPG)

In [None]:
# Cross Validation Regression R^2
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = read_csv(filename, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]

In [None]:
kfold = KFold(n_splits=10, random_state=7, shuffle=True)
model = LinearRegression()
scoring = 'r2'
results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R^2: %.3f (%.3f)" % (results.mean(), results.std()))

R^2: 0.718 (0.099)


### Reference
http://scikit-learn.org/stable/modules/model_evaluation.html



![metrics.png](attachment:metrics.png)

# Machine Learning (ML)
ML is a subfield of AI and computer science. ML provides platform to learn useful things from past and improve from experience without human intervention using smart machines. In this manner, ML techniques, tools and algorithms are used for getting quality predictions and estimations. So, in this chapter you will be able to learn and understand some essential components, algorithms and tools of ML.
- ML is a branch of AI and computer science.  
- ML works on the use of data and algorithms to imitate the way that humans learn in order to improve machine’s accuracy.  
- ML allows the user to feed a computer algorithm an immense amount of data and have the computer analyse and make data-driven recommendations and decisions based on only the input data.  
- ML is based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.  -ML provides systems the ability to automatically learn and improve from experience without being explicitly programmed.  
- ML depends on mathematics and statist

### Brief history of Machine Learning (ML)
- Arthur Samuel has coined the phrase “ML” in 1950’s.  
- ML is required to discover patterns in your data and then make predictions based on often complex patterns to answer business questions, detect and analyse trends and help solve problems.

### Some important algorithms used in ML are as under: 

Classification in ML 
- Classification in ML refers to a predictive modelling problem where a class label is predicted for a given example of input data.  
- Classification algorithm learns from the previous/ past data.  
- Classification is one of the primary uses of data science and ML.  A common example of classification in ML is spam filtering of emails.

Regression in ML

- Regression algorithm learns from the previous/ past data.  
- Regression algorithm gives us the value as an output.  
- ML regression methods allow us to predict a continuous outcome variable (y) based on the value of one or multiple predictor variables (X).  
- A common example of regression algorithm is weather forecast (to find the amount of rain in an area).

ML clustering

- ML Clustering is also called cluster analysis.  
- ML Clustering refers to an unsupervised ML task.  
- ML Clustering involves automatically discovering natural grouping in data.  
- ML Clustering is used to classify data into structures that are more easily understood and manipulated.  
- ML clustering method is used for identifying and grouping similar data points in larger datasets without concern for the specific outcome.

Clustering algorithm in ML 

- Clustering algorithm use data and give output in the form of clusters of data.  
- Clustering algorithms only interpret the input data and find natural groups or clusters in feature space.  
- A common example of clustering algorithm is deciding the prices of house/land in a particular area.

Different types of Clustering 

- Connectivity-based clustering (Hierarchical clustering)  
- Centroids -based clustering (partitioning methods)  
- Distribution -based clustering  
- Density -based clustering (Model based methods)  
- Fuzzy clustering  
- Constraint-ML Clustering based clustering (Supervised clustering)
 

### Types of ML

Supervised ML 
- Supervised ML is the ML task of learning a function that maps an input to an output based on example input-output pairs.
- Supervised ML uses past data to make predictions.  
- Supervised ML uses classification and regression ML algorithms. 
- A common example of supervised ML is the spam filtering of e-mails.

Unsupervised ML 
- Unsupervised ML refers to a type of algorithm that learns pattern from untagged data.  
- Unsupervised ML deals with the unlabelled data.  
- Unsupervised ML finds hidden patterns. Through mimicry, the machine is forced to build a compact internal representation of its world and then generate imaginative content.  
- Unsupervised ML allows the model to work on its own to discover patterns and information that was previously undetected.  
- Unsupervised ML uses clustering and association ML algorithms.  
- A common example of unsupervised ML is Facebook.

Reinforcement ML 
- Reinforcement ML is used for improving or increasing efficiency.

### Popular ML software tool
Scikit-learn 
- Scikit-learn is a software tool designed for ML development in Python.  
- Scikit-learn provide a library for the Python programming language.  
- Scikit-learn provide models and algorithms for classification, regression, clustering, dimensional reduction, model selection and pre-processing.
- Scikit-learn help in data mining and data analysis.

Other ML tools you may want to explore
- Pytorch
- TensorFlow
- TensorFlow.js
- Keras.io
- Weka
- KNIME
- Google colab
- Accord.Net
- Apache Mahout
- Shogun
- Rapid Miner

### Some common applications of ML 
- Image recognition  
- Speech recognition  
- Traffic prediction  
- Product recommendation  
- Self-driving cars  
- E-mail spam and malware filtering  
- Virtual personal assistant  
- Online fraud detection
- Data Driven Drug Discovery