## Loading Data

We will start off by loading our data from the **data** folder of this lecture.

Next we print of the dataset, than we split our **features** into **X** and our **target** into **Y**

In [48]:
import numpy as np
#import requests
dataset = np.loadtxt('./data/pima-indians-diabetes.data.csv', delimiter=",")
print(dataset)# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]


### Print our Features

In [49]:
print(X)

[[  6.    148.     72.    ...   0.     33.6     0.627]
 [  1.     85.     66.    ...   0.     26.6     0.351]
 [  8.    183.     64.    ...   0.     23.3     0.672]
 ...
 [  5.    121.     72.    ... 112.     26.2     0.245]
 [  1.    126.     60.    ...   0.     30.1     0.349]
 [  1.     93.     70.    ...   0.     30.4     0.315]]


### Print our Target Values

In [50]:
print(y)

[1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1.
 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0.
 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0.
 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0.
 0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0.
 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1.
 1. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 1. 0.
 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0.
 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1.
 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0.
 1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0. 1.

## Data Normalisation and standardization

The majority of models are highly sensitive to data scaling. Prior to running an algorithm, Nomalisation and standardisatoin should be performed.

**Normalization** involves replacing nominal features, so that each of them would be in a range from 0 to 1.

**standardization** involves data preproccessing, after which each feature has and average 0 and 1 dispersion(all data does not follow normal distribution).

Scikit-learn provides libraries for this.

In [51]:
from sklearn import preprocessing

#normalize the data attributes
normalized_X = preprocessing.normalize(X)

#standardizes the data attributes
standardized_X = preprocessing.scale(X)

In [24]:
print(normalized_X)

[[0.03494617 0.86200564 0.41935409 ... 0.         0.19569858 0.00365188]
 [0.00872683 0.74178025 0.57597054 ... 0.         0.23213358 0.00306312]
 [0.04093566 0.93640332 0.32748532 ... 0.         0.11922512 0.0034386 ]
 ...
 [0.02727338 0.66001582 0.39273669 ... 0.61092373 0.14291252 0.0013364 ]
 [0.0070043  0.8825414  0.42025781 ... 0.         0.21082934 0.0024445 ]
 [0.00804902 0.74855891 0.56343144 ... 0.         0.24469022 0.00253544]]


In [25]:
print(standardized_X)

[[ 0.63994726  0.84832379  0.14964075 ... -0.69289057  0.20401277
   0.46849198]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.69289057 -0.68442195
  -0.36506078]
 [ 1.23388019  1.94372388 -0.26394125 ... -0.69289057 -1.10325546
   0.60439732]
 ...
 [ 0.3429808   0.00330087  0.14964075 ...  0.27959377 -0.73518964
  -0.68519336]
 [-0.84488505  0.1597866  -0.47073225 ... -0.69289057 -0.24020459
  -0.37110101]
 [-0.84488505 -0.8730192   0.04624525 ... -0.69289057 -0.20212881
  -0.47378505]]


## Feature Selection

The most import thing in solving a task is the ability to properly choose or even create **features**, also called **Feature Selection**

Good features are more import than good methods.

There are many ready-made algorithms for **Feature Selection**. Tree algorithms allow to compute the informativeness of features.

In [33]:
from sklearn.ensemble import ExtraTreesClassifier
model=ExtraTreesClassifier()
model.fit(X,y)
#display the relative importance of each attribute
print(model.feature_importances_)

[0.12679597 0.272837   0.11836486 0.08862803 0.08419173 0.16895893
 0.14022348]


## feature_selection Module

The classes in this module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators accuracy or to boost their performance on very high-dimensional data

### Recursive Feature Elimination (RFE)

Given an external estimator that assigns weights to feature(e.g. for example the coeffients of a linear model), recursive feature elimination is to select by recursively considering smaller and smaller sets of features.

First the estimator is trained on the intial set of features and the importance of each feature is obtained either through a ``coef_`` attribute or through a ``feature_importances_`` attribute.

Then the least import features are pruned from the current set of features

That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.



In [34]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression

#create the RFE and select 3 attributes
rfe =RFE(model,3)
rfe=rfe.fit(X,y)
print(rfe.ranking_)

[1 2 3 5 4 1 1]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


**Note:** There is a warning message regarding our data, in this instance our data has not been normalised or standardized. Lets try the same code with our data being normalised.

In [40]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

#create the RFE and select 3 attributes
rfe =RFE(model,3)
rfe=rfe.fit(standardized_X,y)
print(rfe.ranking_)

[1 1 3 5 4 1 2]


## Logistic Regression

Often used to solve tasks of classification(binary), but multi-class classification is also allowed (**one-versus-rest**)

In [54]:

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(standardized_X, y)

expected = y
predicted=model.predict(X)
print(metrics.classification_report(expected,predicted))

              precision    recall  f1-score   support

         0.0       1.00      0.00      0.00       500
         1.0       0.35      1.00      0.52       268

    accuracy                           0.35       768
   macro avg       0.67      0.50      0.26       768
weighted avg       0.77      0.35      0.18       768



In [56]:
from sklearn import metrics
from sklearn.svm import SVC
model=SVC()
model.fit(X,y)

expect=y
predicted=model.predict(X)
print(metrics.classification_report(expected,predicted))

              precision    recall  f1-score   support

         0.0       0.77      0.92      0.84       500
         1.0       0.76      0.48      0.59       268

    accuracy                           0.77       768
   macro avg       0.76      0.70      0.71       768
weighted avg       0.76      0.77      0.75       768



In [57]:
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model=GaussianNB()
model.fit(X,y)

expect=y
predicted=model.predict(X)
print(metrics.classification_report(expected,predicted))

              precision    recall  f1-score   support

         0.0       0.80      0.86      0.83       500
         1.0       0.69      0.60      0.64       268

    accuracy                           0.77       768
   macro avg       0.75      0.73      0.73       768
weighted avg       0.76      0.77      0.76       768



In [58]:
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier()
model.fit(X,y)

expect=y
predicted=model.predict(X)
print(metrics.classification_report(expected,predicted))

              precision    recall  f1-score   support

         0.0       0.82      0.90      0.86       500
         1.0       0.77      0.63      0.69       268

    accuracy                           0.80       768
   macro avg       0.79      0.77      0.78       768
weighted avg       0.80      0.80      0.80       768



In [59]:
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(X,y)

expect=y
predicted=model.predict(X)
print(metrics.classification_report(expected,predicted))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       500
         1.0       1.00      1.00      1.00       268

    accuracy                           1.00       768
   macro avg       1.00      1.00      1.00       768
weighted avg       1.00      1.00      1.00       768

