# Applied Machine learning in Python

Most of the computer science problems are solved by writing a series of instruction, but not at all the problems can be solved using this approach, for example a speech to text conversion system, there are millions to billions of words and it is a diffcult task to teach each adn every word and more over pronouncation,accent etc. differs, so for these type of problems the solution is to train the computer with an algorithm to understand some words so that it can learn by itself.This concept is called machine learning

Key types of machine learning problem
* Supervised : Learn to predict target values from labelled data
    * Classification (target values are discrete classes)
    * Regression (target values are continuous values)
* Unsupervised : Find structure in unlabelled data
    * Clustering : Find groups of similar instances in data
    * Outlier Detection : Finding usual patterns

In [53]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

fruits = pd.read_table(r'C:\Users\kkv1\Desktop\Python\DS\Applied ML in python\Dataset\fruit_data_with_colors.txt')

In [54]:
fruits.head()

Unnamed: 0,fruit_label,fruit_name,fruit_subtype,mass,width,height,color_score
0,1,apple,granny_smith,192,8.4,7.3,0.55
1,1,apple,granny_smith,180,8.0,6.8,0.59
2,1,apple,granny_smith,176,7.4,7.2,0.6
3,2,mandarin,mandarin,86,6.2,4.7,0.8
4,2,mandarin,mandarin,84,6.0,4.6,0.79


In [99]:
fruits.describe()

Unnamed: 0,fruit_label,mass,width,height,color_score
count,59.0,59.0,59.0,59.0,59.0
mean,2.542373,163.118644,7.105085,7.69322,0.762881
std,1.208048,55.018832,0.816938,1.361017,0.076857
min,1.0,76.0,5.8,4.0,0.55
25%,1.0,140.0,6.6,7.2,0.72
50%,3.0,158.0,7.2,7.6,0.75
75%,4.0,177.0,7.5,8.2,0.81
max,4.0,362.0,9.6,10.5,0.93


In [55]:
# create a mapping from fruit label value to fruit name to make results easier to interpret
lookup_fruit_name = dict(zip(fruits.fruit_label.unique(), fruits.fruit_name.unique()))   
lookup_fruit_name

{1: 'apple', 2: 'mandarin', 3: 'orange', 4: 'lemon'}

The file contains the mass, height, and width of a selection of oranges, lemons and apples. The heights were measured along the core of the fruit. The widths were the widest width perpendicular to the height.

In any machine learning task to train the computer we have split the data into two parts
* Training set
* Test set

Training set is used to train the model and test set is used to evaluate the learned model.

For creating a model all the features from the dataset might not be required so for that reason we take only those features which are revelant to the model we are creating.

### Creating train_test_split

In [56]:
X = fruits[['mass', 'width', 'height']]
y = fruits['fruit_label']

# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [57]:
X_train.head()

Unnamed: 0,mass,width,height
42,154,7.2,7.2
48,174,7.3,10.1
7,76,5.8,4.0
14,152,7.6,7.3
32,164,7.2,7.0


In [58]:
y_train.head()

42    3
48    4
7     2
14    1
32    3
Name: fruit_label, dtype: int64

The first step in machine learning is to evaluating the dataset, for this any visualization method can be used or one can simply scrol through the data.The reason for evaluating the data is as follows,
* Type of cleaning or prep processing that is required
* Distribution of values for each feature


In [88]:
# plotting a scatter matrix
from matplotlib import cm

X = fruits[['height', 'width', 'mass', 'color_score']]
y = fruits['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X_train, c= y_train, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap=cmap)
plt.plot()

<IPython.core.display.Javascript object>

[]

In [89]:
# plotting a 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(X_train['width'], X_train['height'], X_train['color_score'], c = y_train, marker = 'o', s=100)
ax.set_xlabel('width')
ax.set_ylabel('height')
ax.set_zlabel('color_score')
plt.show()

<IPython.core.display.Javascript object>

In [61]:
X_train.head()

Unnamed: 0,height,width,mass,color_score
42,7.2,7.2,154,0.82
48,10.1,7.3,174,0.72
7,4.0,5.8,76,0.81
14,7.3,7.6,152,0.69
32,7.0,7.2,164,0.8


In [62]:
y_train.head()

42    3
48    4
7     2
14    1
32    3
Name: fruit_label, dtype: int64

### Classification
* k-NN classifiers are an example of what's called instance based or memory based supervised learning. What this means is that instance based learning methods work by memorizing the labeled examples that they see in the training set. And then they use those memorized examples to classify new objects later.
* The k in k-NN refers to the number of nearest neighbors the classifier will retrieve and use in order to make its prediction. 
#### The k-Nearest Neighbor (k-NN) classifier algorithm
* FInd the most similar instances to X_test that are in X_train
* Get the labels of y_NN for the instances in X_NN
* predict the label by combining the labels y_NN
#### A nearest neighbor algorithm needs four things specified
1. A distance metric
2. How many nearest neighbors to look at?
3. Optional weighting function on the neighbor points
4. Method of aggregating the classes of neighbor points

### Create classifier object

In [63]:
# For this example, we use the mass, width, and height features of each fruit instance
X = fruits[['mass', 'width', 'height']]
y = fruits['fruit_label']

# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [64]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 5)

#### Train the classifier (fit the estimator) using the training data

In [65]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

#### Estimate the accuracy of the classifier on future data, using the test data

In [66]:
knn.score(X_test, y_test)

0.53333333333333333

#### Use the trained k-NN classifier model to classify new, previously unseen objects

In [67]:
# first example: a small fruit with mass 20g, width 4.3 cm, height 5.5 cm
fruit_prediction = knn.predict([[20, 4.3, 5.5]])
lookup_fruit_name[fruit_prediction[0]]

'mandarin'

In [68]:
# second example: a larger, elongated fruit with mass 100g, width 6.3 cm, height 8.5 cm
fruit_prediction = knn.predict([[100, 6.3, 8.5]])
lookup_fruit_name[fruit_prediction[0]]

'lemon'

#### Plot the decision boundaries of the k-NN classifier

In [90]:
from adspy_shared_utilities import plot_fruit_knn

plot_fruit_knn(X_train, y_train, 1, 'uniform')   # we choose 5 nearest neighbors
plot_fruit_knn(X_train, y_train, 5, 'uniform') 
plot_fruit_knn(X_train, y_train, 10, 'uniform')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### How sensitive is k-NN classification accuracy to the choice of the 'k' parameter?

In [91]:
k_range = range(1,20)
scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))

plt.figure()
plt.xlabel('k')
plt.ylabel('accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,5,10,15,20]);
plt.show()

<IPython.core.display.Javascript object>

#### How sensitive is k-NN classification accuracy to the train/test split proportion?

In [93]:
t = [0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2]

knn = KNeighborsClassifier(n_neighbors = 5)

plt.figure()

for s in t:

    scores = []
    for i in range(1,1000):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1-s)
        knn.fit(X_train, y_train)
        scores.append(knn.score(X_test, y_test))
    plt.plot(s, np.mean(scores), 'bo')

plt.xlabel('Training set proportion (%)')
plt.ylabel('accuracy');
plt.show()

<IPython.core.display.Javascript object>

In [72]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sn

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

fruits = pd.read_table(r'C:\Users\kkv1\Desktop\Python\DS\Applied ML in python\Dataset\fruit_data_with_colors.txt')

X = fruits[['height','width','mass','color_score']]
y = fruits['fruit_label']

X_train, X_test, y_train,y_test = train_test_split(X, y, random_state = 0)

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train,y_train)
print('Accuracy of knn classifier on the test set:', knn.score(X_test,y_test))

example_fruit = [[5.5,2.2,10,0.70]]
print('predicted fruit type for ', example_fruit, 'is ', knn.predict(example_fruit))


Accuracy of knn classifier on the test set: 0.533333333333
predicted fruit type for  [[5.5, 2.2, 10, 0.7]] is  [2]


Supervised learnign can be divided into two 
* Classification
* Regression
Both classification and regression take a set of training instances and learn a mapping to a target value.
* For classification, the target value is a discrete class value
        Ex: Binary: Deciding whether a transaction is fradulant or not
            Multiclass : Target value is one of a set of discrete values (the fruit dataset is a multi class)
            Multilabel : Example classifying wev pages into multiple topics
            
* For regression, the target value is continous (real-values/floating point)
        Ex: Predicting the selling prize of a house from it's attributes
    * Looking at the target value's type will guide you on what supervised learning method is to be use.
    * Many supervised learning methods have flavors for classification and Regression

#### Overfitting
Overfitting typically occurs when we try to fit a complex model with an inadequate amount of training data. An overfitted model uses its ability to capture complex patterns by being great at predicting lots and lots of specific data samples or areas of local variation in the training set. But it often misses seeing global patterns in the training set that would help it generalize well on the unseen test set. 

In general for a classifier as value of k decreases the risk of overfitting increases, this is because as k decresase say k = 1, now the classifier is affected by noise and outliers and there by decision boundary changes.

#### Underfitting
In case of under fitted model the learned model may not be even able to predict/classify the test vlaues.

#### Overfitting and Underfitting for Regression
In case of linear regresssion with an underfitted model the RSS (residual sum of saqures) is high and there by the predictions from the model will not be accurate, on the other hand in case of overfitted model that linear regression curve is of higher order with the intension of decresing the RSS but with this approach it tries to cover all the aspects but will not be able to generalize global pattern.
    
#### Overfitting and Underfitting for Classification
In case of knn classifier with an underfitted model the classification doesn't happen properly coz the model has not considered the majority of the points for classification but, on the other hand in case of overfitted model that knn boundary tries to classify considering all the points , but while doing so it may leave out the obvious global pattern i.e. in order to put a outlier point inside a particular group boundary the model would go on to extend the boundary which leads to undesired results.

In [73]:
from sklearn.datasets import make_regression
from matplotlib.colors import ListedColormap

cmap_bold = ListedColormap(['#FFFF00', '#00FF00', '#0000FF','#000000'])
plt.figure()

plt.title('Sample regression problem with one input variable')
X_R1, y_R1 = make_regression(n_samples = 100, n_features = 1, n_informative = 1,bias = 150.0, noise = 30, random_state = 0)

plt.scatter(X_R1, y_R1, marker = 'o', s= 50)
plt.show()


<IPython.core.display.Javascript object>

In [74]:
X_R1

array([[-0.35955316],
       [ 0.97663904],
       [ 0.40234164],
       [-0.81314628],
       [-0.88778575],
       [ 0.44386323],
       [-0.97727788],
       [ 0.42833187],
       [ 0.20827498],
       [-0.31155253],
       [-0.51080514],
       [ 0.12691209],
       [-1.53624369],
       [-0.40178094],
       [ 0.6536186 ],
       [ 1.17877957],
       [-0.17992484],
       [ 1.78587049],
       [ 1.45427351],
       [-0.68481009],
       [ 0.97873798],
       [ 1.89588918],
       [-0.4380743 ],
       [ 0.3130677 ],
       [ 0.76103773],
       [ 0.77749036],
       [ 1.9507754 ],
       [ 0.33367433],
       [-0.34791215],
       [ 1.53277921],
       [-0.89546656],
       [-0.57884966],
       [-1.04855297],
       [ 0.37816252],
       [ 0.01050002],
       [ 0.46278226],
       [ 0.14404357],
       [-0.40317695],
       [ 0.0519454 ],
       [-1.25279536],
       [ 1.05445173],
       [ 0.40015721],
       [-1.70627019],
       [ 2.2408932 ],
       [ 0.17742614],
       [-0

In [75]:
y_R1

array([ 120.61202772,  131.22864089,  150.56377656,  169.90502386,
        118.15657878,  196.359452  ,   63.87424131,  166.52111726,
        196.69750568,  109.35202873,  110.89907645,  167.14847479,
         50.75815986,   99.05541869,  186.6671153 ,  231.40263398,
        149.87252122,  220.18993612,  267.54737632,  136.60775075,
        222.40878201,  221.6995233 ,  187.9326153 ,  165.9614544 ,
        220.99830581,  251.48432189,  245.10159929,  159.37894068,
        187.46117173,  225.78043885,   85.72429807,  133.31521903,
         85.72342625,  209.37368061,   83.89422598,  203.09341316,
        201.83216387,  145.006712  ,   99.17872228,  118.32635079,
        163.72617769,  168.56121251,   83.60815569,  206.67171997,
        177.05657699,   81.58089881,  242.46512421,  117.24433728,
        217.53804156,  115.11167787,  151.72729127,   54.40609236,
        112.71980976,  143.60486968,  186.76850282,  174.28756631,
        117.76643281,    8.00651601,  224.52395686,  150.68774

* n_samples = 100 is the number of samples you want to generate
* n_features = 1 is the number of columns you want for your variable in the above case X_R1, as of now it has 1 column if 2 then 2 columns will be created
* n_informative = 1, The number of informative features, i.e., the number of features used to build the linear model used to generate the output.
* bias = 150.0, The bias term in the underlying linear model. [i think it is the difference between X_R1 and y_R1]
* noise = 30 The standard deviation of the gaussian noise applied to the output.
* random_state = 0 

In [76]:
from sklearn.datasets import make_classification
plt.figure()

plt.title('Sample Classification problem with two informative features')
X_C2, y_C2 = make_classification(n_samples = 100, n_features = 2, 
                                 n_informative = 2, n_redundant = 0, 
                                 n_clusters_per_class = 1 , flip_y = 0.1, 
                                 class_sep = 0.5, random_state = 0)

plt.scatter(X_C2[:, 0], X_C2[:, 1], c=y_C2,
           marker= 'o', s=50, cmap=cmap_bold)
plt.show()

<IPython.core.display.Javascript object>

In [77]:
X_C2#[: , 0]

array([[ 0.37163989, -0.92276158],
       [-0.1617182 ,  0.51386743],
       [-1.63650855,  2.23389996],
       [ 0.62041909, -2.18941375],
       [-0.98718544,  1.93299453],
       [-0.14918509,  1.30535614],
       [ 1.61878776,  0.31495229],
       [-2.0077599 ,  1.98857017],
       [ 0.51055071, -0.12656384],
       [ 0.41002859, -0.70119016],
       [ 0.52751248,  1.0967429 ],
       [ 0.59985786,  1.28037474],
       [ 0.45312252,  0.85489986],
       [-1.00514147,  0.77186707],
       [ 0.47834537,  1.85361297],
       [ 0.29625824, -0.78978825],
       [ 0.32783482, -0.5227981 ],
       [ 0.60447379, -0.14882659],
       [ 0.50205696,  1.12159087],
       [ 0.60563412,  0.42995032],
       [-1.28273822,  1.02817583],
       [ 0.30764063,  0.93661465],
       [ 0.38097545,  1.45206938],
       [ 0.35846584, -0.9653136 ],
       [ 0.3016695 ,  0.35707675],
       [ 0.64998136, -0.17995055],
       [-0.15122279,  0.47865617],
       [ 0.45000368,  1.33240861],
       [ 0.2858526 ,

In [78]:
y_C2

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0])

In [79]:
print(len(X_C2),len(y_C2))

100 100


In [80]:
from adspy_shared_utilities import plot_two_class_knn

X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2,
                                                   random_state=0)

plot_two_class_knn(X_train, y_train, 1, 'uniform', X_test, y_test)
plot_two_class_knn(X_train, y_train, 3, 'uniform', X_test, y_test)
plot_two_class_knn(X_train, y_train, 11, 'uniform', X_test, y_test)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In case of figure 3 where k=1 the classifier is doing overfitting by considering each and every point and it can also be observed that for k=1 the train score = 1 and test score = 0.8.

Now as k incereases to 3 i.e. k=3, training score decreases and test score decreases marginally.
i.e. train score = 0.92 and test score = 0.72

Again for k=11 training score has further decreased to 0.77 while the test score increases to 0.8, the other important thing to observe here is the more generalized patter that has come it terms of boundary. which menas the model is able to see the more generalized pattern.


#### Knn can also be used to for regression models.
In this case the depending on the k value the model will consider those many nearest points to the query point and then take the average of all these points to predict the target value,In case of a knn regresser also as the value of k decreases the risk of overfitting increases and as the value of increases the model tries to understand the more global pattern.

#### R-Squared regression score [Co-efiicient of determindation]
Measure how well a prediction model for regression fits the given data. The score is between 0 to 1.
* A value of 0 corresponds to a constant model that predicts the mean value of all training target values.
* A value of 1 corresponds to prefect prediction

In [81]:
from sklearn.neighbors import KNeighborsRegressor

X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1, random_state = 0)

knnreg = KNeighborsRegressor(n_neighbors = 5).fit(X_train, y_train)

print(knnreg.predict(X_test))
print('R-squared test score: {:.3f}'
     .format(knnreg.score(X_test, y_test)))

[ 231.70974697  148.35572605  150.58852659  150.58852659   72.14859259
  166.50590948  141.90634426  235.57098756  208.25897836  102.10462746
  191.31852674  134.50044902  228.32181403  148.35572605  159.16911306
  113.46875166  144.03646012  199.23189853  143.19242433  166.50590948
  231.70974697  208.25897836  128.01545355  123.14247619  141.90634426]
R-squared test score: 0.425


In [82]:
fig, subaxes = plt.subplots(1, 2, figsize=(8,4))
X_predict_input = np.linspace(-3, 3, 50).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X_R1[0::5], y_R1[0::5], random_state = 0)

for thisaxis, K in zip(subaxes, [1, 3]):
    knnreg = KNeighborsRegressor(n_neighbors = K).fit(X_train, y_train)
    y_predict_output = knnreg.predict(X_predict_input)
    thisaxis.set_xlim([-2.5, 0.75])
    thisaxis.plot(X_predict_input, y_predict_output, '^', markersize = 10,
                 label='Predicted', alpha=0.8)
    thisaxis.plot(X_train, y_train, 'o', label='True Value', alpha=0.8)
    thisaxis.set_xlabel('Input feature')
    thisaxis.set_ylabel('Target value')
    thisaxis.set_title('KNN regression (K={})'.format(K))
    thisaxis.legend()
plt.tight_layout()

<IPython.core.display.Javascript object>

In [83]:
# plot k-NN regression on sample dataset for different values of K
fig, subaxes = plt.subplots(5, 1, figsize=(5,20))
X_predict_input = np.linspace(-3, 3, 500).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1,
                                                   random_state = 0)

for thisaxis, K in zip(subaxes, [1, 3, 7, 15, 55]):
    knnreg = KNeighborsRegressor(n_neighbors = K).fit(X_train, y_train)
    y_predict_output = knnreg.predict(X_predict_input)
    train_score = knnreg.score(X_train, y_train)
    test_score = knnreg.score(X_test, y_test)
    thisaxis.plot(X_predict_input, y_predict_output)
    thisaxis.plot(X_train, y_train, 'o', alpha=0.9, label='Train')
    thisaxis.plot(X_test, y_test, '^', alpha=0.9, label='Test')
    thisaxis.set_xlabel('Input feature')
    thisaxis.set_ylabel('Target value')
    thisaxis.set_title('KNN Regression (K={})\n\
Train $R^2 = {:.3f}$,  Test $R^2 = {:.3f}$'
                      .format(K, train_score, test_score))
    thisaxis.legend()
    plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)


<IPython.core.display.Javascript object>

As the value of k increases (till optimum) the Test R^2  increases and train R^2 decreases as shown in above 4 figure with values of k 1, 7, 15, 55. Till k=15 the preformance of the m,odel was increasing but at k=55 the model is underfitting i.e it is not able to work even on the training data.

Advantages of Knn are:
* Simple and easy approach
* knn approch can be reasonable baseline for comparion against more sophistticated methods

Cons for knnn are"
* If the number of feature are more the performance of the k decreases
* In the dataset that has sparse data (many columns are there but with majority of the data as 0)

K-nearest neighbors and linear model fitting using least squares are two complementary approaches to supervised learning. K-nearest neighbors doesn't make a lot of assumptions about the structure of the data and gives potentially accurate but sometimes unstable predictions here unstable means that it could be sensitive to small changes in the training data. 

On the other hand, linear models make strong assumptions about the structure of the data. In other words, that the target value can be predicted just using a weighted sum of the input variables, a linear function.

<img src = "accuracy_vs_complexity.png" >

As it can be seen from the above figure initially the model accuracy increase with increase in model complexity but after certain point it stats decreasing and coz with increase in complexity[no of features] and with same value or lesser value of k the model tries to overfit resulting in reduction of the accuracy.

What is a model?

It's a specific mathematical or computational description that expresses the relationship between a set of input variables and one or more outcome variables that are being studied or predicted. 

In statistical terms, the input variables are called independent variables. And the outcome variables are termed dependent variables. 

In machine learning, we use the term features to refer to the input or independent variables, and target value or target label to refer to the output dependent variable.

#### Linear Models 

A linear model is a sum of weighted variables that predicts a target output value given an input data instance.

Linear models make a strong prior assumption about the relationship between the input x and output y. Linear models may seem simplistic. But for data with many features, linear models can be very effective and generalize well to new data beyond the training set. 

<img src = "Linear_regression.png">

In [84]:
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X_R1, y_R1,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))

linear model coeff (w): [ 45.70870465]
linear model intercept (b): 148.446
R-squared score (training): 0.679
R-squared score (test): 0.492


In [85]:
plt.figure(figsize=(5,4))
plt.scatter(X_R1, y_R1, marker= 'o', s=50, alpha=0.8)
plt.plot(X_R1, linreg.coef_ * X_R1 + linreg.intercept_, 'r-')
plt.title('Least-squares linear regression')
plt.xlabel('Feature value (x)')
plt.ylabel('Target value (y)')
plt.show()

<IPython.core.display.Javascript object>

### Ridge regression

* Ridge regression learns w, busing the same least-squares criterion but adds a penalty for large variations in wparameters

<img src = "Ridge_regression.png">

* Once the parameters are learned, the ridge regression predictionformula is the sameas ordinary least-squares.
* The addition of a parameter penalty is called regularization. Regularization prevents overfitting by restricting the model, typically to reduce its complexity.
* Ridge regression uses L2 regularization: minimize sum of squares of w-entries
* The influence of the regularization term is controlled by the 𝛼 parameter.
* Higher alpha means more regularization and simpler models
* The regularization term prevents overfitting becasue it is adding to the RHS i.e. to sum of squares.

In [86]:
from adspy_shared_utilities import load_crime_dataset
(X_crime, y_crime) = load_crime_dataset()
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)

linridge = Ridge(alpha=20.0).fit(X_train, y_train)

print('Crime dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Crime dataset
ridge regression linear model intercept: -3352.423035845943
ridge regression linear model coeff:
[  1.95091438e-03   2.19322667e+01   9.56286607e+00  -3.59178973e+01
   6.36465325e+00  -1.96885471e+01  -2.80715856e-03   1.66254486e+00
  -6.61426604e-03  -6.95450680e+00   1.71944731e+01  -5.62819154e+00
   8.83525114e+00   6.79085746e-01  -7.33614221e+00   6.70389803e-03
   9.78505502e-04   5.01202169e-03  -4.89870524e+00  -1.79270062e+01
   9.17572382e+00  -1.24454193e+00   1.21845360e+00   1.03233089e+01
  -3.78037278e+00  -3.73428973e+00   4.74595305e+00   8.42696855e+00
   3.09250005e+01   1.18644167e+01  -2.05183675e+00  -3.82210450e+01
   1.85081589e+01   1.52510829e+00  -2.20086608e+01   2.46283912e+00
   3.29328703e-01   4.02228467e+00  -1.12903533e+01  -4.69567413e-03
   4.27046505e+01  -1.22507167e-03   1.40795790e+00   9.35041855e-01
  -3.00464253e+00   1.12390514e+00  -1.82487653e+01  -1.54653407e+01
   2.41917002e+01  -1.32497562e+01  -4.20113118e-01  -3.59710

### Need for Feature Normalization

* Important for some machine learning methods that all features are on the same scale (e.g. faster convergence in learning, more uniform or 'fair' influence for all weights)
        e.g. regularized regression, k-NN, support vector machines, neural networks
* One method of establishing the feature normalization is to use minmax scaler function, this method normalises all the feature to lie between 0 and 1, with 1 for max value in column and 0 for min value. With this approach all the features have equal weightage.
* For each feature 𝑥𝑖: compute the min value 𝑥𝑖𝑀in and the max value 𝑥i𝑀ax achieved across all instances in the training set.
* For each feature: transform a given feature 𝑥𝑖 value to a scaled version 𝑥𝑖′using the formula
        xi' = (xi - xmin)/(ximax - ximin)

from sklearn.preprocessingimport MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)

X_train_scaled= scaler.transform(X_train)

In the above code fit and transform are used differently, however it can be used as one as shown below

scaler = MinMaxScaler()

X_train_scaled= scaler.fit_transform(X_train)

* The test set must use identical scaling to the training set but without the use of fit.
    fit_transform should be used on training set data and only transform should be used on test set to avoid data leakage.
    
#### Disadvantage of doing feature Normalization.

The resulting model and the transformed features may be harder to interpret. 


In general, regularisation works especially well when you have relatively small amounts of training data compared to the number of features in your model. Regularisation becomes less important as the amount of training data you have increases.


#### Ridge regression with regularization parameter: alpha

In [96]:
print('Ridge regression: effect of alpha regularization parameter\n')
for this_alpha in [0, 1, 10, 20, 50, 100, 1000]:
    linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
    r2_train = linridge.score(X_train_scaled, y_train)
    r2_test = linridge.score(X_test_scaled, y_test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    print('Alpha = {:.2f}\nnum abs(coeff) > 1.0: {}, \
r-squared training: {:.2f}, r-squared test: {:.2f}\n'
         .format(this_alpha, num_coeff_bigger, r2_train, r2_test))

Ridge regression: effect of alpha regularization parameter

Alpha = 0.00
num abs(coeff) > 1.0: 88, r-squared training: 0.67, r-squared test: 0.50

Alpha = 1.00
num abs(coeff) > 1.0: 87, r-squared training: 0.66, r-squared test: 0.56

Alpha = 10.00
num abs(coeff) > 1.0: 87, r-squared training: 0.63, r-squared test: 0.59

Alpha = 20.00
num abs(coeff) > 1.0: 88, r-squared training: 0.61, r-squared test: 0.60

Alpha = 50.00
num abs(coeff) > 1.0: 86, r-squared training: 0.58, r-squared test: 0.58

Alpha = 100.00
num abs(coeff) > 1.0: 87, r-squared training: 0.55, r-squared test: 0.55

Alpha = 1000.00
num abs(coeff) > 1.0: 84, r-squared training: 0.31, r-squared test: 0.30



Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number: 1.734371679487529e-19


From the above values of alpha it can be seen that initially the r-squared test value increases with increase in alpha but later it starts to decrease with further increase in alpha.

In [95]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)
    
print('Crime dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Crime dataset
ridge regression linear model intercept: 933.390638504412
ridge regression linear model coeff:
[  88.68827454   16.48947987  -50.30285445  -82.90507574  -65.89507244
   -2.27674244   87.74108514  150.94862182   18.8802613   -31.05554992
  -43.13536109 -189.44266328   -4.52658099  107.97866804  -76.53358414
    2.86032762   34.95230077   90.13523036   52.46428263  -62.10898424
  115.01780357    2.66942023    6.94331369   -5.66646499 -101.55269144
  -36.9087526    -8.7053343    29.11999068  171.25963057   99.36919476
   75.06611841  123.63522539   95.24316483 -330.61044265 -442.30179004
 -284.49744001 -258.37150609   17.66431072 -101.70717151  110.64762887
  523.13611718   24.8208959     4.86533322  -30.46775619   -3.51753937
   50.57947231   10.84840601   18.27680946   44.11189865   58.33588176
   67.08698975  -57.93524659  116.1446052    53.81163718   49.01607711
   -7.62262031   55.14288543  -52.08878272  123.39291017   77.12562171
   45.49795317  184.91229771  -91.35721

#### Lasso Regression

Lasso regression is another form of regularized linear regression that uses an L1 regularizationpenalty for training.

<img src = "Lasso_regression.png">

* This has the effect of setting parameter weights in wto zerofor the least influential variables. This is called a sparsesolution: a kind of feature selection
* The parameter 𝛼𝛼controls amount of L1 regularization (default = 1.0).
* The prediction formula is the same as ordinary least-squares.
* When to use ridge vs lasso regression:
        – Many small/medium sized effects: use ridge.
        – Only a few variables with medium/large effect: use lasso.

In [97]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
                                                   random_state = 0)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=2.0, max_iter = 10000).fit(X_train_scaled, y_train)

print('Crime dataset')
print('lasso regression linear model intercept: {}'
     .format(linlasso.intercept_))
print('lasso regression linear model coeff:\n{}'
     .format(linlasso.coef_))
print('Non-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
     .format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_test_scaled, y_test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X_crime), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))

Crime dataset
lasso regression linear model intercept: 1186.6120619985809
lasso regression linear model coeff:
[    0.             0.            -0.          -168.18346054    -0.            -0.
     0.           119.6938194      0.            -0.             0.
  -169.67564456    -0.             0.            -0.             0.             0.
     0.            -0.            -0.             0.            -0.             0.
     0.           -57.52991966    -0.            -0.             0.
   259.32889226    -0.             0.             0.             0.            -0.
 -1188.7396867     -0.            -0.            -0.          -231.42347299
     0.          1488.36512229     0.            -0.            -0.            -0.
     0.             0.             0.             0.             0.            -0.
     0.            20.14419415     0.             0.             0.             0.
     0.           339.04468804     0.             0.           459.53799903
    -0.             

#### Lasso regression with regularization parameter: alpha

In [100]:
print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

for alpha in [0.5, 1, 2, 3, 5, 10, 20, 50]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
    r2_train = linlasso.score(X_train_scaled, y_train)
    r2_test = linlasso.score(X_test_scaled, y_test)
    
    print('Alpha = {:.2f}\nFeatures kept: {}, r-squared training: {:.2f}, \
r-squared test: {:.2f}\n'
         .format(alpha, np.sum(linlasso.coef_ != 0), r2_train, r2_test))

Lasso regression: effect of alpha regularization
parameter on number of features kept in final model

Alpha = 0.50
Features kept: 35, r-squared training: 0.65, r-squared test: 0.58

Alpha = 1.00
Features kept: 25, r-squared training: 0.64, r-squared test: 0.60

Alpha = 2.00
Features kept: 20, r-squared training: 0.63, r-squared test: 0.62

Alpha = 3.00
Features kept: 17, r-squared training: 0.62, r-squared test: 0.63

Alpha = 5.00
Features kept: 12, r-squared training: 0.60, r-squared test: 0.61

Alpha = 10.00
Features kept: 6, r-squared training: 0.57, r-squared test: 0.58

Alpha = 20.00
Features kept: 2, r-squared training: 0.51, r-squared test: 0.50

Alpha = 50.00
Features kept: 1, r-squared training: 0.31, r-squared test: 0.30



#### Polynomial regression 

In [102]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import make_friedman1

X_F1, y_F1 = make_friedman1(n_samples = 100,
                           n_features = 7, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X_F1, y_F1,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('linear model coeff (w): {}'
     .format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))

print('\nNow we transform the original input data to add\n\
polynomial features up to degree 2 (quadratic)\n')
poly = PolynomialFeatures(degree=2)
X_F1_poly = poly.fit_transform(X_F1)

X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
                                                   random_state = 0)
linreg = LinearRegression().fit(X_train, y_train)

print('(poly deg 2) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2) R-squared score (test): {:.3f}\n'
     .format(linreg.score(X_test, y_test)))

print('\nAddition of many polynomial features often leads to\n\
overfitting, so we often use polynomial features in combination\n\
with regression that has a regularization penalty, like ridge\n\
regression.\n')

X_train, X_test, y_train, y_test = train_test_split(X_F1_poly, y_F1,
                                                   random_state = 0)
linreg = Ridge().fit(X_train, y_train)

print('(poly deg 2 + ridge) linear model coeff (w):\n{}'
     .format(linreg.coef_))
print('(poly deg 2 + ridge) linear model intercept (b): {:.3f}'
     .format(linreg.intercept_))
print('(poly deg 2 + ridge) R-squared score (training): {:.3f}'
     .format(linreg.score(X_train, y_train)))
print('(poly deg 2 + ridge) R-squared score (test): {:.3f}'
     .format(linreg.score(X_test, y_test)))

linear model coeff (w): [  4.42036739   5.99661447   0.52894712  10.23751345   6.5507973
  -2.02082636  -0.32378811]
linear model intercept (b): 1.543
R-squared score (training): 0.722
R-squared score (test): 0.722

Now we transform the original input data to add
polynomial features up to degree 2 (quadratic)

(poly deg 2) linear model coeff (w):
[  3.40951018e-12   1.66452443e+01   2.67285381e+01  -2.21348316e+01
   1.24359227e+01   6.93086826e+00   1.04772675e+00   3.71352773e+00
  -1.33785505e+01  -5.73177185e+00   1.61813184e+00   3.66399592e+00
   5.04513181e+00  -1.45835979e+00   1.95156872e+00  -1.51297378e+01
   4.86762224e+00  -2.97084269e+00  -7.78370522e+00   5.14696078e+00
  -4.65479361e+00   1.84147395e+01  -2.22040650e+00   2.16572630e+00
  -1.27989481e+00   1.87946559e+00   1.52962716e-01   5.62073813e-01
  -8.91697516e-01  -2.18481128e+00   1.37595426e+00  -4.90336041e+00
  -2.23535458e+00   1.38268439e+00  -5.51908208e-01  -1.08795007e+00]
(poly deg 2) linear model int