<img src="https://i.postimg.cc/prCJq27D/scikit-learn.png"  width="300p"/>

# scikit-learn (Sklearn)
One of the most prominent Python libraries for machine learning:

* Contains many state-of-the-art machine learning algorithms
* Builds on numpy (fast), implements advanced techniques
* Wide range of evaluation measures and techniques
* Offers [comprehensive documentation](https://scikit-learn.org/stable/index.html) about each algorithm
* Widely used, and a wealth of [tutorials](http://scikit-learn.org/stable/user_guide.html) and code snippets are available 
* Works well with numpy, scipy, pandas, matplotlib,...

 

## Imports

In [49]:
import pandas as pd
import numpy as np
import plotly.express as px

## Data import
Multiple options:

* A few toy datasets are included in `sklearn.datasets`
* Import 1000s of datasets via `sklearn.datasets.fetch_openml`
* You can import data files (CSV) with `pandas` or `numpy`


We'll work with the Ecommerce Customers csv file from the company. It has Customer info, such as Email, Address, and their color Avatar. Then it also has numerical value columns:

* Avg. Session Length: Average session of in-store style advice sessions.
* Time on App: Average time spent on App in minutes
* Time on Website: Average time spent on Website in minutes
* Length of Membership: How many years the customer has been a member.

In [2]:
customers = pd.read_csv("./dataset/Ecommerce_Customers.csv")

### Understanding the data

In [50]:
customers.head()

Unnamed: 0,Email,Address,Avatar,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
0,mstephenson@fernandez.com,"835 Frank Tunnel\r\nWrightmouth, MI 82180-9605",Violet,34.497268,12.655651,39.577668,4.082621,587.951054
1,hduke@hotmail.com,"4547 Archer Common\r\nDiazchester, CA 06566-8576",DarkGreen,31.926272,11.109461,37.268959,2.664034,392.204933
2,pallen@yahoo.com,"24645 Valerie Unions Suite 582\r\nCobbborough,...",Bisque,33.000915,11.330278,37.110597,4.104543,487.547505
3,riverarebecca@gmail.com,"1414 David Throughway\r\nPort Jason, OH 22070-...",SaddleBrown,34.305557,13.717514,36.721283,3.120179,581.852344
4,mstephens@davidson-herman.com,"14023 Rodriguez Passage\r\nPort Jacobville, PR...",MediumAquaMarine,33.330673,12.795189,37.536653,4.446308,599.406092


In [51]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Email                 500 non-null    object 
 1   Address               500 non-null    object 
 2   Avatar                500 non-null    object 
 3   Avg. Session Length   500 non-null    float64
 4   Time on App           500 non-null    float64
 5   Time on Website       500 non-null    float64
 6   Length of Membership  500 non-null    float64
 7   Yearly Amount Spent   500 non-null    float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB


In [52]:
customers.describe()

Unnamed: 0,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
count,500.0,500.0,500.0,500.0,500.0
mean,33.053194,12.052488,37.060445,3.533462,499.314038
std,0.992563,0.994216,1.010489,0.999278,79.314782
min,29.532429,8.508152,33.913847,0.269901,256.670582
25%,32.341822,11.388153,36.349257,2.93045,445.038277
50%,33.082008,11.983231,37.069367,3.533975,498.887875
75%,33.711985,12.75385,37.716432,4.126502,549.313828
max,36.139662,15.126994,40.005182,6.922689,765.518462


In [53]:
customers.isnull().sum()

Email                   0
Address                 0
Avatar                  0
Avg. Session Length     0
Time on App             0
Time on Website         0
Length of Membership    0
Yearly Amount Spent     0
dtype: int64

### Exploratory Data Analysis Let's explore the data!

In [54]:
px.scatter_matrix(customers, 
                  dimensions=['Yearly Amount Spent', 'Time on Website',
                              'Time on App','Length of Membership',
                              'Avg. Session Length'],
                  width = 1000,
                  height = 1000)

>**❓ QUESTION:** Based off this plot what looks to be the most correlated feature with Yearly Amount Spent?

No high correlation between any two features. Lets verify with Heatmap.

In [8]:
customers.corr()

Unnamed: 0,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
Avg. Session Length,1.0,-0.027826,-0.034987,0.060247,0.355088
Time on App,-0.027826,1.0,0.082388,0.029143,0.499328
Time on Website,-0.034987,0.082388,1.0,-0.047582,-0.002641
Length of Membership,0.060247,0.029143,-0.047582,1.0,0.809084
Yearly Amount Spent,0.355088,0.499328,-0.002641,0.809084,1.0


In [55]:
fig = px.imshow(customers.corr(), 
                title='Correlations Among Training Features',
                height=700, width=700)
fig.show()

## Building models
All scikitlearn estimators follow the same interface


   *  **fit():**    Fit/model the training data
   
   
   * **predict():**  Make predictions
    
    
   *  **score():**  Predict and compare to true


## Training and Testing Data

Now that we've explored the data a bit, let's go ahead and split the data into training and testing sets.
* Set a variable X equal to the numerical features of the customers and
* a variable y equal to the "Yearly Amount Spent" column.

In [60]:
y = customers['Yearly Amount Spent']
X = customers[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]

### Training and testing data
To evaluate our model, we need to test it on unseen data.  
`train_test_split`: splits data randomly.

**Set test_size=0.3 and random_state=101**

In [61]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [62]:
X_train

Unnamed: 0,Avg. Session Length,Time on App,Time on Website,Length of Membership
202,31.525752,11.340036,37.039514,3.811248
428,31.862741,14.039867,37.022269,3.738225
392,33.258238,11.514949,37.128039,4.662845
86,33.877779,12.517666,37.151921,2.669942
443,33.025020,12.504220,37.645839,4.051382
...,...,...,...,...
63,32.789773,11.670066,37.408748,3.414688
326,33.217188,10.999684,38.442767,4.243813
337,31.827979,12.461147,37.428997,2.974737
11,33.879361,11.584783,37.087926,3.713209


In [64]:
y_train

202    443.965627
428    556.298141
392    549.131573
86     487.379306
443    561.516532
          ...    
63     483.159721
326    505.230068
337    440.002748
11     522.337405
351    533.396554
Name: Yearly Amount Spent, Length: 350, dtype: float64

In [65]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (350, 4)
y_train shape: (350,)
X_test shape: (150, 4)
y_test shape: (150,)


## Fitting a model
We'll build a Linear Regression model

In [66]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train,y_train)

LinearRegression()

**Print out the coefficients of the model**

In [67]:
print('Coefficients: \n', lr.coef_)

Coefficients: 
 [25.98154972 38.59015875  0.19040528 61.27909654]


### Predicting Test Data
Now that we have fit our model, let's evaluate its performance by predicting off the test values!

In [68]:
predictions = lr.predict( X_test)

In [69]:
predictions

array([456.44186104, 402.72005312, 409.2531539 , 591.4310343 ,
       590.01437275, 548.82396607, 577.59737969, 715.44428115,
       473.7893446 , 545.9211364 , 337.8580314 , 500.38506697,
       552.93478041, 409.6038964 , 765.52590754, 545.83973731,
       693.25969124, 507.32416226, 573.10533175, 573.2076631 ,
       397.44989709, 555.0985107 , 458.19868141, 482.66899911,
       559.2655959 , 413.00946082, 532.25727408, 377.65464817,
       535.0209653 , 447.80070905, 595.54339577, 667.14347072,
       511.96042791, 573.30433971, 505.02260887, 565.30254655,
       460.38785393, 449.74727868, 422.87193429, 456.55615271,
       598.10493696, 449.64517443, 615.34948995, 511.88078685,
       504.37568058, 515.95249276, 568.64597718, 551.61444684,
       356.5552241 , 464.9759817 , 481.66007708, 534.2220025 ,
       256.28674001, 505.30810714, 520.01844434, 315.0298707 ,
       501.98080155, 387.03842642, 472.97419543, 432.8704675 ,
       539.79082198, 590.03070739, 752.86997652, 558.27

In [16]:
fig = px.scatter(x = predictions, y =y_test, labels={
                     "x": "Predictions",
                     "y": "True values",
                 },)
fig.show()

### Evaluating the model
Feeding all test examples to the model yields all predictions
<img src="https://i.postimg.cc/CxSzwGRL/metrics.png" width="400" />

In [70]:
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

MAE: 7.228148653430809
MSE: 79.81305165097403
RMSE: 8.93381506697861


## Logistic Regression

In this project we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

This data set contains the following features:

* **Daily Time Spent on Site:** consumer time on site in minutes
* **Age:** cutomer age in years
* **Area Income:** Avg. Income of geographical area of consumer
* **Daily Internet Usage:** Avg. minutes a day consumer is on the internet
* **Ad Topic Line:** Headline of the advertisement
* **City:** City of consumer
* **Male:** Whether or not consumer was male
* **Country:** Country of consumer
* **Timestamp:** Time at which consumer clicked on Ad or closed window
* **Clicked on Ad:** 0 or 1 indicated clicking on Ad

### Import Libraries

In [71]:
import pandas as pd
import numpy as np
import plotly.express as px

In [72]:
ad_data = pd.read_csv('./dataset/advertising.csv')

In [73]:
ad_data.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0


In [74]:
ad_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB


In [75]:
ad_data.describe()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Clicked on Ad
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,65.0002,36.009,55000.00008,180.0001,0.481,0.5
std,15.853615,8.785562,13414.634022,43.902339,0.499889,0.50025
min,32.6,19.0,13996.5,104.78,0.0,0.0
25%,51.36,29.0,47031.8025,138.83,0.0,0.0
50%,68.215,35.0,57012.3,183.13,0.0,0.5
75%,78.5475,42.0,65470.635,218.7925,1.0,1.0
max,91.43,61.0,79484.8,269.96,1.0,1.0


In [76]:
ad_data.isnull().sum()

Daily Time Spent on Site    0
Age                         0
Area Income                 0
Daily Internet Usage        0
Ad Topic Line               0
City                        0
Male                        0
Country                     0
Timestamp                   0
Clicked on Ad               0
dtype: int64

### Exploratory Data Analysis

In [77]:
px.histogram(ad_data , x="Age",nbins=30, title="histogram of the Age")

In [78]:
px.scatter(ad_data, x="Age", y='Area Income',
           marginal_x='histogram', marginal_y='histogram',
           title="Area Income versus Age")

In [79]:
px.scatter(ad_data, x="Age", y='Daily Time Spent on Site', 
           marginal_x='violin', marginal_y='violin', 
           title="Area Income versus Age")

In [80]:
px.scatter(ad_data, y="Daily Internet Usage", 
           x='Daily Time Spent on Site', marginal_x='histogram', 
           marginal_y='histogram', title="'Daily Time Spent on Site' vs. 'Daily Internet Usage'")

In [83]:
ad_data['Clicked on Ad'].value_counts()

0    500
1    500
Name: Clicked on Ad, dtype: int64

In [87]:
ad_data['ClickedonAd_str'] = ad_data['Clicked on Ad'].astype('str')

In [88]:
px.scatter_matrix(ad_data, 
                  dimensions=('Daily Time Spent on Site', 'Age', 'Area Income',
       'Daily Internet Usage', 'Male','Clicked on Ad'),
                  color='ClickedonAd_str', width = 1000,
                  height = 1000
                 )

In [29]:
from sklearn.model_selection import train_test_split

In [89]:
X = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]
y = ad_data['Clicked on Ad']

In [90]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [92]:
y_train

29     0
535    0
695    0
557    0
836    1
      ..
106    0
270    1
860    1
435    1
102    0
Name: Clicked on Ad, Length: 800, dtype: int64

In [93]:
from sklearn.linear_model import LogisticRegression

In [95]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

LogisticRegression()

In [96]:
predictions = logmodel.predict(X_test)

The `score` function computes the percentage of correct predictions

In [98]:
logmodel.score(X_test, y_test)

0.9

## confusion matrix

A confusion matrix is a table that is used to define the performance of a classification algorithm.
<img src="https://i.postimg.cc/bvG8Pr9w/confusion2.png" width ="700"/>

## TP, FP, FN, TN 

> **❗ NOTE:** We describe predicted values as Positive and Negative and actual values as True and False.

* **True Positive:**

Interpretation: You predicted positive and it’s true.

You predicted that a woman is pregnant and she actually is.

* **True Negative:**

Interpretation: You predicted negative and it’s true.

You predicted that a man is not pregnant and he actually is not.

* **False Positive:** (Type 1 Error)

Interpretation: You predicted positive and it’s false.

You predicted that a man is pregnant but he actually is not.

* **False Negative:** (Type 2 Error)

Interpretation: You predicted negative and it’s false.

You predicted that a woman is not pregnant but she actually is.

##  How to Calculate Confusion Matrix for a 2-class classification problem?

### Recall
<img src="https://i.postimg.cc/23Nz6DTB/recall.png" width ="300"/>
The above equation can be explained by saying, from all the positive classes, how many we predicted correctly.

Recall should be high as possible.
### Precision
<img src="https://i.postimg.cc/Pr1X7Crb/precision.png" width ="300"/>
The above equation can be explained by saying, from all the classes we have predicted as positive, how many are actually positive.

Precision should be high as possible.


It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time.

### F-measure
<img src="https://i.postimg.cc/dQ7YTLbR/fmeasure.png" width ="400"/>


In [99]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.84      0.96      0.89        89
           1       0.96      0.86      0.90       111

    accuracy                           0.90       200
   macro avg       0.90      0.91      0.90       200
weighted avg       0.91      0.90      0.90       200



## Learning Curves

A plot of the training, validation score with respect to the size of the training set is known as a Learning curve.

<img src="https://i.postimg.cc/kGsmnbVL/learning-curve.png"/>  
If we plotted the error scores for each training size, we’d get two learning curves looking similarly to these:
<img src="https://i.postimg.cc/GmV1hr23/learning-curve2.png"/>

**Learning curves give us an opportunity to diagnose bias and variance in supervised learning models.**
<img src="https://i.postimg.cc/BvRT4SGf/bias2.png"/>  
------------------------------------------------------------------------------------

<img src="https://i.postimg.cc/52RvS4yQ/variance.png"/>

In [100]:
from sklearn.model_selection import learning_curve

In [101]:
train_sizes = [200, 400, 600, 800]

In [103]:
train_sizes, train_scores, validation_scores = learning_curve(
estimator = LogisticRegression(max_iter=1000),
X = X,
y =y, train_sizes = train_sizes,
scoring = 'accuracy')

In [104]:
print('Training scores:\n\n', train_scores)
print('\n', '-' * 70)
print('\nValidation scores:\n\n', validation_scores)

Training scores:

 [[0.87       0.925      0.93       0.93       0.93      ]
 [0.88       0.9075     0.905      0.905      0.905     ]
 [0.88833333 0.90166667 0.89666667 0.90333333 0.90333333]
 [0.89125    0.90625    0.90125    0.95875    0.9       ]]

 ----------------------------------------------------------------------

Validation scores:

 [[0.93  0.875 0.88  0.88  0.92 ]
 [0.93  0.87  0.905 0.89  0.92 ]
 [0.93  0.87  0.885 0.89  0.92 ]
 [0.925 0.87  0.89  0.935 0.915]]


In [105]:
train_scores_mean = train_scores.mean(axis = 1)
validation_scores_mean = validation_scores.mean(axis = 1)
print('Mean training scores\n\n', pd.Series(train_scores_mean, index = train_sizes))
print('\n', '-' * 20) # separator
print('\nMean validation scores\n\n',pd.Series(validation_scores_mean, index = train_sizes))

Mean training scores

 200    0.917000
400    0.900500
600    0.898667
800    0.911500
dtype: float64

 --------------------

Mean validation scores

 200    0.897
400    0.903
600    0.899
800    0.907
dtype: float64


In [106]:
fig = px.line(x =train_sizes , y =train_scores_mean)
fig.add_scatter(x = train_sizes, y =validation_scores_mean)
fig.show()

### Feature Scaling

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.

* **Normalization** is the process of scaling data into a range of [0, 1]. It's more useful and common for regression tasks.
* **Standardization** is the process of scaling data so that they have a mean value of 0 and a standard deviation of 1. It's more useful and common for classification tasks.

<img src = "https://i.postimg.cc/j5vg2QX6/scaling2.png" />

In [107]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

In [109]:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [110]:
logmodel = LogisticRegression()
logmodel.fit(X_train_scaled, y_train)

LogisticRegression()

In [111]:
predictions = logmodel.predict(X_test_scaled)

In [112]:
logmodel.score(X_test_scaled, y_test)

0.96

In [113]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96        89
           1       0.97      0.95      0.96       111

    accuracy                           0.96       200
   macro avg       0.96      0.96      0.96       200
weighted avg       0.96      0.96      0.96       200



In [114]:
X_train_scaled

array([[ 0.68151885, -1.59171433, -0.6639286 ,  0.64565463, -0.95596841],
       [ 0.85377694, -0.45872077,  1.11780512,  0.60455583,  1.04605967],
       [ 0.52246303,  0.22107537,  0.90246998,  1.43132657, -0.95596841],
       ...,
       [ 1.64654135,  1.01417086, -0.23476995,  0.02529117,  1.04605967],
       [-0.53560399,  0.67427279,  1.31753347, -0.96176486, -0.95596841],
       [ 1.5547542 ,  0.44767408,  0.33886526,  0.28969341, -0.95596841]])