## Scikitlearn Creating Machine Learning Models

### Scikit Introduction 

Where does Scikit Learn takeplace

![image.png](attachment:image.png)

Why Scikit-Learn(SK-Learn)
* Buit on NumPy and Matplotlib(and Python)
* Has many in-built machine learning models
* Methods to evaluate your machine learning
* Very well-designed API

What are we going to cover:
* the SK-learn workflow
    1. Get data ready
    2. pick a model(suit your problem)
    3. Fit the model to the data (learning patterns)
    4. make a prediction(using patterns)
    5. Evaluate the model
    6. Improve through experimentation
    7. Save and reload your trained model

In [24]:
Scheme = [ '1. Get data ready',
    '2. pick a model(suit your problem)',
    '3. Fit the model to the data (learning patterns)',
    '4. make a prediction(using patterns)',
    '5. Evaluate the model',
    '6. Improve through experimentation',
    '7. Save and reload your trained model']

Where can you get help?
![image.png](attachment:image.png)

### Refresher - What is Machien Learning

Typically, we tell the computer the inputs, the computer will compute the outputs: 

![image.png](attachment:image.png)

In Machine Learning, we will give the inputs and the desired outputs, the computer will figure out what the function is. 

### SciketLearn Cheatsheet

#### Re-iterate on the steps of Scikit-Learn

1. Get Data Ready
2. Pick a model that fits your problem
3. Fit the model to the data and make a prediction
4. Evaluate the model
5. Improve through experimentation
6. Save and reload. 

#### Offical Documentation


https://scikit-learn.org/stable/user_guide.html

#### Example of ScikitLearn Workflow

https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/scikit-learn-workflow-example.ipynb

## Typical Scikitlearn WorkFlow

1. activate env in conda
    
    **conda activate**

2. Start Jupyter Notebook"

### 1. Get the Date Ready

In [2]:
import pandas as pd
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
# create x - the feature matrix, or called data, or feature variable)
x = heart_disease.drop("target",axis=1)

#create y （label matrix or label)
y = heart_disease["target"]

### 2. Choose the right model and hyperparameters

In [4]:
#classification  model
from sklearn.ensemble import RandomForestClassifier 

clf = RandomForestClassifier(n_estimators=100) #clf- short for classification

# We will keep the default hyperparameters

clf.get_params()


{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3. Fit the model to the training data

In [5]:
from sklearn.model_selection import train_test_split

#Basically what the following function does is to split the data
#X and Y into x_train, x test, y_train,y_test
#The training data is what used to train, and the testing data is what the model
#has never seen before to check its performance.

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
#Test size -> percentage of data that will be used to train

In [6]:
clf.fit(X_train,y_train) # the fit function is basically to tell the model(random forest)
# to find the pattern

In [7]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
293,67,1,2,152,212,0,0,150,0,0.8,1,0,3
118,46,0,1,105,204,0,1,172,0,0.0,2,0,2
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
169,53,1,0,140,203,1,0,155,1,3.1,0,0,3
92,52,1,2,138,223,0,1,169,0,0.0,2,4,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,55,0,0,128,205,0,2,130,1,2.0,1,1,3
35,46,0,2,142,177,0,0,160,1,1.4,0,0,2
279,61,1,0,138,166,0,0,125,1,3.6,1,1,2
84,42,0,0,102,265,0,0,122,0,0.6,1,0,2


In [8]:
# make a predictions
import numpy as np
y_label = clf.predict(np.array([1,2,3,4])) 
#when entering input, the input must be the same shape of X_train. 
# so above will give errors



ValueError: Expected 2D array, got 1D array instead:
array=[1. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
y_preds = clf.predict(X_test) # we can give X_test for Y prediction(y_preds)
y_preds

array([0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0], dtype=int64)

In [None]:
y_test

179    0
235    0
237    0
72     1
96     1
      ..
168    0
301    0
58     1
120    1
199    0
Name: target, Length: 61, dtype: int64

### 4. Evaluate the model on training data and test data

In [None]:
clf.score(X_test,y_test) #average score on getting the answer correct

0.819672131147541

In [None]:
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score

print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       0.87      0.79      0.83        33
           1       0.77      0.86      0.81        28

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



In [None]:
confusion_matrix(y_test,y_preds)


array([[19,  7],
       [ 5, 30]], dtype=int64)

In [None]:
accuracy_score(y_test,y_preds)

0.8032786885245902

### 5. Improve a model
    Try different amount of n_estimators

In [None]:
np.random.seed(42)
for i in range(10,100,10):
    print(f"trying model with {i} estimators..")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
    print(f"Model accuracy on test set:{clf.score(X_test,y_test)*100:0.2f}%")
    print("")

trying model with 10 estimators..
Model accuracy on test set:80.33%

trying model with 20 estimators..
Model accuracy on test set:78.69%

trying model with 30 estimators..
Model accuracy on test set:81.97%

trying model with 40 estimators..
Model accuracy on test set:80.33%

trying model with 50 estimators..
Model accuracy on test set:80.33%

trying model with 60 estimators..
Model accuracy on test set:85.25%

trying model with 70 estimators..
Model accuracy on test set:78.69%

trying model with 80 estimators..
Model accuracy on test set:80.33%

trying model with 90 estimators..
Model accuracy on test set:83.61%



### 6. Save a model and load it

In [None]:
import pickle # a library allows us to save a model and load

pickle.dump(clf,open("random_forst_model_1.pk1","wb"))

In [None]:
loaded_model = pickle.load(open("random_forst_model_1.pk1","rb"))
loaded_model.score(X_test,y_test)

0.8360655737704918

## Optional Debugging Warnings in Jupyter

Oftentime we will see warmings pop up. 
One way to fix the warnings is to read each warnings and solve them.
another way is to import warnings module and filter the warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")
#warnings.filterwarnings("default")
#This is beneficial when we know what we are doing. 


### Upgrading an package in existing package enviornment

1. Go to the current directory foler
2. Activate conda enviornment
3. use *conda list* to check current installed package
![image.png](attachment:image.png)
4. use *conda search* function to find out what version is available for download. Make sure the dependencies are correct. If not, uninstall the package and reinstall them. 

![image-2.png](attachment:image-2.png)

5. use *conda update [packagename]* to update individual package. or we can use *conda install skikitlearn=0.22* to specific the version

## Step 1: Getting you data ready

now we will go through each procedure step by step. 
The first step for any machine project is to get the project ready. 

Three main things to do in this step:
    
    1. Split the data into features and labels(usually `X` & `y`)
    2. filling(also called inputting) or disregarding missing value 
    3. Converting non-numerical values to numerical values(also call featuring encoding)

In [None]:
#standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

In [None]:

heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
#drop the target column because that is the output
X = heart_disease.drop("target",axis=1) # axis 0 is the row axis, axis 1 is the column axis.


In [None]:
# label of the ML problem
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

### 1.1 Splitting Data

In ML, one of the most fundamental principles is not to evaluate or test your models with your training data. 
Therefore, we need to split the data into training set and test set.

In [None]:
# SPlit the data into trainning and test sets
from sklearn.model_selection import train_test_split 

#when we call train_test_split, it will return 4 values. 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2) 
#20% of the X and y will be in distributed to testing data
#80% of X and y will be in training data


In [None]:
X.shape,len(heart_disease)

((303, 13), 303)

In [None]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((242, 13), (61, 13), (242,), (61,))

We can see that X_train(242) + X_test(61) = X

### 1.1.1 Quick Tip


When we receive a bunch of data, we need to examinate the data to make sure that the data is actually useful to us. 

Usually it will go through the following few steps:

    Clean Data -> Transform Data -> Reduce Data

- Clean -> Delete outliners, missing data, fill it with average

- Transform -> Transform Data into forms that the Computer can understand. E.g.: Color to RGB values, True = 1, False = 0

- Reduce -> more data we have, the more CPU(energy) it will take. So it is important to reduce data to get the same results. The process of reducing data is also called dimenionality or column reduction


### 1.2. "Massaging Data" 

    1. Make sure everything is numerical
    2. Filling of missing data(inputting)
    3. Feature Scaling

#### 1.2.1 Make sure everything is numerical

Not all data sets come in with only numerical values, take the car-sales data as example. 
What we need to do is to convert them into numerical.

In [None]:
car_sales = pd.read_csv("data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [None]:
len(car_sales)

1000

In [None]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

we can see that the Make and Color are object because they are string.

In [None]:
# Split into X/y

X = car_sales.drop("Price",axis=1)
y = car_sales["Price"]

#Split into trainning and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2) 

In [None]:
# Build machine learning model
# instead of using RandomForestClassification, we have a regression model(number)
from sklearn.ensemble import RandomForestRegressor 

model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

ValueError: could not convert string to float: 'Toyota'

We can see that the error occurs. MachineLearning model does not deal with "str"

Here is what we need to fix

In [None]:
#using sklearn module, turning the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features =["Make","Colour","Doors"] 
# notice that Door itself is numerical, but we are treating it as categorical(4-door car)
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)]
                                ,remainder="passthrough"
                                )
#Transform X into numbers
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In above code, we first define what the categories are that we need to turn into numbers. -> Categorical_features

We call the OneHotEncode() function. 

![image.png](attachment:image.png)

One HOT Encoding is a process of turning categories into numbers. 

then we use the ColumnTransformer function, to create a transformer. it take the "one_hot" the one hot encoder and apply it to the categorical_features, for the remainder, don't do anything. 

Then we created transformed x and fit our transformer to X - data. 

In [None]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [None]:
pd.DataFrame(transformed_X)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


Make, Colors and doors have turned into 0 and 1. 

Another way to achieve the same effect is to use `dummies functions` by pandas.

In [None]:
dummies = pd.get_dummies(car_sales[["Make","Colour","Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True


Now we have everything in numbers, we can re-run our training. 

In [None]:
#Let's fit a model
np.random.seed(42)
X_train,X_test,y_train,y_test = train_test_split(transformed_X,y,test_size=0.2)

model.fit(X_train,y_train)

In [None]:
model.score(X_test,y_test)

0.3235867221569877

#### 1.2.2 Filling with missing values

    1. Fill them with some value(also known as imputation)
    2. Remove the samples with missing data altogether.

In [None]:
#using car_cales_missing as example:

car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum() # find how many numbers that are missing(is NA)


Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [None]:
#Create X&y

X = car_sales_missing.drop("Price",axis=1)
y = car_sales_missing["Price"]

##### Option 1: Fill Missing Data in panda

In [None]:
#Fill the "Make" Column

car_sales_missing["Make"].fillna("missing",inplace=True)

#Fill the ”Colour" column
car_sales_missing["Colour"].fillna("missing",inplace=True)

#Fill the "Odemeter(KM)" column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(),inplace=True)

#Fill the "Doors" Column
car_sales_missing["Doors"].fillna(4,inplace=True) #This is because most of cars have 4 doors. 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_sales_missing["Doors"].fillna(4,inplace=True) #This is because most of cars have 4 doors.


In [None]:
car_sales_missing["Doors"].value_counts()

Doors
4.0    811
5.0     75
3.0     64
Name: count, dtype: int64

In [None]:
#Check our dataframe again
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

We left the price column untouched, because Price is what we are trying to predict. We don't want to do anything to the Price. So we simply remove rows with missing price value

In [None]:
# Remove Rows with missing price value
car_sales_missing.dropna(inplace=True)

In [None]:
#Now we can see that there is no more missing value
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [None]:
len(car_sales_missing) #however we are loose some data sets, we had 1000 at first

950

In [None]:
#Now we can re-split our data

X = car_sales_missing.drop("Price",axis=1)
y = car_sales_missing["Price"]


In [None]:
# Now Let's try and convert our data to numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],remainder = "passthrough")


##### Option2: Fill Missing Data in ScikitLearn

In [None]:
# Reimport data, because perviously we have filled the missing data
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")

In [None]:
#check for amount of missing data
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [None]:
# drop missing price value row
car_sales_missing.dropna(subset="Price",inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [None]:
# Split into X&y
X = car_sales_missing.drop("Price",axis=1)
y = car_sales_missing["Price"]

In [None]:
#Fill missing data
from sklearn.impute import SimpleImputer #module that takes care of the missing data
from sklearn.compose import ColumnTransformer # define somesort of transformer and apply to columns

# Define some IMPUTER
#Fill categorical values with"missing" & numerical values with mean using SKLEARN

cat_imputer = SimpleImputer(strategy="constant",fill_value= "missing") # go over each values keeping the strategy constant,fill missing value with 'missing'
door_imputer = SimpleImputer(strategy="constant",fill_value= 4) # go over each value, fill missing value with 4
num_imputer = SimpleImputer(strategy="mean")

#Define Column
cat_features = ["Make","Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]


#Create imputer(something that fills missing data)
# use columntransformer to create one imputer
# it takes a list of multiple imputer, each imputer with a name, the imputer, and where it applies to. 
imputer = ColumnTransformer([("cat_imputer",cat_imputer,cat_features),
                             ("door_imputer",door_imputer,door_features),
                             ("num_imputer",num_imputer,num_features)
 ])

#Transform the data

filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [None]:
car_sales_filled = pd.DataFrame(filled_X,columns = ["Make","Colour","Doors","Odometer (KM)"])
car_sales_filled.isna().sum()


Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

By now, we have completed the filling of the data using both Panda method and Scikit Learn method.

Let now try to first split our data, and then convert object into numerical 

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],remainder = "passthrough")
transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [None]:
# let's fit our model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(transformed_X,y,test_size=0.2)

model = RandomForestRegressor(n_estimators=100)

model.fit(X_train,y_train)
model.score(X_test,y_test)

0.21990196728583944

#### 1.2.3 Feature Scaling

Process of making sure all of your numerical data is on the same scale.

For example, say you were trying to predict the sale price of cars and the number of kilometres on their odometers varies from 6,000 to 345,000 but the median previous repair cost varies from 100 to 1,700. A machine learning algorithm may have trouble finding patterns in these wide-ranging variables.

To fix this, there are two main types of feature scaling.

1. `Normalization` (also called min-max scaling) - This rescales all the numerical values to between 0 and 1, with the lowest value being close to 0 and the highest previous value being close to 1. Scikit-Learn provides functionality for this in the MinMaxScalar class.

2. `Standardization` - This subtracts the mean value from all of the features (so the resulting features have 0 mean). It then scales the features to unit variance (by dividing the feature by the standard deviation). Scikit-Learn provides functionality for this in the StandardScalar class.

A couple of things to note.

Feature scaling usually isn't required for your target variable(y).

Feature scaling is usually not required with tree-based models (e.g. Random Forest) since they can handle varying features.

Additional Reading:
1. https://rahul-saini.medium.com/feature-scaling-why-it-is-required-8a93df1af310
2. https://benalexkeen.com/feature-scaling-with-scikit-learn/
3. https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/

## Step 2: Choosing the right estimator/algorithm for your problem(epi.116)

Somethings to note:

* SKlearn refers to machine learning models, algorithms as estimators. 
* Classification problem - predicting a category (heart disease or not)
    * sometimes you'll see `clf`(short for classifier) used as a classification estimator
* Regression problem - predicting a number(selling price of a car)

If you are working on a machine learning problem and looking to use SKlearn and not sure waht model you should use, refer to the SK learn machine learning map： 

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

![image.png](attachment:image.png)

### 2.1 Picking a machine learning model for regression problem

Let's use Califoria Data Set(from Scikit-Learn Toy DataSets)

In [15]:
# Get California Housing Dataset

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing.feature_names
housing_df =pd.DataFrame(housing["data"],columns=housing["feature_names"])
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [16]:
housing_df["target"] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [17]:
housing_df = housing_df.drop("MedHouseVal",axis=1)

KeyError: "['MedHouseVal'] not found in axis"

In [18]:
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [None]:
#Import Algorithm
#Set up random seed

from sklearn.linear_model import Ridge

np.random.seed(42)

#Create the data

X = housing_df.drop("target",axis=1)
y = housing_df["target"] # median house prince in $100,000

#Split into train the test sets
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.2)

# Instantiate and fit the model(on the training set )
#Going through the road map, we decided to experiece with the RidgeRegression model

model = Ridge()
model.fit(X_train,y_train)

#Check the socre of the model(on the test set)

model.score(X_test,y_test) #R-squared 


0.5758549611440128

what the score function tells us is the Coefficient of Determination, commonly known as R-squared, describing how strong the linear relationship is between two variables, and is heavily relied on by researchers when conducting trend analysis. 

Highest the better - how predictive are the features to the target value. 

https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression


What if `Ridge` model doesn't work at all? We can try different models. 
This time, let's try the ensemble methods/(Ensemble-全体，合奏)

![image.png](attachment:image.png)

Ensemble methods `combine the predictions of several base estimators` built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

Sklearn's ensemble method: https://scikit-learn.org/stable/modules/ensemble.html#

In [None]:
# import the randomForestREgressor model class from the ensemble module

from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# create the data 

X = housing_df.drop("target",axis = 1)
y - housing_df["target"]

#SPlit into train and test sets 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train,y_train)

# check the score of the model(on the test set)
model.score(X_test,y_test)

0.8065734772187598

we can already see that the random forest is doing way better

#### （Optional 1） Understanding R-squared(determination of the prediction)



![image.png](attachment:image.png)

R-squared describes how co-linear it is between two sets of data 

#### （Optional-2） Understanding Random Forrest(beginner)

1. it is a combination of multiple decision tree. 
2. 

#### （Optional-3 Understanding Regression Prediction Matric

### 2.2 Picking a machine learning model for a `classification problem`

Let's go to the map... wwww.scikit-learn.org/stable/tutorial/machine_learning_map

In [None]:
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Consulting the map and it says to try `linearSVC`

In [None]:
#import LinearSVC

from sklearn.svm import LinearSVC

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]

#Slit the data 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

#Instantiate LienarSVC
clf = LinearSVC(max_iter = 1000)
clf.fit(X_train,y_train)

#evaluate the linearSVC

clf.score(X_test,y_test)




0.8688524590163934

In [None]:
#lets try the random forest classifier  
from sklearn.ensemble import RandomForestClassifier

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]

#Slit the data 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

#Instantiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)

#evaluate the RandomForest Classifier

clf.score(X_test,y_test)

0.8524590163934426

Tidbit: 

    1. If you have structured data, used ensemble methods.(such as excel, table, the heart_disease, things in a data frame etc) 
    2. If you have unstructured data, use deep learning or transfer learning. 

### 2.3 Fitting A Model To The Data

Fit in the model/algorithm and use it to make predictions  on our data.

Different names for:

* X = features, features variable, data
* y = labels, targets, target variables




#### 2.3.1 Fitting the model to the data

In [None]:
# import the randomfprestClassifier estimator class

from sklearn.ensemble import RandomForestClassifier

#Setup random seed
np.random.seed(42)

#Make the Data
X = heart_disease.drop("target",axis =1)
y = heart_disease["target"]


#Split the data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.2)

#Instantite the model
clf = RandomForestClassifier(n_estimators=100)

#Fit the model to the data(training the machine learning model)
clf.fit(X_train,y_train)




# Evaluate the Random Forest Classifier(use the patterns the model has learned)
clf.score(X_test,y_test)

X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


The fit function is core of the ML model. What it does bascally is to find the pattern of the numbers and come up with a target. 

### 2.4 Make Predictions using a machine learning model


#### 2.4.1 Making Predictions on Classification Problems.

2 ways to make predictions: 

1. `perdict()`
2. `perdict_proba()`

when we pass in data, we need to make sure that the data structure matches the data structure that the model is trained on 

##### 1. make prediction using `predict()` function

In [None]:
clf.predict(np.array([1,7,9,3,2]))# This is going to give us error



ValueError: Expected 2D array, got 1D array instead:
array=[1. 7. 9. 3. 2.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
#Used untrained model to make predictions

clf.predict(X_test)



array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int64)

With the predicted value, we can compare it with y_test(which is the actual results)

In [None]:
np.array([y_test])

array([[0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
        0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0]], dtype=int64)

In [None]:
# now we can compare our predictions to the truth labels to evaluate the models

y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)  # basically giving us R-score.


0.8524590163934426

In [None]:
clf.score(X_test,y_test)

0.8524590163934426

In [None]:
# another way to evaluate the model 
from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_preds)

0.8524590163934426

##### 2. make predictions using `pedict_proba() ` function

Perviously we discussed about the `predict()` function. The output of the `predict` function is list of 0s and 1s. It looks like the following: 

In [11]:
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1], dtype=int64)

pedict_probal() returns probability of a classification label. 

In [13]:
clf.predict_proba(X_test[:10])

array([[0.06, 0.94],
       [0.27, 0.73],
       [0.37, 0.63],
       [0.43, 0.57],
       [0.03, 0.97],
       [0.49, 0.51],
       [1.  , 0.  ],
       [0.02, 0.98],
       [0.02, 0.98],
       [0.84, 0.16]])

as we can see, we don't get the exact 1 or 0. Instead, we get a probability of the label being true. In this case, 1 - stands for heard disease, 0 - without heart disease. The probability tells us how much it is closer to 0 and how much it is closer to 1. 

The threshold of is 0.5

#### 2.4.2 Making Prediction on Linear Regression Model

In [19]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

#Create the data
X = housing_df.drop("target",axis=1)
y = housing_df["target"]

# Split into trainging and test sets.
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

#Create model instance
model = RandomForestRegressor(n_estimators=100)

#Fit the model with our training data
model.fit(X_train,y_train)

# Make predictions
y_preds = model.predict(X_test)




In [20]:
y_preds[:10]

array([0.49384  , 0.75494  , 4.9285964, 2.54316  , 2.33176  , 1.6525301,
       2.34323  , 1.66182  , 2.47489  , 4.8344779])

In [21]:
np.array(y_test[:10])

array([0.477  , 0.458  , 5.00001, 2.186  , 2.78   , 1.587  , 1.982  ,
       1.575  , 3.4    , 4.466  ])

In [22]:
len(y_preds),len(y_test)

(4128, 4128)

quick reminder: The y_preds is what we use the model to predict, whereas y_test is the actual truth.

What we want to do is to compare the predictions to the truth. 

In [23]:
# Compare the predictions to the truth
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test,y_preds)

0.32659871732073664

What the 0.32 value tells us is on average, how much the prediction is off from the ground truth. (target values)



### 2.5 Evaluating A Machine Learning Model Score(Part I)

In [25]:
# Reminder: we are at the fifth step at the work flow

Scheme

['1. Get data ready',
 '2. pick a model(suit your problem)',
 '3. Fit the model to the data (learning patterns)',
 '4. make a prediction(using patterns)',
 '5. Evaluate the model',
 '6. Improve through experimentation',
 '7. Save and reload your trained model']

Three ways to evaluate Scikit-Learn Model/estimators:

1. Estimator's Build-int `score()` method
2. The `scoring` parameter
3. Problem-specific metric functions.

reference: https://scikit-learn.org/stable/modules/model_evaluation.html

#### 2.5.1 Evaluating a model with the `score()` method

Example 1: using `score()` method on classification problem - heart disease

In [26]:
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [30]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

#Create X & y
X = heart_disease.drop("target",axis = 1)
y = heart_disease["target"]

#Create train/test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
#Create classifier model instance

clf = RandomForestClassifier(n_estimators=100)

#Fit classifier to training data. 
clf.fit(X_train,y_train)


In [31]:
clf.score(X_train,y_train) 

1.0

we get a 1.0 if we look for the score for X_train,y_train， because this is what we used to train our model. Our model makes sure that at least all the train data are correct. 

In this case, 1.0 is 100% accuracy on the train model. 

If we run the socre on the test data

In [32]:
clf.score(X_test,y_test)

0.8524590163934426

Now we see the the score for X_test and y_test is 0.85.

Example 2 - using `score()` method on regression problem - housing

In [33]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)
#Cre·Spli·Cre·Fit

#Create the data
X = housing_df.drop("target",axis=1)
y = housing_df["target"]
#Split the data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
#Create model instance
model = RandomForestRegressor()#default n_estimators = 10
#Fit the model to the data
model.fit(X_train,y_train)

The default score() evaluation metric is r_Squared for regression algorithms

In [34]:
model.score(X_test,y_test)

0.8065734772187598

#### 2.5.2 Evaluating a model using the `scoring` parameter

https://scikit-learn.org/stable/modules/cross_validation.html

First, without knowing what cross-validation is, let's run the code and see what happen. 

In [35]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)

X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train,y_train);



In [36]:
clf.score(X_test,y_test) #What we learned before.

0.8524590163934426

In [37]:
#what we are learning today using Cross_val_score
cross_val_score(clf,X,y,cv=5) # the cross_val_score takes in the X and y data, not the test data

array([0.81967213, 0.86885246, 0.81967213, 0.78333333, 0.76666667])

notice that the cross_val_score takes in the X and y data, not the test data.

The cross_val_score gives us an array. 

##### WHAT IS CROSS_VALIDATION?

![image.png](attachment:image.png)

On the left is our normal training&testing data. We split our data into training sets, and test sets. 

What the `score()` method doing previously was to calcualte the R-square on the test sets.

Cross-Validation(usually refer to as K-fold Cross-Validation), where K is a arbitrary number(we set cv = 5).

For a K-fold Cross-Validation, it will run K different split. 
20% test data is spited at different location of the entire data set, for K times.  This is beneficial because in these case, all data gets trained and spited, and tested. 

![image.png](attachment:image.png)

By switching up the test data and training data, it will help reduce the over-fitting effect on the data.

Cross-Validation aims to provide a solution to not training on all the data and avoiding getting those lucky scores when the data is single splitted. 

That is why the `cross_val_score()` with give us K numbers where K is the number of split and return K different score. 

Default scoring parameter(scoring = None) of classifier is mean accuracy (R-squared)

In [42]:
#comparing two scores

clf_single_score = clf.score(X_test,y_test)
clf_cross_val_score = cross_val_score(clf,X,y,cv=5,scoring=None) # the cross_val_score takes in the X and y data, not the test data

clf_cross_val_score_mean = clf_cross_val_score.mean()

print(clf_single_score,clf_cross_val_score,clf_cross_val_score_mean)

0.8524590163934426 [0.81967213 0.90163934 0.86885246 0.8        0.78333333] 0.8346994535519124


Takeaways: 

1. On the results above, we can tell that by splitting the data sets differently, we get different scores. [The first number is the same]

2. We also see that the single score is slightly higher than the cross-val-score-mean. 

### 2.6 Classification Model Evaluation Metrics


Perviously we only discuss about 1 kind of evaluation metric(scoreing method) which is accuracy(or R-squred)

In this section, we will talk about other metrics:

1. Accuracy
2. Area under ROC curve
3. Confusion matrix
4. Classification Report

#### 1. **Accuracy**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)
