# This is where the ML magic happens!
![](https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png)
Reference :
* https://scikit-learn.org/stable/
* https://scikit-learn.org/stable/user_guide.html
* There will be many terminologies which you might hear for the first time here. There's nothing to worry, just keep this url handy. It will help you in most of the cases! - https://developers.google.com/machine-learning/glossary

<br>This book is awesome! Get it if possible! ![](https://learning.oreilly.com/library/cover/9781492032632/250w/)
* https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

Aah! Just the meme I needed!
![](https://pics.esmemes.com/albert-einstein-insanity-is-doing-the-same-thing-over-and-61232420.png)

In [1]:
# first things first - the standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Our Track
#### An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together
![](https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/sklearn-workflow.png)

# 1. Getting the data ready

In [2]:
# Import dataset
hd = pd.read_csv("https://raw.githubusercontent.com/ineelhere/Machine-Learning-and-Data-Science/master/scikit-learn/heart-disease.csv")

# View the data
hd.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1


In [9]:
# creating a "features matrix" - X (age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal)
x = hd.drop("target", axis = 1)
x.head(3)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2


In [10]:
#creating a "label matrix" - Y (target)
y = hd["target"]
y.head(3)

0    1
1    1
2    1
Name: target, dtype: int64

# 2. Choose the right estimator/algorithm for our problems

*  Our problem here is "classification". We want to see if someone has heart disease or not
* https://developers.google.com/machine-learning/glossary#classification-model

In [5]:
# choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier( n_estimators=100) #with n_estimators=100 we are altering our hyperparameters

<p>A random forest classifier.</p>
<p>A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the&nbsp;<code class="docutils literal notranslate"><span class="pre">max_samples</span></code>&nbsp;parameter if&nbsp;<code class="docutils literal notranslate"><span class="pre">bootstrap=True</span></code>&nbsp;(default), otherwise the whole dataset is used to build each tree.</p>
<p>Read more in the&nbsp;<a class="reference internal" href="https://scikit-learn.org/stable/modules/ensemble.html#forest"><span class="std std-ref">User Guide</span></a>.</p>

* Go here for the full documentation - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* Need help with a new terminology? Go here - https://developers.google.com/machine-learning/glossary

In [6]:
# we will keep the default hyperparameters 
# let us see them first
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

# 3. Fit the model/algorithm and use it to make predictions on our data

In [13]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.5) 
#splits out data from x and y into training and test set data

In [14]:
x_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
103,42,1,2,120,240,1,1,194,0,0.8,0,0,3
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
277,57,1,1,124,261,0,1,141,0,0.3,2,0,3
43,53,0,0,130,264,0,0,143,0,0.4,1,0,2
25,71,0,1,160,302,0,1,162,0,0.4,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3
17,66,0,3,150,226,0,1,114,0,2.6,0,0,2
205,52,1,0,128,255,0,1,161,1,0.0,2,1,3
114,55,1,1,130,262,0,1,155,0,0.0,2,0,2


In [16]:
x_test

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
18,43,1,0,150,247,0,1,171,0,1.5,2,0,2
19,69,0,3,140,239,0,1,151,0,1.8,2,2,2
35,46,0,2,142,177,0,0,160,1,1.4,0,0,2
31,65,1,0,120,177,0,1,140,0,0.4,2,0,3
108,50,0,1,120,244,0,1,162,0,1.1,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12,49,1,1,130,266,0,1,171,0,0.6,2,0,2
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
109,50,0,0,110,254,0,0,159,0,0.0,2,0,2
34,51,1,3,125,213,0,0,125,1,1.4,2,1,2


In [15]:
y_train

103    1
298    0
277    0
43     1
25     1
      ..
7      1
17     1
205    0
114    1
246    0
Name: target, Length: 151, dtype: int64

In [18]:
y_test

18     1
19     1
35     1
31     1
108    1
      ..
12     1
1      1
109    1
34     1
56     1
Name: target, Length: 152, dtype: int64

In [11]:
#fit our model to the data - find the patterns in training data
clf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [12]:
# just a tip - get rid of the above output just by using a semicolon (;)
clf.fit(x_train, y_train);

In [19]:
# now lets make a prediction on test data - need to pass a numpy array for this purpose
# also keep in mind that the array must look like the x_train - scroll above to see
# so here it is--
y_preds = clf.predict(x_test)
y_preds

array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1],
      dtype=int64)

# 4. Evaluating a model

In [21]:
# evaluating the model on the training data and the test data
clf.score(x_train, y_train)

0.9072847682119205

That is a good score! <br>
now let us see what happens with the test data

In [22]:
clf.score(x_test, y_test)

0.9013157894736842

In [24]:
# some more metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [30]:
print(classification_report(y_test, y_preds)) # compare test labels to the prediction labels

              precision    recall  f1-score   support

           0       0.91      0.87      0.89        69
           1       0.90      0.93      0.91        83

    accuracy                           0.90       152
   macro avg       0.90      0.90      0.90       152
weighted avg       0.90      0.90      0.90       152



In [31]:
print(accuracy_score(y_test, y_preds))

0.9013157894736842


In [32]:
print(confusion_matrix(y_test, y_preds))

[[60  9]
 [ 6 77]]


# 5. Improving a Model
* Try different amount of n_estimators

In [35]:
np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators")
    clf = RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
    print(f"Model accuracy on test set: {clf.score(x_test, y_test)*100:2f} ")

Trying model with 10 estimators
Model accuracy on test set: 79.605263 
Trying model with 20 estimators
Model accuracy on test set: 81.578947 
Trying model with 30 estimators
Model accuracy on test set: 81.578947 
Trying model with 40 estimators
Model accuracy on test set: 80.921053 
Trying model with 50 estimators
Model accuracy on test set: 78.947368 
Trying model with 60 estimators
Model accuracy on test set: 80.263158 
Trying model with 70 estimators
Model accuracy on test set: 82.894737 
Trying model with 80 estimators
Model accuracy on test set: 83.552632 
Trying model with 90 estimators
Model accuracy on test set: 81.578947 


* so the highest accuracy wins and also tells us how many estimators it required. (10 being the default)

# 6. Save and load the trained model!

In [36]:
import pickle
pickle.dump(clf, open("random_forest_model1.pkl", "wb"))

In [42]:
loaded_model = pickle.load(open("random_forest_model1.pkl", "rb"))
loaded_model.score(x_test, y_test)

0.8157894736842105

* shows the last model that we tried above (see the looping results)

# 7. Putting it all together!
![](https://pics.me.me/ytorch-data-scientist-pandas-nunpym-mpy-sklearn-bug-data-matplotlib-54757344.png)

## Just a friendly advice -- learn statistics. It helps you to understand ML better. 

![](https://i.redd.it/4f71u8ti5hg31.jpg)
### Don't be this guy!

# E N J O Y !