## 4. Evaluating a machine learning model

Three ways to evaluate Scikit-Learn models/estimators: 

1. Estimator's built-in `score()` method
2. The `scoring` parameter
3. Problem-specific metric functions
    
You can read more about these here: https://scikit-learn.org/stable/modules/model_evaluation.html 

### 4.1 Evaluating a model with the `score` method

In [5]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
# Get the data (be sure to click "raw") - https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/data/heart-disease.csv 
heart_disease = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)

# Fit the model to the data (training the machine learning model)
clf.fit(X_train, y_train)

In [11]:
#The highest value for the .score() method is 1.0  , the lowest is 0.0
clf.score(X_train, y_train)
#the default score method for a classfication algorithm is accuracy 
#of course we have to convert into percentage if we want accuracy into percentage so 1.0 is 100% accuracy so if 0.8 then 80% accuracy


1.0

In [12]:
#why is our model getting 1.0 on the training data well it's had exposure to all fo the the training features and all of the training labels and so if the model is powerful enought , it iwll achieve a perfect socre on the traing data because its able to split data in our case, it's binary zero and one it's able to predcit from all of the X train value to  predict perfectly all o fthe y train values  

In [13]:
clf.score(X_test , y_test)

0.8524590163934426

In [14]:
#now it might be a differnet scenario for the test data it hasn't seen the test sample  learn patterns in data that we have existing to make predcitions, quality predictions on data that data we have't seen before
# the core is less on unseen data

In [15]:
#the training score is usually sometimes quite a bit higher than the test score, but they should be relatively close but the training will generally be higher than the testing score and if you ever get a perfect test score, like 100% accuracy or something like that , always be skeptical go back and check your data but if we made our model a little bit worse

Let's use the 'score()' on our regression problem

In [20]:
# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [22]:

housing_df["target"] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [27]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create model instance
model = RandomForestRegressor(n_estimators=1000)

# Fit the model to the data
model.fit(X_train, y_train)

In [28]:
model.score(X_test , y_test) # if we want to have a look at what the metric here that's being used for the score method how might we do that shift + tab

0.8083212359139857

In [29]:
y_test.mean() #if we go the mean of that mean, if every single sample in our predictions redicted that we'd get an r-squared value of zero

2.0550030959302323

#we've seen how to quickly get a sniff of how our machine learning model is doing and evaluate it using the score method and that'll return a default evalutation metric depending on the problem we're working in regression it's returns ot the coeffiecient of determination and in classfication it returns the mean accuracy

## However, when you get furhter into a problem, it's likely you'll want to start using osme more powerful metrics to evaluate your mdoel's performance 

## 4.2 Evaluating a model using the scoring parmaeter

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train);