## Implementing machine learning with sk-learn

### Example 1: Classifying Heart Disease

#### Importing nessacary modules

In [18]:
# Standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#### 1. Get the data ready

As an example dataset, we'll import `heart-disease.csv`. This file contains anonymised patient medical records and whether they have heart disease or not.

### Import data

In [19]:
heart_disease = pd.read_csv('./data/heart-disease.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [20]:
heart_disease.shape

(303, 14)

### Split data into feature and target

Each row is a different patient and all columns except `target` are different patient characteristics. `target` indicates whether the patient has heart disease (`target` = 1) or not (`target` = 0).

In [21]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

In [22]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [23]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

In [24]:
y.value_counts()

1    165
0    138
Name: target, dtype: int64

### Split the data into training and test sets

In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)


In [26]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape


((227, 13), (227,), (76, 13), (76,))

In [27]:
type(X_train)

pandas.core.frame.DataFrame

In [28]:
type(y_train)

pandas.core.series.Series


In this example, the **default split proportion** is used, which allocates:
- **75%** of the data to training (`X_train` and `y_train`).
- **25%** of the data to testing (`X_test` and `y_test`).

Alternatively, we explicitly specify a **70-30 split**:

```python
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


Data is ready now: Data is prepared for training and testing. Next step is selecting model. 

#### 2. Choose the model
This is often referred to as `model` or `clf` (short for classifier) or estimator (as in the Scikit-Learn) documentation.



In [29]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()


#### 3. Train model with data: Fit the model to the data and use it to make a prediction.


Fitting the model on the data involves passing it the data and asking it to figure out the patterns. 
If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels. If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

Where `X` is a feature array and `y` is a target array.

Other names for `X` include:
* Data
* Feature variables
* Features

Other names for `y` include:
* Labels
* Target variable

For supervised learning there is usually an `X` and `y`. 
For unsupervised learning, there's no `y` (no labels).

Let's revisit the example of using patient data (`X`) to predict whether or not they have heart disease (`y`).

In [54]:
# model.fit(input features, target labels(true data))
model.fit(X_train, y_train)

In scikit-learn, when you call the fit() method on a model instance with X_train and y_train as arguments, scikit-learn internally converts these data structures into NumPy arrays before training the model. This conversion is necessary because scikit-learn's machine learning algorithms and models are designed to work efficiently with NumPy arrays.

#### 4. Use the model to make a prediction on unseen data

The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once our model instance is trained, you can use the `predict()` method to predict a target value given a set of features. In other words, use the model, along with some unlabelled data to predict the label. 

Note, data you predict on has to be in the same shape as data you trained on.

In [47]:
# Use the model to make a prediction on the test data (further evaluation)
y_preds = model.predict(X_test)

In [48]:
y_preds

array([0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 0])

In [49]:
y_test

152    1
21     1
208    0
219    0
168    0
      ..
176    0
7      1
18     1
290    0
270    0
Name: target, Length: 76, dtype: int64

In [50]:
type(y_preds)

numpy.ndarray

In [51]:
type(y_test)

pandas.core.series.Series

#### 5. Evaluate the model

Now we've made some predictions, we can start to use some more Scikit-Learn methods to figure out how good our model is. 

Each model or estimator has a built-in score method. This method compares how well the model was able to learn the patterns between the features and labels. In other words, it returns how accurate your model is.

#### Evaluation metric in classification model

- **Accuracy:** In a classification problem, accuracy is a widely used evaluation metric that measures the ratio of  correctly predicted observations to the total number of observations in the dataset.

Acc = (number_of_correct_predictions/total_predictions) * 100

percentage of correct predictions it made among all predictions.

In [52]:
from sklearn.metrics import accuracy_score

In [53]:
# 1. true values  - true label
# 2. perdicted target values - predicted label

accuracy_score(y_test, y_preds)

0.8157894736842105

.. percent of predictions are correct

In the context of a Random Forest or any machine learning algorithm, `np.random.seed()` is used to set the random seed for NumPy's random number generator. 

Here’s what it does and why it’s useful in Random Forests:

### What is `np.random.seed()`?
- It initializes the random number generator to a fixed state.
- The argument `42` is an arbitrary seed value (it could be any integer). Using the same seed value ensures reproducibility of random operations.

### Why is it used in Random Forests?
Random Forests rely on randomness in several parts of their construction:
1. **Bootstrap Sampling**: For each tree in the forest, a random sample (with replacement) of the training data is drawn. This is known as bootstrap sampling.
2. **Random Feature Selection**: At each split in the decision tree, a random subset of features is selected to determine the best split.

Setting a random seed ensures that:
- The same bootstrap samples are drawn.
- The same feature subsets are selected at each split.

This consistency helps when debugging or comparing models because you can reproduce the exact same results across runs.



The performance of a machine learning model can vary between different training runs due to several factors:

1. **Randomness**: Some machine learning algorithms, such as random forests and stochastic gradient descent, incorporate randomness during training. For example, random forests use bootstrap sampling and random feature selection, which can lead to variations in the resulting model's performance across different training runs.

2. **Data Variability**: If the dataset used for training contains inherent variability or noise, different training runs may result in slightly different models due to the variation in the data samples or distribution.

3. **Hyperparameters**: The performance of the model can also be affected by the choice of hyperparameters, such as the learning rate, regularization strength, or tree depth. Different hyperparameter settings can lead to different model behaviors and performances.

4. **Data Splitting**: When using techniques like cross-validation, the data is split into different subsets for training and evaluation in each run. The performance of the model may vary depending on the specific data splits used in each training run.

5. **Initialization**: Some machine learning algorithms, particularly neural networks, involve random initialization of model parameters. As a result, different initializations can lead to different model behaviors and performances.

6. **Model Complexity**: The complexity of the model, such as the number of parameters or layers in a neural network, can affect its performance. Different training runs may converge to different optimal solutions depending on the model complexity and the training data.

Overall, the variability in model performance across different training runs is a common phenomenon in machine learning. It's essential to consider this variability when interpreting model performance and to use techniques like cross-validation to estimate the model's performance more reliably.

#### 6.Improving model performance


- The first model you build is often referred to as a baseline.
- It's important to remember that this baseline model is not the final solution.

There are two main approaches to improving your model: model perspective and data perspective.

- Model perspective involves using more complex models or tuning hyperparameters.
- Data perspective involves collecting more or better quality data to enhance learning.


If you're already working on an existing dataset, it's often easier try a series of model perspective experiments first and then turn to data perspective experiments if you aren't getting the results you're looking for.

Model Perspective Experiments

- Start with model perspective experiments, adjusting hyperparameters or trying different algorithms.
- Cross-validate all results to ensure consistency across training and test datasets.

One thing you should be aware of is if you're tuning a models hyperparameters in a series of experiments, your results should always be cross-validated. 

Cross-validation is a way of making sure the results you're getting are consistent across your training and test datasets (because it uses multiple versions of training and test sets) rather than just luck because of the order the original training and test sets were created. 

Hyperparameter Tuning

- Hyperparameters are like knobs on an oven that you can tune to optimize performance.
- All different parameters should be cross-validated 
    * **Note:** Beware of cross-validation for time series problems 


## Hyperparameter VS parameters 


### Parameters (for example: weight and bias):

- Parameters are the internal variables or coefficients that the model learns from the training data. That means parameters are learned from the data during the training process
- They directly influence the output of the model.
- In linear regression, for instance, parameters are the coefficients (weights) and the intercept (bias) learned from the training data. 
- During training, the machine adjusts these parameters iteratively to minimize the error between predicted and actual outcomes.


### Hyperparameters:

- Hyperparameters are external configurations or settings of a machine learning model that govern its learning process. Hyperparameters are set prior to training and are not updated based on the training data: Hyperparameters are set by the machine learning engineer or researcher to control the learning process itself. Adjusting hyperparameters often involves experimentation or optimization techniques to enhance the model's performance.

- Examples of hyperparameters include the number of layers in a neural network, learning rate, batch size, regularization strength, and the number of trees in a random forest.
- Hyperparameters influence the learning process indirectly by controlling the behavior of the algorithm or model.
- The selection of appropriate hyperparameters significantly impacts the model's performance and generalization to new data.



### Trying to Improve RandomForestClassifier by turning hyperparameter
Different models you use will have different hyperparameters you can tune. 

#### Checking hyperparameters in RandomForestClassifier

In [22]:
model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

`n_estimators`  determines the number of decision trees in the forest. 

What is the default value of n_estimaters? 


More trees can potentially lead to better performance but also increase computational cost.

![Alt Text](data/random-forest-algorithm2.png)

In a random forest, comprising 100 trees, each tree generates a prediction independently. The final prediction is determined by aggregating the individual predictions of all trees and selecting the most commonly occurring prediction, often referred to as the majority vote.

The purpose of setting the seed is to ensure that any random operation inside the RandomForestClassifier remains consistent across different runs of the code.

In [23]:
# Try different numbers of estimators (trees)
np.random.seed(42)

model = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)
print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
print("")

Model accuracy on test set: 76.31578947368422%



In [24]:
# Try different numbers of estimators (trees)... (no cross-validation)
np.random.seed(42)

model = RandomForestClassifier(n_estimators=20).fit(X_train, y_train)
print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
print("")

Model accuracy on test set: 81.57894736842105%



In [None]:
##   10, 20, 30, ... 90,100

In [25]:
# Try different numbers of estimators (trees)... (no cross-validation)
np.random.seed(42)
for i in range(10, 130, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 76.31578947368422%

Trying model with 20 estimators...
Model accuracy on test set: 84.21052631578947%

Trying model with 30 estimators...
Model accuracy on test set: 81.57894736842105%

Trying model with 40 estimators...
Model accuracy on test set: 82.89473684210526%

Trying model with 50 estimators...
Model accuracy on test set: 85.52631578947368%

Trying model with 60 estimators...
Model accuracy on test set: 85.52631578947368%

Trying model with 70 estimators...
Model accuracy on test set: 82.89473684210526%

Trying model with 80 estimators...
Model accuracy on test set: 82.89473684210526%

Trying model with 90 estimators...
Model accuracy on test set: 88.1578947368421%

Trying model with 100 estimators...
Model accuracy on test set: 86.8421052631579%

Trying model with 110 estimators...
Model accuracy on test set: 84.21052631578947%

Trying model with 120 estimators...
Model accuracy on test set: 81.57894736842105%



## Evaluating model with cross validation

It provides a more reliable estimate of a model's performance compared to a single train-test split. 

By repeatedly splitting the dataset into different subsets for training and testing, and averaging the results, cross-validation offers a more robust evaluation of how well the model might perform on unseen data.

![Alt text](data/5_fold_cross.png)

- cross_val_score takes the model (model), features (X), and target (y) as input.
- cv=5 specifies that it should perform 5-fold cross-validation.

- It splits the dataset (X and y) into 5 folds, trains the model on 4 folds, and evaluates it on the remaining fold. 

- This process is repeated 5 times (once for each fold), and the accuracy score is computed for each iteration.
- The np.mean() function is used to calculate the mean of these 5 accuracy scores, providing an overall estimation of the model's performance.

The cross-validation score is more robust compared to single train-test split because the model is trained and tested with variety of datasets. 

In [27]:
from sklearn.model_selection import cross_val_score
# Try different numbers of estimators (trees)... (no cross-validation)
np.random.seed(42)

model = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)
print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}%")
print("")

Model accuracy on test set: 76.31578947368422%
Cross-validation score: 78.53551912568305%



In [29]:
from sklearn.model_selection import cross_val_score

# With cross-validation
np.random.seed(42)
for i in range(10, 110, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
    print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}%")
    print("")
    
    
  

Trying model with 10 estimators...
Model accuracy on test set: 76.31578947368422%
Cross-validation score: 78.53551912568305%

Trying model with 20 estimators...
Model accuracy on test set: 81.57894736842105%
Cross-validation score: 79.84699453551912%

Trying model with 30 estimators...
Model accuracy on test set: 82.89473684210526%
Cross-validation score: 80.50819672131148%

Trying model with 40 estimators...
Model accuracy on test set: 85.52631578947368%
Cross-validation score: 82.15300546448088%

Trying model with 50 estimators...
Model accuracy on test set: 82.89473684210526%
Cross-validation score: 81.1639344262295%

Trying model with 60 estimators...
Model accuracy on test set: 88.1578947368421%
Cross-validation score: 83.47540983606557%

Trying model with 70 estimators...
Model accuracy on test set: 84.21052631578947%
Cross-validation score: 81.83060109289617%

Trying model with 80 estimators...
Model accuracy on test set: 80.26315789473685%
Cross-validation score: 82.81420765027

# Improving model: Turning Hyperparameter with GridSearchCV

GridSearchCV is a technique for finding the optimal hyperparameters for a machine learning model.



1. **Parameter Grid:** You define a grid of hyperparameters that you want to optimize or tune. For a Random Forest model, these could include parameters like `n_estimators` (number of trees), `max_depth` (maximum depth of each tree), `min_samples_split`, etc.

2. **Cross-validation:** `GridSearchCV` performs cross-validation, which involves splitting the dataset into multiple subsets (folds). It iterates through each combination of hyperparameters specified in the grid and trains a model on a subset of the data (training set) while evaluating its performance on a different subset (validation set).

3. **Scoring:** After training and evaluating the model on each combination of hyperparameters, a scoring metric (like accuracy, precision, recall, etc.) is used to determine the performance of the model for each set of hyperparameters.

4. **Best Parameters:** Once all combinations have been tried, `GridSearchCV` identifies the set of hyperparameters that yielded the best score according to the specified metric.

5. **Final Model:** Finally, `GridSearchCV` retrains the model using the entire dataset with the best hyperparameters found during the search.

This process helps in finding the optimal set of hyperparameters for the Random Forest model, leading to better performance on unseen data.

GridSearchCV is technique used in machine learning to simplify the process of experimenting with different hyperparameters. It combines hyperparameter tuning and cross-validation by training the model using various hyperparameters defined in a grid-like structure while also employing cross-validation. Ultimately, it helps in finding the best combination of hyperparameters and parameters that optimize the performance of the model.

In [30]:
# Another way to do it with GridSearchCV...
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

# Define the parameters to search over

# param_grid = {'n_estimators': [i for i in range(10, 100, 10)]}

param_grid = {
    'n_estimators': [i for i in range(10, 110, 10)],
    'max_depth': [None, 5, 10, 15, 20, 30]  # Example values for max_depth
}

# Setup the grid search
grid = GridSearchCV(RandomForestClassifier(),
                    param_grid,
                    cv=5)

# Fit the grid search to the data
grid.fit(X, y)

# Find the best parameters
grid.best_params_


{'max_depth': 5, 'n_estimators': 50}

The fit() method of the grid object is called to perform the grid search. This method fits the model for each combination of hyperparameters and evaluates it using cross-validation.

In [31]:
# Set the model to be the best estimator
model = grid.best_estimator_
model

In [32]:
# Fit the best model
model = model.fit(X_train, y_train)

In [33]:
y_pred = model.predict(X_test)

In [34]:
# Find the best model scores
accuracy_score(y_test, y_pred)

0.868421052631579

Note: There are many other hyperparameter turning methods. The purpose of these methods to find optimal hyperparameter that can give high performance perdiction.

#### 6. Save a trained model for future usage

When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or application you offer.

Saving a model also allows you to reuse it later without having to go through retraining it. Which is helpful, especially when your training times start to increase.

You can save a scikit-learn model using Python's in-built `pickle` module.

In [35]:
import pickle

# Save an existing model to file
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))

#### Load a saved model and make a prediction

In [36]:

loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))

loaded_model.score(X_test, y_test)

# y_pred = loaded_model.predict(X_test)
# accuracy_score(y_test, y_pred)


0.868421052631579

## Summary: RandomForestClassifier 

In [7]:
# Import the RandomForestClassifier model class from the ensemble module
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

heart_disease = pd.read_csv('./data/heart-disease.csv')
heart_disease

# Setup random seed
np.random.seed(42)

# Split the data into X (features/data) and y (target/labels)
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate and fit the model (on the training set)
model = RandomForestClassifier()
model.fit(X_train, y_train)

# y_pred = model.predict(X_test)
# accuracy_score(y_test, y_pred)

# Check the accurary of the model (on the test set)
model.score(X_test, y_test)

0.8524590163934426

In [8]:
from sklearn.model_selection import cross_val_score

# With cross-validation

model = RandomForestClassifier().fit(X_train, y_train)
print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
    
print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}%")
print("")

Model accuracy on test set: 86.88524590163934%
Cross-validation score: 82.16393442622952%



### TASK 2: Experiment with Classification Models
Using the classification example provided earlier, perform the following steps:

1. **Model 1: LinearSVC**
   - Import and initialize the `LinearSVC` model:
     ```python
     from sklearn.svm import LinearSVC
     model = LinearSVC()
     ```
   - Train the model using your hear desease dataset and evaluate its performance.

2. **Model 2: SVM with Linear Kernel**
   - Import and initialize the `SVC` model with a linear kernel:
     ```python
     from sklearn import svm
     model = svm.SVC(kernel='linear')
     ```
   - Train the model using the same dataset and evaluate its performance.

3. **Comparison**
   - Compare the results of the two models and RandomForestClassifier in terms of:
     - Accuracy 
     - Cross Validation score

