<a href="https://colab.research.google.com/github/kang25-gif/BA810/blob/main/Lab1_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with `scikit-learn` for BA810

The purpose of this guide is to illustrate some of the main features of `scikit-learn`. It assumes a very basic working knowledge of machine learning practices (model fitting, predicting, etc.).

`Scikit-learn` is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

## Fitting and predicting: estimator basics

`Scikit-learn` provides dozens of built-in machine learning algorithms and models, called `estimators`. Each estimator can be fitted to some data using its `fit` method.

Here is a simple example where we fit a `sklearn.ensemble.RandomForestClassifier` to some very basic data:

In [13]:
from sklearn.linear_model import GaussianNB #One of many classifiers provided by scikit-learn
clf = LogisticRegression(random_state=0)  #create a classifier object, with a random seed to make runs repeatable
X = [[ 1,  2,  3],  # 2 samples/rows, 3 features/columns
     [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)

The `fit` method generally accepts 2 inputs:

-   The samples matrix (or design matrix) `X`. The size of `X` is typically `(n_samples, n_features)`, which means that samples are represented as rows and features are represented as columns.
-   The target values `y` which are real numbers for regression tasks, or integers for classification (or any other discrete set of values).

Both `X` and `y` are usually expected to be numpy arrays or equivalent `array-like` data types, though some estimators work with other formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of new data. You don't need to re-train the estimator. In fact, the `Estimators` that offer a `predict`  method are a subclass of estimators known as `Predictors`. All predictive models within scikit-learn are implemented as `Predictors`.

Though you can certainly use a fitted classifier to predict on the training data ...

In [14]:
clf.predict(X)  # predict classes of the training data

array([0, 1])

we often are interested in predicting on new records:

In [15]:
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

***
**Exercise**

Copy the above mentioned fitting and test data prediction code to a code block below. Then, import and fit a `LogisticRegression`, Naive Bayes (`GaussianNB`), and a `DecisionTreeClassifier` to the toy dataset above. Then apply to the two new test records.

Note: the classifiers won't be in the `sklearn.ensemble` module. Find out which `sklearn` module has them and import from there.
***

In [18]:
#@title Solution (Do not unfold before you try)

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)

clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

## Transformers and pre-processors

Machine learning workflows are often composed of different parts. A typical pipeline consists of a pre-processing step that transforms the input data, a model learning step that learns how to predict, and a final predictor that predicts target values.

In `scikit-learn`, pre-processors and transformers follow the same API as the estimator objects (they actually all inherit from the same `BaseEstimator` class). The transformer objects don't have a `predict` method but rather a `transform` method that outputs a newly transformed sample matrix `X` :

In [21]:
from sklearn.preprocessing import StandardScaler
X = [[0, 15],
     [5, -8],
     [1, -10]]
# scale data according to computed scaling values
StandardScaler().fit_transform(X)

array([[-0.9258201 ,  1.41054504],
       [ 1.38873015, -0.61711345],
       [-0.46291005, -0.79343158]])

The `fit_transform` method is a shorthand for calling the `fit` method followed by `transform` method on the `X` matrix.

### Pipelines: chaining pre-processors and estimators

Transformers and estimators (predictors) can be chained together into a single unifying object: a `sklearn.pipeline.Pipeline`. The pipeline offers the same API as a regular estimator. For example, when the last step is a predictor, it can be fitted and used for prediction with `fit` and `predict`.

### A regression model to predict house prices (numeric values)

Let's see how we'd fit a linear regression model to predict house price, using [California Housing dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset).
Each row represents a district in California.
This dataset has the following features/predictors:

1. `MedInc`: median income in block group
1. `HouseAge`: median house age in block group
1. `AveRooms`: average number of rooms per household
1. `AveBedrms`: average number of bedrooms per household
1. `Population`: block group population
1. `AveOccup`: average number of household members
1. `Latitude`: block group latitude
1. `Longitude`: block group longitude

The target is the median house value in the district.


In [22]:
# Import the necessary modules

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import root_mean_squared_error

from sklearn import set_config
set_config(display='diagram') # shows the pipeline structure graphically

# Fetch the dataset
X, y = fetch_california_housing(return_X_y=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Create a pipeline with a standard scaler and a linear regression model
pipeline = Pipeline([
    ('scaler', StandardScaler()), # bring predictors with different ranges to roughly similar range
    ('regressor', LinearRegression())
])
pipeline  # inspect the pipeline

In [23]:
# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Predict on the testing data
y_pred = pipeline.predict(X_test)

# Evaluate the model
rmse = root_mean_squared_error(y_test, y_pred)
print(f'Root Mean Squared Error on test data: {rmse:.2f}')

# What would the error be if measured on the training data?
  # We are measuring this to illustrate -- rarely done in practice to assess predictor performance
train_rmse = root_mean_squared_error(y_train, pipeline.predict(X_train))
print(f'Root Mean Squared Error on training data: {train_rmse:.2f}')
  # Generally training performance would be optimistic

Root Mean Squared Error on test data: 0.73
Root Mean Squared Error on training data: 0.72


What would the errors be if we fit an intercept only model or *null model*?


In [24]:
from sklearn.base import clone # use this to make a deep copy of the original fitted pipeline
p2 = clone(pipeline)  # then update the copy using data that has no features

X_train_ones = np.ones((X_train.shape[0], 1))  # Feature matrix of ones for training
X_test_ones = np.ones((X_test.shape[0], 1))  # Feature matrix of ones
p2.fit(X_train_ones, y_train)

# Predict on the testing data
y_pred_1 = p2.predict(X_test_ones)
print(y_pred_1) #what do we see?

# Evaluate the model
null_rmse = root_mean_squared_error(y_test, y_pred_1)
print(f'RMSE using scikit learn          : {null_rmse:.2f}')

# Another way to compute the same is by predicting average of y_train for each test case
y_pred_2 = np.repeat(np.mean(y_train), len(y_test)) # repeat to make prediction vector as long as test data
null_rmse2 = root_mean_squared_error(y_test, y_pred_2)
print(f'RMSE predicting training average : {null_rmse2:.2f}')

[2.07249896 2.07249896 2.07249896 ... 2.07249896 2.07249896 2.07249896]
RMSE using scikit learn          : 1.14
RMSE predicting training average : 1.14


This suggests that by using the eight features in the housing dataset, we are able to reduce error (improve prediction) from 1.14 to 0.73.

## **Follow-On Exercise: Salary Prediction using the Hitters Dataset**



In this exercise we'll predict the salary of a baseball player from the player's various performance attributes. You can learn more about this dataset from [this page](https://islp.readthedocs.io/en/latest/datasets/Hitters.html).

Here is the table presenting the attributes of the **Hitters Dataset**:

| Variable Name | Description                                                        | Type       |
|---------------|--------------------------------------------------------------------|------------|
| AtBat         | Number of times at bat in the 1986 season                          | Numeric    |
| Hits          | Number of hits made in the 1986 season                             | Numeric    |
| HmRun         | Number of home runs hit in the 1986 season                         | Numeric    |
| Runs          | Number of runs scored in the 1986 season                           | Numeric    |
| RBI           | Number of runs batted in during the 1986 season                    | Numeric    |
| Walks         | Number of walks received in the 1986 season                        | Numeric    |
| Years         | Number of years the player has been in the major leagues           | Numeric    |
| CAtBat        | Number of times at bat during the player’s career                  | Numeric    |
| CHits         | Number of hits made during the player’s career                     | Numeric    |
| CHmRun        | Number of home runs hit during the player’s career                 | Numeric    |
| CRuns         | Number of runs scored during the player’s career                   | Numeric    |
| CRBI          | Number of runs batted in during the player’s career                | Numeric    |
| CWalks        | Number of walks received during the player’s career                | Numeric    |
| League        | League in which the player played (A: American, N: National)       | Categorical|
| Division      | Division in which the player played (E: East, W: West)             | Categorical|
| PutOuts       | Number of putouts made by the player in the 1986 season            | Numeric    |
| Assists       | Number of assists made by the player in the 1986 season            | Numeric    |
| Errors        | Number of errors made by the player in the 1986 season             | Numeric    |
| Salary        | Player’s annual salary (in thousands of dollars)                   | Numeric    |
| NewLeague     | League the player was in at the start of the 1987 season (A/N)     | Categorical|


**Objective**:  
The main tasks of this exercise are:
1. **Fit a Linear Regression Model** to predict baseball player salaries using the `Hitters` dataset.
2. **Compare it** with **RandomForestRegressor** in terms of RMSE performance.
3. Implement a **null model** and compare its RMSE with trained models.

Pair with the person next to you and work on the following steps. You may work in a 'pair-programming' model where one of you leads the task of writing while the other watches for potential errors and provides suggestions.

### **Step 1: Load the Dataset**

```python

# Import necessary libraries
import pandas as pd

# Load the dataset
url = 'https://drive.google.com/uc?export=download&id=143Mw3ZMAyrQuIvpgqsnSnE9_4KRJ32Yu'
data = pd.read_csv(url)
data = data.rename(columns={data.columns[0]: 'PlayerName'})
  # The data file doesn't have a label for the first column. It contains player name. We are naming it appropriately so that we can refer to the column easily later.

# Display the first few rows of the dataset before any changes
print(f"A few records before any changes:")
print(data.head())
print(f"Dataset size: {data.shape}")

# Remove rows with missing salary values
data = data.dropna(subset=["Salary"])

# Additinally remove the categorical features for now -- we'll see how to handle them later.
data = data.drop(['PlayerName', 'League', 'Division', 'NewLeague'], axis=1)

# Display the cleaned dataset
print(f"\nA few records after minimal cleanup:")
print(data.head())
print(f"Dataset size after removing missing salaries: {data.shape}")


```

### **Step 2: Preprocess the Data**  
1. Handle missing values (done above).
2. Define **X** (features) and **y** (target).
3. Split the dataset into **training and testing sets** using `train_test_split` from `sklearn`.

### **Step 3: Fit a Linear Regression Model**  
Fit a **Linear Regression** model using the **training data** to predict the **Salary**. Do you think we need to standardize the attributes? If yes, do so in a pipeline.
Then, evaluate the model using **Root Mean Squared Error (RMSE)**.

### **Step 4: Compare with Random Forest Regression**  
Fit and evaluate **Random Forest Regression** model. Compare its performance with the **Linear Regression** model based on RMSE.

### **Step 5: Implement a Null Model**  
Create a **null model** by predicting the mean salary from the **training target values** for all instances. Evaluate this model on the test set using RMSE.


### **Discussion Questions**:
1. **Model Comparison**: Which regression model performed the best based on RMSE? Why do you think this is the case?
2. **Null Model Baseline**: How does the performance of your models compare to the null model? What does this tell you about the predictive power of the features?


In [25]:
# Import necessary libraries
import pandas as pd

# Load the dataset
url = 'https://drive.google.com/uc?export=download&id=143Mw3ZMAyrQuIvpgqsnSnE9_4KRJ32Yu'
data = pd.read_csv(url)
data = data.rename(columns={data.columns[0]: 'PlayerName'})
  # The data file doesn't have a label for the first column. It contains player name. We are naming it appropriately so that we can refer to the column easily later.

# Display the first few rows of the dataset before any changes
print(f"A few records before any changes:")
print(data.head())
print(f"Dataset size: {data.shape}")

# Remove rows with missing salary values
data = data.dropna(subset=["Salary"])

# Additinally remove the categorical features for now -- we'll see how to handle them later.
data = data.drop(['PlayerName', 'League', 'Division', 'NewLeague'], axis=1)

# Display the cleaned dataset
print(f"\nA few records after minimal cleanup:")
print(data.head())
print(f"Dataset size after removing missing salaries: {data.shape}")



A few records before any changes:
          PlayerName  AtBat  Hits  HmRun  Runs  RBI  Walks  Years  CAtBat  \
0     -Andy Allanson    293    66      1    30   29     14      1     293   
1        -Alan Ashby    315    81      7    24   38     39     14    3449   
2       -Alvin Davis    479   130     18    66   72     76      3    1624   
3      -Andre Dawson    496   141     20    65   78     37     11    5628   
4  -Andres Galarraga    321    87     10    39   42     30      2     396   

   CHits  ...  CRuns  CRBI  CWalks  League Division PutOuts  Assists  Errors  \
0     66  ...     30    29      14       A        E     446       33      20   
1    835  ...    321   414     375       N        W     632       43      10   
2    457  ...    224   266     263       A        W     880       82      14   
3   1575  ...    828   838     354       N        E     200       11       3   
4    101  ...     48    46      33       N        E     805       40       4   

   Salary  NewLeague  

In [45]:
X, y = data.drop('Salary', axis=1), data['Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

pipeline.fit(X_train, y_train)

rmse = root_mean_squared_error(y_test, pipeline.predict(X_test))


In [49]:
from sklearn.ensemble import RandomForestRegressor

# Create a pipeline with a standard scaler and a Random Forest Regressor model
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(random_state=0))
])

rf_pipeline.fit(X_train, y_train)

y_pred_rf = rf_pipeline.predict(X_test)

rmse_rf = root_mean_squared_error(y_test, y_pred_rf)
print(f'Root Mean Squared Error for Random Forest on test data: {rmse_rf:.2f}')

print(f'Root Mean Squared Error for Linear Regression on test data: {rmse:.2f}')

Root Mean Squared Error for Random Forest on test data: 351.78
Root Mean Squared Error for Linear Regression on test data: 364.59


In [50]:
from sklearn.base import clone # use this to make a deep copy of the original fitted pipeline
p3 = clone(pipeline)  # then update the copy using data that has no features

X_train_ones = np.ones((X_train.shape[0], 1))  # Feature matrix of ones for training
X_test_ones = np.ones((X_test.shape[0], 1))  # Feature matrix of ones
p3.fit(X_train_ones, y_train)

# Predict on the testing data
y_pred_1 = p3.predict(X_test_ones)

# Evaluate the model
null_rmse = root_mean_squared_error(y_test, y_pred_1)
print(f'RMSE using scikit learn          : {null_rmse:.2f}')

# Another way to compute the same is by predicting average of y_train for each test case
y_pred_2 = np.repeat(np.mean(y_train), len(y_test)) # repeat to make prediction vector as long as test data
null_rmse2 = root_mean_squared_error(y_test, y_pred_2)
print(f'RMSE predicting training average : {null_rmse2:.2f}')

RMSE using scikit learn          : 545.37
RMSE predicting training average : 545.37


## Other things to try at home



We have briefly covered estimator fitting and predicting, pre-processing
steps, pipelines, cross-validation tools and automatic hyper-parameter
searches. This guide should give you an overview of some of the main
features of the library, but there is much more to `scikit-learn`!

Please refer to the [user_guide](https://scikit-learn.org/stable/user_guide.html) for details on all
the tools that are provided. You can also find an exhaustive list of the
public API in the [api_ref](https://scikit-learn.org/stable/modules/classes.html).

You can also look at numerous [examples](https://scikit-learn.org/stable/auto_examples/index.html)
that illustrate the use of `scikit-learn` in many different contexts.

Finally, we'll do our labs on Google Colab. This [overview to colab](https://colab.research.google.com/notebooks/basic_features_overview.ipynb) is helpful. Keyboard shortcuts (accessible through Tools/Keyboard shorcuts menu, or Cmd/Ctrl+M+h) can make your lives easier too!


## Credit
This document is based on the [scikit-learn getting started guide](https://scikit-learn.org/stable/getting_started.html). It's modified by Nachiketa Sahoo for BA810 course.
