### What does this class cover?

The Scikit-Learn library is very capable. However, learning everything off by heart isn't necessary. 
Instead, this notebook focuses some of the main use cases of the library.


## Where can I get help?

This notebook goes through a range of common and useful featues of the Scikit-Learn library.

It's long but it's called quick because of how vast the Scikit-Learn library is. Covering everything requires a [full-blown documentation](https://scikit-learn.org/stable/user_guide.html), of which, if you ever get stuck, you should read.


If you get stuck or think of something you'd like to do which this notebook doesn't cover, don't fear!

The recommended steps you take are:
1. **Try it** - Since Scikit-Learn has been designed with usability in mind, your first step should be to use what you know and try figure out the answer to your own question (getting it wrong is part of the process). If in doubt, run your code.
2. **Press SHIFT+TAB** - See you can the docstring of a function (information on what the function does) by pressing **SHIFT + TAB** inside it. Doing this is a good habit to develop. It'll improve your research skills and give you a better understanding of the library. 
3. **Search for it** - If trying it on your own doesn't work, since someone else has probably tried to do something similar, try searching for your problem. You'll likely end up in 1 of 2 places:
    * [Scikit-Learn documentation/user guide](https://scikit-learn.org/stable/user_guide.html) - the most extensive resource you'll find for Scikit-Learn information.
    * [Stack Overflow](https://stackoverflow.com/) - this is the developers Q&A hub, it's full of questions and answers of different problems across a wide range of software development topics and chances are, there's one related to your problem.
    
An example of searching for a Scikit-Learn solution might be:

> "how to tune the hyperparameters of a sklearn model"

Searching this on Google leads to the Scikit-Learn documentation for the `GridSearchCV` function: http://scikit-learn.org/stable/modules/grid_search.html

The next steps here are to read through the documentation, check the examples and see if they line up to the problem you're trying to solve. If they do, **rewrite the code** to suit your needs, run it, and see what the outcomes are.

4. **Ask for help** - If you've been through the above 3 steps and you're still stuck, you might want to ask your question on [Stack Overflow](https://www.stackoverflow.com). Be as specific as possible and provide details on what you've tried.

Remember, you don't have to learn all of the functions off by heart to begin with. 

What's most important is continually asking yourself, **"what am I trying to do with the data?"**.

Start by answering that question and then practicing finding the code which does it.


### Implementing a Regression Problem with Scikit-Learn

#### **California Housing Dataset**
- A real-world dataset provided by Scikit-Learn.
- **Goal**: Predict the median house value (in hundreds of thousands of dollars) for California districts.
- **Features**:
  - Age of the home
  - Number of rooms
  - Number of bedrooms
  - Population living in the area
  - Average income
  - Other district-level features


1. **Load the Dataset**
   - Use `fetch_california_housing` from `sklearn.datasets`.

In [21]:

import pandas as pd
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()


In [22]:
type(housing)

sklearn.utils._bunch.Bunch

In [13]:
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df["target"] = pd.Series(housing["target"])
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


Beautiful, our goal here is to use the feature columns, such as:
* `MedInc` - median income in block group
* `HouseAge` - median house age in block group
* `AveRooms` - average number of rooms per household
* `AveBedrms` - average number of bedrooms per household

To predict the `target` column which expresses the median house value for specfici California districts in hundreds of thousands of dollars ($100,000). 

In essence, each row is a different district in California (the data) and we're trying to build a model to predict the median house value in that distract (the target/label) given a series of attributes about the houses in that district.

Since we have data and labels, this is a supervised learning problem. And since we're trying to predict a number, it's a regression problem.



## Evaluation metric
- Mean squared Error
- R² score

Imagine you're trying to predict house prices based on their size using a linear equation.

R² helps you understand how well the model explains the relationship between size and house price.

R² is like a score that tells you how good the model is at predicting price from size.
It ranges from 0 to 1:

- 0 means your the model doesn't help at all in predicting price from size.
- 1 means your the model perfectly predicts price from size.

So, if you have an R² of 0.8, it means 80% of the variation in price can be explained by house feature using the model. The higher the R², the better your model explains the relationship between the variables.

In essence, R² helps you understand how much of the variability in one variable (like price) is captured or explained by another variable (like size of house) through your formula.

## Example 1: LinearRegression 

In [18]:
# Import the Ridge model class from the linear_model module
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Setup random seed
# np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Institate and fit the model (on the training set)
model = LinearRegression()
model.fit(X_train, y_train)
          
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Check the score of the model (on the test set)
# The default score() metirc of regression aglorithms is R^2

r = model.score(X_test, y_test)
print(f"R^2 score: {r}")

Mean Squared Error (MSE): 0.5137848824526513
R^2 score: 0.6144594222407171


In [19]:
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


Question: How many weight and bias parameters are there? 

In [47]:
weights = model.coef_

# Get intercept (bias)
bias = model.intercept_

print("Coefficients (weights) for each feature:")
for feature, weight in zip(X.columns, weights):
    print(f"{feature}: {weight}")

print("\nIntercept (bias):", bias)

Coefficients (weights) for each feature:
MedInc: 0.44318495086959414
HouseAge: 0.009720455167713159
AveRooms: -0.1149743739234838
AveBedrms: 0.7612873947087113
Population: -2.7550258373471796e-06
AveOccup: -0.004123962992775818
Latitude: -0.4251944662850576
Longitude: -0.43750899277720684

Intercept (bias): -37.2820104162937


In [48]:
from sklearn.model_selection import cross_val_score
import numpy as np

# Perform cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')

# Print the cross-validation scores
print("Cross-Validation r2 Scores:", cv_scores)
print("Cross-Validation r2 Score:", np.mean(cv_scores))


Cross-Validation r2 Scores: [0.54866323 0.46820691 0.55078434 0.53698703 0.66051406]
Cross-Validation r2 Score: 0.5530311140279556


In machine learning, weights and biases are fundamental components of many models, especially in neural networks and certain linear models.

- **Weights:** In machine learning, weights refer to the coefficients attributed to the input features in a model. For instance, in a linear regression model, each feature has a corresponding weight that determines its contribution to the output prediction. In neural networks, weights represent the strengths of connections between neurons in different layers.

- **Biases:** Biases are additional parameters in a model that allow shifting the output, providing more flexibility and expressiveness. In a neural network, biases are added to the weighted sum of inputs and act similarly to intercepts in linear models. They allow the model to represent patterns even when all input features are zeros.

Here's a simple example in a linear regression equation:

y = w_1 * feature_1 + w_2 * feature_2 + ... + w_n * feature_n + b 

- \(w_1, w_2, ... , w_n\) are the weights associated with each feature.
-  b is the bias term.

During the training process, these weights and biases are adjusted iteratively to minimize the error between predicted outputs and actual targets. This process involves optimization algorithms like gradient descent, where the model learns the best values for weights and biases to make accurate predictions on unseen data.

In neural networks, these parameters become more complex, as there are multiple layers with interconnected neurons, each having its set of weights and biases. Adjusting these parameters through techniques like backpropagation with gradient descent is crucial for training neural networks effectively.

Understanding and optimizing these parameters are essential for improving a model's performance and generalization to new, unseen data. They play a significant role in the representational power and flexibility of machine learning models.

Next step : 


What if `LinearRegression` didn't work? Or what if we wanted to improve our results?

The next step would be to try other models. It is an ensemble model in which multiple models put together to make a decision.

One of the most common and useful ensemble methods is the [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest). Known for its fast training and prediction times and adaptibility to different problems. In ensemble model, multiple models put together to make a decision.

The basic premise of the Random Forest is to combine a number of different decision trees, each one random from the other and make a prediction on a sample by averaging the result of each decision tree.

An in-depth discussion of the Random Forest algorithm is beyond the scope of this notebook but if you're interested in learning more, [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen is a great read.

Since we're working with regression, we'll use Scikit-Learn's [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

We can use the exact same workflow as Linear regresser. Except for changing the model.

## Example 2: RandomForestRegressor 

In [51]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Institate and fit the model (on the training set)
model = RandomForestRegressor()
model.fit(X_train, y_train)


y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Check the score of the model (on the test set)
# The default score() metirc of regression aglorithms is R^2

r = model.score(X_test, y_test)
print(f"R^2 score: {r}")



Mean Squared Error (MSE): 0.2534073069137548
R^2 score: 0.8066196804802649


In [52]:
from sklearn.model_selection import cross_val_score
import numpy as np

# Perform cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')

# Print the cross-validation scores
print("Cross-Validation r2 Scores:", cv_scores)
print("Mean r2 Score:", np.mean(cv_scores))


Cross-Validation r2 Scores: [0.80951797 0.79354052 0.81118186 0.80483983 0.80497546]
Mean r2 Score: 0.8048111291645605


## Hyperparameter turning

In [None]:
np.random.seed(42)

# Create a second classifier
model = RandomForestRegressor(n_estimators= 20)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Check the score of the model (on the test set)
# The default score() metirc of regression aglorithms is R^2

r = model.score(X_test, y_test)
print(f"R^2 score: {r}")

## Hyperparameter turning with grid search 

In [53]:
# Another way to do it with GridSearchCV...
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

# Define the parameters to search over
param_grid = {'n_estimators': [i for i in range(10, 100, 10)]}

# Setup the grid search
grid = GridSearchCV(RandomForestRegressor(),
                    param_grid,
                    cv=5)

# Fit the grid search to the data
grid.fit(X, y)

# Find the best parameters
grid.best_params_




{'n_estimators': 70}

With GridSearchCV, different versions of the model with different hyperparameters are created. For each version, cross-validation is employed to assess the model's performance.

In [None]:
model = grid.best_estimator_

In [None]:
model.fit(X_train, y_train)

In [None]:
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

# Check the score of the model (on the test set)
# The default score() metirc of regression aglorithms is R^2

r = model.score(X_test, y_test)
print(f"R^2 score: {r}")

### TASK 3: Experiment with Additional Model - Ridge Regression
In addition to the **LinearSVC** and **SVC with Linear Kernel** models, follow these steps to evaluate the performance of a **Ridge Regression** model:

1. **Model 3: Ridge Regression**
   - Import and initialize the Ridge Regression model:
     ```python
     from sklearn.linear_model import Ridge
     model = Ridge()
     ```
   - Train the Ridge model using the same dataset.

2. **Comparison**
   - Evaluate the Ridge Regression model using **Mean Squared Error (MSE)** and **R² Score**.
   - Compare these metrics against the evaluation results of the previous models.