# Using linear regression model in BQML to predict penguin weight 
This project used a linear regression model in BigQuery ML to predict the weight of a penguin based on the penguin's species, island of residence, culmen length and depth, flipper length, and sex.

## Objective 
1. Create a linear regression model 
1. Evaluate the linear regression model 
1. Make the penguin weight predictions using the linear regression model. 

## Key Concepts
1. Linear regression 
1. Explainable AI
1. ML.EVALUATE
1. ML PREDICT

## steps
1. Create the dataset and dataset table
1. Use the SELECT statement to examine the data 
1. Use the CREATE VIEW statement to compile your training data
1. Use the CREATE MODEL statement to create your linear regression model. 
1. Use the ML.EVALUATE function to evaluate the model data
1. Use the ML.PREDICT function to predict the penguin weight for a given set of data
1. Use the ML.EXPLAIN_PREDICT function to explain prediction results with explainable AI Methods. 
1. Use the ML.GLOBAL_EXPLAIN function to know which features are the most important to determine the weight. 


## Step 1: Create the Dataset and Dataset Table

Dataset was retrieved from the [BQ public dataset of penguin data](bigquery-public-data.ml_datasets.penguins) then the data copied into a new table within the dataset. 

```sql
-- create the table and pull in data from the public data set. 

CREATE OR REPLACE TABLE
`01_bqml_linear_reg_penguin_weight_prediction.penguins_table` AS (
 SELECT * FROM
`bigquery-public-data.ml_datasets.penguins`
 WHERE
   body_mass_g IS NOT NULL);
   --342 ROWS 
```

## Step 2: Use the SELECT statement to examine the data
Next the dataset was examined and identified which columns to use as training data for your linear regression model.

```sql
SELECT
 species,
 island,
 culmen_length_mm,
 culmen_width_mm,
 culmen_depth_mm,
 Flipper_length_mm,
 body_mass_g,
 sex
FROM
 `bigquery-public-data.ml_datasets.penguins`
LIMIT
 100;
```
The data view results show that the body_mass_g column in the penguins table has linear values.
![01](assets/01.png "01")


## Step 3: Use the CREATE MODEL Statement to create your linear regression model
Next, we used the ```CREATE MODEL``` statement to train the new linear regression model with the option 'LINEAR_REG' on the view from the previous query.

```sql
#standardSQL
CREATE OR REPLACE MODEL `01_bqml_linear_reg_penguin_weight_prediction.penguins_model`
OPTIONS
 (model_type='linear_reg',
 input_label_cols=['body_mass_g']) AS
SELECT
 *
FROM
 `bigquery-public-data.ml_datasets.penguins`
WHERE
 body_mass_g IS NOT NULL

```
### Results
The Training data loss was 81,838 which represents the loss metric calculated after the model is trained on the training dataset. This is the Mean Squared Error.


## Step 5: Use the ML.EVALUATE function to evaluate the model data

Then, use the ML.EVALUATE function to provide statistics about model performance.

```sql
# STANDARD SQL
SELECT
 *
FROM
 ML.EVALUATE(MODEL `01_bqml_linear_reg_penguin_weight_prediction.penguins_model`,
   (
   SELECT
     *
   FROM
     `bigquery-public-data.ml_datasets.penguins`
   WHERE
     body_mass_g IS NOT NULL) )
```

The ML.EVALUATE function retrieved the evaluation metrics calculated during the training. 
![02](assets/02.png "02")

**Mean Absolute Error**: MAE is the average absolute difference between the expected and predicted values across all training examples.
**Mean Squared Error**: The average squared loss per example. MSE is calculated by dividing the squared loss by the number of examples.
**Mean Squared Log Error**: can be interpreted as a measure of the ratio between the true and predicted values.
***Median Absolute Error**: The loss is calculated by taking the median of all absolute differences between the target and the prediction. 
**R2** - The R2 score is a statistical measure that determines if the linear regression predictions approximate the actual data. 
***0*** indicates that the model explains none of the variability of the response data around the mean. 
***1*** indicates that the model explains all the variability of the response data around the mean.




## Step 6: Use the ML.PREDICT function to predict the penguin weight fora given set of data 

Next, the ML.PREDICT function was used to predict the penguin weight using the penguins_model. 

```sql
#standardSQL
SELECT
 *
FROM
 ML.PREDICT(MODEL `01_bqml_linear_reg_penguin_weight_prediction.penguins_model`,
   (
   SELECT
     *
   FROM
     `bigquery-public-data.ml_datasets.penguins`
   WHERE
     body_mass_g IS NOT NULL
     AND island = "Biscoe"))
```

### Results 
When the function runs, it generates a new column called “predicted <body_mass_g>” column. 

![03](assets/03.png "03")

## Step 7: Use the ML.EXPLAIN_PREDICT function to know which features are the most important to determine the weight. 

To understand why the model is generating these prediction results, the ```ML.EXPLAIN_PREDICT``` function was used which returns prediction results with additional columns that explain those results for each of the rows of data. (Note: results are for each row).

```sql
 # STANDARD SQL
SELECT
 *
FROM
 ML.EXPLAIN_PREDICT (MODEL `paulkamau.01_bqml_linear_reg_penguin_weight_prediction.penguins_model`, (
 SELECT
   *
 FROM
   `bigquery-public-data.ml_datasets.penguins`
 WHERE
   body_mass_g IS NOT NULL
   AND island = "Biscoe"),
 STRUCT(3 AS top_k_features));
```

### Results
When the function runs, it generates a new column called “```top_feature_attribution.feature,top_feature_attribution.attribution```” columns. The attributions are sorted by the absolute value of the attribution in descending order. In this case, Island was the top feature contributing to the body weight prediction. 

![04](assets/04.png "04")


## Step 8: Use the ML.GLOBAL_EXPLAIN function to know which features are the most important to determine the weight. 

Unlike the ```EXPLAIN_PREDICT ```function which tells which attribute contributes to the prediction at a row level, ```GLOBAL_EXPLAIN``` gives the overall attribution of which column contributes the most to the prediction in the entire dataset. 

In order to do, first go back to the CREATE MODEL function and include a new option “enable_global_explain=TRUE”, rerun it. 

```sql
#standardSQL
CREATE OR REPLACE MODEL `01_bqml_linear_reg_penguin_weight_prediction.penguins_model`
OPTIONS
 (model_type='linear_reg',
 input_label_cols=['body_mass_g'],
 enable_global_explain=TRUE) AS
SELECT
 *
FROM
 `bigquery-public-data.ml_datasets.penguins`
WHERE
 body_mass_g IS NOT NULL
```

Then use the ML.GLOBAL_EXPLAIN function to get the overall attribution. 

```sql
#standardSQL
SELECT
 *
FROM
 ML.GLOBAL_EXPLAIN(MODEL `01_bqml_linear_reg_penguin_weight_prediction.penguins_model`)
```

### Results 
When the function runs, it generated a table with the attributes ordered by descending order. In this case, sex was the top feature contributing to the body weight prediction. 



![05](assets/04.png "05")
