# 05: Creating a matrix factorization model to make movie recommendations

This project used a matrix factorization model in BigQuery ML to generate movie recommendations for a user using the public movielens dataset. This dataset contains explicit feedback data given a movie ID and User ID. The recommendation is based on rating scale (1-5), genre and other metadata info. The user_id and movie_id ratings dataset has 1,000,209 rows. The movie titles dataset has 3,883 rows. 


## Key Concepts: 
- Matrix Factorization
- ML.RECOMMEND 
- ML.WEIGHTS

## Objective
- Create an explicit recommendations model using the CREATE MODEL statement 
- Use the ML.EVALUATE function to evaluate the ML Models 
- Use the ML.WEIGHTS function to inspect the latent factor weights generated during training 
- Use the ML.RECOMMEND function to produce recommendations for a user. 
 
## Steps
- Create the dataset & load the dataset into BQ
- Use the SELECT statement to examine the data 
- Use the CREATE MODEL statement to create the explicit recommendation model. 
- Use the ML.EVALUATE function to evaluate the model data
- Use the ML.PREDICT function to predict the ratings and make recommendations
- Use the ML.RECOMMEND function to fetch all the ratings for the user-item pairs
- Save the recommendation predictions to a table
- Generate the top 5 recommendations per user. 

### Step 1: Create the dataset & load the dataset into BQ
Dataset was retrieved from the grouplens Marketplace of datasets [Movielens](http://files.grouplens.org/datasets/movielens/ml-1m.zip) and the dataset created on BigQuery with the ID: ``04_bqml_matrixFactor_movie_recc_prediction`` prefix. The data format was originally in csv, so it was first loaded and transformed from the gcs bucket into the BQ table. 

### Step 2: Use the SELECT statement to examine the data 

Next the dataset was examined and identified which columns to use as training data for the matrix factorization model.

The query retrieves the data in the movielens_1m table

```sql
SELECT * FROM `04_bqml_matrixFactor_movie_recc_prediction.movielens_1m`;
```

The query retrieves the data in the movie_titles table

```sql
SELECT * FROM `04_bqml_matrixFactor_movie_recc_prediction.movie_titles`;
```

### Step 3: Use the CREATE MODEL statement to create the explicit recommendation model. 

Next, we created the explicit recommendations model using the movielens data using the query to create the model that was used to predict a rating for every user-item pair.

```sql
#standardSQL
CREATE OR REPLACE MODEL `04_bqml_matrixFactor_movie_recc_prediction.my_explicit_mf_model`
OPTIONS
 (model_type='matrix_factorization',
  user_col='user_id',
  item_col='item_id',
  l2_reg=9.83, --The amount of L2 regularization applied. A type of regularization that penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0.
  num_factors=34 --Specifies the number of latent factors to use for matrix factorization models.
  ) AS
SELECT
 user_id,
 item_id,
 rating
FROM 04_bqml_matrixFactor_movie_recc_prediction.movielens_1m
```

#### Results: Understanding the Evaluation. 
Using the training data to calculate the loss metric, the model went through **11 iterations** to minimize loss as much as possible with the last iteration having a training Data Loss of 0.3950 and duration of 87.12 seconds. Initially, on the first iteration, it had a training Data Loss of 0.3.6700 and a duration of 246.14 seconds.  The training Data Loss column is considered the Mean Squared Error for all Matrix factorization models. 

![01](assets/01.png "Training loss")

### Step 4: Use the ML.EVALUATE function to evaluate the model data
The ML.EVALUATE function was used to provide statistics about model performance. 

```sql
#standardSQL
SELECT
 *
FROM
 ML.EVALUATE(MODEL 04_bqml_matrixFactor_movie_recc_prediction.my_explicit_mf_model)
```

#### Understanding the Results. 
Because you performed an explicit matrix factorization, the results include the following columns: ***mean_absolute_error, mean_squared_error, mean_squared_log_error median_absolute_error, r2_score, explained_variance***

![02](assets/02.png "results")

### Step 5: Use the ML.PREDICT function to predict the ratings and make recommendations

The ```ML.RECOMMEND``` function outputs a column name for the model called predicted_<rating_column_name>. For explicit matrix factorization models, predicted_rating is the estimated value of rating.
The following query fetches all of the predicted movie ratings for 5 users:

```sql
#standardSQL
SELECT
 *
FROM
 ML.RECOMMEND(MODEL 04_bqml_matrixFactor_movie_recc_prediction.my_explicit_mf_model,
   (
   SELECT
     user_id
   FROM
     04_bqml_matrixFactor_movie_recc_prediction.movielens_1m
   LIMIT 5))
```
The following query fetches all of the ratings for all user-item pairs

```sql
#standardSQL
SELECT
 *
FROM
 ML.RECOMMEND(MODEL 04_bqml_matrixFactor_movie_recc_prediction.my_explicit_mf_model)

```

![01](assets/01.png "results")

### Step 6: Use the ML.RECOMMEND function to fetch all the ratings for the user-item pairs
The query below was used to store the predictions to a table 

```sql
#standardSQL
CREATE OR REPLACE TABLE 04_bqml_matrixFactor_movie_recc_prediction.recommend_1m
OPTIONS() AS
SELECT
 *
FROM
 ML.RECOMMEND(MODEL 04_bqml_matrixFactor_movie_recc_prediction.my_explicit_mf_model)
```

![01](assets/01.png "results")


### Step 7: Save the recommendation predictions to a table
Using the previous recommendations query, we ordered by the predicted rating and output the top predicted items for each user. The following query joins the item_ids with the movie_ids found in the``` 04_bqml_matrixFactor_movie_recc_prediction.movie_titles``` table uploaded earlier and outputs the top 5 recommended movies per user.

```sql
#standardSQL
SELECT
 user_id,
 ARRAY_AGG(STRUCT(movie_title, genre, predicted_rating)
ORDER BY predicted_rating DESC LIMIT 5)
FROM (
SELECT
 user_id,
 item_id,
 predicted_rating,
 movie_title,
 genre
FROM
 04_bqml_matrixFactor_movie_recc_prediction.recommend_1m
JOIN
 04_bqml_matrixFactor_movie_recc_prediction.movie_titles
ON
 item_id = movie_id)
GROUP BY
 user_id

```

#### Understanding the Results. 
 Since we had additional metadata information about each movie_id beyond an INT64, we can see info like genre about the top 5 recommended movies for each user.


![03](assets/03.png "results")

### Step 8: Generate the top 5 recommendations per user. 
To view which genre each latent factor might correlate to, the following query was run:

```sql
#standardSQL
SELECT
 factor,
 ARRAY_AGG(STRUCT(feature, genre,
     weight)
 ORDER BY
   weight DESC
 LIMIT
   10) AS weights
FROM (
 SELECT
   * EXCEPT(factor_weights)
 FROM (
   SELECT
     *
   FROM (
     SELECT
       factor_weights,
       CAST(feature AS INT64) as feature
     FROM
       ML.WEIGHTS(model 04_bqml_matrixFactor_movie_recc_prediction.my_explicit_mf_model)
     WHERE
       processed_input= 'item_id')
   JOIN
     movielens.movie_titles
   ON
     feature = movie_id) weights
 CROSS JOIN
   UNNEST(weights.factor_weights)
 ORDER BY
   feature,
   weight DESC)
GROUP BY
 factor 

```

#### Understanding the results; 
From the results, Crime | Drama genre has the greatest weights

![04](assets/04.png "results")

