<h1><b><center>Cloud and Advanced Analytics: Personal Assignment 2023</center></b></h1>

**Contents**: the assignment covers 3 topics:
1. BigQuery
2. IoT 
3. Recommender Systems in Python

**Due: Apr 2, 2023, 23.59pm** <u>(notebook + quiz)</u>

**Clarifications**: You can post your questions in slack channel #assignments. If necessary we will update the notebook accordingly (so make sure to check for updates on GitHub). 

**Grading**: The personal assignment is worth **30%** of your final grade. For your work to be graded, you must:

* Upload your completed notebook on [Moodle](https://moodle.unil.ch/mod/assign/view.php?id=1482850&forceview=1)
* Answer all questions in the Moodle Quiz. We will check that your quiz answers reflect the responses provided in the notebook. 

>Note: You can only complete the quiz one time. Have your notebook with the answers ready for answering the quiz. 

**Personal work**: Note that this assignment represents strictly *personal* work. Do not share it with your colleagues. Just do as much as you can on your own. Your code will be compared to that your colleagues. In case of statistically high similarity, you will receive a grade of zero.


Good Luck and Enjoy ☀

----

# Phase 1: BigQuery and SQL

In this first part, you will explore a dataset using Google BigQuery. Similar to week 2, you will connect to BigQuery, upload the data and access it in the notebook. Your job is to write SQL queries to answer the questions below. 

**Caution**: Do not forget that each user receives a limited amount of free data processing every month in BigQuery, which can be used to run queries on any dataset. Given the size of the datasets that you will query, performing a lot of queries can result in exceeding your free monthly quota. Therefore, you should try to avoid queries that have a big output. The solution is simple: always remember to use the **LIMIT** keyword (especially if you are not sure about the output of your query) to limit the size of the output. You can also write the query on the Google Console, and it will estimate the size of the query. Consider that you have 4TB in total.


## Connecting to BigQuery

To make things easier, we advise you to work in **Google Colab**. 

**For Google Colab users**

In [None]:
from google.colab import auth

auth.authenticate_user()
print("Authenticated")

Authenticated


**For Jupyter users**

Make sure to replace "PATH_TO_CREDENTIALS_FILE" with the *absolute* path to the JSON service account key, e.g., "C:/Users/John/credentials.json".

In [None]:
!pip install google-cloud-bigquery

In [None]:
import os
PROJECT_NAME = "cloud-analytics-init" # REPLACE WITH YOUR PROJECT NAME
# The following line will try to locate the `google_key.json` file in the parent directory of this git repository.
PATH_TO_CREDENTIALS_FILE = os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.getcwd()))), "google_key.json") # REPLACE WITH YOUR PATH TO KEY

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = PATH_TO_CREDENTIALS_FILE

**Everyone**

Make sure to replace the above `PROJECT_NAME` with the ID of one of your Google Cloud projects, where you upload the data from the dats folder of the assignment.

In [None]:
import pandas as pd
from google.cloud import bigquery


# Create a client for public datasets
client = bigquery.Client(project="caa-course-sandbox")


## 1.1: Public datasets

In this very first part, we will be using public datasets available on Google Cloud. For each question, answer by writing a query and executing it on Google BigQuery.

Use the df `bigquery-public-data.samples.gsod`. 
#### Question 1.1.1
> How many rows are in the gsod dataset for the year 1929?

In [None]:
# Define the SQL query to retrieve data from a BigQuery dataset
query = """
SELECT COUNT(*) as num_rows
FROM `bigquery-public-data.samples.gsod`
WHERE year = 1929
"""
# Execute the query and get the result
result = client.query(query).to_dataframe()

# Define the correct answer
correct_answer = result['num_rows'][0]
print(correct_answer)

2081


Use the df `bigquery-public-data.world_bank_intl_education.international_education`.
#### Question 1.1.2
> How many distinct countries are represented in the international_education dataset for the year 2016?
> 
> How many rows are in the international_education dataset for the year 2016?

In [None]:

# Define the SQL query to retrieve data from a BigQuery dataset
query = """
SELECT
    COUNT(DISTINCT country_name) as num_countries,
    COUNT(*) as total_rows
FROM `bigquery-public-data.world_bank_intl_education.international_education`
WHERE year = 2016
"""

# Execute the query and get the result
result = client.query(query).to_dataframe()

# Define the correct answers
correct_num_countries = result['num_countries'][0]
correct_total_rows = result['total_rows'][0]
print(correct_num_countries)
print(correct_total_rows)

238
16460


#### Question 1.1.3
>What was the average arrival delay and average departure delay for all flights from SFO to LAX in the years from 2000 to 2015 (both included), using the df bigquery-samples.airline_ontime_data.flights?

In [None]:
# Define the SQL query to retrieve data from a BigQuery dataset
query = """
SELECT
    AVG(SAFE_CAST(arrival_delay AS FLOAT64)) as avg_arr_delay,
    AVG(SAFE_CAST(departure_delay AS FLOAT64)) as avg_dep_delay
FROM `bigquery-samples.airline_ontime_data.flights`
WHERE date >= '2000-01-01' AND date <= '2015-12-31' AND
      departure_airport = 'SFO' AND arrival_airport = 'LAX'
"""

# Execute the query and get the result
result = client.query(query).to_dataframe()

# Display the result to the user

print("Average arrival delay:", round(result['avg_arr_delay'][0], 2), "minutes")
print("Average departure delay:", round(result['avg_dep_delay'][0], 2), "minutes")


Average arrival delay: 6.72 minutes
Average departure delay: 9.45 minutes


## 1.2: Private dataset

Now, you will need to create a Database on GCP, like we did in the lab of week2. You will find all the files to upload in the "data" folder. The folder contains 7 tables.

**You will first need to upload them all, in the same BigQuery database, as different tables.**

The database is regarding an HR department of the company BestCompanyEver. For this exercise, you will be the main Data Scientist of BestCompanyEver, and you will need to produce the queries to extract relevant data from the database. 

Once again, answer the questions by writing and executing a query on BigQuery.

The following is the structure of the database:


<img src="database-model-hr-new.gif" alt="Alternative text" />





Credits for the df: w3resource

#### Question 1.2.1
> Connect to your BigQuery dataset and list the tables

In [None]:
# Create a "Client" object
#client = bigquery.Client(project="YOUR-PROJECT-ID")
dataset_ref = client.dataset("hr", project="caa-course-sandbox") 
dataset = client.get_dataset(dataset_ref)
# List the tables in the dataset
tables = list(client.list_tables(dataset))
for table in tables:  
    print(table.table_id)


countries
departments
employees
job_history
jobs
locations
regions


#### Question 1.2.2
> Display the name, surname, salary and department number in descending order by salary of the employees. Limit the number of rows of the answer to 10.

In [None]:
query = """SELECT first_name, last_name, salary,  department_id
  FROM hr.employees
   ORDER BY salary DESC
   LIMIT 10;"""


# Execute the query and get the result
result = client.query(query).to_dataframe()

# Display the result to the user
result



Unnamed: 0,first_name,last_name,salary,department_id
0,Steven,King,24000,90
1,Lex,De Haan,17000,90
2,Neena,Kochhar,17000,90
3,John,Russell,14000,80
4,Karen,Partners,13500,80
5,Michael,Hartstein,13000,20
6,Shelley,Higgins,12000,110
7,Alberto,Errazuriz,12000,80
8,Nancy,Greenberg,12000,100
9,Lisa,Ozer,11500,80


#### Question 1.2.3
>Display the first name, last name, salary and department id for those employees whose first name ends with the letter 'd' or 'n' or 's' and also arrange the result in descending order by department id. Out of those, display ONLY the ones with salary higher than 10000.

In [None]:
query="""SELECT first_name, last_name, salary, department_id 
FROM hr.employees 
WHERE (first_name LIKE '%D%' OR first_name LIKE '%S%' OR first_name LIKE '%N%') 
  AND salary > 10000 
ORDER BY department_id DESC;"""



# Execute the query and get the result
result = client.query(query).to_dataframe()

# Display the result to the user
result



Unnamed: 0,first_name,last_name,salary,department_id
0,Shelley,Higgins,12000,110
1,Nancy,Greenberg,12000,100
2,Steven,King,24000,90
3,Neena,Kochhar,17000,90
4,Den,Raphaely,11000,30


#### Question 1.2.4
> Display the ID for those employees who did two or more jobs in the past.

In [None]:
query="""SELECT employee_id 
	FROM hr.job_history 
		GROUP BY employee_id 
			HAVING COUNT(*) >=2;"""


# Execute the query and get the result
result = client.query(query).to_dataframe()

# Display the result to the user
result



Unnamed: 0,employee_id
0,176
1,200
2,101


#### Question 1.2.5
> Calculate the average salary of employees who have held the same job title for at least 6 months. Hint: filter out the cases in which there is no end date.

In [None]:
query = """SELECT AVG(e.salary) AS avg_salary
FROM hr.employees e
INNER JOIN hr.job_history jh ON e.employ_id = jh.employee_id
WHERE jh.end_date_ IS NOT NULL
  AND DATE_DIFF(jh.end_date_, jh.start_date, MONTH) >=6
  AND e.job_id = jh.job_id
"""


# Execute the query and get the result
result = client.query(query).to_dataframe()

# Display the result to the user
result





Unnamed: 0,avg_salary
0,6500.0


#### Question 1.2.6
> Find the top 1 country with the highest average salary for employees in management positions. Hint: consider the table "jobs" to understand which jobs are management ones. (Challenging)

In [None]:
query ="""
SELECT countries.country_name, AVG(salary) 
FROM (
  SELECT employees.job_id, salary, department_id 
  FROM hr.employees 
  INNER JOIN hr.jobs 
  ON employees.job_id = jobs.job_id 
  WHERE job_title LIKE '%Manager%'
) AS manager_salaries
INNER JOIN hr.departments 
ON manager_salaries.department_id = departments.department_id
INNER JOIN hr.locations 
ON departments.location_id = locations.location_id
INNER JOIN hr.countries 
ON locations.country_id = countries.country_id
GROUP BY countries.country_id, countries.country_name
ORDER BY AVG(salary) DESC
LIMIT 1;"""


# Execute the query and get the result
result = client.query(query).to_dataframe()

# Display the result to the user
result


Unnamed: 0,country_name,f0_
0,Canada,13000.0


#### Question 1.2.7 
> Retrieve the details (name and surname) of the employees who have a salary above their department average. Order the result alphabetically by first_name. Limit the result at the top 3. (challenging)

In [None]:
query="""SELECT first_name, last_name from `hr.employees` AS emp
WHERE emp.salary > ( 
  SELECT AVG(emp2.salary) from hr.employees AS emp2
  WHERE emp2.department_id = emp.department_id
)
ORDER BY first_name
LIMIT 3
"""
# Execute the query and get the result
result = client.query(query).to_dataframe()

# Display the result to the user
result


Unnamed: 0,first_name,last_name
0,Adam,Fripp
1,Alberto,Errazuriz
2,Alexander,Hunold


# IoT Qs

Open flow.M5Stack, create a program that does the following:
- On setup, for 35 times it checks whether the M5Stack is charging. If it is, it calls a function called "dosomething" that plays tone lowA for 1 beat
- When button A is pressed, it waits 1 second, then it sets the background screen to red, then the label called label0 needs to show TVOC from CO2 sensor. 

Paste the microPython code in the following cell.

In [None]:
"""from m5stack import *
from m5ui import *
from uiflow import *
import time
import unit


setScreenColor(0x222222)
tvoc_0 = unit.get(unit.TVOC, unit.PORTA)






label0 = M5TextBox(139, 107, "label0", lcd.FONT_Default, 0xFFFFFF, rotate=0)

# Describe this function...
def dosomething():
  speaker.sing(220, 1)


def buttonA_wasPressed():
  # global params
  wait(1)
  setScreenColor(0xff0000)
  label0.setText(str(tvoc_0.TVOC))
  pass
btnA.wasPressed(buttonA_wasPressed)


for count in range(35):
  if power.isCharging():
    dosomething()
  else:
    pass"""

Answer briefly in a markdown cell.

>Which kinds of Google cloud services do you use to send data from IoT devices to BigQuery?
>
>Connect a CO2 sensor to the M5Stack. What measurements can you gather with the sensor?


# Phase 2: Recommenders in Python

## 2.1: Movielens (KNN and SVD)

Using the surprise library, with item-based collaborative filtering, find the top 10 recommended films to watch for a given user, similar to what we have seen in the lab. [Documentation about surprise is available here](http://surpriselib.com/).

We will use the 100k MovieLens dataset, smaller as compared to what we have used before. Follow the guiding steps below and answer the two questions.

**First install the library and import the required packages**

In [1]:
!pip install surprise

Collecting surprise
  Using cached surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Using cached scikit-surprise-1.1.3.tar.gz (771 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting joblib>=1.0.0
  Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Collecting numpy>=1.17.3
  Using cached numpy-1.24.2-cp311-cp311-macosx_10_9_x86_64.whl (19.8 MB)
Collecting scipy>=1.3.2
  Downloading scipy-1.10.1-cp311-cp311-macosx_10_9_x86_64.whl (35.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.0/35.0 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hBuilding wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25ldone
[?25h  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp311-cp311-macosx_10_9_x86_64.whl size=1129821 sha256=9e208ca59df098624401c431a57eb0020af3338933046f311cc2ebb12df42575
  Stored in directory: /Users/rob/Library/Caches/pip/wheel

In [9]:
# Import packages
from surprise import KNNBasic, KNNWithMeans
from surprise import Dataset
from surprise.model_selection import GridSearchCV
from collections import defaultdict
from surprise import get_dataset_dir
from surprise.model_selection import train_test_split
import io
import random
import numpy as np
from utils import *

my_seed = 42 # DO NOT CHANGE THIS LINE
random.seed(my_seed)
np.random.seed(my_seed)

**Load the data using the built-in dataset `ml-100k`**

In [3]:

data = Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /Users/rob/.surprise_data/ml-100k


**Use the full dataset and the `build_full_trainset()` method to build a trainset object. Then use the `build_anti_testset()` method for building the trainset.**

In [4]:

trainset = data.build_full_trainset()

testset = trainset.build_anti_testset() 


There are two "dumb" recommenders that are often used as baselines: the Normal Predictor and the Baseline Estimate.
The Normal Predictor is an algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. Set the seed equal to 42. 
The baseline estimate for an unknown rating $r_{ui}$ is denoted by $b_{ui}$ and accounts for the user and item effects: $$b_{ui} = \mu + b_u + b_i$$ where $b_u$ and $b_i$ indicate the observed deviations of user u and item i, respectively, from the average. (see section 2.1 of Koren, 2010 for more details)

**Instantiate a Normal Predictor and a Baseline Estimator. Fit them on the training set. Then, compute the Normal Predictor and the Baseline Estimate for user 140 and item 4 (you should use Surprise). Round the answer to the second decimal point (i.e. 5.315 -> 5.32)**

In [5]:
from surprise import NormalPredictor, BaselineOnly

my_seed = 42 # DO NOT CHANGE THIS LINE
random.seed(my_seed)
np.random.seed(my_seed)

#YOUR CODE HERE
normalP = NormalPredictor()
normalP.fit(trainset)
estNormalP = normalP.estimate()
print(round(estNormalP, 2))


baseline = BaselineOnly()
baseline.fit(trainset)
estBaseline = baseline.estimate(140,4)
print(round(estBaseline, 2))

4.09
Estimating biases using als...
3.74


> 2.1.0: What's the prediction given by the normal predictor for user 140 and item 4? And the baseline estimate?

### 2.1.1 Item-based Collaborative filtering in KNN

**Use GridSearchCV to find the best number of neighbours (k) for a KNNWithMeans item-based algorithm. Use the complete dataset (not split).**  
using root-mean-square-error (RMSE) and the following parameter grid `param_grid={'k': [10, 25, 45, 55], 'sim_options': {'name': ['pearson'], 'user_based': [False]}}`
other parameters `cv=4, refit=True, joblib_verbose=2, n_jobs=-1`

In [None]:
my_seed = 42 # DO NOT CHANGE THIS LINE
random.seed(my_seed)
np.random.seed(my_seed)

KNN_grid_search = GridSearchCV(KNNWithMeans, param_grid={'k': [10, 25, 45, 55], 
                                                         'sim_options': {'name': ['pearson'], 'user_based': [False]}}, 
                               measures=['RMSE'], cv=4,
                               refit=True, joblib_verbose=2, n_jobs=-1)
KNN_grid_search.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 out of  16 | elapsed:  1.4min finished


Computing the pearson similarity matrix...
Done computing similarity matrix.


> 2.1.1: What is the optimal k for which GridSearchCV returned the best RMSE score?



In [None]:
print("best parameter:", KNN_grid_search.best_params)
print("best rmse: ", KNN_grid_search.best_score)

best parameter: {'rmse': {'k': 55, 'sim_options': {'name': 'pearson', 'user_based': False}}}
best rmse:  {'rmse': 0.9416388342386743}


**Instantiate the KNNWithMeans algorithm using the best k value retrieved above: `KNNWithMeans(k=YOUR_RETRIEVED_VALUE, min_k=1, sim_options=sim_options, verbose=False)`**  
KNNWithMeans takes into account the mean ratings of each user. You can read more about it here: [Documentation](https://surprise.readthedocs.io/en/stable/knn_inspired.html)

`sim_options` needs to be the same as before. 

**Fit the model on the training set and predict ratings on the test set. (It will take a couple of minutes)**

In [32]:
# Load data
data = Dataset.load_builtin('ml-100k') # there are a couple of famous Rec System datasets available in this library
trainset = data.build_full_trainset()

random.seed(my_seed)
np.random.seed(my_seed)


# Define options and create instance of class
sim_options = {
    'name': 'pearson', # let's use pearson similarity which can be seen as mean-centered cosine similarity
    'user_based': False 
}
knn_means = KNNWithMeans(k=KNN_grid_search.best_params["rmse"]["k"], min_k=1, sim_options=sim_options, verbose=False)



# Fit model
knn_means.fit(trainset)

# Predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset() 
predictions = knn_means.test(testset)

**Find the top 20 predictions for the user 100 (you can use helper functions used in the labs)**

In [37]:
# YOUR CODE HERE
#from utils.py import *
top_n = get_top_n(predictions, n=20)
top_n["100"]

[('1189', 5),
 ('1500', 5),
 ('814', 5),
 ('1536', 5),
 ('1293', 5),
 ('1599', 5),
 ('1653', 5),
 ('1467', 5),
 ('1122', 5),
 ('1201', 5),
 ('1064', 4.71676605264943),
 ('114', 4.517024072255383),
 ('169', 4.460312406226903),
 ('1642', 4.422925224093107),
 ('868', 4.406710641019919),
 ('1524', 4.385518568793411),
 ('1456', 4.379773599027038),
 ('1639', 4.333333333333333),
 ('318', 4.290533345217262),
 ('1125', 4.25)]

### 2.1.2 User-based Collaborative filtering in KNN

**Now, using the same value for k that you found above and the same training and test set, fit a KNN model with User-based collaborative filtering (all other parameters must be left unchanged).**

In [38]:
random.seed(my_seed)
np.random.seed(my_seed)


# Define options and create instance of class
sim_options = {
    'name': 'pearson', # let's use pearson similarity which can be seen as mean-centered cosine similarity
    'user_based': True 
}
knn_means = KNNWithMeans(k=KNN_grid_search.best_params["rmse"]["k"], min_k=1, sim_options=sim_options, verbose=False)



# Fit model
knn_means.fit(trainset)

# Predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset() 
predictions = knn_means.test(testset)

**Once again, find the top 20 predictions for user 100 (you can use helper functions used in the labs).**

In [40]:
# YOUR CODE HERE
from utils.py import *
top_n = get_top_n(predictions, n=20)
top_n["100"]

[('814', 4.9703123334399315),
 ('851', 4.857027247454124),
 ('1536', 4.7724229446890645),
 ('1467', 4.416536105967811),
 ('1293', 4.38775210870818),
 ('1653', 4.361914257228315),
 ('1429', 4.28627983051033),
 ('1500', 4.282082324455206),
 ('1642', 4.2783740735041516),
 ('113', 4.221851641773584),
 ('1189', 4.216166630439447),
 ('1449', 4.214538036861534),
 ('1175', 4.184036653281956),
 ('1367', 4.162514030162149),
 ('1585', 4.159767413089199),
 ('1651', 4.159767413089199),
 ('1639', 4.159767413089199),
 ('1650', 4.159767413089199),
 ('1631', 4.159767413089199),
 ('1636', 4.159767413089199)]

### 2.1.3: Matrix Factorization

Now, we will be using the SVD matrix factorization method on the same problem. 

**Run a 5-fold cross validation using RMSE on a singular value decomposition algorithm, on the full dataset.**

In [41]:
# YOUR CODE HERE
from surprise.model_selection import cross_validate
from surprise import SVD

random.seed(my_seed)
np.random.seed(my_seed)

algo = SVD()

# Run 5-fold cross-validation and print results.
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)


Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9355  0.9351  0.9405  0.9374  0.9346  0.9366  0.0022  
Fit time          1.35    1.39    1.50    1.83    1.38    1.49    0.18    
Test time         0.16    0.15    0.25    0.17    0.23    0.19    0.04    


{'test_rmse': array([0.93552801, 0.93508199, 0.94052991, 0.9373669 , 0.93459892]),
 'fit_time': (1.3512763977050781,
  1.390824794769287,
  1.501483678817749,
  1.8326098918914795,
  1.3844380378723145),
 'test_time': (0.15918946266174316,
  0.14778661727905273,
  0.25342392921447754,
  0.1698756217956543,
  0.23023080825805664)}

**Now, fit a SVD on the trainset and compute predictions on the testset (using the same train set and test set as before). Then, find the top 20 predictions for user 100.**
You can use helper functions used in the labs

In [45]:
# YOUR CODE HERE

random.seed(my_seed)
np.random.seed(my_seed)

algo.fit(trainset)
predictions = algo.test(testset)
top_n = get_top_n(predictions, n=20)
top_n["100"]

[('318', 4.274791300224256),
 ('22', 4.251638315493811),
 ('64', 4.228429679721632),
 ('114', 4.172495768117867),
 ('513', 4.094406975403808),
 ('59', 4.06918227791095),
 ('12', 4.04994053478945),
 ('408', 4.029916189453773),
 ('963', 4.0268618913516585),
 ('357', 4.007681982989534),
 ('178', 4.005231192275642),
 ('511', 3.9963876488736294),
 ('516', 3.995093456648734),
 ('190', 3.9787208351222882),
 ('519', 3.97633356164986),
 ('1019', 3.9502146886701826),
 ('489', 3.9462538276978),
 ('187', 3.935594437965357),
 ('694', 3.9284217922804787),
 ('199', 3.914499029656482)]

In recommender systems, we often use Recall@k and Precision@k to measure the accuracy of the recommendations. 

Recall@k= (Relevant_Items_Recommended in top-k) / (Relevant_Items)

Precision@k= (Relevant_Items_Recommended in top-k) / (k_Items_Recommended)

**Considering as relevant the recommendations with rating greater or equal than 4, compute the average precision and recall @2,30, using an SVD algorithm, and the usual train and test dataframes.**
You can use helper functions used in the labs

BOTH VERSION 1 and 2 WILL BE ACCEPTED.

In [15]:
# Version 1
from surprise import SVD


random.seed(my_seed)
np.random.seed(my_seed)

algo = SVD()

algo.fit(trainset)
predictions = algo.test(testset)

k_vals = [2,30]


for k in k_vals:
    precisions, recalls = precision_recall_at_k(predictions, k=k, threshold=4) # rating >= 4 -> relevant, rating < 4 -> irrelevant

    # Precision and recall can then be averaged over all users
    precision = sum(prec for prec in precisions.values()) / len(precisions) 
    recall =  sum(rec for rec in recalls.values()) / len(recalls) 
    print("k = ", k, "precision = ", precision, "recall = ", recall)


k =  2 precision =  0.030752916224814422 recall =  1.0
k =  30 precision =  0.030752916224814422 recall =  1.0


In [16]:
# Version 2
from surprise import SVD


data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.20, random_state=my_seed)


random.seed(my_seed)
np.random.seed(my_seed)

algo = SVD()

algo.fit(trainset)
predictions = algo.test(testset)

k_vals = [2,30]


for k in k_vals:
    precisions, recalls = precision_recall_at_k(predictions, k=k, threshold=4) # rating >= 4 -> relevant, rating < 4 -> irrelevant

    # Precision and recall can then be averaged over all users
    precision = sum(prec for prec in precisions.values()) / len(precisions) 
    recall =  sum(rec for rec in recalls.values()) / len(recalls) 
    print("k = ", k, "precision = ", precision, "recall = ", recall)


k =  2 precision =  0.8936170212765957 recall =  0.1681385161984644
k =  30 precision =  0.8703578316835745 recall =  0.3379017969065891


## 2.2: Recipes (baseline recommender from scratch)

For this exercise you are not allowed to use the Surprise library. You will be asked to implement a very "dumb" recommender, normally used as baseline. 

For this exercise, consider the dataframe recipe_ratings.csv. The dataset contains food recipes with the ratings that were given by the users of a Cooking website. To know more, check: *Generating Personalized Recipes from Historical User Preferences*, Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, Julian McAuley EMNLP, 2019 https://www.aclweb.org/anthology/D19-1613/.

All the needed data cleaning has already been done. 

**Implement a baseline recommender that recommends the 10 most popular recipes to all users. Use only the training data.**

In [115]:
import pandas as pd
train_data = pd.read_csv('recipes_train_data.csv')
test_data = pd.read_csv('recipes_test_data.csv')

# Calculate the mean rating for each recipe
recipe_ratings = train_data.groupby('item_id')['rating'].mean()

# Get the most popular recipes
popular_recipes = recipe_ratings.sort_values(ascending=False).index[:10]

# Define the baseline recommender function
def baseline_recommender(user_id):
    return popular_recipes



> Fitting the recommender on the training set, what's the first recommended recipe to user 37?

In [59]:
baseline_recommender(37) # you could give whatever you want as input, nothing changes

Int64Index([86659, 140886, 69988, 69982, 34014, 69981, 34018, 140916, 34027,
            140906],
           dtype='int64', name='item_id')

**Now, instead of doing top-K recommendation, we predict the ratings. Create a simple predictor for the ratings that, given a couple (user, item) predicts the rating as the mean rating of that user. See it like this: let's say you've been to 20  restaurants and reviewed them on tripadvisor. On average, you gave 3 stars. This recommender should predict 3 as your future ratings. PAY ATTENTION: if in the test set there is a user which was not present in the training set, your recommender should predict the "global average": the average of all ratings of the training set. Fit this recommender on the train data and compute the MSE on the test data.**

In [110]:
from sklearn.metrics import mean_squared_error
import numpy as np

global_avg = np.mean(train_data["rating"])

predictions = train_data.groupby('user_id').mean()["rating"]


#join the predictions to the test data
test_data = test_data.join(predictions, on="user_id", rsuffix="_pred")
test_data=test_data.fillna(global_avg)

# Calculate the mean squared error between the predicted ratings and the actual ratings in the test set
mse = mean_squared_error(test_data['rating'], test_data['rating_pred'])

print(f"Mean squared error: {mse}")

Mean squared error: 5.339175716071459


> Report the MSE you obtained

# Theory

Which of the following is NOT a type of recommender system?
- [ ] Content-based filtering
- [ ] Collaborative filtering
- [ ] Hybrid filtering
- [x] Deep learning filtering

Which of the following evaluation metrics is NOT used to evaluate recommender systems?
- [ ] Mean Absolute Error (MAE)
- [ ] Root Mean Squared Error (RMSE)
- [ ] Accuracy
- [x] F1 Score

Which of the following is used in item-based collaborative filtering?
- [ ] K-means clustering
- [ ] Apriori algorithm
- [x] Cosine similarity
- [ ] None of the above

