### Codio Activity 19.3: Collaborative Filtering

**Expected Time = 90 minutes**

**Total Points = 50**

In this activity, you will use collaborative filtering to predict user ratings.  This iterative process will begin with our simple reviews dataset to fill in the missing values for the users.  Your regression models will be built with scikit-learn's `LinearRegression` estimator.

### Index


- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [20]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

### The Data

Again, you begin with data indexed by artists.  You will add random values for `F1` and `F2`, and use these to create regression models for each user.  Then, tracking the coefficients -- you create new artist vectors, and repeat the process.  The goal remains to predict user ratings of unrated albums.

In [21]:
reviews = pd.read_csv("data/user_rated.csv", index_col=0).iloc[:, :-2].T

In [22]:
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino
Michael Jackson,3.0,,2.0,3.0,1.0
Clint Black,4.0,9.0,5.0,,1.0
Dropdead,,,8.0,9.0,
Anti-Cimex,4.0,3.0,9.0,4.0,9.0
Cardi B,4.0,8.0,,9.0,5.0


[Back to top](#-Index)

### Problem 1

### Creating F1 and F2

**5 Points**

To begin, create two randomly instantiated vectors `F1` and `F2` as columns in your DataFrame.  To do so, you will draw numbers from a random normal distribution using `np.random.normal(size = 5)`.  Set `np.random.seed = 42`.  

In [23]:
### GRADED
np.random.seed(42)
reviews["F1"] = np.random.normal(size=5)
reviews["F2"] = np.random.normal(size=5)

### ANSWER CHECK
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.496714,-0.234137
Clint Black,4.0,9.0,5.0,,1.0,-0.138264,1.579213
Dropdead,,,8.0,9.0,,0.647689,0.767435
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.52303,-0.469474
Cardi B,4.0,8.0,,9.0,5.0,-0.234153,0.54256


[Back to top](#-Index)

### Problem 2

#### Regression models for all users

**10 Points**

As in earlier codio activities, use `X = reviews[['F1', 'F2']]` and `y = user` for each user column with no missing data in `y`.  Build a regression model **with no intercept**, and create a (5, 2) numpy array `uf` of the coefficients in order of the model.  

In [24]:
### GRADED
uf = np.ndarray((5, 2))
features = ["F1", "F2"]
users = ["Alfred", "Mandy", "Lenny", "Joan", "Tino"]
for idx, user in enumerate(users):
    user_df = reviews[[user] + features].dropna()
    model = LinearRegression(fit_intercept=False).fit(user_df[features], user_df[user])
    uf[idx, :] = model.coef_

### ANSWER CHECK
display(uf.shape)  # should be (5, 2)
display(uf)

(5, 2)

array([[ 3.82095605,  3.39576219],
       [ 3.71034729,  7.00619661],
       [ 7.11326267,  3.95250165],
       [ 5.24016749, 10.03575897],
       [ 5.86328014,  2.19748154]])

[Back to top](#-Index)

### Problem 3

#### New Model for artists

**10 Points**

Below, a DataFrame `ui_df` is created using the coefficients from the previous problem.  Now, you are to use this data with `F1` and `F2` to build a new model and track each *artists* coefficients.  Assign this as a numpy array to `ifs` below.


In [25]:
ui_df = reviews.iloc[:, :-2].T
ui_df["F1"] = uf[:, 0]
ui_df["F2"] = uf[:, 1]
ui_df

Unnamed: 0,Michael Jackson,Clint Black,Dropdead,Anti-Cimex,Cardi B,F1,F2
Alfred,3.0,4.0,,4.0,4.0,3.820956,3.395762
Mandy,,9.0,,3.0,8.0,3.710347,7.006197
Lenny,2.0,5.0,8.0,9.0,,7.113263,3.952502
Joan,3.0,,9.0,4.0,9.0,5.240167,10.035759
Tino,1.0,1.0,,9.0,5.0,5.86328,2.197482


In [26]:
### GRADED
ifs = np.ndarray((5, 2))
features = ["F1", "F2"]
targets = ["Michael Jackson", "Clint Black", "Dropdead", "Anti-Cimex", "Cardi B"]
for idx, target in enumerate(targets):
    target_df = ui_df[[target] + features].dropna()
    model = LinearRegression(fit_intercept=False).fit(
        target_df[features], target_df[target]
    )
    ifs[idx, :] = model.coef_

### ANSWER CHECK
display(ifs.shape)
display(ifs)

(5, 2)

array([[ 0.16405953,  0.24803718],
       [-0.20766573,  1.42108075],
       [ 0.88235547,  0.43607161],
       [ 1.56999825, -0.4292041 ],
       [ 0.57004139,  0.67045138]])

[Back to top](#-Index)

### Problem 4

#### New model for users

**10 Points**

Again, a DataFrame is created using the coefficients from our linear model on artists -- `if_df`.  You are to use this data to create new arrays of coefficients for the users.  Assign this array of coefficients as `uf2`.

In [27]:
if_df = reviews.copy().iloc[:, :-2]
if_df.loc[:, "F1"] = ifs[:, 0]
if_df.loc[:, "F2"] = ifs[:, 1]
if_df

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.16406,0.248037
Clint Black,4.0,9.0,5.0,,1.0,-0.207666,1.421081
Dropdead,,,8.0,9.0,,0.882355,0.436072
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.569998,-0.429204
Cardi B,4.0,8.0,,9.0,5.0,0.570041,0.670451


In [28]:
### GRADED
uf2 = np.ndarray((5, 2))

features = ["F1", "F2"]
users = ["Alfred", "Mandy", "Lenny", "Joan", "Tino"]
for idx, user in enumerate(users):
    user_df = if_df[[user] + features].dropna()
    model = LinearRegression(fit_intercept=False).fit(user_df[features], user_df[user])
    uf2[idx, :] = model.coef_

### ANSWER CHECK
display(uf2.shape)  # should be (5, 2)
display(uf2)

(5, 2)

array([[3.53046728, 3.4336384 ],
       [4.11783667, 7.26746079],
       [6.91421806, 4.47389919],
       [5.17072815, 9.36768386],
       [6.24342403, 1.6826658 ]])

[Back to top](#-Index)

### Problem 5

#### One more iteration

**5 Points**

Again, a DataFrame `ui_df2` is created using the results of `uf2`.  Use the features `F1` and `F2` to create regression models for each user and track the coefficients in `ifs2`.  

In [29]:
ui_df2 = reviews.copy().iloc[:, :-2].T
ui_df2["F1"] = uf2[:, 0]
ui_df2["F2"] = uf2[:, 1]
ui_df2

Unnamed: 0,Michael Jackson,Clint Black,Dropdead,Anti-Cimex,Cardi B,F1,F2
Alfred,3.0,4.0,,4.0,4.0,3.530467,3.433638
Mandy,,9.0,,3.0,8.0,4.117837,7.267461
Lenny,2.0,5.0,8.0,9.0,,6.914218,4.473899
Joan,3.0,,9.0,4.0,9.0,5.170728,9.367684
Tino,1.0,1.0,,9.0,5.0,6.243424,1.682666


In [30]:
### GRADED
ifs2 = np.ndarray((5, 2))
features = ["F1", "F2"]
targets = ["Michael Jackson", "Clint Black", "Dropdead", "Anti-Cimex", "Cardi B"]
for idx, target in enumerate(targets):
    target_df = ui_df2[[target] + features].dropna()
    model = LinearRegression(fit_intercept=False).fit(
        target_df[features], target_df[target]
    )
    ifs2[idx, :] = model.coef_

### ANSWER CHECK
display(ifs2.shape)
display(ifs2)

(5, 2)

array([[ 0.12958264,  0.29264609],
       [-0.17854626,  1.35044285],
       [ 0.8328282 ,  0.50104934],
       [ 1.57959703, -0.45644567],
       [ 0.60189159,  0.66892235]])

[Back to top](#-Index)

### Problem 6

#### Comparing Models

**10 Points**

Based on the first iteration resulting in `if_df` and the last in `if_df2` use these different item factors as inputs to a `LinearRegression` model to determine the `mean_squared_error` for each model for Alfred.  Which user factors did a better job as inputs to the model -- `if_df` or `if_df2`.  Assign your answer as a string to `ans6` below.

In [31]:
if_df2 = reviews.copy().iloc[:, :-2]
if_df2.loc[:, "F1"] = ifs2[:, 0]
if_df2.loc[:, "F2"] = ifs2[:, 1]
if_df2

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.129583,0.292646
Clint Black,4.0,9.0,5.0,,1.0,-0.178546,1.350443
Dropdead,,,8.0,9.0,,0.832828,0.501049
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.579597,-0.456446
Cardi B,4.0,8.0,,9.0,5.0,0.601892,0.668922


In [50]:
from sklearn.metrics import mean_squared_error

In [64]:
user = "Alfred"
features = ["F1", "F2"]
user_df = if_df[[user] + features].dropna()
y1 = LinearRegression().fit(user_df[features], user_df[user]).predict(user_df[features])

In [65]:
user_df2 = if_df2[[user] + features].dropna()
y2 = (
    LinearRegression()
    .fit(user_df2[features], user_df2[user])
    .predict(user_df2[features])
)

In [66]:
[
    mean_squared_error(user_df[user], y1),
    mean_squared_error(user_df2[user], y2),
]

[0.0008769726989889001, 0.003832373090788041]

In [67]:
### GRADED
ans6 = "if_df"

### ANSWER CHECK
ans6

'if_df'