###  Codio Activity 19.1: Regression Models for Predictions

**Expected Time = 60 minutes**

**Total Points = 50**

This activity will use regression models to provide scores for unseen content (albums).  Using these scores, you can make recommendations for unheard albums to users. You are also given similar information as to that from the lecture in terms of *lofi* and *slick* scores for each artist.

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [2]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

#### Our Data

This example uses a synthetic dataset of reviews from five individuals and five albums.  The dataset is loaded and displayed below. Two additional columns `lofi` and `slick` are included to rate the nature of the music. 


In [5]:
reviews = pd.read_csv('../data/sample_reviews.csv', index_col=0)

In [7]:
reviews.head()

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,slick,lofi
Michael Jackson,3.0,,2.0,3.0,1.0,8,2
Clint Black,4.0,9.0,5.0,,1.0,8,2
Dropdead,,,8.0,9.0,,2,9
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,2,10
Cardi B,4.0,8.0,,9.0,5.0,9,3


[Back to top](#-Index)

### Problem 1

#### Considering Alfred

**10 Points**

Define `X` to contain only the `slick` and `lofi` columns of the `reviews` dataframe, with rows where the `Alfred` column had missing values removed. Define `y`  as a new series y that contains the non-missing values from the `Alfred` column in the `reviews` dataframe.

Instantiate a new linear regression model and fit it to `X` and `y`. Assign this model to the variable `alfred_lr`.

Next, create a new dataframe `newx` that contains only the rows from the `reviews` dataframe where the `Alfred` column has missing (NaN) values. Additionally, ensure that you are selecting only the `slick` and `lofi` columns from these rows.

Finally, use the function `predict` on `alfred_lr` with argument equal to `newx` to calculate your predictions. Assign your result to `alfred_dd_predict`.


In [9]:
### GRADED
X = reviews[['slick', 'lofi']].loc[reviews['Alfred'].notna()]
y = reviews['Alfred'].dropna()
alfred_lr = LinearRegression().fit(X, y)

newx = reviews[['slick', 'lofi']].loc[reviews['Alfred'].isna()]

alfred_dd_predict = alfred_lr.predict(newx)

### ANSWER CHECK
alfred_dd_predict

array([3.75])

[Back to top](#-Index)

### Problem 2

#### User Vector for Alfred

**10 Points**

Assign the coefficients of the linear regressions model `alfred_lr` to `alfred_vector` below.


In [11]:
### GRADED
alfred_vector = alfred_lr.coef_

### ANSWER CHECK
pd.DataFrame(alfred_vector.reshape(1, 2), columns = ['slick', 'lofi'], index = ['Alfred'])

Unnamed: 0,slick,lofi
Alfred,0.25,0.25


[Back to top](#-Index)

### Problem 3

#### Considering Tino

**10 Points**

Build a regression model `tino_lr` in a similar way as in Problem 1, but now for the user `Tino`.  Assign the prediction to `tino_dd_predict` as a numpy array below.

In [13]:
### GRADED
X = reviews[['slick', 'lofi']].loc[reviews['Tino'].notna()]
y = reviews['Tino'].dropna()
tino_lr = LinearRegression().fit(X, y)

newx = reviews[['slick', 'lofi']].loc[reviews['Tino'].isna()]

tino_dd_predict = tino_lr.predict(newx)

### ANSWER CHECK
tino_dd_predict

array([6.71428571])

[Back to top](#-Index)

### Problem 4

#### Tino's user vector

**10 Points**

Assign the coefficients of the linear regressions model `tino_lr` to `tino_vector` below. 

In [15]:
### GRADED
tino_vector = tino_lr.coef_

### ANSWER CHECK
pd.DataFrame(tino_vector.reshape(1, 2), columns = ['slick', 'lofi'], index = ['Tino'])

Unnamed: 0,slick,lofi
Tino,1.714286,2.285714


[Back to top](#-Index)

### Problem 5

#### Completing the Table

**10 Points**

Write a `for` loop to iterate over each column of `reviews` and perform the prediction process using the same columns of `slick` and `lofi` as inputs. 

Create a DataFrame called `reviews_df_full` and complete the scores for each individual. 

In [21]:
### GRADED
reviews_df_full = reviews.copy()

# Iterate over each user column (excluding slick and lofi) to fill missing values
for user in reviews.columns[:-2]:  # Exclude slick and lofi
    # Define X as slick and lofi columns, removing rows where the current user's values are not NaN
    X_user = reviews[['slick', 'lofi']].loc[reviews[user].notna()]
    # Define y as the current user's column, excluding NaN values
    y_user = reviews[user].dropna()
    
    # Instantiate and fit the Linear Regression model for the current user
    user_lr = LinearRegression().fit(X_user, y_user)
    
    # Define newx_user for rows where the user's column has NaN values, selecting only slick and lofi columns
    newx_user = reviews[['slick', 'lofi']].loc[reviews[user].isna()]
    
    # Predict and update the missing values in the user's column in reviews_df_full
    if not newx_user.empty:  # Only predict if there are missing values
        reviews_df_full.loc[reviews[user].isna(), user] = user_lr.predict(newx_user)

reviews_df_full = reviews_df_full.round(1)

### ANSWER CHECK
reviews_df_full

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,slick,lofi
Michael Jackson,3.0,9.0,2.0,3.0,1.0,8,2
Clint Black,4.0,9.0,5.0,4.66,1.0,8,2
Dropdead,3.75,3.86,8.0,9.0,6.71,2,9
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,2,10
Cardi B,4.0,8.0,4.92,9.0,5.0,9,3
