###  Codio Activity 19.1: Regression Models for Predictions

This activity will use regression models to provide scores for unseen content (albums).  Using these scores, you can make recommendations for unheard albums to users. You are also given similar information as to that from the lecture in terms of *lofi* and *slick* scores for each artist.

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [3]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot

from sklearn.linear_model import LinearRegression

#### Our Data

This example uses a synthetic dataset of reviews from five individuals and five albums.  The album covers and artists are displayed below and the dataset is loaded and displayed below.  Two additional columns `lowfi` and `slick` are included to rate the nature of the music. 

![](images/covers.png)

In [8]:
reviews = pd.read_csv('codio_19_1_solution/data/sample_reviews.csv',index_col = 0)
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,slick,lofi
Michael Jackson,3.0,,2.0,3.0,1.0,8,2
Clint Black,4.0,9.0,5.0,,1.0,8,2
Dropdead,,,8.0,9.0,,2,9
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,2,10
Cardi B,4.0,8.0,,9.0,5.0,9,3


In [7]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Michael Jackson to Cardi B
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Alfred  4 non-null      float64
 1   Mandy   3 non-null      float64
 2   Lenny   4 non-null      float64
 3   Joan    4 non-null      float64
 4   Tino    4 non-null      float64
 5   slick   5 non-null      int64  
 6   lofi    5 non-null      int64  
dtypes: float64(5), int64(2)
memory usage: 320.0+ bytes


### Problem 1

#### Considering Alfred

To begin, create `X` and `y` based on `Alfred`.  This means to drop the row for **Dropdead**, and build a model using all other rows `slick` and `lofi` scores.  Assign the input as `X` and target as `y`, name your model `alfred_lr` and make prediction for Alfred as `alfred_dd_predict` below.  

In [14]:
X = reviews.dropna(subset = ['Alfred'])[['slick','lofi']]
X

Unnamed: 0,slick,lofi
Michael Jackson,8,2
Clint Black,8,2
Anti-Cimex,2,10
Cardi B,9,3


In [15]:
y = reviews['Alfred'].dropna()
y

Michael Jackson    3.0
Clint Black        4.0
Anti-Cimex         4.0
Cardi B            4.0
Name: Alfred, dtype: float64

In [16]:
alfred_lr = LinearRegression().fit(X,y)

In [18]:
newx = reviews[reviews['Alfred'].isnull()][['slick','lofi']]
newx

Unnamed: 0,slick,lofi
Dropdead,2,9


In [19]:
alfred_dd_predict = alfred_lr.predict(newx)
alfred_dd_predict

array([3.75])

### Problem 2

#### User Vector for Alfred

Use your model for Alfred to construct his user vector based on the coefficients of the model. What does this tell you about Alfred's preference for slick and lofi?  Assign his user vector as a numpy array to `alfred_vector` below.

In [20]:
alfred_vector = alfred_lr.coef_
alfred_vector

array([0.25, 0.25])

In [21]:
pd.DataFrame(alfred_vector.reshape(1,2), columns = ['slick','lofi'], index = ['Alfred'])

Unnamed: 0,slick,lofi
Alfred,0.25,0.25


### Problem 3

#### Considering Tino

Repeat the process above for Tino.  Use Tino's user vector to predict their rating of **Dropdead**.  Assign the prediction to `tino_dd_predict` as a numpy array below.

In [24]:
Xt = reviews.dropna(subset = ['Tino'])[['slick','lofi']]
Xt

Unnamed: 0,slick,lofi
Michael Jackson,8,2
Clint Black,8,2
Anti-Cimex,2,10
Cardi B,9,3


In [25]:
yt = reviews['Tino'].dropna()
yt

Michael Jackson    1.0
Clint Black        1.0
Anti-Cimex         9.0
Cardi B            5.0
Name: Tino, dtype: float64

In [26]:
tino_lr = LinearRegression().fit(Xt,yt)
tino_lr

In [28]:
newxt = reviews[reviews['Tino'].isnull()][['slick','lofi']]
newxt

Unnamed: 0,slick,lofi
Dropdead,2,9


In [29]:
tino_dd_predict = tino_lr.predict(newxt)
tino_dd_predict

array([6.71428571])

### Problem 4

#### Tino's user vector

Now, create a user vector for Tino and assign as a numpy array to `tino_vector` below.  What does this say about their preference for *slick* versus *lofi*?  

In [30]:
tino_vector = tino_lr.coef_
tino_vector

array([1.71428571, 2.28571429])

In [31]:
pd.DataFrame(tino_vector.reshape(1,2), columns = ['slick','lofi'], index = ['Dropdead'])

Unnamed: 0,slick,lofi
Dropdead,1.714286,2.285714


### Problem 5

#### Completing the Table

Consider writing a function to loop over each column and perform the prediction process using the same columns of `slick` and `lofi` as inputs.  Create a DataFrame called `reviews_df_full` and complete the scores for each individual. 

In [32]:
for name in reviews.columns:
    try:
        X = reviews.dropna(subset = [name])[['slick','lofi']]
        y = reviews[name].dropna()
        name_lr = LinearRegression().fit(X,y)
        newx = reviews[reviews[name].isnull()][['slick','lofi']]
        name_predict = name_lr.predict(newx)
        print(newx.index, name, name_predict)
    except:
        pass

Index(['Dropdead'], dtype='object') Alfred [3.75]
Index(['Michael Jackson', 'Dropdead'], dtype='object') Mandy [9.         3.85714286]
Index(['Cardi B'], dtype='object') Lenny [4.91666667]
Index(['Clint Black'], dtype='object') Joan [4.66444444]
Index(['Dropdead'], dtype='object') Tino [6.71428571]


In [33]:
reviews_df_full = reviews.copy()
reviews_df_full.loc['Dropdead', 'Alfred'] = 3.75
reviews_df_full.loc[['Michael Jackson', 'Dropdead'], 'Mandy'] = [9, 3.85]
reviews_df_full.loc[['Cardi B'], 'Lenny'] = [4.91666667]
reviews_df_full.loc[['Clint Black'], 'Joan'] = [4.66444444]
reviews_df_full.loc[['Dropdead'], 'Tino'] = [6.71428571]

In [34]:
reviews_df_full

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,slick,lofi
Michael Jackson,3.0,9.0,2.0,3.0,1.0,8,2
Clint Black,4.0,9.0,5.0,4.664444,1.0,8,2
Dropdead,3.75,3.85,8.0,9.0,6.714286,2,9
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,2,10
Cardi B,4.0,8.0,4.916667,9.0,5.0,9,3
