# 15.774/15.780 Fall 2023
# Recitation 4 - Choice Modeling & Collaborative Filtering
--------------------------------------------------------------------------------------------------------------------------------

First, we import the packages we will be using

In [1]:
# Install the packages you do not have
# !pip install xlogit

In [2]:
import pandas as pd
import numpy as np
from xlogit.utils import wide_to_long
from xlogit import MultinomialLogit

---
## Choice Models

By the end of this recitation, you will learn:

* How to represent discrete choices in two forms: wide and long
* How to transform data from wide to long form (packages that train choice models use this format)
* How to train a Multinomial Logistic MNL choice model in Python
* Compute predictions and hypothetical scenarios with the trained model




### Trains

The train dataset contains data for individual choices between two train tickets that vary in 4 attributes: 
`price` (in Netherlands Antillean Guilder), `time` (duration in minutes), `change` (number of changes), `comfort` (comfort level: lower level means more comfortable)
The same individual made multiple choices:
* an individual is labeled by `id`,
* the different choice scenarios experienced by an individual are labeled by `choiceid` 
* the actual choice made in each scenario was stored in the column `choice`
* The two ticket options are labeled choice1 and choice2 in the feature


In [3]:
# Assuming 'trains.csv' is in the same directory as your Python script or Jupyter Notebook
file_path = "trains.csv"

# Use the read_csv function to load the CSV file into a DataFrame
trains = pd.read_csv(file_path)
trains.head()

Unnamed: 0,id,choiceid,choice,price1,time1,change1,comfort1,price2,time2,change2,comfort2,income
0,1,1,choice1,2400,150,0,1,4000,150,0,1,55200
1,1,2,choice1,2400,150,0,1,3200,130,0,1,46000
2,1,3,choice1,2400,115,0,1,4000,115,0,0,21100
3,1,4,choice2,4000,130,0,1,3200,150,0,0,75800
4,1,5,choice2,2400,150,0,1,3200,150,0,0,103300


#### Wide vs long format 

In the **wide format**, a choice is represented on each row. All the alternatives are included in the same row. For that reason, the explanatory features (e.g., `price`, `time`) are repeated per alternative. Here we have only two choices, but for a model with hundreds or thousands of alternatives, this representation is not recommended.

The **long format** solves this problem by representing the explanatory features of all alternatives in the same column, one row per alternative. A choice scenario is represented by a group of rows, so we need an additional column that identifies the same choice scenario. The choice value is repeated for all the new rows.

Packages that train MNL models use the long format, so we need to transform our data to the long format. We will use the function `wide_to_long` from the `xlogit` package to transform the data.

In [4]:
# First we need an identifier to represent each choice scenario. 
# In this case, we simply use a range function to give a different number to each row
trains["custom_id"] = np.arange(len(trains))

# The function `wide_to_long` also requires the alternative names to be either suffixes or prefixes of the column names
# In this case, we have `price1` and `price2`, so the alternative suffixes are `1` and `2`. 
# However, the alternatives are `choice1` and `choice2`, so we transform these names to `1` and `2` by keeping the last character
trains["choice"] = trains["choice"].str[-1]

# Transform data from wide to long format
trains_long = wide_to_long(
    trains, 
    id_col="custom_id", # Id representing each scenario
    alt_name="alt", # The name you want to assign to the column that will identify the alternative represented on each row 
    sep="", # Separator used between the feature name and the alternative (suffix in this case)
    alt_list=["1", "2"], # Alternative names
    varying=["price", "time", "change", "comfort"], # Actual features
    alt_is_prefix=False, # The alternatives are suffixes
)
trains_long.head()

Unnamed: 0,custom_id,alt,price,time,change,comfort,id,choiceid,choice,income
0,0,1,2400,150,0,1,1,1,1,55200
1,0,2,4000,150,0,1,1,1,1,55200
2,1,1,2400,150,0,1,1,2,1,46000
3,1,2,3200,130,0,1,1,2,1,46000
4,2,1,2400,115,0,1,1,3,1,21100


We can observe that the first row of `trains` is represented in the two first rows of `trains_long` (one row per alternative). The feature `custom_id` groups a choice scenario, the column `alt` identifies the alternative, and `choice` is the final choice (alternative 1) for both alternatives.

In [5]:
# We transform the data to interpret price in USD and time in hours
trains_long["price"] = trains_long["price"]/100 * 2.20371
trains_long["time"] = trains_long["time"]/60 


In [6]:
varnames = ["price", "time", "change", "comfort"] 
y = trains_long.choice
X = trains_long.loc[:, varnames] 

model = MultinomialLogit()
model.fit(
    X, # X contains all the features in varnames
    y, # y contains the choice feature
    varnames=varnames, # Features to be used in the model
    ids=trains_long.custom_id, # Choice scenario identifier
    alts=trains_long.alt, # Alternative names
    fit_intercept=False # We won't fit intercept in general
)
model.summary()

Optimization terminated successfully.
    Message: The gradients are close to zero
    Iterations: 6
    Function evaluations: 7
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
price                  -0.0673582     0.0033933   -19.8506051      2.24e-82 ***
time                   -1.7205457     0.1603517   -10.7298276      2.26e-26 ***
change                 -0.3263459     0.0594892    -5.4858023      4.47e-08 ***
comfort                -0.9457283     0.0649455   -14.5618765      1.98e-46 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -1724.150
AIC= 3456.300
BIC= 3480.230


We can now interpret the coefficients we just obtained. As with logistic regression, we cannot use the scale of the number but only the sign. We could interpret the coefficients as follows:
* The higher the price of an alternative, the less likely a user is to choose it
* The higher the duration of an alternative, the less likely a user is to choose it
* The higher the number of changes of an alternative, the less likely a user is to choose it
* The higher the comfort of an alternative, the less likely a user is to choose it (remember that comfort is flipped, 1 is low comfort level and 0 is high)

Signs make sense!

Let us give another interpretation of the coefficients through ratios between the features and price, that is ratios will represent how much a customer is willing to pay to improve their current condition.

In [7]:
# Coefficients
pd.DataFrame([model.coeff_[1:] / model.coeff_[0]], columns=model.coeff_names[1:])

Unnamed: 0,time,change,comfort
0,25.543244,4.844936,14.040294


The results can be interpreted as follows:
* A customer is willing to pay $\$25$ more to reduce 1 hour of duration
* A customer is willing to pay $\$5$ more to reduce 1 change
* A customer is willing to pay $\$14$ more to reduce 1 comfort level (in this case, to move to the high comfort level)

What if we want to include customer features such as `income`?

In [8]:
varnames = ["price", "time", "change", "comfort", "income"]  # we addd income here
y = trains_long.choice
X = trains_long.loc[:, varnames] 

model_with_income = MultinomialLogit()
model_with_income.fit(
    X,
    y,
    varnames=varnames, 
    isvars=["income"], # We also add income here
    ids=trains_long.custom_id, 
    alts=trains_long.alt, 
    fit_intercept=False 
)
model_with_income.summary()

Optimization terminated successfully.
    Message: The gradients are close to zero
    Iterations: 11
    Function evaluations: 12
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
income.2               -0.0000058     0.0000003   -20.4281658         8e-87 ***
price                  -0.0914596     0.0041975   -21.7888257      1.13e-97 ***
time                   -1.5890385     0.1920458    -8.2742667      1.94e-16 ***
change                 -0.2823665     0.0721362    -3.9143551      9.27e-05 ***
comfort                -0.8801245     0.0769455   -11.4382879      1.14e-29 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -1278.244
AIC= 2566.488
BIC= 2596.400


The interpretation of the negative signs is that the higher the income of the customer, the less likely the customer will select the second option.

#### Predictions and sensitivity analysis

Suppose we offer two ticket options with the following attributes:
* Ticket 1: price = 60 euros, time = 2.5 hours, no. of changes = 1, comfort level = 1
* Ticket 2: price = 100 euros, time = 2.5 hours, no. of changes = 1, comfort level = 0

What is the interpretation of the two alternatives?

What is the predicted market share of each ticket? 
***Note:** We will use the first model without `income` as we do not have information about income in the options above*

In [9]:
# Let us compute the probability (market share) of each alternative using the logit function
VT1 = np.exp(model.coeff_[0]*60 + model.coeff_[1]*2.5 + model.coeff_[2]*1 + model.coeff_[3]*1)
VT2 = np.exp(model.coeff_[0]*100 + model.coeff_[1]*2.5 + model.coeff_[2]*1 + model.coeff_[3]*0)
Share2 = VT2/(VT1 + VT2)

print(f"Ticket 2 has a market share of {np.round(Share2*100, 2)}%")

Ticket 2 has a market share of 14.82%


Suppose now we reduce Ticket 2's price to 80 euros, how much increase in market share would we expect?

In [10]:
# We recompute VT2 and the share
VT22 = np.exp(model.coeff_[0]*80 + model.coeff_[1]*2.5 + model.coeff_[2]*1 + model.coeff_[3]*0)
Share22 = VT22/(VT1 + VT22)

# We output the new share and the percentage improvement in market share of the second option
print(f"Ticket 2 with a price of $80 has a market share of {np.round(Share22*100, 2)}% instead")

share_improvement = (Share22/Share2 - 1)
print(f"The improvement in market share is of {np.round(share_improvement*100, 2)}% ")

Ticket 2 with a price of $80 has a market share of 40.1% instead
The improvement in market share is of 170.51% 


Which option has a better expected revenue per customer?

In [11]:
# Let us compute market share by price and see which one has higher revenue
print(f"The expected revenue of the first option is ${np.round(Share2 * 100 + (1-Share2) * 60, 2)} per customer")
print(f"The expected revenue of the second option is ${np.round(Share22 * 80 + (1-Share22) * 60, 2)} per customer")

The expected revenue of the first option is $65.93 per customer
The expected revenue of the second option is $68.02 per customer


The second option leads to higher revenue.

---
---
## Collaborative filtering 

**Technical Note**: You can assume that the data is already “bias-adjusted” (i.e. you do not need to deaverage users' ratings).


In [12]:
# Assuming 'lastfm-matrix-germany.csv' is in the same directory as your Python script or Jupyter Notebook
file_path = "lastfm-matrix-germany.csv"

# Use the read_csv function to load the CSV file into a DataFrame
music = pd.read_csv(file_path)
music.head()

Unnamed: 0,user,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,33,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,42,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,51,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,62,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# Let us analyze only some  artists
artists  = [
    "beyonce", "black eyed peas", "britney spears",  "justin timberlake",
    "bloc party", "franz ferdinand", "babyshambles", "the libertines", "the hives"
]
music = music[["user"] + artists]

# We filter out users without ratings for these artists
music = music[music[artists].sum(axis=1) > 1] 

music.head()

Unnamed: 0,user,beyonce,black eyed peas,britney spears,justin timberlake,bloc party,franz ferdinand,babyshambles,the libertines,the hives
12,256,0,1,0,1,0,0,0,0,0
20,422,0,0,0,0,1,1,0,1,0
50,925,0,0,0,0,0,1,1,0,1
59,1022,0,0,0,0,0,1,1,1,0
72,1253,0,0,0,0,1,1,0,0,1


We will focus now on users `14618` and `1361` and see who are the most similar users in terms of cosine similarity

In [14]:
# We keep these users in a their own variables 
user_14618 = music[music.user == 14618].iloc[0,:]
user_14618

user                 14618
beyonce                  0
black eyed peas          0
britney spears           0
justin timberlake        1
bloc party               0
franz ferdinand          1
babyshambles             0
the libertines           0
the hives                1
Name: 921, dtype: int64

In [15]:
user_1361 = music[music.user == 1361].iloc[0,:]
user_1361

user                 1361
beyonce                 1
black eyed peas         1
britney spears          1
justin timberlake       0
bloc party              0
franz ferdinand         0
babyshambles            0
the libertines          0
the hives               0
Name: 81, dtype: int64

In [16]:
# We define a lambda function to compute the cosine similarity of two vectors 
cosine_similarity = lambda x, y: (x * y).sum() / np.sqrt((x ** 2).sum() * (y ** 2).sum())

In [17]:
# Let us use it to compute the similarity between our selected users
# Note: We have to remove the user identifier before computing the similarity
cosine_similarity(user_1361[artists], user_14618[artists])

0.0

The result makes sense as we selected two users with no ratings in common. Let us compute the most similar users to those two

### User similarity

In [18]:
# There are many ways of computing it but we will simply loop over the users and then sort the similarities
sim_with_14618 = []
sim_with_1361 = []

for user_id in music.user:
    # Get user ratings
    user = music[music.user == user_id].iloc[0,:] 

    # Compute and store similarity with 14618
    sim_with_14618.append(cosine_similarity(user[artists], user_14618[artists]))

    # Compute and store similarity with 1361
    sim_with_1361.append(cosine_similarity(user[artists], user_1361[artists]))

# Build a matrix with this data
sims = pd.DataFrame({"user": music.user, "sim_with_14618": sim_with_14618, "sim_with_1361": sim_with_1361})
sims

Unnamed: 0,user,sim_with_14618,sim_with_1361
12,256,0.408248,0.408248
20,422,0.333333,0.000000
50,925,0.666667,0.000000
59,1022,0.333333,0.000000
72,1253,0.666667,0.000000
...,...,...,...
1187,18669,0.408248,0.000000
1201,18816,0.333333,0.000000
1203,18870,0.408248,0.000000
1246,19565,0.000000,0.000000


In [19]:
# Top 5 users most similar to 14618
top5_similar_to_14618 = sims.sort_values("sim_with_14618", ascending=False).head(6)
top5_similar_to_14618

Unnamed: 0,user,sim_with_14618,sim_with_1361
921,14618,1.0,0.0
1125,17657,0.816497,0.0
116,1796,0.816497,0.0
239,3757,0.816497,0.0
908,14347,0.816497,0.0
724,11093,0.816497,0.0


In [20]:
# And their ratings
music[music.user.isin(top5_similar_to_14618.user)]

Unnamed: 0,user,beyonce,black eyed peas,britney spears,justin timberlake,bloc party,franz ferdinand,babyshambles,the libertines,the hives
116,1796,0,0,0,0,0,1,0,0,1
239,3757,0,0,0,1,0,1,0,0,0
724,11093,0,0,0,0,0,1,0,0,1
908,14347,0,0,0,0,0,1,0,0,1
921,14618,0,0,0,1,0,1,0,0,1
1125,17657,0,0,0,0,0,1,0,0,1


In [21]:
# Top 5 users most similar to 1361
top5_similar_to_1361 = sims.sort_values("sim_with_1361", ascending=False).head(6)
top5_similar_to_1361

Unnamed: 0,user,sim_with_14618,sim_with_1361
384,6163,0.0,1.0
81,1361,0.0,1.0
1070,16860,0.288675,0.866025
1149,18030,0.288675,0.866025
329,5302,0.0,0.816497
587,9200,0.0,0.816497


In [22]:
# And their ratings
music[music.user.isin(top5_similar_to_1361.user)]

Unnamed: 0,user,beyonce,black eyed peas,britney spears,justin timberlake,bloc party,franz ferdinand,babyshambles,the libertines,the hives
81,1361,1,1,1,0,0,0,0,0,0
329,5302,1,1,0,0,0,0,0,0,0
384,6163,1,1,1,0,0,0,0,0,0
587,9200,1,0,1,0,0,0,0,0,0
1070,16860,1,1,1,1,0,0,0,0,0
1149,18030,1,1,1,1,0,0,0,0,0


### Band similarity

We can use the same function to compute band similarities!

In [23]:
cosine_similarity(music["beyonce"], music["britney spears"])

0.5129891760425771

In [24]:
cosine_similarity(music["babyshambles"], music["the libertines"])

0.5291502622129182

In [25]:
cosine_similarity(music["babyshambles"], music["justin timberlake"])

0.03706246583305506

In [26]:
cosine_similarity(music["babyshambles"], music["beyonce"])

0.0

### Prediction

We have a new user  who likes `babyshambles`, `bloc party`, and `justin timberlake`, and has listened to `beyonce` and `the hives` without rating them, we want to predict the user rating for `britney spears` and `the libertines` 

This user ratings can be represented as the following `pandas.Series`


In [27]:
new_user = pd.Series({
    "beyonce": 0, 
    "justin timberlake": 1,
    "bloc party": 1,
    "babyshambles": 1, 
    "the hives": 0
})

**Important:** We compute the cosine similarities only on those artist for whom we know the ratings of the new users

In [28]:
# Define the known information 
known_rating_artists = ["beyonce", "justin timberlake", "bloc party", "babyshambles", "the hives"]

# Filter out users without ratings for these bands
known_rating_music = music[music[known_rating_artists].sum(axis=1) > 0].reset_index()

sim_with_new_user = []
for user_id in known_rating_music.user:
    # Get user ratings
    user = known_rating_music[known_rating_music.user == user_id].iloc[0,:] 
    
    # Compute and store similarity with new_user ONLY ON KNOWN RATINGS!
    sim_with_new_user.append(cosine_similarity(user[known_rating_artists], new_user[known_rating_artists]))

sim_with_new_user = pd.Series(sim_with_new_user)
sim_with_new_user

0     0.577350
1     0.577350
2     0.408248
3     0.577350
4     0.408248
        ...   
95    0.816497
96    0.577350
97    0.577350
98    0.816497
99    0.816497
Length: 100, dtype: float64

Now we compute the prediction using the similarities as weights and the ratings for `britney spears` and `the libertines` respectively of those original users. For that we use the following lambda function

In [29]:
# Function that computes predictions
rating_predictions = lambda weights, y: (weights * y).sum() / weights.sum()

In [30]:
# Rating for `britney spears`
rating_predictions(sim_with_new_user, known_rating_music["britney spears"])

0.10741132120129308

In [31]:
# Rating for `the libertines`
rating_predictions(sim_with_new_user, known_rating_music["the libertines"])

0.27350413198343976

---
*End of Recitation 4*