# 15.774/15.780 Fall 2023
## The Analytics of Operations Management
### Problem Set 3 - Choice Modeling & Collaborative Filtering
#### Due Date:  10/13
---
Name of Student: [Joy Bhattacharya]

MIT ID Number: [922631264]

---

### Instructions:

1) Submit solutions that are your own, in your own words. You are allowed to discuss with other students in general terms, but make sure you are not copying verbatim from another student. Therefore do not read other students' solutions. If you use material from outside this class, reference it in your solution. 

2) Please download the python file attached in the assignment and complete your answers there in the same file. Read the questions carefully, and make sure you answer every part that the question asks.

3) Include relevant code in the PDF submission even if the question doesn't explicitly ask for it. Upload your solutions as a PDF file. Include your name and MIT ID on the first page.

4) To convert to pdf, you can use the "print to pdf" option in jupyter (or equivalent options in other IDE). There are other options to directly download in to pdf format which might include additional installation of packages. 

5) Show your work and explain your conclusions clearly and precisely. Plots should have clear titles and axis labels so that it is clear what your analysis is showing.

--------------------------------------------------------------------------------------------------------------------------------

First, we import the packages we will be using

In [2]:
# Install the packages you do not have
!pip install xlogit

Collecting xlogit
  Obtaining dependency information for xlogit from https://files.pythonhosted.org/packages/5b/ab/5280d6920d59e739063effb59b54349a71ee66bdd11daa1664d08e68564a/xlogit-0.2.7-py3-none-any.whl.metadata
  Downloading xlogit-0.2.7-py3-none-any.whl.metadata (10 kB)
Downloading xlogit-0.2.7-py3-none-any.whl (36 kB)
Installing collected packages: xlogit
Successfully installed xlogit-0.2.7


DEPRECATION: holopy unknown has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of holopy or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
DEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [1]:
import pandas as pd
import numpy as np
np.float=float
from xlogit.utils import wide_to_long
from xlogit import MultinomialLogit

  from pandas.core.computation.check import NUMEXPR_INSTALLED


---
## $\textbf{Problem 1. Choice Models}$ (40 pts)

In this problem, we want to build a model to represent a choice between the usage of two transportation systems (**bus** or **car**). We use the data stored in `CommuteFall2020.csv`, which contains the choices made by 400 commuters. The data is stored in **wide format**, and contains the following variables:
* `Car.Travel` travel time by car (in minutes)
* `HHIncome` is annual household income (in dollars) **Note:** This feature is a characteristic of the individual choosing and not the choices
* `ParkingCost` is the car parking cost (in dollars)
* `BusFare` is bus fare in dollars
* `Bus.Travel` is the travel time by the bus (in minutes)
* `Bus.Wait` is the average waiting time for the bus (in minutes)
* `Choice` whether the commuter chose CAR or BUS

We assume that for each person, the utility of taking either transportation method is **linear** in **total
journey time** (travel time for a car, or *travel plus waiting time for a bus*), daily cost (parking cost for car
or bus fare), and **household income**. This means we should define new variables `Time` and `Cost` for each option, but carefully using the correct prefix or suffixes. 

We will also assume that there is no outside option, i.e. a person must
choose either car or bus (meaning the model has no intercept). 

First, we load the data

In [2]:
# Assuming 'loans.csv' is in the same directory as your Python script or Jupyter Notebook
file_path = "CommuteFall2020.csv"

# Use the read_csv function to load the CSV file into a DataFrame
commute = pd.read_csv(file_path)
commute.head()

Unnamed: 0,Car.Travel,ParkingCost,Bus.Travel,Bus.Wait,BusFare,HHIncome,Choice
0,40,10,80,9,1.8,75000,BUS
1,35,8,80,14,1.8,65000,CAR
2,45,10,80,9,1.8,45000,CAR
3,25,8,65,24,1.8,45000,CAR
4,35,8,50,14,1.8,75000,CAR


---
**1.** We need to do some pre-processing to get the data in the right format for `xlogit`. You will need to create some new columns and name them correctly. **(10pts)**

In [3]:
# Let us define the column `Time` for the two modes 
# (Note: we suggest the alternative names as suffixes separated by "_" as in the first line)

commute["Time_BUS"] = commute["Bus.Travel"]+commute["Bus.Wait"]
commute["Time_CAR"] = commute["Car.Travel"]
# FILL IN (same for CAR)

# Let us define the column `Cost` for the two modes

# FILL IN for BUS and CAR
commute["Cost_BUS"]=commute['BusFare']
commute["Cost_CAR"]=commute['ParkingCost']

# Remove the variables used to compute the Time and Cost
del commute['Bus.Travel']
del commute['Bus.Wait']
del commute['BusFare']
del commute['ParkingCost']
del commute['Car.Travel']

# FILL IN with all the variables

commute.head()

Unnamed: 0,HHIncome,Choice,Time_BUS,Time_CAR,Cost_BUS,Cost_CAR
0,75000,BUS,89,40,1.8,10
1,65000,CAR,94,35,1.8,8
2,45000,CAR,89,45,1.8,10
3,45000,CAR,89,25,1.8,8
4,75000,CAR,64,35,1.8,8


In [6]:
# Define a custom id per choice scenario
commute["custom_id"] = np.arange(len(commute))

# Note: The alternatives should be the same as the suffixes or prefixes you used in the previous step


# Change from wide to long format
commute_long = wide_to_long(
    commute, 
    id_col="custom_id", # Id representing each scenario
    alt_name="alt", # The name you want to assign to the column that will identify the alternative represented on each row 
    sep="_",# Separator used between the feature name and the alternative (suffix in this case)
    alt_list=["BUS", "CAR"], # Alternative names
    varying=['Time', 'Cost'], # Actual features
    alt_is_prefix=False, # The alternatives are suffixes
    
)
commute_long.head()


Unnamed: 0,custom_id,alt,Time,Cost,HHIncome,Choice
0,0,BUS,89,1.8,75000,BUS
1,0,CAR,40,10.0,75000,BUS
2,1,BUS,94,1.8,65000,CAR
3,1,CAR,35,8.0,65000,CAR
4,2,BUS,89,1.8,45000,CAR


---
**2.** Fit an MNL model to the data and show the model coefficients. **(5pts)**

In [8]:
# Define the variable names, and the sets for training
varnames = ['Time', 'Cost', 'HHIncome']
y = commute_long.Choice
X = commute_long.loc[:, varnames] 

# Define and train the MNL model using xlogit.MultinomialLogit
# Hint: HHIncome is not a feature of the choices but a feature of the populations or individuals
model = MultinomialLogit()
model.fit(
    # FILL IN all parameters (Hint: 7 params including isvars)
    X, # X contains all the features in varnames
    y, # y contains the choice feature
    varnames=varnames, # Features to be used in the model
    ids=commute_long.custom_id, # Choice scenario identifier
    alts=commute_long.alt, # Alternative names
    fit_intercept=False, # We won't fit intercept in general
    isvars=["HHIncome"]
)
model.summary()

Optimization terminated successfully.
    Message: The gradients are close to zero
    Iterations: 6
    Function evaluations: 7
Estimation time= 0.0 seconds
---------------------------------------------------------------------------
Coefficient              Estimate      Std.Err.         z-val         P>|z|
---------------------------------------------------------------------------
HHIncome.CAR            0.0000139     0.0000060     2.3033910        0.0218 *  
Time                   -0.0569383     0.0074806    -7.6114628      1.97e-13 ***
Cost                   -0.3765784     0.0541445    -6.9550616      1.44e-11 ***
---------------------------------------------------------------------------
Significance:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log-Likelihood= -223.766
AIC= 453.533
BIC= 465.507


In [24]:
# Report model coeffs
# The coefficient on HHIncome.CAR is 0.0000139. The coefficient on time is -0.0569383. The coefficient on cost is -0.3765784. 

---
**3.** Table 1 shows the characteristics of commutes for two subpopulations. Think of subpopulation L as
people living near the center of a city and subpopulation H as people living in the suburbs. People in
both subpopulations all work in the downtown area. We assume that the same MNL model applies to
both subpopulations. 

Based on the information in Table 1, compute the probabilities that an individual
belonging to each subpopulation will choose to take a car to work. Show what equations you used.
*(Hint: You should provide 2 numeric answers for this question)*. **(10 pts)**



|  | Segment L (CAR) | Segment L (BUS) | Segment H (CAR) | Segment H (BUS) |
| :- | :-: | :-: | :-: | :-: |
| Car travel time (min) | 25 | - | 40 | - |
| Parking cost (\$) | 3.0 | - | 8.0 | - |
| Bus travel Time (min) | - | 40 | - | 60 |
| Bus waiting Time (min) | - | 5 | - | 10 |
| Bus gare (\$) | - | 0.8 | - | 2.0 |
| Household annual income (\$) | 40,000 | 40,000 | 80,000 | 80,000 |

<center><b> Table 1: Characteristics of commute for two subpopulations</b> </center>

In [13]:
# Compute the alternative utilities (or weights) of each option first (Hint: 4 in total)
seg_l_car_utility=np.exp(model.coeff_[0]*40000 + model.coeff_[1]*25 + model.coeff_[2]*3)
seg_l_bus_utility=np.exp(model.coeff_[0]*40000 + model.coeff_[1]*45 + model.coeff_[2]*0.8)
seg_h_car_utility=np.exp(model.coeff_[0]*80000 + model.coeff_[1]*40 + model.coeff_[2]*8)
seg_h_bus_utility=np.exp(model.coeff_[0]*80000 + model.coeff_[1]*70 + model.coeff_[2]*2)
# Compute and report the market shares for car usage
Share_L_CAR = seg_l_car_utility/(seg_l_car_utility+seg_l_bus_utility)
Share_H_CAR = seg_h_car_utility/(seg_h_car_utility+seg_h_bus_utility)

# FILL IN to report both
print(f"Cars in segment L have a market share of {np.round(Share_L_CAR*100, 2)}%")
print(f"Cars in segment H have a market share of {np.round(Share_H_CAR*100, 2)}%")




Cars in segment L have a market share of 57.7%
Cars in segment H have a market share of 36.56%


---
**4.** In order to increase bus ridership, the local transport authority has considered the following options:
1. Reducing the bus fare by 50 cents in both subpopulations.
2. Increasing the number of buses so that the waiting times would be cut in half for both subpopulations.
3. Doubling parking costs in both populations.

**For each subpopulation**, how will each of these strategies (1, 2, 3 above) affect the **probability of
taking the bus**, and which option will be most effective in increasing bus ridership? **(15 pts)**


In [15]:
# Compute the current Bus shares in both subpopulations 
Share_L_BUS = seg_l_bus_utility/(seg_l_car_utility+seg_l_bus_utility)
Share_H_BUS = seg_h_bus_utility/(seg_h_car_utility+seg_h_bus_utility)

# FILL IN to report both
print(f"Buses in segment L have a market share of {np.round(Share_L_BUS*100, 2)}%")
print(f"Buses in segment H have a market share of {np.round(Share_H_BUS*100, 2)}%")

Buses in segment L have a market share of 42.3%
Buses in segment H have a market share of 63.44%


In [16]:
# Compute strategy 1 shares for bus usage and report them (Hint: you might find useful to recompute utilities first)
# Strategy 1: Reducing the bus fare by 50 cents in both subpopulations
seg_l_car_utility=np.exp(model.coeff_[0]*40000 + model.coeff_[1]*25 + model.coeff_[2]*3)
seg_l_bus_utility=np.exp(model.coeff_[0]*40000 + model.coeff_[1]*45 + model.coeff_[2]*0.3)
seg_h_car_utility=np.exp(model.coeff_[0]*80000 + model.coeff_[1]*40 + model.coeff_[2]*8)
seg_h_bus_utility=np.exp(model.coeff_[0]*80000 + model.coeff_[1]*70 + model.coeff_[2]*1.5)

Share_L_BUS_1 = seg_l_bus_utility/(seg_l_bus_utility+seg_l_car_utility)
Share_H_BUS_1 = seg_h_bus_utility/(seg_h_bus_utility+seg_h_car_utility)

# FILL IN to report both
print(f"With strategy 1, buses in segment L have a market share of {np.round(Share_L_BUS_1*100, 2)}%")
print(f"With strategy 1, buses in segment H have a market share of {np.round(Share_H_BUS_1*100, 2)}%")

With strategy 1, buses in segment L have a market share of 46.95%
With strategy 1, buses in segment H have a market share of 67.69%


In [19]:
# Compute strategy 2 shares for bus usage and report them
# Strategy 2: Increasing the number of buses so that the waiting times would be cut in half for both subpopulations.
seg_l_car_utility=np.exp(model.coeff_[0]*40000 + model.coeff_[1]*25 + model.coeff_[2]*3)
seg_l_bus_utility=np.exp(model.coeff_[0]*40000 + model.coeff_[1]*42.5 + model.coeff_[2]*0.8)
seg_h_car_utility=np.exp(model.coeff_[0]*80000 + model.coeff_[1]*40 + model.coeff_[2]*8)
seg_h_bus_utility=np.exp(model.coeff_[0]*80000 + model.coeff_[1]*65 + model.coeff_[2]*2)

Share_L_BUS_2 = seg_l_bus_utility/(seg_l_bus_utility+seg_l_car_utility)
Share_H_BUS_2 = seg_h_bus_utility/(seg_h_bus_utility+seg_h_car_utility)

# FILL IN to report both
print(f"With strategy 2, buses in segment L have a market share of {np.round(Share_L_BUS_2*100, 2)}%")
print(f"With strategy 2, buses in segment H have a market share of {np.round(Share_H_BUS_2*100, 2)}%")

With strategy 2, buses in segment L have a market share of 45.81%
With strategy 2, buses in segment H have a market share of 69.76%


In [21]:
# Compute strategy 3 shares for bus usage and report them
# Strategy 3: Doubling parking costs in both populations
seg_l_car_utility=np.exp(model.coeff_[0]*40000 + model.coeff_[1]*25 + model.coeff_[2]*6)
seg_l_bus_utility=np.exp(model.coeff_[0]*40000 + model.coeff_[1]*45 + model.coeff_[2]*0.8)
seg_h_car_utility=np.exp(model.coeff_[0]*80000 + model.coeff_[1]*40 + model.coeff_[2]*16)
seg_h_bus_utility=np.exp(model.coeff_[0]*80000 + model.coeff_[1]*70 + model.coeff_[2]*2)


Share_L_BUS_3 = seg_l_bus_utility/(seg_l_bus_utility+seg_l_car_utility)
Share_H_BUS_3 = seg_h_bus_utility/(seg_h_bus_utility+seg_h_car_utility)

# FILL IN to report both
print(f"With strategy 3, buses in segment L have a market share of {np.round(Share_L_BUS_3*100, 2)}%")
print(f"With strategy 3, buses in segment H have a market share of {np.round(Share_H_BUS_3*100, 2)}%")

With strategy 3, buses in segment L have a market share of 69.41%
With strategy 3, buses in segment H have a market share of 97.25%


**Comment:** With strategy 1, decreasing the bus fare cost by 50 cents increases the market share in both populations to 46.9% and 67.6% repsectively and therefore, increases the probability of taking the bus. With strategy 2, decreasing the waiting time increases the market share in both populations to 45.8% and 69.7% respectively and therefore, increases the probability of taking the bus. With strategy 3, doubling the parking costs increases the market share of buses in both populations to 69.4% and 97.2% respecitively and therefore increases the chance of taking the bus.

Strategy 3 is the most effective in increasing the probability of taking the bus in both groups because we see the greatest increase in bus market share in both populations under strategy 3. 

---
---
## $\textbf{Problem 2. Collaborative Filtering}$ (40 pts)

Ted is gonna be DJ tonight, and he is really trying to impress one particular individual,
let’s call him User “6”. He wants to play Rebecca Black’s magnum opus (“Friday”) for User 6 because he
thinks they’ll like it. Ted got a hold of the Spotify ratings for a few songs and people, including User 6. The
data was manually created in the `pandas.DataFrame` and stored in the variable `music` (see next cell). Help him figure out if he should play “Friday” at the party.


**Technical Note:** You can assume that this matrix is already “bias-adjusted” (i.e. you do not need to deaverage
users’ ratings).

In [17]:
# Please do not modify or delete this cell as it will change your results
music = pd.DataFrame({
    "user": ["User 1", "User 2", "User 3", "User 4", "User 5"],
    "friday": [0, 0, 1, 0, 1],
    "bad_blood": [1, 1, 0, 0, 0],
    "work": [0, 0, 1, 0, 1],
    "bohemian_rhapsody": [1, 0, 0, 1, 1]
})

user_6 = pd.Series({
    "user": "User 6",
    "bad_blood": 0,
    "work": 1,
    "bohemian_rhapsody": 0,
})

music

Unnamed: 0,user,friday,bad_blood,work,bohemian_rhapsody
0,User 1,0,1,0,1
1,User 2,0,1,0,0
2,User 3,1,0,1,0
3,User 4,0,0,0,1
4,User 5,1,0,1,1


In [18]:
user_6

user                 User 6
bad_blood                 0
work                      1
bohemian_rhapsody         0
dtype: object

You may find useful the following two functions computing cosine similatiries and rating predictions respectively

In [19]:
cosine_similarity = lambda x, y: (x * y).sum() / np.sqrt((x ** 2).sum() * (y ** 2).sum())
rating_predictions = lambda weights, y: (weights * y).sum() / weights.sum()

---

**1**. Compute the cosine similarities for all users with User 6. Which two users are most similar to User 6? Does this make sense?
*(Hint: use a for loop as in the recitation)* **(15 points)** 


In [22]:
# Define the known information
known_rating_songs = ['bad_blood','work','bohemian_rhapsody']

# No filtering of users is needed here. All of them have ratings!

# Compute similarities
sim_with_user_6 = []
for user_id in music.user:
    # Get user ratings
    user = music[music.user == user_id].iloc[0,:]
    
    # Compute and store similarity with new_user ONLY ON KNOWN RATINGS!
    sim_with_user_6.append(cosine_similarity(user[known_rating_songs], user_6[known_rating_songs]))

sim_with_user_6 = pd.DataFrame({"user": music.user, "sim_with_6": sim_with_user_6})
sim_with_user_6

Unnamed: 0,user,sim_with_6
0,User 1,0.0
1,User 2,0.0
2,User 3,1.0
3,User 4,0.0
4,User 5,0.707107


**Comment:** User 3 and User 5 are the most similar because they have the highest similarity scores. This makes sense because because they both like the song Work and so does user 6. 

---
**2**. Compute the cosine similarities for all other songs and `friday`. Which is the most similar song in terms of ratings?
*(Hint: compute them separately)* **(15 points)** 


In [26]:
# Compute similarity of `friday` and `bad_blood` and report it 
sim_with_bad_blood = cosine_similarity(music["bad_blood"], music["friday"])
sim_with_bad_blood

0.0

In [27]:
# Compute similarity of `friday` and `work` and report it 
sim_with_work = cosine_similarity(music["work"], music["friday"])
sim_with_work

1.0

In [28]:
# Compute similarity of `friday` and `bohemian_rhapsody` and report it 
sim_with_br = cosine_similarity(music["bohemian_rhapsody"], music["friday"])
sim_with_br

0.4082482904638631

**Comment:** The most similar song to Friday is Work. This makes sense because the people who like Work also like Friday, so it makes sense that the songs have similar appeal. 

---
**3.** Use a user-based collaborative filter to find the predicted rating of User 6 for `friday`. Should Ted play that song?  **(10 pts)**

In [38]:
# Compute the weights (similarities to other users)
weights = sim_with_user_6['sim_with_6']
# Compute the prediction of Friday
rating_predictions(weights, music['friday'])

1.0

**Comment:** Ted should play the song because the predicted rating is 1, which is the highest it can be. 

---
*End of Homework 3*