<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_II_8_LasVegasExample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's load some libraries:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm

# Las Vegas Dataset

Let's load the dataset from the course repository.

In [2]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

Cloning into 'MSDIA_PredictiveModelingAndMachineLearning'...


In [3]:
lasvegas = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_II_8_LasVegasTripAdvisorReviews.csv')

And let's take a look:

In [4]:
lasvegas.head()

Unnamed: 0,User country,Nr. reviews,Nr. hotel reviews,Helpful votes,Score,Period of stay,Traveler type,Pool,Gym,Tennis court,Spa,Casino,Free internet,Hotel name,Hotel stars
0,USA,11,4,13,5,Dec-Feb,Friends,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3
1,USA,119,21,75,3,Dec-Feb,Business,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3
2,USA,36,9,25,5,Mar-May,Families,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3
3,UK,14,7,14,4,Mar-May,Friends,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3
4,Canada,5,5,2,4,Mar-May,Solo,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3


In [5]:
lasvegas.describe()

Unnamed: 0,Nr. reviews,Nr. hotel reviews,Helpful votes,Score,Hotel stars
count,504.0,504.0,504.0,504.0,504.0
mean,48.130952,16.02381,31.751984,4.123016,4.047619
std,74.996426,23.957953,48.520783,1.007302,0.84465
min,1.0,0.0,0.0,1.0,3.0
25%,12.0,5.0,8.0,4.0,3.0
50%,23.5,9.0,16.0,4.0,4.0
75%,54.25,18.0,35.0,5.0,5.0
max,775.0,263.0,365.0,5.0,5.0


In [6]:
lasvegas['User country'].value_counts()

User country
USA                     220
UK                       74
Canada                   65
Australia                36
Ireland                  13
India                    12
Mexico                    8
Germany                   7
New Zealand               5
Brazil                    5
Egypt                     5
Netherlands               4
Singapore                 4
Norway                    3
Finland                   3
Thailand                  3
Israel                    3
Switzerland               3
Malaysia                  3
Spain                     2
United Arab Emirates      2
Costa Rica                2
Jordan                    1
Kenya                     1
Greece                    1
China                     1
Hungary                   1
South Africa              1
Puerto Rico               1
Belgium                   1
Philippines               1
Croatia                   1
Syria                     1
France                    1
Iran                      1
Saudi A

In [7]:
lasvegas['Hotel name'].value_counts()

Hotel name
Circus Circus Hotel & Casino Las Vegas                 24
Excalibur Hotel & Casino                               24
Monte Carlo Resort&Casino                              24
Treasure Island- TI Hotel & Casino                     24
Tropicana Las Vegas - A Double Tree by Hilton Hotel    24
Caesars Palace                                         24
The Cosmopolitan Las Vegas                             24
The Palazzo Resort Hotel Casino                        24
Wynn Las Vegas                                         24
Trump International Hotel Las Vegas                    24
The Cromwell                                           24
Encore at wynn Las Vegas                               24
Hilton Grand Vacations on the Boulevard                24
Marriott's Grand Chateau                               24
Tuscany Las Vegas Suites & Casino                      24
Hilton Grand Vacations at the Flamingo                 24
Wyndham Grand Desert                                   24
The

## Data Preparation

The first issue we encounter is that there are categorical variables (Period of stay, Traveler type, etc.), continuous/numerical variables (Nr. reviews, Helpful votes), and some where it isn't clear. For instance, Hotel stars could be continuous or ordinal---and really Score as well.


We will treat our dependent variable (Score) as continuous since we are running a linear regression (we will discuss alternatives later). We will treat Stars as categorical and we will drop the 'User country' and 'Hotel name'.

In [8]:
numerics = list(lasvegas.select_dtypes(include=['int64']).columns)
numerics.remove('Hotel stars')
numerics.remove('Score')
factors = list(lasvegas.select_dtypes(include=['object']).columns)
factors.append('Hotel stars')
factors.remove('User country')
factors.remove('Hotel name')

Let's look at the numerical columns:

In [9]:
lasvegas_numcols = lasvegas[numerics]
lasvegas_numcols.head()

Unnamed: 0,Nr. reviews,Nr. hotel reviews,Helpful votes
0,11,4,13
1,119,21,75
2,36,9,25
3,14,7,14
4,5,5,2


One aspect that is maybe problematic is that `Helpful votes' is correlated to Nr. reviews---the more reviews there are, the more can be helpful. So we will **engineer** a new feature which considers the proportion of helpful votes divided by the Number of reviews:

In [10]:
lasvegas_numcols['helpful_proportion'] = lasvegas_numcols['Helpful votes'] / lasvegas_numcols['Nr. reviews']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  lasvegas_numcols['helpful_proportion'] = lasvegas_numcols['Helpful votes'] / lasvegas_numcols['Nr. reviews']


In [11]:
lasvegas_numcols.head()

Unnamed: 0,Nr. reviews,Nr. hotel reviews,Helpful votes,helpful_proportion
0,11,4,13,1.181818
1,119,21,75,0.630252
2,36,9,25,0.694444
3,14,7,14,1.0
4,5,5,2,0.4


Now we take the categorical data and transfer them into dummies:

In [12]:
lasvegas_faccols = lasvegas[factors]
dummies = pd.get_dummies(lasvegas_faccols.astype('object'), drop_first=True)

In [13]:
lasvegas_faccols

Unnamed: 0,Period of stay,Traveler type,Pool,Gym,Tennis court,Spa,Casino,Free internet,Hotel stars
0,Dec-Feb,Friends,NO,YES,NO,NO,YES,YES,3
1,Dec-Feb,Business,NO,YES,NO,NO,YES,YES,3
2,Mar-May,Families,NO,YES,NO,NO,YES,YES,3
3,Mar-May,Friends,NO,YES,NO,NO,YES,YES,3
4,Mar-May,Solo,NO,YES,NO,NO,YES,YES,3
...,...,...,...,...,...,...,...,...,...
499,Sep-Nov,Couples,YES,YES,NO,YES,YES,YES,4
500,Sep-Nov,Couples,YES,YES,NO,YES,YES,YES,4
501,Sep-Nov,Friends,YES,YES,NO,YES,YES,YES,4
502,Dec-Feb,Families,YES,YES,NO,YES,YES,YES,4


In [14]:
dummies


Unnamed: 0,Period of stay_Jun-Aug,Period of stay_Mar-May,Period of stay_Sep-Nov,Traveler type_Couples,Traveler type_Families,Traveler type_Friends,Traveler type_Solo,Pool_YES,Gym_YES,Tennis court_YES,Spa_YES,Casino_YES,Free internet_YES,Hotel stars_4,Hotel stars_5
0,False,False,False,False,False,True,False,False,True,False,False,True,True,False,False
1,False,False,False,False,False,False,False,False,True,False,False,True,True,False,False
2,False,True,False,False,True,False,False,False,True,False,False,True,True,False,False
3,False,True,False,False,False,True,False,False,True,False,False,True,True,False,False
4,False,True,False,False,False,False,True,False,True,False,False,True,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499,False,False,True,True,False,False,False,True,True,False,True,True,True,True,False
500,False,False,True,True,False,False,False,True,True,False,True,True,True,True,False
501,False,False,True,False,False,True,False,True,True,False,True,True,True,True,False
502,False,False,False,False,True,False,False,True,True,False,True,True,True,True,False


And finally we combine the numerical and the categorical columns---plus our outcome varoable---together:

In [15]:
lasvegas_new = pd.concat([lasvegas_numcols, dummies], axis = 1)
lasvegas_new = pd.concat([lasvegas_new, lasvegas['Score']], axis =1)
lasvegas_new.head()

Unnamed: 0,Nr. reviews,Nr. hotel reviews,Helpful votes,helpful_proportion,Period of stay_Jun-Aug,Period of stay_Mar-May,Period of stay_Sep-Nov,Traveler type_Couples,Traveler type_Families,Traveler type_Friends,Traveler type_Solo,Pool_YES,Gym_YES,Tennis court_YES,Spa_YES,Casino_YES,Free internet_YES,Hotel stars_4,Hotel stars_5,Score
0,11,4,13,1.181818,False,False,False,False,False,True,False,False,True,False,False,True,True,False,False,5
1,119,21,75,0.630252,False,False,False,False,False,False,False,False,True,False,False,True,True,False,False,3
2,36,9,25,0.694444,False,True,False,False,True,False,False,False,True,False,False,True,True,False,False,5
3,14,7,14,1.0,False,True,False,False,False,True,False,False,True,False,False,True,True,False,False,4
4,5,5,2,0.4,False,True,False,False,False,False,True,False,True,False,False,True,True,False,False,4


## Run our Linear Regression

Let's run our linear regression:

In [16]:
y = lasvegas_new['Score']
X = lasvegas_new.drop(columns=['Score'])
X = sm.add_constant(X)
model_sm = sm.OLS(y, X.astype(float)).fit() #Because of the way data was stored in the df, sm does not work. Have to coerce into numbers.
model_sm.summary()

0,1,2,3
Dep. Variable:,Score,R-squared:,0.197
Model:,OLS,Adj. R-squared:,0.166
Method:,Least Squares,F-statistic:,6.255
Date:,"Sat, 04 Oct 2025",Prob (F-statistic):,1.78e-14
Time:,13:41:50,Log-Likelihood:,-662.98
No. Observations:,504,AIC:,1366.0
Df Residuals:,484,BIC:,1450.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.8985,0.582,3.264,0.001,0.756,3.041
Nr. reviews,-0.0036,0.001,-3.407,0.001,-0.006,-0.002
Nr. hotel reviews,-0.0022,0.003,-0.795,0.427,-0.008,0.003
Helpful votes,0.0065,0.002,3.235,0.001,0.003,0.010
helpful_proportion,-0.2992,0.055,-5.485,0.000,-0.406,-0.192
Period of stay_Jun-Aug,-0.0732,0.118,-0.619,0.536,-0.306,0.159
Period of stay_Mar-May,-0.1521,0.117,-1.303,0.193,-0.381,0.077
Period of stay_Sep-Nov,-0.1229,0.118,-1.043,0.298,-0.355,0.109
Traveler type_Couples,0.4216,0.127,3.331,0.001,0.173,0.670

0,1,2,3
Omnibus:,59.676,Durbin-Watson:,2.144
Prob(Omnibus):,0.0,Jarque-Bera (JB):,78.145
Skew:,-0.906,Prob(JB):,1.07e-17
Kurtosis:,3.662,Cond. No.,1820.0


The regression table provides insights on how features are associated with scores. For instance, hotel starts are positively associated with the predicted score, and so is having a pool.

However, we should be mindful not to attach "causal" interpretations. For instance, even though 'Spa' has a negative association with Score, that likely doesn't mean that closing your Spa and leaving everything else unchanged will positively affect scores. Possibly the mechanism is that having a Spa leads to higher prices and custumers don't like paying higher prices---so closing the spa while charging the same price may not have an effect. Again, we explore when and how to obtain causal inference in more detail in another class in your program (GB 740).

However, in the spirit of this class, we can use the model for generating a prediction---in the spirit of this class!

## Prediction

For generating predictions, we can now take features of a traveler and their profile (how many ratings have they done, when are they traveling, are they going with their family, etc.) and of the hotel (how many stars, does it have free internet, etc.) to predict a satisfaction score for their trip. This may be helpful in recommending a hotel in a given price range, say.

For instance, a traveler that has written 4 total reviews, 2 on hotels, and receeived 2 helpful votes traveling in the witer period with their family---paired with 'Circus Circus' that has 3 stars, a gym, and free internet, but no other amenities---will obtain a Score of:

In [17]:
model_sm.predict([[1,4,2,3,.75,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]])

array([2.65839709])