# 3B. Data Modeling: Seasonality
<hr>

In [0]:
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
from sklearn import linear_model
import sklearn.metrics as metrics
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as Lin_Reg
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pylab
import scipy.stats as stats
%matplotlib inline
import datetime as dt
from datetime import datetime

### Process Overview:
The general idea behind this analysis is as follows: we aggregate prices by weekday for each listing. Then, we normalize each listing's price by the monday price to find an average multiplier for each listing for each day. Then, for each day we average across all listings to get a final average multiplier for each day. Lastly, we compare these predictions to a subset of the listings.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import sys
sys.path.append('/content/drive/My Drive/Masters Project/datasets/clean_datasets/')

In [0]:
#Importing Datafile
results_nona = pd.read_csv('/content/drive/My Drive/Masters Project/datasets/clean_datasets/seasonality_tomodel.csv')
results_multiplier = pd.read_csv('/content/drive/My Drive/Masters Project/datasets/clean_datasets/seasonality_tomodel.csv')
b=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
for i in b[1:7]:
    results_multiplier[i] = results_multiplier[i]/results_multiplier['Mon']
results_multiplier['Mon']= 1
b=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
for i in b[1:7]:
    results_multiplier[i] = results_multiplier[i]/results_multiplier['Mon']
results_multiplier['Mon']= 1
results_multiplier.head(5)

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,listing_id
0,1,0.997173,0.99348,1.006224,0.868412,0.867819,0.993085,142177
1,1,1.0,1.0,1.0,1.0,1.0,1.0,51557
2,1,1.018245,1.028955,1.023534,1.028586,1.015525,0.991745,958
3,1,0.995326,0.997263,1.010948,1.116325,1.116325,1.010948,3850
4,1,1.0,1.0,1.0,1.0,1.0,1.0,51773


We see that the dataframe now contains a multiplier for each day of the week for each listing. Now we take an average for each day(averaging across all listings) to see an average multiplier value for each day

In [0]:
multiplier = dict.fromkeys(b)
for index,i in enumerate(multiplier):
    multiplier[i]=results_multiplier.mean()[i]
multiplier

{'Fri': 1.0262542633811345,
 'Mon': 1.0,
 'Sat': 1.0260984917953242,
 'Sun': 1.00036695347491,
 'Thu': 1.0009906217589972,
 'Tue': 1.0005849876877382,
 'Wed': 1.000198764920445}

## Predicting Prices Using Our Seasonality Averages

Now, it is important to test the performance of the averages we arrived at. Here we seek to utilize the RidgeCV regression-- one of the best ones from the models we ran-- and apply seasonality training to it.

In [0]:
#We import the data and rerun the RidgeCV Regression
data = pd.read_csv('/content/drive/My Drive/Masters Project/datasets/clean_datasets/listings_clean.csv')
data.head()
# split into x and y (note that we do not include id and host_id as predictors)
x = data.iloc[:, 2:-2]
y = data.iloc[:, -2]
y_log = data.iloc[:, -1]

In [0]:
x = x.fillna(method='ffill')

In [0]:
reg_params = 10.**np.linspace(-10, 5, 10)
RidgeCV_model = RidgeCV(alphas=reg_params, fit_intercept=True, cv=5)
RidgeCV_model.fit(x,y_log)
sample = results_nona.sample(frac=0.4,axis=0)
# some of the id's in the sample can't be found. So at the end we readjust the sample dataframe too so they have the same entries
sample_variables=data.loc[data['id'].isin(sample['listing_id'])]
sample_variables.head(5)
sample_variables.shape
sample = sample.loc[sample['listing_id'].isin(sample_variables['id'])]
X_sample = sample_variables.iloc[:, 2:-2]
new_predictions = sample.copy()
new_predictions.loc[:,0:7]=0
new_predictions['Mon']=np.exp(RidgeCV_model.predict(X_sample))
for i in b[1:]:
    new_predictions[i]=np.exp(RidgeCV_model.predict(X_sample))*multiplier[i]



In [0]:
new_predictions.head(5)

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,listing_id
150,99.519008,99.577226,99.538789,99.617594,102.131806,102.116304,99.555527,42577
500,284.862539,285.02918,284.919159,285.14473,292.341395,292.297021,284.96707,1104362
2229,273.10328,273.263042,273.157563,273.373822,280.273406,280.230864,273.203496,9990865
683,100.779856,100.838811,100.799887,100.87969,103.425757,103.410058,100.816837,1721354
2316,301.660119,301.836587,301.720079,301.958951,309.579984,309.532994,301.770815,10572287


In [0]:
sample.head(5)

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,listing_id
150,150.0,150.0,150.0,150.0,150.0,150.0,150.0,42577
500,243.769231,242.377358,245.692308,245.692308,245.692308,245.692308,242.365385,1104362
2229,680.653846,682.283019,686.115385,687.461538,687.0,686.326923,668.519231,9990865
683,418.0,416.566038,418.0,418.0,416.576923,418.0,418.0,1721354
2316,150.0,150.943396,151.019231,150.961538,150.961538,150.0,150.0,10572287


We see already from the head that the output of our seasonality data may not yield the best results. The top data frame is our projections versus the lower which is the actual prices.

In [0]:
metrics.median_absolute_error(sample.iloc[:,:-1].values.flatten(), new_predictions.iloc[:,:-1].values.flatten())

95.4555999522433