# Data Analysis + Predictions

It is assumed that a master csv file (named "master_individual.csv") exists containing the following information for each individual tweet:
- tweet ID, full text, sentiment score, date, hashtags
- location of origin data (city, state, place type, zip code, metropolitan area)
- data for the zip code from which the tweet originates (average Zillow Home Value Index (ZHVI), number of establishments in educational services, number of establishments in healthcare and social assistance, number of establishments in professional, scientific, and technical services, ground truth vaccine hesitancy, binarized ground truth vaccine hesitancy)

It is also assumed that a master csv file (named "master_aggregate.csv") exists containing the following information for each zip code:
- average and standard deviation of sentiment across all tweets originating from that zip code
- metropolitan area in which zip code is located
- zip code-level data (average Zillow Home Value Index (ZHVI), number of establishments in educational services, number of establishments in healthcare and social assistance, number of establishments in professional, scientific, and technical services, ground truth vaccine hesitancy, binarized ground truth vaccine hesitancy)

Note: When binarizing the ground truth vaccine hesitancy for each zip code, we used 0.70 as the cut-off (i.e. a continuous ground truth vaccine hesitancy of >= 0.70 corresponds to a 1 and all other cases results in 0). However, binarized ground truth vaccine hesitancy is ultimately not used in study.

In [18]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.svm import LinearSVR, SVR
from sklearn.metrics import mean_squared_error

In [19]:
master_individual = pd.read_csv('master_individual.csv')
master_aggregate = pd.read_csv('master_aggregate.csv')
text_only_tweets = pd.read_csv('text_only_tweets.csv')
text_and_hashtags_tweets = pd.read_csv('text_and_hashtags_tweets.csv')
hybrid_tweets = pd.read_csv('hybrid_tweets.csv')

# 1) Create ground truth vaccine hesitancy dataframe

This dataframe contains all the zip codes present in master_individual with their corresponding vaccine hesitancies.

In [20]:
baseline_vac_hes_per_zip_code = pd.DataFrame()
baseline_vac_hes_per_zip_code['zip_code'] = master_aggregate['zip_code'].values
baseline_vac_hes_per_zip_code['vac_hes'] = master_aggregate['vac_hes'].values
baseline_vac_hes_per_zip_code['vac_hes_bin'] = master_aggregate['vac_hes_bin'].values
baseline_vac_hes_per_zip_code['metropolitan_area'] = master_aggregate['metropolitan_area'].values
baseline_vac_hes_per_zip_code = baseline_vac_hes_per_zip_code.sort_values(by=['zip_code'])

In [21]:
print('Max vaccine hesitancy:', max(baseline_vac_hes_per_zip_code['vac_hes']))
print('Min vaccine hesitancy:', min(baseline_vac_hes_per_zip_code['vac_hes']))
print('Std vaccine hesitancy:', baseline_vac_hes_per_zip_code['vac_hes'].std())
print('Mean vaccine hesitancy:', baseline_vac_hes_per_zip_code['vac_hes'].mean())

Max vaccine hesitancy: 1.0
Min vaccine hesitancy: 0.0
Std vaccine hesitancy: 0.3343627664992047
Mean vaccine hesitancy: 0.24025403264754172


# 2) Create stratified split

It is necessary that that the tweets (tweet IDs) in the test set are the same across all text representations (text only, text and hashtags, and hybrid) in order to be able to compare predictive power of the models. Training will be done on different parts of the tweet (depending on the representation used), but but the testing should be done on the same tweet IDs.  

### 2.1) Text only 

In [22]:
text_only_tweets_copy = text_only_tweets.copy()
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(text_only_tweets_copy, text_only_tweets_copy['zip_code']):
    train_text_only = text_only_tweets_copy.iloc[train_index]
    test_text_only = text_only_tweets_copy.iloc[test_index]

In [23]:
train_text_only.reset_index(inplace=True, drop=True)
test_text_only.reset_index(inplace=True, drop=True)

print(train_text_only['zip_code'].nunique())
print(test_text_only['zip_code'].nunique())
print()
print(train_text_only['metropolitan_area'].value_counts() / len(train_text_only))
print(test_text_only['metropolitan_area'].value_counts() / len(test_text_only))
print()
print(train_text_only.shape)
print(test_text_only.shape)

493
493

NewYork         0.428074
LosAngeles      0.357464
Chicago         0.052406
Houston         0.051897
SanDiego        0.035942
Philadelphia    0.027837
Dallas          0.022575
Phoenix         0.016846
SanAntonio      0.006959
Name: metropolitan_area, dtype: float64
NewYork         0.428377
LosAngeles      0.357773
Chicago         0.052444
Houston         0.051935
SanDiego        0.036320
Philadelphia    0.027325
Dallas          0.022403
Phoenix         0.016802
SanAntonio      0.006619
Name: metropolitan_area, dtype: float64

(23566, 310)
(5892, 310)


### 2.2) Text + hashtags  

In [24]:
text_and_hashtags_tweets_copy = text_and_hashtags_tweets.copy()
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(text_and_hashtags_tweets_copy, text_and_hashtags_tweets_copy['zip_code']):
    train_text_and_hashtags = text_and_hashtags_tweets_copy.iloc[train_index]
    test_text_and_hashtags = text_and_hashtags_tweets_copy.iloc[test_index]

In [25]:
train_text_and_hashtags.reset_index(inplace=True, drop=True)
test_text_and_hashtags.reset_index(inplace=True, drop=True)

print(train_text_and_hashtags['zip_code'].nunique())
print(test_text_and_hashtags['zip_code'].nunique())
print()
print(train_text_and_hashtags['metropolitan_area'].value_counts() / len(train_text_and_hashtags))
print(test_text_and_hashtags['metropolitan_area'].value_counts() / len(test_text_and_hashtags))
print()
print(train_text_and_hashtags.shape)
print(test_text_and_hashtags.shape)

493
493

NewYork         0.428074
LosAngeles      0.357464
Chicago         0.052406
Houston         0.051897
SanDiego        0.035942
Philadelphia    0.027837
Dallas          0.022575
Phoenix         0.016846
SanAntonio      0.006959
Name: metropolitan_area, dtype: float64
NewYork         0.428377
LosAngeles      0.357773
Chicago         0.052444
Houston         0.051935
SanDiego        0.036320
Philadelphia    0.027325
Dallas          0.022403
Phoenix         0.016802
SanAntonio      0.006619
Name: metropolitan_area, dtype: float64

(23566, 310)
(5892, 310)


### 2.3) Hybrid  

In [26]:
hybrid_tweets_copy = hybrid_tweets.copy()
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(hybrid_tweets_copy, hybrid_tweets_copy['zip_code']):
    train_hybrid = hybrid_tweets_copy.iloc[train_index]
    test_hybrid = hybrid_tweets_copy.iloc[test_index]

In [27]:
train_hybrid.reset_index(inplace=True, drop=True)
test_hybrid.reset_index(inplace=True, drop=True)

print(train_hybrid['zip_code'].nunique())
print(test_hybrid['zip_code'].nunique())
print()
print(train_hybrid['metropolitan_area'].value_counts() / len(train_hybrid))
print(test_hybrid['metropolitan_area'].value_counts() / len(test_hybrid))
print()
print(train_hybrid.shape)
print(test_hybrid.shape)

493
493

NewYork         0.428074
LosAngeles      0.357464
Chicago         0.052406
Houston         0.051897
SanDiego        0.035942
Philadelphia    0.027837
Dallas          0.022575
Phoenix         0.016846
SanAntonio      0.006959
Name: metropolitan_area, dtype: float64
NewYork         0.428377
LosAngeles      0.357773
Chicago         0.052444
Houston         0.051935
SanDiego        0.036320
Philadelphia    0.027325
Dallas          0.022403
Phoenix         0.016802
SanAntonio      0.006619
Name: metropolitan_area, dtype: float64

(23566, 310)
(5892, 310)


### 2.4) Dataframe summary

At this point, there are 3 types of dataframes:
- text only (train_text_only and test_text_only)
- text and hashtags (train_text_and_hashtags and test_text_and_hashtags)
- hybrid (train_hybrid, test_hybrid)

The train and test sets contain the same tweet IDs, but not the same embedded information.

# 3) Create Baseline Predictions Arrays

### 3.1) Calculate mean vaccine hesitancies

In [28]:
mean_text_only_tweets = text_only_tweets['vac_hes'].mean()
mean_text_and_hashtags_tweets = text_and_hashtags_tweets['vac_hes'].mean()
mean_hybrid_tweets = hybrid_tweets['vac_hes'].mean()
mean_complete_dataset = mean_text_only_tweets

mean_baseline_vac_hes = baseline_vac_hes_per_zip_code['vac_hes'].mean()

mean_train_text_only = train_text_only['vac_hes'].mean()
mean_train_text_and_hashtags = train_text_and_hashtags['vac_hes'].mean()
mean_train_hybrid = train_hybrid['vac_hes'].mean()
mean_train_set = mean_train_text_only

print('Mean vac hes in text only:', mean_text_only_tweets)
print('Mean vac hes in text and hashtags:', mean_text_and_hashtags_tweets)
print('Mean vac hes in hybrid:', mean_hybrid_tweets)
print()
print('Mean vac hes in baseline vaccine hesitancy:', mean_baseline_vac_hes)
print()
print('Mean vac hes in train text only:', mean_train_text_only)
print('Mean vac hes in train text and hashtags:', mean_train_text_and_hashtags)
print('Mean vac hes in train hybrid:', mean_train_hybrid)

Mean vac hes in text only: 0.23486475768029166
Mean vac hes in text and hashtags: 0.23486475768029166
Mean vac hes in hybrid: 0.23486475768029166

Mean vac hes in baseline vaccine hesitancy: 0.24025403264754172

Mean vac hes in train text only: 0.23490204478068344
Mean vac hes in train text and hashtags: 0.23490204478068344
Mean vac hes in train hybrid: 0.23490204478068344


### 3.2) Create predictions arrays 

In [29]:
print(len(test_text_only), len(test_text_and_hashtags), len(test_hybrid))
pred_size = len(test_text_only)
print('pred_size:', pred_size)

pred_all_zero = np.full(pred_size, 0)
pred_all_one = np.full(pred_size, 1)
pred_all_half = np.full(pred_size, 0.5)
pred_mean_train_set = np.full(pred_size, mean_train_set)
pred_mean_entire_dataset = np.full(pred_size, mean_complete_dataset)
pred_mean_baseline_vac_hes = np.full(pred_size, mean_baseline_vac_hes)

5892 5892 5892
pred_size: 5892


# 4) Calculate Baseline Metrics

In [30]:
def calculate_rmse(true_values, pred):
    return np.sqrt(np.mean((pred-true_values)**2))

### 4.1) Individual tweets

In [31]:
test_labels = test_text_only['vac_hes'].copy()

rmse = calculate_rmse(test_labels, pred_all_zero)
rmse2 = np.sqrt(mean_squared_error(test_labels, pred_all_zero))
print('Baseline RMSE when predictions are all zero:', rmse, 'and', rmse2)

rmse = calculate_rmse(test_labels, pred_all_one)
rmse2 = np.sqrt(mean_squared_error(test_labels, pred_all_one))
print('Baseline RMSE when predictions are all one:', rmse, 'and', rmse2)

rmse = calculate_rmse(test_labels, pred_all_half)
rmse2 = np.sqrt(mean_squared_error(test_labels, pred_all_half))
print('Baseline RMSE when predictions are all 0.5:', rmse, 'and', rmse2)

rmse = calculate_rmse(test_labels, pred_mean_train_set)
rmse2 = np.sqrt(mean_squared_error(test_labels, pred_mean_train_set))
print('Baseline RMSE when predictions are all mean of train set:', rmse, 'and', rmse2)

rmse = calculate_rmse(test_labels, pred_mean_entire_dataset)
rmse2 = np.sqrt(mean_squared_error(test_labels, pred_mean_entire_dataset))
print('Baseline RMSE when predictions are all mean of entire dataset:', rmse, 'and', rmse2)

rmse = calculate_rmse(test_labels, pred_mean_baseline_vac_hes)
rmse2 = np.sqrt(mean_squared_error(test_labels, pred_mean_baseline_vac_hes))
print('Baseline RMSE when predictions are all mean of baseline vac hes:', rmse, 'and', rmse2)

Baseline RMSE when predictions are all zero: 0.39881213912013264 and 0.39881213912013264
Baseline RMSE when predictions are all one: 0.8304335484783919 and 0.8304335484783919
Baseline RMSE when predictions are all 0.5: 0.4175350289185266 and 0.4175350289185266
Baseline RMSE when predictions are all mean of train set: 0.3224278739229078 and 0.3224278739229078
Baseline RMSE when predictions are all mean of entire dataset: 0.32242785452010764 and 0.32242785452010764
Baseline RMSE when predictions are all mean of baseline vac hes: 0.32247538374608337 and 0.32247538374608337


### 4.2) Aggregated by zip code

In [32]:
pred_size = len(baseline_vac_hes_per_zip_code)
print('pred_size:', pred_size)

pred_all_zero = np.full(pred_size, 0)
pred_all_one = np.full(pred_size, 1)
pred_all_half = np.full(pred_size, 0.5)
pred_mean_train_set = np.full(pred_size, mean_train_set)
pred_mean_entire_dataset = np.full(pred_size, mean_complete_dataset)
pred_mean_baseline_vac_hes = np.full(pred_size, mean_baseline_vac_hes)

pred_size: 493


In [33]:
labels = baseline_vac_hes_per_zip_code['vac_hes']

rmse = calculate_rmse(labels, pred_all_zero)
rmse2 = np.sqrt(mean_squared_error(labels, pred_all_zero))
print('Baseline RMSE when predictions are all zero:', rmse, 'and', rmse2)

rmse = calculate_rmse(labels, pred_all_one)
rmse2 = np.sqrt(mean_squared_error(labels, pred_all_one))
print('Baseline RMSE when predictions are all one:', rmse, 'and', rmse2)

rmse = calculate_rmse(labels, pred_all_half)
rmse2 = np.sqrt(mean_squared_error(labels, pred_all_half))
print('Baseline RMSE when predictions are all 0.5:', rmse, 'and', rmse2)

rmse = calculate_rmse(labels, pred_mean_train_set)
rmse2 = np.sqrt(mean_squared_error(labels, pred_mean_train_set))
print('Baseline RMSE when predictions are all mean of train set:', rmse, 'and', rmse2)

rmse = calculate_rmse(labels, pred_mean_entire_dataset)
rmse2 = np.sqrt(mean_squared_error(labels, pred_mean_entire_dataset))
print('Baseline RMSE when predictions are all mean of entire dataset:', rmse, 'and', rmse2)

rmse = calculate_rmse(labels, pred_mean_baseline_vac_hes)
rmse2 = np.sqrt(mean_squared_error(labels, pred_mean_baseline_vac_hes))
print('Baseline RMSE when predictions are all mean of baseline vac hes:', rmse, 'and', rmse2)

Baseline RMSE when predictions are all zero: 0.4114531420478382 and 0.4114531420478382
Baseline RMSE when predictions are all one: 0.8299310952157144 and 0.8299310952157144
Baseline RMSE when predictions are all 0.5: 0.42313077819215283 and 0.42313077819215283
Baseline RMSE when predictions are all mean of train set: 0.33406635818615366 and 0.33406635818615366
Baseline RMSE when predictions are all mean of entire dataset: 0.3340669576332044 and 0.3340669576332044
Baseline RMSE when predictions are all mean of baseline vac hes: 0.3340234840510956 and 0.3340234840510956


# 5) Standardize numerical data

In [34]:
# function to split up dataframe into columns that need to be standardized
# and those that do not
def split_dataframe(df):
    # columns to be standardized
    col_names = ['avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
                'num_est_prof_sci_tech_serv','sentiment']
    df_std = df[col_names]
    df_no_change = df.drop(columns=col_names)
    return df_std, df_no_change

In [35]:
# function to standardize certain columns in dataframe
def standardize(df):
    # columns to be standardized
    col_names = ['avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
                'num_est_prof_sci_tech_serv','sentiment']
    df_std = StandardScaler().fit_transform(df)
    df_std = pd.DataFrame(df_std, columns=col_names)
    return df_std

### 5.1) Text only

In [36]:
train_text_only_copy = train_text_only.copy()
train_text_only_std, train_text_only_no_change = split_dataframe(train_text_only_copy)
train_text_only_std = standardize(train_text_only_std)
train_text_only = train_text_only_no_change.join(train_text_only_std)

test_text_only_copy = test_text_only.copy()
test_text_only_std, test_text_only_no_change = split_dataframe(test_text_only_copy)
test_text_only_std = standardize(test_text_only_std)
test_text_only = test_text_only_no_change.join(test_text_only_std)

### 5.2) Text + hashtags

In [37]:
train_text_and_hashtags_copy = train_text_and_hashtags.copy()
train_text_and_hashtags_std, train_text_and_hashtags_no_change = split_dataframe(train_text_and_hashtags_copy)
train_text_and_hashtags_std = standardize(train_text_and_hashtags_std)
train_text_and_hashtags = train_text_and_hashtags_no_change.join(train_text_and_hashtags_std)

test_text_and_hashtags_copy = test_text_and_hashtags.copy()
test_text_and_hashtags_std, test_text_and_hashtags_no_change = split_dataframe(test_text_and_hashtags_copy)
test_text_and_hashtags_std = standardize(test_text_and_hashtags_std)
test_text_and_hashtags = test_text_and_hashtags_no_change.join(test_text_and_hashtags_std)

### 5.3) Hybrid

In [38]:
train_hybrid_copy = train_hybrid.copy()
train_hybrid_std, train_hybrid_no_change = split_dataframe(train_hybrid_copy)
train_hybrid_std = standardize(train_hybrid_std)
train_hybrid = train_hybrid_no_change.join(train_hybrid_std)

test_hybrid_copy = test_hybrid.copy()
test_hybrid_std, test_hybrid_no_change = split_dataframe(test_hybrid_copy)
test_hybrid_std = standardize(test_hybrid_std)
test_hybrid = test_text_only_no_change.join(test_hybrid_std)

# 6) Build baseline models

Model input data can be adjusted by droppping appropriate columns in the train and test datasets.

In [39]:
train_text_only_copy = train_text_only.copy()
# train_input = train_text_only_copy.drop(columns=['avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
#                                                 'num_est_prof_sci_tech_serv', 'sentiment', 'id','zip_code',
#                                                 'metropolitan_area','vac_hes','vac_hes_bin'])
train_input = train_text_only_copy[['avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
                                    'num_est_prof_sci_tech_serv', 'sentiment']]

train_labels = train_text_only_copy['vac_hes']

test_text_only_copy = test_text_only.copy()
# test_input = test_text_only_copy.drop(columns=['avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
#                                                 'num_est_prof_sci_tech_serv', 'sentiment','id','zip_code',
#                                                 'metropolitan_area','vac_hes','vac_hes_bin'])
test_input = test_text_only_copy[['avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
                                    'num_est_prof_sci_tech_serv', 'sentiment']]
test_labels = test_text_only_copy['vac_hes']

Model type can be adjusted.

In [40]:
model = LinearRegression()
# model = SVR(kernel='rbf')

folds = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, train_input, train_labels, scoring='neg_root_mean_squared_error', cv=folds)
print('cross validation scores:', scores)
print('average cross validation score:', scores.mean())

model = model.fit(train_input, train_labels)
model_pred = model.predict(test_input)

cross validation scores: [-0.3065724  -0.3124583  -0.30723302 -0.30314838 -0.31227854]
average cross validation score: -0.308338127397242


Calculate tweet-level RMSE two ways

In [41]:
rmse = calculate_rmse(test_labels, model_pred)
print('RMSE:', rmse)
rmse = np.sqrt(mean_squared_error(test_labels, model_pred))
print('RMSE:', rmse)

RMSE: 0.30793025992382783
RMSE: 0.30793025992382783


Calculate zip code-level RMSE

In [42]:
model_pred_df = pd.DataFrame(model_pred, columns=['vac_hes_pred'])
model_pred_df['zip_code'] = test_text_only['zip_code']

# dataframe with all the predictions grouped by zip code (taking average of predictions)
model_pred_df_zip_code_mean = model_pred_df.groupby(['zip_code'], as_index=False).mean()
model_pred_df_zip_code_mean = model_pred_df_zip_code_mean.sort_values(by=['zip_code'])
print(model_pred_df_zip_code_mean.shape)

# the baseline dataframe is already sorted by zip code
# now, both the model predictions and the baseline vaccine hesitancies are sorted by zip code
true_labels = baseline_vac_hes_per_zip_code['vac_hes']
pred = model_pred_df_zip_code_mean['vac_hes_pred']

rmse = calculate_rmse(true_labels, pred)
rmse2 = np.sqrt(mean_squared_error(true_labels, pred))
print('RMSE of predictions grouped by zip code:', rmse, 'and', rmse2)

(493, 2)
RMSE of predictions grouped by zip code: 0.3445654066728429 and 0.3445654066728429


# 7) Build Machine Learning Models 

### 7.1) Text only

Model input data can be adjusted by droppping appropriate columns in the train and test datasets.

In [43]:
# train_input = train_text_only.copy().drop(columns=['vac_hes', 'metropolitan_area', 'vac_hes_bin',
#                                                     'zip_code', 'id', 'avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
#                                                     'num_est_prof_sci_tech_serv', 'sentiment'])

train_input = train_text_only.copy().drop(columns=['vac_hes', 'metropolitan_area', 'vac_hes_bin',
                                                    'zip_code', 'id', 'sentiment'])


train_labels = train_text_only['vac_hes'].copy()

# test_input = test_text_only.copy().drop(columns=['vac_hes', 'metropolitan_area', 'vac_hes_bin',
#                                                     'zip_code', 'id', 'avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
#                                                     'num_est_prof_sci_tech_serv', 'sentiment'])

test_input = test_text_only.copy().drop(columns=['vac_hes', 'metropolitan_area', 'vac_hes_bin',
                                                    'zip_code', 'id', 'sentiment'])


test_labels = test_text_only['vac_hes'].copy()

Model type can be adjusted.

In [44]:
# model = LinearRegression()
# model = LinearSVR(max_iter=4000, random_state=42)
# model = SGDRegressor(random_state=42)
model = SVR(kernel='rbf')

folds = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, train_input, train_labels, scoring='neg_root_mean_squared_error', cv=folds)
print('cross validation scores:', scores)
print('average cross validation score:', scores.mean())

model = model.fit(train_input, train_labels)
model_pred = model.predict(test_input)

cross validation scores: [-0.20539328 -0.2127378  -0.21148047 -0.20088998 -0.21352056]
average cross validation score: -0.2088044167729281


#### 7.1.1) Compute metrics on individual predictions

In [45]:
rmse = calculate_rmse(test_labels, model_pred)
print('RMSE:', rmse)
rmse = np.sqrt(mean_squared_error(test_labels, model_pred))
print('RMSE:', rmse)

RMSE: 0.20624460462988284
RMSE: 0.20624460462988284


#### 7.1.2) Compute metrics on predictions aggregated by zip code

In [46]:
model_pred_df = pd.DataFrame(model_pred, columns=['vac_hes_pred'])
model_pred_df['zip_code'] = test_text_only['zip_code']

In [47]:
# dataframe with all the predictions grouped by zip code (taking average of predictions)
model_pred_df_zip_code_mean = model_pred_df.groupby(['zip_code'], as_index=False).mean()
model_pred_df_zip_code_mean = model_pred_df_zip_code_mean.sort_values(by=['zip_code'])
model_pred_df_zip_code_mean['metropolitan_area'] = baseline_vac_hes_per_zip_code['metropolitan_area']
model_pred_df_zip_code_mean['vac_hes_true'] = baseline_vac_hes_per_zip_code['vac_hes']

In [48]:
# the baseline dataframe is already sorted by zip code
# now, both the model predictions and the baseline vaccine hesitancies are sorted by zip code
true_labels = baseline_vac_hes_per_zip_code['vac_hes']
pred = model_pred_df_zip_code_mean['vac_hes_pred']

rmse = calculate_rmse(true_labels, pred)
rmse2 = np.sqrt(mean_squared_error(true_labels, pred))
print('RMSE of predictions grouped by zip code:', rmse, 'and', rmse2)

RMSE of predictions grouped by zip code: 0.3121243534981443 and 0.3121243534981443


### 7.2) Text + hashtags

Model input data can be adjusted by droppping appropriate columns in the train and test datasets.

In [49]:
train_input = train_text_and_hashtags.copy().drop(columns=['vac_hes', 'metropolitan_area', 'vac_hes_bin',
                                                    'zip_code', 'id', 'avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
                                                    'num_est_prof_sci_tech_serv'])
train_labels = train_text_and_hashtags['vac_hes'].copy()

test_input = test_text_and_hashtags.copy().drop(columns=['vac_hes', 'metropolitan_area', 'vac_hes_bin',
                                                    'zip_code', 'id', 'avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
                                                    'num_est_prof_sci_tech_serv'])
test_labels = test_text_and_hashtags['vac_hes'].copy()

Model type can be adjusted.

In [50]:
# model = LinearRegression()
# model = LinearSVR(max_iter=4000, random_state=42)
# model = SGDRegressor(random_state=42)
model = SVR(kernel='rbf')

folds = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, train_input, train_labels, scoring='neg_root_mean_squared_error', cv=folds)
print('cross validation scores:', scores)
print('average cross validation score:', scores.mean())

model = model.fit(train_input, train_labels)
model_pred = model.predict(test_input)

# model = LinearRegression().fit(train_input, train_labels)
# model_pred = model.predict(test_input)
# print('total number of predictions:', len(model_pred))

cross validation scores: [-0.28644405 -0.29230373 -0.29031964 -0.28354353 -0.29331877]
average cross validation score: -0.2891859422196078


#### 7.2.1) Compute metrics on individual predictions

In [51]:
rmse = calculate_rmse(test_labels, model_pred)
print('RMSE:', rmse)
rmse = np.sqrt(mean_squared_error(test_labels, model_pred))
print('RMSE:', rmse)

RMSE: 0.2906366503544305
RMSE: 0.2906366503544305


#### 7.2.2) Compute metrics on predictions aggregated by zip code

In [52]:
model_pred_df = pd.DataFrame(model_pred, columns=['vac_hes_pred'])
model_pred_df['zip_code'] = test_text_and_hashtags['zip_code']

In [53]:
# dataframe with all the predictions grouped by zip code (taking average of predictions)
model_pred_df_zip_code_mean = model_pred_df.groupby(['zip_code'], as_index=False).mean()
model_pred_df_zip_code_mean = model_pred_df_zip_code_mean.sort_values(by=['zip_code'])

In [54]:
# the baseline dataframe is already sorted by zip code
# now, both the model predictions and the baseline vaccine hesitancies are sorted by zip code
true_labels = baseline_vac_hes_per_zip_code['vac_hes']
pred = model_pred_df_zip_code_mean['vac_hes_pred']

rmse = calculate_rmse(true_labels, pred)
rmse2 = np.sqrt(mean_squared_error(true_labels, pred))
print('RMSE of predictions grouped by zip code:', rmse, 'and', rmse2)

RMSE of predictions grouped by zip code: 0.3345441658461021 and 0.3345441658461021


### 7.3) Hybrid

Model input data can be adjusted by droppping appropriate columns in the train and test datasets.

In [55]:
train_input = train_hybrid.copy().drop(columns=['vac_hes', 'metropolitan_area', 'vac_hes_bin',
                                                    'zip_code', 'id', 'avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
                                                    'num_est_prof_sci_tech_serv'])
train_labels = train_hybrid['vac_hes'].copy()

test_input = test_hybrid.copy().drop(columns=['vac_hes', 'metropolitan_area', 'vac_hes_bin',
                                                    'zip_code', 'id', 'avg_zhvi','num_est_educ_serv','num_est_healthcare_social_assist',
                                                    'num_est_prof_sci_tech_serv'])
test_labels = test_hybrid['vac_hes'].copy()

Model type can be adjusted.

In [56]:
# model = LinearRegression()
# model = LinearSVR(max_iter=4000, random_state=42)
# model = SGDRegressor(random_state=42)
model = SVR(kernel='rbf')

folds = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, train_input, train_labels, scoring='neg_root_mean_squared_error', cv=folds)
print('cross validation scores:', scores)
print('average cross validation score:', scores.mean())

model = model.fit(train_input, train_labels)
model_pred = model.predict(test_input)

cross validation scores: [-0.29570065 -0.299765   -0.29706717 -0.2896552  -0.30104773]
average cross validation score: -0.2966471500790989


#### 7.3.1) Compute metrics on individual predictions

In [57]:
rmse = calculate_rmse(test_labels, model_pred)
print('RMSE:', rmse)
rmse = np.sqrt(mean_squared_error(test_labels, model_pred))
print('RMSE:', rmse)

RMSE: 0.29926414379679095
RMSE: 0.29926414379679095


#### 7.3.2) Compute metrics on predictions aggregated by zip code

In [58]:
model_pred_df = pd.DataFrame(model_pred, columns=['vac_hes_pred'])
model_pred_df['zip_code'] = test_hybrid['zip_code']

In [59]:
# dataframe with all the predictions grouped by zip code (taking average of predictions)
model_pred_df_zip_code_mean = model_pred_df.groupby(['zip_code'], as_index=False).mean()
model_pred_df_zip_code_mean = model_pred_df_zip_code_mean.sort_values(by=['zip_code'])

In [60]:
# the baseline dataframe is already sorted by zip code
# now, both the model predictions and the baseline vaccine hesitancies are sorted by zip code
true_labels = baseline_vac_hes_per_zip_code['vac_hes']
pred = model_pred_df_zip_code_mean['vac_hes_pred']

rmse = calculate_rmse(true_labels, pred)
rmse2 = np.sqrt(mean_squared_error(true_labels, pred))
print('RMSE of predictions grouped by zip code:', rmse, 'and', rmse2)

RMSE of predictions grouped by zip code: 0.33104134739951124 and 0.33104134739951124
