### QUESTION 6: 

We define three time periods and their corresponding window length as follows:
1. Before Feb. 1, 8:00 a.m.: 1-hour window
2. Between Feb. 1, 8:00 a.m. and 8:00 p.m.: 5-minute window 
3. After Feb. 1, 8:00 p.m.: 1-hour window

For each hashtag, train 3 regression models, one for each of these time periods (the times are all in PST). Report the MSE and R-squared score for each case.


In [1]:
hash_tags = ['#gohawks','#gopatriots','#nfl','#patriots','#sb49','#superbowl']

In [2]:
import pickle

def save_object(data, fileName):
    with open('pynb_data/'+fileName + ".pickle", 'wb') as f:
        pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
        
def load_object(fileName):
    try:
        with open('pynb_data/'+fileName + ".pickle", 'rb') as f:
            data = pickle.load(f)
            return data
    except IOError:
        print("Could not read file: " + fileName)

In [3]:
import json

def getMinAndMaxTs(tag):
    filename = 'data/tweets_'+tag+'.txt'
    max_ts = 0
    min_ts = 1552522378
    with open(filename) as f:
        for line in f:
            json_object = json.loads(line)
            timestamp = json_object['citation_date']
            if(timestamp < min_ts):                
                min_ts = timestamp
            
            if(timestamp > max_ts):
                max_ts = timestamp
                
    return [min_ts,max_ts]

tagsToMinTs = {}
tagsToMaxTs = {}
for tag in hash_tags:
    ts_list = getMinAndMaxTs(tag)
    tagsToMinTs[tag] = (ts_list[0])
    tagsToMaxTs[tag] = (ts_list[1])    

In [4]:
import math
import datetime
import pytz


def getLocalHour(timestamp):
    d = datetime.datetime.fromtimestamp(timestamp)
    pst = pytz.timezone('America/Los_Angeles')
    d = pst.localize(d)
    return d.hour

def getWindowNumber(start_ts, curr_ts, window):
    elapsed = (curr_ts - start_ts)/(window*1.0)
    windowNum = math.ceil(elapsed)
    return windowNum    

def getFeatures(tag,start_ts,end_ts,window):
    windowToTweets = {}
    windowToRetweets = {}
    windowToFollowerCount = {}
    windowToMaxFollowers = {}
    features = []
    labels = []
    
    filename = 'data/tweets_'+tag+'.txt'
    with open(filename) as f:
        for line in f:
            json_object = json.loads(line)
            timestamp = json_object['citation_date']
            
            if timestamp < start_ts or timestamp > end_ts:                            
                continue
                
            key = getWindowNumber(start_ts,timestamp,window)

            if key not in windowToTweets.keys():
                windowToTweets[key]=0
            windowToTweets[key]+=1
            
            retweetCount = json_object['metrics']['citations']['total']        
            
            if key not in windowToRetweets.keys():
                windowToRetweets[key]=0
            windowToRetweets[key]+=retweetCount
        
            followerCount = json_object['author']['followers']
            if key not in windowToFollowerCount.keys():
                windowToFollowerCount[key]=0
            windowToFollowerCount[key]+=followerCount
        
            if key not in windowToMaxFollowers.keys():
                windowToMaxFollowers[key]=0
            windowToMaxFollowers[key] = max(windowToMaxFollowers[key],followerCount)            
            
        for period in range(start_ts,end_ts,window):
            key = getWindowNumber(start_ts,period,window)
            tweetCount = windowToTweets.get(key, 0)
            retweetCount = windowToRetweets.get(key,0)
            followerCount = windowToFollowerCount.get(key,0)
            maxFollowers = windowToMaxFollowers.get(key,0)

            h = getLocalHour(period)
            
            feature = [tweetCount, retweetCount, followerCount, maxFollowers, h]
            features.append(feature)
                
            nextKey = getWindowNumber(start_ts, period + window, window)
            labels.append(windowToTweets.get(nextKey,0))
                
    return features,labels

In [5]:
import statsmodels.api as sm
import statsmodels.tools.eval_measures as ste

print("Linear Regression models for time period 1")

for tag in hash_tags:
    tp1_window_size = 3600 # 1 hour window size
    
    #find the start_ts based on minimum time for this tag
    tp1_start_ts = tp1_window_size * math.floor(tagsToMinTs[tag]/(tp1_window_size*1.0))
    tp1_end_ts = 1422806400
    features,labels = getFeatures(tag,tp1_start_ts,tp1_end_ts,tp1_window_size)
    
    print('\nLinear Regression Model for {}'.format(tag))
   
    X_orig = features
    y = labels
    
    #     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
    X = sm.add_constant(X_orig)
    
    model = sm.OLS(y,X)
    results = model.fit()
    pred_y = results.predict(X)

    print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
    print("R-squared : {}".format(results.rsquared))    
    print("P values for the features are \n {} \n\n".format(results.pvalues))

    print(results.summary())
    print('---'*20)
    print('\n\n')
    
    save_object(features, "q6_tp1_features_{}".format(tag))
    save_object(labels, "q6_tp1_labels_{}".format(tag))


Linear Regression models for time period 1

Linear Regression Model for #gohawks

MSE : 715120.8608315315
R-squared : 0.3047777689514175
P values for the features are 
 [3.36075078e-01 2.75387335e-12 3.36800976e-04 3.36224274e-03
 6.59176417e-02 3.24118770e-01] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.305
Model:                            OLS   Adj. R-squared:                  0.297
Method:                 Least Squares   F-statistic:                     38.05
Date:                Wed, 20 Mar 2019   Prob (F-statistic):           2.28e-32
Time:                        12:06:16   Log-Likelihood:                -3590.0
No. Observations:                 440   AIC:                             7192.
Df Residuals:                     434   BIC:                             7216.
Df Model:                           5                                         
Covariance Type:        


Linear Regression Model for #sb49

MSE : 6885.556248416906
R-squared : 0.8677596629510165
P values for the features are 
 [1.71555069e-02 7.58140847e-92 5.46570391e-06 2.08563720e-03
 3.30013260e-01 4.20294474e-03] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.868
Model:                            OLS   Adj. R-squared:                  0.866
Method:                 Least Squares   F-statistic:                     564.3
Date:                Wed, 20 Mar 2019   Prob (F-statistic):          2.40e-186
Time:                        12:08:04   Log-Likelihood:                -2545.2
No. Observations:                 436   AIC:                             5102.
Df Residuals:                     430   BIC:                             5127.
Df Model:                           5                                         
Covariance Type:            nonrobust                                 

In [10]:
import statsmodels.api as sm
import statsmodels.tools.eval_measures as ste

print("Linear Regression models for time period 2")

#tp2
for tag in hash_tags:
    tp2_window_size = 300 # 5 minute window size
    tp2_start_ts = 1422806400
    tp2_end_ts = 1422849600
    features,labels = getFeatures(tag,tp2_start_ts,tp2_end_ts,tp2_window_size)
    
    print('\nLinear Regression Model for {}'.format(tag))
    X_orig = features
    y = labels
    
    #     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
    X = sm.add_constant(X_orig)
    
    model = sm.OLS(y,X)
    results = model.fit()
    pred_y = results.predict(X)

    print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
    print("R-squared : {}".format(results.rsquared))    
    print("P values for the features are \n {} \n\n".format(results.pvalues))

    print(results.summary())
    print('---'*20)
    print('\n\n')
    
    save_object(features, "q6_tp2_features_{}".format(tag))
    save_object(labels, "q6_tp2_labels_{}".format(tag))
    print("Finished {}".format(tag))

Linear Regression models for time period 2

Linear Regression Model for #gohawks

MSE : 72065.96223196147
R-squared : 0.4928205826269495
P values for the features are 
 [0.2745786  0.00168241 0.82192757 0.13860422 0.17855737 0.02367774] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.493
Model:                            OLS   Adj. R-squared:                  0.474
Method:                 Least Squares   F-statistic:                     26.82
Date:                Wed, 20 Mar 2019   Prob (F-statistic):           7.10e-19
Time:                        12:22:18   Log-Likelihood:                -1009.7
No. Observations:                 144   AIC:                             2031.
Df Residuals:                     138   BIC:                             2049.
Df Model:                           5                                         
Covariance Type:            nonrobust            


Linear Regression Model for #sb49

MSE : 1311055.597774237
R-squared : 0.8644049008009105
P values for the features are 
 [5.83113296e-01 1.71233396e-36 5.52950901e-01 7.60020290e-01
 3.73008733e-01 8.77794517e-01] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.864
Model:                            OLS   Adj. R-squared:                  0.859
Method:                 Least Squares   F-statistic:                     175.9
Date:                Wed, 20 Mar 2019   Prob (F-statistic):           4.76e-58
Time:                        12:23:56   Log-Likelihood:                -1218.5
No. Observations:                 144   AIC:                             2449.
Df Residuals:                     138   BIC:                             2467.
Df Model:                           5                                         
Covariance Type:            nonrobust                                 

In [11]:
print("Linear Regression models for time period 3")

#tp3
for tag in hash_tags:
    print("Started building feature vector for {}".format(tag))
    tp3_window_size = 3600 # 1 hour window size
    tp3_start_ts = 1422849600
    tp3_end_ts = tp3_window_size * math.ceil(tagsToMaxTs[tag]/(tp3_window_size*1.0))
    features,labels = getFeatures(tag,tp3_start_ts,tp3_end_ts,tp3_window_size)
    
    print('\nLinear Regression Model for {}'.format(tag))
    X_orig = features
    y = labels
    
    #     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
    X = sm.add_constant(X_orig)
    
    model = sm.OLS(y,X)
    results = model.fit()
    pred_y = results.predict(X)

    print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
    print("R-squared : {}".format(results.rsquared))    
    print("P values for the features are \n {} \n\n".format(results.pvalues))

    print(results.summary())
    print('---'*20)
    print('\n\n')
    
    save_object(features, "q6_tp3_features_{}".format(tag))
    save_object(labels, "q6_tp3_labels_{}".format(tag))
    print("Finished {}".format(tag))

Linear Regression models for time period 3
Started building feature vector for #gohawks

Linear Regression Model for #gohawks

MSE : 8974.299492954375
R-squared : 0.49458221714108497
P values for the features are 
 [0.97537044 0.68370654 0.12340042 0.01034152 0.01459041 0.26720334] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.495
Model:                            OLS   Adj. R-squared:                  0.474
Method:                 Least Squares   F-statistic:                     23.68
Date:                Wed, 20 Mar 2019   Prob (F-statistic):           1.53e-16
Time:                        12:25:34   Log-Likelihood:                -758.19
No. Observations:                 127   AIC:                             1528.
Df Residuals:                     121   BIC:                             1545.
Df Model:                           5                                         
Cov


Linear Regression Model for #sb49

MSE : 376815.3045315723
R-squared : 0.40977482771752705
P values for the features are 
 [0.51368866 0.001652   0.28473384 0.56274336 0.96774985 0.49516809] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.410
Model:                            OLS   Adj. R-squared:                  0.387
Method:                 Least Squares   F-statistic:                     17.91
Date:                Wed, 20 Mar 2019   Prob (F-statistic):           1.85e-13
Time:                        12:27:17   Log-Likelihood:                -1058.2
No. Observations:                 135   AIC:                             2128.
Df Residuals:                     129   BIC:                             2146.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
               