### Linear regression
Create time windows from the data to extract features. Here, use 1-hour time window (00:00 - 01:00 am, 01:00 - 02:00 am, etc.) and calculate the features in each time window, resulting in
<# of hours> data points.
For each hashtag data file, fit a linear regression model using the following 5 features to predict number of tweets in the next hour, with features extracted from tweet data in the previous hour.
The features you should use are:
* Number of tweets
* Total number of retweets
* Sum of the number of followers of the users posting the hashtag
* Maximum number of followers of the users posting the hashtag
* Time of the day (which could take 24 values that represent hours of the day with respect to a given time zone)

In [1]:
hash_tags = ['#gohawks','#gopatriots','#nfl','#patriots','#sb49','#superbowl']

In [2]:
import pickle

def save_object(data, fileName):
    with open('pynb_data/'+fileName + ".pickle", 'wb') as f:
        pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
        
def load_object(fileName):
    try:
        with open('pynb_data/'+fileName + ".pickle", 'rb') as f:
            data = pickle.load(f)
            return data
    except IOError:
        print("Could not read file: " + fileName)

In [3]:
import json

def getMinAndMaxTs(tag):
    filename = 'data/tweets_'+tag+'.txt'
    max_ts = 0
    min_ts = 1552522378
    with open(filename) as f:
        for line in f:
            json_object = json.loads(line)
            timestamp = json_object['citation_date']
            if(timestamp < min_ts):                
                min_ts = timestamp
            
            if(timestamp > max_ts):
                max_ts = timestamp
                
    return [min_ts,max_ts]

tagsToMinTs = {}
tagsToMaxTs = {}
for tag in hash_tags:
    ts_list = getMinAndMaxTs(tag)
    tagsToMinTs[tag] = (ts_list[0])
    tagsToMaxTs[tag] = (ts_list[1])    

In [4]:
import math
import datetime
import pytz


def getLocalHour(timestamp):
    d = datetime.datetime.fromtimestamp(timestamp)
    pst = pytz.timezone('America/Los_Angeles')
    d = pst.localize(d)
    return d.hour

def getWindowNumber(start_ts, curr_ts, window):
    elapsed = (curr_ts - start_ts)/(window*1.0)
    windowNum = math.ceil(elapsed)
    return windowNum    

def getFeatures(tag,start_ts,end_ts,window):
    windowToTweets = {}
    windowToRetweets = {}
    windowToFollowerCount = {}
    windowToMaxFollowers = {}
    features = []
    labels = []
    
    filename = 'data/tweets_'+tag+'.txt'
    with open(filename) as f:
        for line in f:
            json_object = json.loads(line)
            timestamp = json_object['citation_date']
            
            if timestamp < start_ts or timestamp > end_ts:                            
                continue
                
            key = getWindowNumber(start_ts,timestamp,window)

            if key not in windowToTweets.keys():
                windowToTweets[key]=0
            windowToTweets[key]+=1
            
            retweetCount = json_object['metrics']['citations']['total']        
            
            if key not in windowToRetweets.keys():
                windowToRetweets[key]=0
            windowToRetweets[key]+=retweetCount
        
            followerCount = json_object['author']['followers']
            if key not in windowToFollowerCount.keys():
                windowToFollowerCount[key]=0
            windowToFollowerCount[key]+=followerCount
        
            if key not in windowToMaxFollowers.keys():
                windowToMaxFollowers[key]=0
            windowToMaxFollowers[key] = max(windowToMaxFollowers[key],followerCount)            
            
        for period in range(start_ts,end_ts,window):
            key = getWindowNumber(start_ts,period,window)
            tweetCount = windowToTweets.get(key, 0)
            retweetCount = windowToRetweets.get(key,0)
            followerCount = windowToFollowerCount.get(key,0)
            maxFollowers = windowToMaxFollowers.get(key,0)

            h = getLocalHour(period)
            
            feature = [tweetCount, retweetCount, followerCount, maxFollowers, h]
            features.append(feature)
                
            nextKey = getWindowNumber(start_ts, period + window, window)
            labels.append(windowToTweets.get(nextKey,0))
                
    return features,labels

In [6]:
import statsmodels.api as sm
import statsmodels.tools.eval_measures as ste

print("Linear Regression models for time period 1")

for tag in hash_tags:
    tp1_window_size = 3600 # 1 hour window size
    
    #find the start_ts based on minimum time for this tag
    tp1_start_ts = tp1_window_size * math.floor(tagsToMinTs[tag]/(tp1_window_size*1.0))
    tp1_end_ts = tp1_window_size * math.ceil(tagsToMaxTs[tag]/(tp1_window_size*1.0))
    features,labels = getFeatures(tag,tp1_start_ts,tp1_end_ts,tp1_window_size)
    
    print('\nLinear Regression Model for {}'.format(tag))
   
    X_orig = features
    y = labels
    
    #     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
    X = sm.add_constant(X_orig)
    
    model = sm.OLS(y,X)
    results = model.fit()
    pred_y = results.predict(X)

    print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
    print("\nMSE(from mse_resid) : {}".format(results.mse_resid))
    print("R-squared : {}".format(results.rsquared))    
    print("P values for the features are \n {} \n\n".format(results.pvalues))

    print(results.summary())
    print('---'*20)
    print('\n\n')
    
#     save_object(features, "q6_tp1_features_{}".format(tag))
#     save_object(labels, "q6_tp1_labels_{}".format(tag))


Linear Regression models for time period 1

Linear Regression Model for #gohawks

MSE : 756103.4056698364

MSE(from mse_resid) : 764020.7188182118
R-squared : 0.47725202954519985
P values for the features are 
 [3.64880173e-01 2.51419939e-14 1.48842178e-03 1.74459236e-02
 7.34249076e-01 3.99406584e-01] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.477
Model:                            OLS   Adj. R-squared:                  0.473
Method:                 Least Squares   F-statistic:                     104.6
Date:                Fri, 22 Mar 2019   Prob (F-statistic):           2.38e-78
Time:                        12:53:14   Log-Likelihood:                -4740.2
No. Observations:                 579   AIC:                             9492.
Df Residuals:                     573   BIC:                             9519.
Df Model:                           5                        


Linear Regression Model for #sb49

MSE : 16150649.280861808

MSE(from mse_resid) : 16318593.640801437
R-squared : 0.8047169058052113
P values for the features are 
 [5.81313868e-01 1.14232377e-34 3.44153088e-02 3.97433051e-01
 3.07148906e-02 5.75147032e-01] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.805
Model:                            OLS   Adj. R-squared:                  0.803
Method:                 Least Squares   F-statistic:                     475.5
Date:                Fri, 22 Mar 2019   Prob (F-statistic):          6.10e-202
Time:                        12:54:56   Log-Likelihood:                -5665.4
No. Observations:                 583   AIC:                         1.134e+04
Df Residuals:                     577   BIC:                         1.137e+04
Df Model:                           5                                         
Covariance Type:           

### QUESTION 3: 
For each of your models, report your model’s Mean Squared Error (MSE) and R-squared measure. Also, analyse the significance of each feature using the t-test and p-value. You may use the OLS in the libarary statsmodels in Python.

In [8]:
# import statsmodels.api as sm
# import statsmodels.tools.eval_measures as ste

# for tag in hash_tags:
#     print('\nLinear Regression Model for {}'.format(tag))
#     X = load_object('q2features_{}'.format(tag))
#     y = load_object('q2labels_{}'.format(tag))
    
# #     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
#     X = sm.add_constant(X)
    
#     model = sm.OLS(y,X)
#     results = model.fit()
#     pred_y = results.predict(X)
#     print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
#     print("\nMSE(from mse_resid) : {}".format(results.mse_resid))
#     print("R-squared : {}".format(results.rsquared))    
#     print("P values for the features are \n {} \n\n".format(results.pvalues))
#     print(results.summary())
#     print('---'*20)
#     print('\n\n')
    


Linear Regression Model for #gohawks

MSE : 828331.5553024177

MSE(from mse_resid) : 837852.6076622158
R-squared : 0.4771264119150189
P values for the features are 
 [2.31433584e-01 2.55595214e-13 2.42415497e-03 1.55961444e-02
 7.12324880e-01 7.74337359e-01] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.477
Model:                            OLS   Adj. R-squared:                  0.472
Method:                 Least Squares   F-statistic:                     95.27
Date:                Thu, 21 Mar 2019   Prob (F-statistic):           3.36e-71
Time:                        23:43:45   Log-Likelihood:                -4346.8
No. Observations:                 528   AIC:                             8706.
Df Residuals:                     522   BIC:                             8731.
Df Model:                           5                                         
Covariance Type:          