### Linear regression
Create time windows from the data to extract features. Here, use 1-hour time window (00:00 - 01:00 am, 01:00 - 02:00 am, etc.) and calculate the features in each time window, resulting in
<# of hours> data points.
For each hashtag data file, fit a linear regression model using the following 5 features to predict number of tweets in the next hour, with features extracted from tweet data in the previous hour.
The features you should use are:
* Number of tweets
* Total number of retweets
* Sum of the number of followers of the users posting the hashtag
* Maximum number of followers of the users posting the hashtag
* Time of the day (which could take 24 values that represent hours of the day with respect to a given time zone)

In [2]:
hash_tags = ['#gohawks','#gopatriots','#nfl','#patriots','#sb49','#superbowl']

In [3]:
import pickle

def save_object(data, fileName):
    with open('pynb_data/'+fileName + ".pickle", 'wb') as f:
        pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
        
def load_object(fileName):
    try:
        with open('pynb_data/'+fileName + ".pickle", 'rb') as f:
            data = pickle.load(f)
            return data
    except IOError:
        print("Could not read file: " + fileName)

In [12]:
from datetime import date, timedelta
import datetime
import pytz
import json

def getHour(timestamp):
    d = datetime.datetime.fromtimestamp(timestamp)
    return d.hour

def getHourAsKey(timestamp):
    d = datetime.datetime.fromtimestamp(timestamp)
    return "{}:{}:{}:{}".format(d.year,d.month,d.day,d.hour)

def getDayAsKey(timestamp):
    d = datetime.datetime.fromtimestamp(timestamp)
    return "{}:{}:{}".format(d.year,d.month,d.day)

#https://stackoverflow.com/questions/2315032/how-do-i-find-missing-dates-in-a-list-of-sorted-dates
def getAllDays(dayStrList):
    dayList = [ datetime.datetime.strptime(x, '%Y:%m:%d') for x in dayStrList ]    
    irange = [i for i in range((dayList[-1] - dayList[0]).days)]
    irange.append(irange[-1]+1)    
    allDaysList = sorted(list(set(dayList[0] + timedelta(x) for x in irange)))
    allDayStrList = [ ('{}:{}:{}'.format(d.year,d.month,d.day)) for d in allDaysList ]
    return allDayStrList

def getFeatures(hash_tag):
    hourToTweets = {}
    hourToRetweets = {}
    hourToFollowerCount = {}
    hourToMaxFollowers = {}
    dayDict = {}
    features = []
    labels = []
    
    filename = 'data/tweets_'+tag+'.txt'
    with open(filename) as f:
        for line in f:
            json_object = json.loads(line)
            timestamp = json_object['citation_date']
            dayDict[getDayAsKey(timestamp)]=1        
            key = getHourAsKey(timestamp)
            if key not in hourToTweets.keys():
                hourToTweets[key]=0
            hourToTweets[key]+=1
            
            retweetCount = json_object['metrics']['citations']['total']        
            
            if key not in hourToRetweets.keys():
                hourToRetweets[key]=0
            hourToRetweets[key]+=retweetCount
        
            followerCount = json_object['author']['followers']
            if key not in hourToFollowerCount.keys():
                hourToFollowerCount[key]=0
            hourToFollowerCount[key]+=followerCount
        
            if key not in hourToMaxFollowers.keys():
                hourToMaxFollowers[key]=0
            hourToMaxFollowers[key] = max(hourToMaxFollowers[key],followerCount)
            
        dayList = getAllDays(list(dayDict.keys()))
            
        for day in dayList:
            for h in range(0,24):
                key=day+':'+str(h)
                tweetCount = hourToTweets.get(key, 0)
                retweetCount = hourToRetweets.get(key,0)
                followerCount = hourToFollowerCount.get(key,0)
                maxFollowers = hourToMaxFollowers.get(key,0)

                feature = [tweetCount, retweetCount, followerCount, maxFollowers, h]
                features.append(feature)
                
                nexthour = datetime.datetime.strptime(key, '%Y:%m:%d:%H') + timedelta(hours=1)
                nexthourkey = "{}:{}:{}:{}".format(nexthour.year,nexthour.month,nexthour.day,nexthour.hour)
                labels.append(hourToTweets.get(nexthourkey,0))
                
    return features,labels

In [14]:
for tag in hash_tags:
    print("Started building feature vectors for {}".format(tag))
    features,labels = getFeatures(tag)
    save_object(features,'q2features_{}'.format(tag))
    save_object(labels,'q2labels_{}'.format(tag))
    print("Completed building feature vectors for {}".format(tag))

Started building feature vectors for #gohawks
Completed building feature vectors for #gohawks
Started building feature vectors for #gopatriots
Completed building feature vectors for #gopatriots
Started building feature vectors for #nfl
Completed building feature vectors for #nfl
Started building feature vectors for #patriots
Completed building feature vectors for #patriots
Started building feature vectors for #sb49
Completed building feature vectors for #sb49
Started building feature vectors for #superbowl
Completed building feature vectors for #superbowl


### QUESTION 3: 
For each of your models, report your model’s Mean Squared Error (MSE) and R-squared measure. Also, analyse the significance of each feature using the t-test and p-value. You may use the OLS in the libarary statsmodels in Python.

In [20]:
import statsmodels.api as sm
import statsmodels.tools.eval_measures as ste

for tag in hash_tags:
    print('\nLinear Regression Model for {}'.format(tag))
    X = load_object('q2features_{}'.format(tag))
    y = load_object('q2labels_{}'.format(tag))
    
#     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
    X = sm.add_constant(X)
    
    model = sm.OLS(y,X)
    results = model.fit()
    pred_y = results.predict(X)
    print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
    print("R-squared : {}".format(results.rsquared))    
    print("P values for the features are \n {} \n\n".format(results.pvalues))
    print(results.summary())
    print('---'*20)
    print('\n\n')
    


Linear Regression Model for #gohawks

MSE : 828331.5553024177
R-squared : 0.4771264119150189
P values for the features are 
 [2.31433584e-01 2.55595214e-13 2.42415497e-03 1.55961444e-02
 7.12324880e-01 7.74337359e-01] 


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.477
Model:                            OLS   Adj. R-squared:                  0.472
Method:                 Least Squares   F-statistic:                     95.27
Date:                Sat, 16 Mar 2019   Prob (F-statistic):           3.36e-71
Time:                        15:23:01   Log-Likelihood:                -4346.8
No. Observations:                 528   AIC:                             8706.
Df Residuals:                     522   BIC:                             8731.
Df Model:                           5                                         
Covariance Type:            nonrobust                              