### QUESTION 16: 
The dataset in hands is rich as there is a lot of metadata to each tweet. Be creative and propose a new problem (something interesting that can be inferred from this dataset) other than the previous parts. You can look into the literature of Twitter data analysis to get some ideas. Implement your idea and show that it works. As a suggestion, you might provide some analysis based on changes of tweet sentiments for fans of the opponent teams participating in the match. You get full credit for briniging in novelty and full or partial implementation of your new ideas.

Use-case : Predicting number of impressions for a given tweet
* the impressions for a tweet are present in the tweet['metrics'] object inside the json
* some possible features for this could be; user-passivity, created-hour, no.of hashtags, sentiment of tweet, no. of followers of author, no. of tweets made by author

In [1]:
hash_tags = ['#gohawks','#gopatriots','#nfl','#patriots','#sb49','#superbowl']

In [2]:
import pickle

def save_object(data, fileName):
    with open('pynb_data/'+fileName + ".pickle", 'wb') as f:
        pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
        
def load_object(fileName):
    try:
        with open('pynb_data/'+fileName + ".pickle", 'rb') as f:
            data = pickle.load(f)
            return data
    except IOError:
        print("Could not read file: " + fileName)

In [3]:
import json

def getMinAndMaxTs(tag):
    filename = 'data/tweets_'+tag+'.txt'
    max_ts = 0
    min_ts = 1552522378
    with open(filename) as f:
        for line in f:
            json_object = json.loads(line)
            timestamp = json_object['citation_date']
            if(timestamp < min_ts):                
                min_ts = timestamp
            
            if(timestamp > max_ts):
                max_ts = timestamp
                
    return [min_ts,max_ts]

tagsToMinTs = {}
tagsToMaxTs = {}
for tag in hash_tags:
    ts_list = getMinAndMaxTs(tag)
    tagsToMinTs[tag] = (ts_list[0])
    tagsToMaxTs[tag] = (ts_list[1])    

In [4]:
import math
import datetime
import pytz

# https://arxiv.org/pdf/1401.2018v2.pdf
def getUserPassivity(user,ts):
    createdDateTimeObj = datetime.datetime.strptime(user['created_at'],"%a %b %d %H:%M:%S %z %Y")
    created = datetime.datetime.fromtimestamp(createdDateTimeObj.timestamp())
    d = datetime.datetime.fromtimestamp(ts)
    td = (created - d).days
    statuses_count = user['statuses_count']
    return td/(1.0+statuses_count)

def getLocalHour(timestamp):
    d = datetime.datetime.fromtimestamp(timestamp)
    pst = pytz.timezone('America/Los_Angeles')
    d = pst.localize(d)
    return d.hour

def getWindowNumber(start_ts, curr_ts, window):
    elapsed = (curr_ts - start_ts)/(window*1.0)
    windowNum = math.ceil(elapsed)
    return windowNum    

def getFeatures(start_ts,end_ts,window):
    windowToTweets = {}
    windowToRetweets = {}
    windowToFollowerCount = {}
    windowToMaxFollowers = {}
    features = []
    labels = []
    
    for tag in hash_tags:
        filename = 'data/tweets_'+tag+'.txt'
        with open(filename) as f:
            for line in f:
                json_object = json.loads(line)
                timestamp = json_object['citation_date']

                if timestamp < start_ts or timestamp > end_ts:                            
                    continue

                impressions = json_object['metrics']['impressions']
                userPassivity = getUserPassivity(json_object['tweet']['user'],timestamp)
                h = getLocalHour(timestamp)
                hashtagCount = len(json_object['tweet']['entities']['hashtags'])
                followerCount = json_object['author']['followers']
                tweetCount = json_object['tweet']['user']['statuses_count']

                features.append([userPassivity,h,hashtagCount,followerCount,tweetCount])
                labels.append(impressions)                                
                
    return features,labels

In [5]:
import statsmodels.api as sm
import statsmodels.tools.eval_measures as ste

min_ts = min(list(tagsToMinTs.values()))

#tp1
tp1_window_size = 3600 # 1 hour window size
tp1_start_ts = tp1_window_size * math.floor(min_ts/(tp1_window_size*1.0))
tp1_end_ts = 1422806400
features,labels = getFeatures(tp1_start_ts,tp1_end_ts,tp1_window_size)
save_object(features, "q16_tp1_features")
save_object(labels, "q16_tp1_labels")

print('\nLinear Regression Model for {}'.format(tag))
X = features
y = labels
    
#     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
X = sm.add_constant(X)
    
model = sm.OLS(y,X)
results = model.fit()
pred_y = results.predict(X)
print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
print("R-squared : {}".format(results.rsquared))    
print(results.summary())
print('---'*20)
print('\n\n')    


Linear Regression Model for #superbowl

MSE : 3615741434.205783
R-squared : 0.8770361550012772
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.877
Model:                            OLS   Adj. R-squared:                  0.877
Method:                 Least Squares   F-statistic:                 8.612e+05
Date:                Wed, 20 Mar 2019   Prob (F-statistic):               0.00
Time:                        16:20:05   Log-Likelihood:            -7.5004e+06
No. Observations:              603742   AIC:                         1.500e+07
Df Residuals:                  603736   BIC:                         1.500e+07
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------

In [6]:
import statsmodels.api as sm
import statsmodels.tools.eval_measures as ste

min_ts = min(list(tagsToMinTs.values()))

#tp2
tp2_window_size = 300 # 5 minute window size
tp2_start_ts = 1422806400
tp2_end_ts = 1422849600
features,labels = getFeatures(tp2_start_ts,tp2_end_ts,tp2_window_size)
save_object(features, "q16_tp2_features")
save_object(labels, "q16_tp2_labels")

print('\nLinear Regression Model for {}'.format(tag))
X = features
y = labels
    
#     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
X = sm.add_constant(X)
    
model = sm.OLS(y,X)
results = model.fit()
pred_y = results.predict(X)
print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
print("R-squared : {}".format(results.rsquared))    
print(results.summary())
print('---'*20)
print('\n\n')    


Linear Regression Model for #superbowl

MSE : 67146716.77004832
R-squared : 0.9967625943023873
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.997
Model:                            OLS   Adj. R-squared:                  0.997
Method:                 Least Squares   F-statistic:                 1.213e+08
Date:                Wed, 20 Mar 2019   Prob (F-statistic):               0.00
Time:                        17:00:33   Log-Likelihood:            -2.0553e+07
No. Observations:             1970528   AIC:                         4.111e+07
Df Residuals:                 1970522   BIC:                         4.111e+07
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------

In [8]:
import statsmodels.api as sm
import statsmodels.tools.eval_measures as ste

max_ts = min(list(tagsToMaxTs.values()))

#tp3
tp3_window_size = 3600 # 1 hour window size
tp3_start_ts = 1422849600
tp3_end_ts = tp3_window_size * math.ceil(max_ts/(tp3_window_size*1.0))
features,labels = getFeatures(tp3_start_ts,tp3_end_ts,tp3_window_size)
save_object(features, "q16_tp3_features")
save_object(labels, "q16_tp3_labels")

print('\nLinear Regression Model for {}'.format(tag))
X = features
y = labels
    
#     https://becominghuman.ai/stats-models-vs-sklearn-for-linear-regression-f19df95ad99b
X = sm.add_constant(X)
    
model = sm.OLS(y,X)
results = model.fit()
pred_y = results.predict(X)
print("\nMSE : {}".format(ste.mse(pred_y, y,axis=0)))
print("R-squared : {}".format(results.rsquared))    
print(results.summary())
print('---'*20)
print('\n\n')    


Linear Regression Model for #superbowl

MSE : 4638475443.966269
R-squared : 0.8994272160994153
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.899
Model:                            OLS   Adj. R-squared:                  0.899
Method:                 Least Squares   F-statistic:                 4.454e+05
Date:                Wed, 20 Mar 2019   Prob (F-statistic):               0.00
Time:                        17:05:06   Log-Likelihood:            -3.1250e+06
No. Observations:              249046   AIC:                         6.250e+06
Df Residuals:                  249040   BIC:                         6.250e+06
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------