# TueSNLP - Assignment 2

## Linear regression
The assignment and data are available here: https://snlp2018.github.io/assignments.html.

The data (already splitted in training and testing sets) is a list of timestamps of tweets, in the UNIX format. The goal of the assignment is to model the distribution of tweets during the hours of a day.

### Exercise 1
Load the data, convert it into more informative format and count the number of tweets in each hour of each day. The goal of the exercise is to output two `numpy` arrays, one with the hours (0,1,...,23,0,1,...) and the other with the count of tweets in each hour.

In [3]:
# libraries
import gzip
import numpy as np
import pandas as pd
import time

In [4]:
# read data ("rt" mode makes sure we read as text)
with gzip.open("data/timestamps.train.gz", "rt") as input_f:
    timestamps_train_raw = input_f.read().splitlines()

It looks like this:

In [5]:
print(timestamps_train_raw[0:10])

['1522533600', '1522533600', '1522533602', '1522533603', '1522533603', '1522533604', '1522533604', '1522533604', '1522533605', '1522533606']


We can convert the UNIX format with `time.localtime()`; for example:

In [6]:
print(time.localtime(int(timestamps_train_raw[0])))
print(time.localtime(int(timestamps_train_raw[1])))
print(time.localtime(int(timestamps_train_raw[2])))

time.struct_time(tm_year=2018, tm_mon=4, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=6, tm_yday=91, tm_isdst=1)
time.struct_time(tm_year=2018, tm_mon=4, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=6, tm_yday=91, tm_isdst=1)
time.struct_time(tm_year=2018, tm_mon=4, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=2, tm_wday=6, tm_yday=91, tm_isdst=1)


We can see that the attribute `hour` might just be what we need:

In [70]:
timestamps_train_int = [int(entry) for entry in timestamps_train_raw] # convert to integer
timestamps_train_int[0:10]

[1522533600,
 1522533600,
 1522533602,
 1522533603,
 1522533603,
 1522533604,
 1522533604,
 1522533604,
 1522533605,
 1522533606]

In [71]:
# convert with localtime and extract hour attribute
timestamps_train_converted = [time.localtime(entry) for entry in timestamps_train_int]
timestamps_train_hours = np.array([entry.tm_hour for entry in timestamps_train_converted])
timestamps_train_hours[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

We want to count tweets in each hour in each day, so we need more information, i.e. year, month, day as well:

In [89]:
timestamps_train_keys = [(entry.tm_year, entry.tm_mon, entry.tm_mday, entry.tm_hour) 
                         for entry in timestamps_train_converted]

In [90]:
timestamps_train_keys[0:5]

[(2018, 4, 1, 0),
 (2018, 4, 1, 0),
 (2018, 4, 1, 0),
 (2018, 4, 1, 0),
 (2018, 4, 1, 0)]

In [91]:
train_df = pd.DataFrame()
train_df["key"] = timestamps_train_keys
train_df.head()

Unnamed: 0,key
0,"(2018, 4, 1, 0)"
1,"(2018, 4, 1, 0)"
2,"(2018, 4, 1, 0)"
3,"(2018, 4, 1, 0)"
4,"(2018, 4, 1, 0)"


In [95]:
counts_df = train_df.groupby(["key"]).size().reset_index(name="count")
counts_df["hour"] = [entry[3] for entry in counts_df["key"]]
counts_df.head()

Unnamed: 0,key,count,hour
0,"(2018, 4, 1, 0)",5682,0
1,"(2018, 4, 1, 1)",3480,1
2,"(2018, 4, 1, 2)",1782,2
3,"(2018, 4, 1, 3)",1029,3
4,"(2018, 4, 1, 4)",911,4


### Exercise 2
Fit a linear model to predict the number of tweets in each hour of the day, using `sklearn.linear_model`. Output `R^2` score on training and testing data, and make predictions for a few samples.

In [96]:
import sklearn.linear_model as lm

In [105]:
# initialize model
lm_model = lm.LinearRegression()

# train
lm_model.fit(X = counts_df["count"].values.reshape(-1,1), # reshape needed to obtain 2D array
             y = counts_df["hour"])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [106]:
# R^2 score on training data
lm_model.score(X = counts_df["count"].values.reshape(-1,1), # reshape needed to obtain 2D array
               y = counts_df["hour"])

0.518005943364782

In [115]:
# some predictions
lm_model.predict(X = np.array([0,8,12,18,23]).reshape(-1,1))

array([5.47860383, 5.48506782, 5.48829982, 5.49314781, 5.4971878 ])

Do all this with test data as well :)

In [117]:
# read and process data
with gzip.open("data/timestamps.test.gz", "rt") as input_f:
    timestamps_test_raw = input_f.read().splitlines()

timestamps_test_converted = [time.localtime(int(entry)) for entry in timestamps_test_raw]
timestamps_test_hours = np.array([entry.tm_hour for entry in timestamps_test_converted])
timestamps_test_keys = [(entry.tm_year, entry.tm_mon, entry.tm_mday, entry.tm_hour) 
                         for entry in timestamps_test_converted]

In [119]:
# df
test_df = pd.DataFrame()
test_df["key"] = timestamps_test_keys
test_df.head()
test_counts_df = test_df.groupby(["key"]).size().reset_index(name="count")
test_counts_df["hour"] = [entry[3] for entry in test_counts_df["key"]]

In [121]:
# model
lm_model = lm.LinearRegression()

lm_model.fit(X = counts_df["count"].values.reshape(-1,1), # reshape needed to obtain 2D array
             y = counts_df["hour"])

lm_model.score(X = test_counts_df["count"].values.reshape(-1,1), # reshape needed to obtain 2D array
               y = test_counts_df["hour"])

0.4214312293344443

### Exercise 3
Fit 9 polynomial regression models, each increasing by one the order of the function, i.e. exponent of the maximal term, starting from 2. Output `R^2` score for each model on training and testing data.

We use `PolynomialFeatures`:

In [122]:
from sklearn.preprocessing import PolynomialFeatures

In [124]:
# polynomial features
poly = PolynomialFeatures(degree = 2) # initialize constructor
poly_counts = poly.fit_transform(counts_df["count"].values.reshape(-1,1)) # add polynomial feature

In [126]:
# model
lm_model = lm.LinearRegression()
lm_model.fit(X = poly_counts, y = counts_df["hour"])
lm_model.score(X = poly_counts, y = counts_df["hour"])

0.6671070445831935

As a function:

In [146]:
def polynomial_regression(n, X_train, y_train, X_test, y_test):
    poly = PolynomialFeatures(degree = n) # initialize constructor
    poly_X_train = poly.fit_transform(X_train) # add polynomial feature
    lm_model = lm.LinearRegression()
    lm_model.fit(X = poly_X_train, y = y_train)
    score_train = lm_model.score(X = poly_X_train, y = y_train)
    print("R^2 score with degree "+str(n)+" on training data: "+str(score_train))
    poly_X_test = poly.fit_transform(X_test)
    score_test = lm_model.score(X = poly_X_test, y = y_test)
    print("R^2 score with degree "+str(n)+" on testing data: "+str(score_test))   

For example:

In [137]:
polynomial_regression(n = 2, X_train = counts_df["count"].values.reshape(-1,1), y_train = counts_df["hour"], 
                     X_test = test_counts_df["count"].values.reshape(-1,1), y_test = test_counts_df["hour"])

R^2 score with degree 2 on training data:0.6671070445831935
R^2 score with degree 2 on testing data:0.6198662583600794


Let's run it for each value from 2 to 9 on both training and testing data:

In [147]:
X_train = counts_df["count"].values.reshape(-1,1)
y_train = counts_df["hour"]
X_test = test_counts_df["count"].values.reshape(-1,1)
y_test = test_counts_df["hour"]

for n in range(2,10):
    polynomial_regression(n, X_train, y_train, X_test, y_test)
    print("")

R^2 score with degree 2 on training data: 0.6671070445831935
R^2 score with degree 2 on testing data: 0.6198662583600794

R^2 score with degree 3 on training data: 0.6845080968714403
R^2 score with degree 3 on testing data: 0.6462106869714331

R^2 score with degree 4 on training data: 0.686429937799754
R^2 score with degree 4 on testing data: 0.6501344099219183

R^2 score with degree 5 on training data: 0.7040726102662854
R^2 score with degree 5 on testing data: 0.6844282943198319

R^2 score with degree 6 on training data: 0.6325925869456348
R^2 score with degree 6 on testing data: 0.5740567894636623

R^2 score with degree 7 on training data: 0.4938746612869205
R^2 score with degree 7 on testing data: 0.36389313938593926

R^2 score with degree 8 on training data: 0.19047518826271015
R^2 score with degree 8 on testing data: 0.03158609258730105

R^2 score with degree 9 on training data: 0.1478869139419121
R^2 score with degree 9 on testing data: 0.013864090396260575



Performance peak at degree 5.