# TueSNLP - Assignment 2

## Linear regression
The assignment and data are available here: https://snlp2018.github.io/assignments.html.
The data (already splitted in training and testing sets) is a list of timestamps of tweets, in the UNIX format. The goal of the assignment is to model the distribution of tweets during the hours of a day.

### Exercise 1
Load the data, convert it into more informative format and count the number of tweets in each hour of each day. The goal of the exercise is to output two `numpy` arrays, one with the hours (0,1,...,23,0,1,...) and the other with the count of tweets in each hour.

In [71]:
# libraries
import numpy as np
import pandas as pd
import time

In [33]:
# read data ("rt" mode makes sure we read as text)
with gzip.open("data/timestamps.train.gz", "rt") as input_f:
    timestamps_train_raw = input_f.read().splitlines()

It looks like this:

In [35]:
print(timestamps_train_raw[0:10])

['1522533600', '1522533600', '1522533602', '1522533603', '1522533603', '1522533604', '1522533604', '1522533604', '1522533605', '1522533606']


We can convert the UNIX format with `time.localtime()`; for example:

In [44]:
print(time.localtime(int(timestamps_train_raw[0])))
print(time.localtime(int(timestamps_train_raw[1])))
print(time.localtime(int(timestamps_train_raw[2])))

time.struct_time(tm_year=2018, tm_mon=4, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=6, tm_yday=91, tm_isdst=1)
time.struct_time(tm_year=2018, tm_mon=4, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=6, tm_yday=91, tm_isdst=1)
time.struct_time(tm_year=2018, tm_mon=4, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=2, tm_wday=6, tm_yday=91, tm_isdst=1)


We can see that the attribute `hour` might just be what we need. But first, we make sure the tweets in the data are ordered:

In [58]:
timestamps_train_sorted = [int(entry) for entry in timestamps_train_raw] # convert to integer
timestamps_train_sorted.sort() # sort with ascending order

In [65]:
# convert with localtime and extract hour attribute
timestamps_train_converted = [time.localtime(entry) for entry in timestamps_train_sorted]
timestamps_train_hours = np.array([entry.tm_hour for entry in timestamps_train_converted])

We want to count tweets in each hour in each day, so we need more information, i.e. year, month, day as well:

In [77]:
timestamps_train_keys = [str(entry.tm_year)+"-"+str(entry.tm_mon)+"-"+str(entry.tm_mday)+"-"+str(entry.tm_hour) for entry in timestamps_train_converted]

In [78]:
timestamps_train_keys[0:5]

['2018-4-1-0', '2018-4-1-0', '2018-4-1-0', '2018-4-1-0', '2018-4-1-0']

In [80]:
counts = pd.Series(timestamps_train_keys).value_counts()

In [84]:
counts.head()

2018-4-29-18    35725
2018-4-26-20    34713
2018-4-27-20    31984
2018-4-19-21    30970
2018-4-20-19    28245
dtype: int64