# Classical dataset extraction

Time series data can't be directly input into classical machine learning algorithms, since the data points are highly dependent on previous values. A typical approach to solve this is to introduce lag variables.

This kernel demonstrates an approach how to convert time series data into a dataset appropriate for classical machine learning algorithms.

In [None]:
import re

import numpy as np
import pandas as pd

Firstly, we need to load our data files. Since this is meant only for demonstration purposes, we'll only use a small fraction of the data to speed up the process and so it's easier to verify that it works properly.

As you may imagine, running this on the entire dataset of time series may take a while and use up quite a bit of memory.

In [None]:
data = pd.read_csv('../input/train_1.csv').iloc[:256]

data.head()

We'll extract the date columns from the data, as they will come in useful later for indexing and extracting lag variables.

In [None]:
date_columns = [c for c in data.columns if re.match(r'\d{4}-\d{2}-\d{2}', c)]

print(date_columns[:5])
print(date_columns[-5:])

We can specify how many lag variables we want.

In [None]:
LAG_DAYS = 7

Clearly, since we're using a number of lag variables, we can't use the first N days in the series, since their lag values will be unknown.

In [None]:
used_data = data[['Page'] + date_columns[LAG_DAYS:]]

We now convert the original table to individual entries. Since doing this produces a very large amount of data points, we can drop any rows with NaN values.

In [None]:
flattened = pd.melt(used_data, id_vars='Page', var_name='date', value_name='Visits')
flattened.dropna(how='any', inplace=True)

flattened.head()

We will also want to get date indices, which we will later use to select the correct date values from the data matrix.

In [None]:
date_indices = {d: i for i, d in enumerate(date_columns)}

We now prepare the data needed to extract the lag values from the data.
- `page_indices` will store the row index
- `date_indices` will store the column index

In [None]:
# We will need the page indices to tell us which row to look at
data['page_indices'] = data.index
# We set the index to page so we can merge with `flattened` easily
data.set_index('Page', inplace=True)

flattened['date_indices'] = flattened['date'].apply(date_indices.get)
flattened = flattened.set_index('Page').join(data['page_indices']).reset_index()

flattened.iloc[538:548] # 543 happens to be the index where the second time series begins

In [None]:
for lag in range(1, LAG_DAYS + 1):
    flattened['lag_%d' % lag] = data[date_columns].values[
        flattened['page_indices'],
        flattened['date_indices'] - lag
    ]

Again, since we've got plenty of data to work with, we can drop any rows containing NaNs. This ensures that we don't incur any error with any non-optimal imputation strategy. 

In [None]:
flattened.dropna(how='any', inplace=True)

In [None]:
flattened.shape

In the three following cells, we verify that we got the desired results. We examine the first five data points and see, that we were in fact, successful.

In [None]:
flattened.head()

In [None]:
flattened.iloc[543:548]

In [None]:
data.iloc[:2]

Finally, we can drop the index columns `page_indices` and `date_indices`. `flattened` now contains data that we can use with typical machine learning algorithms. We can save this to a csv file for later use.

In [None]:
flattened.drop(['page_indices', 'date_indices'], inplace=True, axis=1)

flattened.head()