## Preprocessing the Data

This notebook shows how we preprocess the data from the Yahoo Dataset. 

The python code in this notebook is in the file `preprocessing.py` thus it can be easily importend when needed.

We also do not encapsulate the code in any function so we can show it working step by step. In the file to be imported, the code is properly encapsulated in single python function.

**Importing Libraries**

In [1]:
import pandas as pd
import numpy as np

Open the file with interactions in reading mode

In [2]:
file = open("Sample", "r")

Each line in the Yahoo dataset has the following format:

```1317513580 id-563582 0 |user 1 5 12 18 20 |id-552077 |id-555224 |id-555528 |id-559744 |id-559855 |id-560290 |id-560518 |id-560620 |id-563115 |id-563582 |id-563643 |id-563787 |id-563846 |id-563938 |id-564335 |id-564418 |id-564604 |id-565364 |id-565479 |id-565515 |id-565533 |id-565561 |id-565589 |id-565648 |id-565747 |id-565822```

As their README file states, the fields delimited by spaces are: 
> 
> * timestamp: e.g., 1317513291
> * displayed_article_id: e.g., id-560620
> * user_click (0 for no-click and 1 for click): e.g., 0
> * string "|user" indicates the start of user features
> * features are 136-dimensional binary vectors; the IDs of nonzero features are listed after the string "|user"
> * The pool of available articles for recommendation for each user visit is the set of articles that appear in that line of data.  All user IDs (bcookies in our data) are replaced by a common string "user".
> 
> Note that each user is associated with a 136-dimensional binary feature vector.
Features IDs take integer values in {1,2,...,136}.  Feature #1 is the constant
(always 1) feature, and features #2-136 correspond to other user information
such as age, gender, and behavior-targeting features, etc.  Some user features
are not present, since not all users logged in to Yahoo! when they visited the
front page.



Creating the Dataframe to place our iteractions

In [3]:
df = pd.DataFrame(columns=['Timestamp', 'Clicked_Article', 'Click', 'User_Features', 'Article_List'])

### Reading line by line

In order to read the data, we first remove some lements from the line to make the process more straightforward.

For each line, we read the data and sotre it. In the end of the line, we append the data to the DataFrame.


In [4]:
for line in file:
#     Removing 'not data' parts of the line string
    line = line.replace('id-', '')
    line = line.replace('\n', '')
    line = line.replace('user', '')
    line = line.replace('|', '')
    line = line.replace('  ', ' ')

    aux_str = ''
    info = 0 # 0 = time; 1 = clicked_article; 2 = click; 3 = user_features; 4 = articles_list
    features = np.zeros(136, dtype=np.bool)
    articles_list = []
    for i in line:
        if i == ' ':
            if info == 0:
                timestamp = int(aux_str)
                aux_str = ''
                info+=1
            elif info == 1:
                clicked_article = int(aux_str)
                aux_str = ''
                info+=1
            elif info == 2:
                click = int(aux_str)
                aux_str = ''
                info+=1
            elif info == 3:
                try:
                    features[int(aux_str)-1] = 1
                    aux_str = ''
                except:
                    articles_list.append(int(aux_str))
                    aux_str = ''
                    info+=1
            elif info == 4:
                articles_list.append(int(aux_str))
                aux_str = ''
        else:
            aux_str+=i

    articles_list.append(int(aux_str))
    aux_str = ''
    df = df.append({'Timestamp': timestamp, 'Clicked_Article': clicked_article, 'Click': click, 'User_Features': features, 'Article_List': np.asarray(articles_list)}, ignore_index=True)

file.close()

Now, we have our dataframe with the columns | Timestamp | Clicked_Article | Click | User_Features | Article_List |

In [5]:
df.head(5)

Unnamed: 0,Timestamp,Clicked_Article,Click,User_Features,Article_List
0,1317513291,560620,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
1,1317513291,565648,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
2,1317513291,563115,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
3,1317513292,552077,0,"[True, False, False, False, False, False, True...","[552077, 555224, 555528, 559744, 559855, 56029..."
4,1317513292,564335,0,"[True, False, False, False, False, False, Fals...","[552077, 555224, 555528, 559744, 559855, 56029..."
