# Preprocessing and Cleaning of the Data

In [130]:
import pandas as pd
import numpy as np
import scipy.sparse as sp

import plotly.express as px

from progressbar import ProgressBar

### Loading the data

In [2]:
behaviors = pd.read_csv("../../data/MINDlarge_train/behaviors.tsv", sep='\t', header=None)
news = pd.read_csv("../../data/MINDlarge_train/news.tsv", sep='\t', header=None)

Let's first give our datasets some proper column names:

In [3]:
behaviors = behaviors.rename(columns={0:'impression_id', 1 : 'user_id', 2 : 'time', 3:'history', 4 : 'impressions'})
news = news.rename(columns={0:'article_id', 1:'category', 2:'subcategory', 3:'title', 4:'abstract', 5:'url', 6:'title_entities', 7:'abstract_entities'})


In [4]:
news.head(3)

Unnamed: 0,article_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
0,N88753,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N45436,news,newsscienceandtechnology,Walmart Slashes Prices on Last-Generation iPads,Apple's new iPad releases bring big deals on l...,https://assets.msn.com/labs/mind/AABmf2I.html,"[{""Label"": ""IPad"", ""Type"": ""J"", ""WikidataId"": ...","[{""Label"": ""IPad"", ""Type"": ""J"", ""WikidataId"": ..."
2,N23144,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."


In [5]:
news.shape

(101527, 8)

We have more than 100,000 news articles in our *news* dataset with information concerning the **news category, subcategory, it's title, abstract and even some entitiy embeddings** (most of the urls don't work anymore so we don't have access to the full bodies). Let's check whether these are all unique articles or if we also have some duplicates:

In [6]:
print("Number of unique news articles: ", news.title.nunique())
print("Number of duplicates:             ", news.shape[0] - news.title.nunique())

Number of unique news articles:  98388
Number of duplicates:              3139


Apparently, there are **news articles with multiple IDs**. We don't just want to drop them, because this would result in a loss of useful information concerning the click behaviors and reading histories in our *behaviors* dataset, which looks like this:

In [7]:
behaviors.head(3)

Unnamed: 0,impression_id,user_id,time,history,impressions
0,1,U87243,11/10/2019 11:30:54 AM,N8668 N39081 N65259 N79529 N73408 N43615 N2937...,N78206-0 N26368-0 N7578-0 N58592-0 N19858-0 N5...
1,2,U598644,11/12/2019 1:45:29 PM,N56056 N8726 N70353 N67998 N83823 N111108 N107...,N47996-0 N82719-0 N117066-0 N8491-0 N123784-0 ...
2,3,U532401,11/13/2019 11:23:03 AM,N128643 N87446 N122948 N9375 N82348 N129412 N5...,N103852-0 N53474-0 N127836-0 N47925-1


Let us first prepare this dataset before we get back to handling the duplicate news articles.

### Preparing the *behaviors* dataset

In [8]:
behaviors.shape

(2232748, 5)

Here, we have more than two million online sessions on msn with information concerning **user ID, date and daytime, the click history, and the recommended articles and user behavior** (ending on -1 = clicked) for the respective session. Let's see how many unique users there are:

In [9]:
print(f'There are {len(behaviors.user_id.unique())} individual users in our dataset.')
print(f'The average number of sessions is: {behaviors.shape[0] / len(behaviors.user_id.unique()):.2f}')

There are 711222 individual users in our dataset.
The average number of sessions is: 3.14



Since we have to work with the click history, which was recorded before the sessions and is the same for all the sessions, as these two random samples show:

In [10]:
behaviors_id = behaviors[behaviors.user_id == 'U87243']
len(behaviors_id.history), len(behaviors_id.history.unique())

(4, 1)

In [11]:
behaviors_id = behaviors[behaviors.user_id == 'U593596']
len(behaviors_id.history), len(behaviors_id.history.unique())

(8, 1)

we want to include **only users with at least five articles read** in their history. So we need to drop ahistorical users as well as users with too few articles read:

In [12]:
behaviors.isna().sum()

impression_id        0
user_id              0
time                 0
history          46065
impressions          0
dtype: int64

In [13]:
behaviors.dropna(inplace=True)

In [14]:
behaviors.shape

(2186683, 5)

In [15]:
behaviors['length_history'] = behaviors.history.str.split()

In [16]:
behaviors['length_history'] = behaviors.length_history.map(len)

In [17]:
behaviors = behaviors[behaviors['length_history'] >= 5]

In [18]:
behaviors.length_history.value_counts()

5      83726
6      78792
7      73876
8      67973
9      63602
       ...  
371        6
317        5
385        4
404        2
479        1
Name: length_history, Length: 411, dtype: int64

In [19]:
behaviors.shape[0]

1940992

With dropping those users, the numbers for our *behaviors* dataset now look like this:

In [20]:
print(f'There are {len(behaviors.user_id.unique())} individual users in our dataset.')
print(f'The average number of sessions is: {behaviors.shape[0] / len(behaviors.user_id.unique()):.2f}')

There are 575603 individual users in our dataset.
The average number of sessions is: 3.37


### Droppping duplicate article IDs in *news* and remapping them in *behaviors*
Now that we have set up our *behaviors* dataset with respect to users, let's get back to the duplicate news articles. With different IDs for the de facto same articles we would not be able to track similarities among users sufficiently. In the following, we will **replace every redundant article-ID with the first ID for the respective article**. In order to this, we first create a subset with all the duplicates in it:

In [21]:
duplis_title = news[news.duplicated(subset="title", keep=False)]

In [22]:
duplis_title.sort_values(by="title").head(3)

Unnamed: 0,article_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
22698,N11490,travel,travelnews,$2 million Florida Lottery ticket sold in Jack...,,https://assets.msn.com/labs/mind/AAJLM24.html,"[{""Label"": ""Florida Lottery"", ""Type"": ""O"", ""Wi...",[]
43099,N33885,travel,travelnews,$2 million Florida Lottery ticket sold in Jack...,,https://assets.msn.com/labs/mind/AAJTMIW.html,"[{""Label"": ""Florida Lottery"", ""Type"": ""O"", ""Wi...",[]
53996,N88198,lifestyle,lifestylebuzz,$23 Million SuperLotto Plus Ticket Sold In San...,Are you holding the winning ticket worth $23 m...,https://assets.msn.com/labs/mind/AAJCjxB.html,"[{""Label"": ""San Fernando Valley"", ""Type"": ""L"",...",[]


In [23]:
title_set = duplis_title['title'].unique()

In [24]:
article_list = []
for title in title_set:
    x = duplis_title[duplis_title['title']==title]['article_id'].to_list()
    article_list.append(x)

In [25]:
article_list[:5]

[['N93333', 'N18950'],
 ['N108072', 'N19710'],
 ['N112099', 'N50263'],
 ['N108765', 'N123453'],
 ['N1396', 'N45124']]

We now have a **list which contains article IDs for every article which has multiple IDs** in our original dataset. With this list, we can generate a dictionary called articleID_dict, which maps all the redundant IDs (keys) to a single ID (value):

In [26]:
articleID_dict = {}
articles_to_change = []
for article in article_list:
    value = article[0]
    keys = article [1:]
    for k in keys:
        articleID_dict[k] = value
        articles_to_change.append(k)

Let's make a copy of the original behaviors dataframe and make it to a numpy array, so that we can **loop through all the redundant IDs in the *behaviors* dataset and homogenize them** according to our dictionary:

In [27]:
behav = behaviors.copy()
behav = behav.to_numpy()

In [28]:
pbar = ProgressBar()
userIDs_hist_changes = []
userIDs_impr_changes = []
users_to_change = []
articles_to_change_set = set(articles_to_change)

In [29]:
for idx in pbar(range(behav.shape[0])):
    user_row = behav[idx]
    hist_flag = False
    hist = user_row[3]
    hist_list = hist.split()
    hist_set = set(hist_list)
    hist_inter = hist_set & articles_to_change_set
    for art in hist_inter:
        hist_flag = True
        users_to_change.append(idx)
        userIDs_hist_changes.append(user_row[1])
        hist = hist.replace(art, articleID_dict[art])
    if hist_flag:
        behav[idx][3] = hist
    impression_flag = False
    impressions = user_row[4]
    impression_list = [l[:-2] for l in impressions.split()]
    impression_set = set(impression_list)
    for art in (impression_set & articles_to_change_set):
        impression_flag = True
        userIDs_impr_changes.append(user_row[1])
        impressions = impressions.replace(art, articleID_dict[art])
    if impression_flag:        
        behav[idx][4] = impressions


100% |########################################################################|


Let's **check wether our method worked**. For this task, we **construct a list containing all articles in our array and compare it to the articles_to_change list**, by making them to sets and calculating their intersection, which should be empty:

In [30]:
all_articles = []
for row in behav:
    hist_list=row[3].split(' ')
    for article in hist_list:
        all_articles.append(article)
    

In [31]:
set(all_articles) & articles_to_change_set

set()

Apparently, our method worked! Now let's make a **new dataframe out of our processed behavioral data**:

In [32]:
behaviors_new = pd.DataFrame(behav, columns= behaviors.columns)

and a new one for the information on **news articles without duplicates**:

In [33]:
news_new = news.drop_duplicates(subset="title", keep='first')

### Saving processed datasets
Now we want to save the processed data and write it to csv files:

In [34]:
behaviors_new.to_csv("../../data/MINDlarge_train/behaviors_processed.csv", index=False)

In [None]:
news_new.to_csv("../../data/MINDlarge_train/news_processed.csv", 
                    index=False)

### Preprocessing for collaborative filtering approaches
For the deployment of recommender systems which use collaborative filtering (CF) techniques, user-article interactions play a pivotal role. Because CF is of great importance to understand modern day recommender systems in general, we too want to construct and discuss different versions of this approach. In order to do this, it is useful to further process our data with repsect to user-article interactions. 

Because we **only work with the click history** when deploying CF methods, we only need one session per user:


In [36]:
behaviors_cf = behaviors_new.drop_duplicates(subset='user_id').copy()

In [37]:
behaviors_cf.shape[0] == behaviors_new.user_id.nunique()

True

In [108]:
behaviors_cf.head()

Unnamed: 0,impression_id,user_id,time,history,impressions,length_history
0,1,U87243,11/10/2019 11:30:54 AM,N8668 N39081 N65259 N79529 N73408 N43615 N2937...,N78206-0 N26368-0 N7578-0 N58592-0 N19858-0 N5...,16
1,2,U598644,11/12/2019 1:45:29 PM,N56056 N8726 N70353 N67998 N83823 N111108 N107...,N47996-0 N82719-0 N117066-0 N8491-0 N123784-0 ...,24
2,3,U532401,11/13/2019 11:23:03 AM,N128643 N87446 N122948 N9375 N82348 N129412 N5...,N103852-0 N53474-0 N127836-0 N47925-1,16
3,4,U593596,11/12/2019 12:24:09 PM,N31043 N39592 N4104 N8223 N114581 N92747 N1207...,N38902-0 N76434-0 N71593-0 N100073-0 N108736-0...,13
4,5,U239687,11/14/2019 8:03:01 PM,N65250 N122359 N71723 N53796 N41663 N41484 N11...,N76209-0 N48841-0 N67937-0 N62235-0 N6307-0 N3...,339


Now we want to construct a numpy array out of this smaller dataset

In [38]:
behav_cf = behaviors_cf.to_numpy(copy=True)

so that we can get **two arrays with user-article-interactions**. One for training and another one with the last article in history for testing purposes:

In [59]:
uai_train, uai_test = [], []

for row in behav_cf:
    user = row[1]
    hist = row[3].split(' ')
    for art in hist[:-1]:
        uai_train.append([user, art])
    last_art = hist[-1]
    uai_test.append([user, last_art])
     

Let's check if that worked:

In [96]:
len(behaviors.loc[0].history.split(' ')[:-1])

15

In [95]:
uai_train[0:15], uai_test[0]

([['U87243', 'N8668'],
  ['U87243', 'N39081'],
  ['U87243', 'N65259'],
  ['U87243', 'N79529'],
  ['U87243', 'N73408'],
  ['U87243', 'N43615'],
  ['U87243', 'N29379'],
  ['U87243', 'N32031'],
  ['U87243', 'N110232'],
  ['U87243', 'N101921'],
  ['U87243', 'N12614'],
  ['U87243', 'N129591'],
  ['U87243', 'N105760'],
  ['U87243', 'N60457'],
  ['U87243', 'N1229']],
 ['U87243', 'N64932'])

That looks good! Now we want to get some **extra user- and article integer IDs**, that we can later use for a **dictionary-of-keys-matrix**, which in turn will be **employed in a neural network**. For our train data, we can do it like this:

In [115]:
uai_train_df = pd.DataFrame(uai_train, columns=['user_id', 'article_id'])

In [117]:
uai_train_df['user_int_id'] = uai_train_df.user_id.astype('category').cat.codes
uai_train_df['article_int_id'] = uai_train_df.article_id.astype('category').cat.codes

In [122]:
uai_train_df.head(3)

Unnamed: 0,user_id,article_id,user_int_id,article_int_id
0,U87243,N8668,564162,67522
1,U87243,N39081,564162,36610
2,U87243,N65259,564162,53587


and for our test data, we need to make sure that it contains only the articles which are also in the train data. At first, we need to find those articles:

In [110]:
train_articles = [elem[1] for elem in uai_train]
test_articles = [elem[1] for elem in uai_test]
articles_to_drop = set(test_articles)-set(train_articles)

With this list, we're now able to reduce our test data:

In [111]:
uai_test_red = [ele for ele in uai_test if ele[1] not in articles_to_drop]

and can thus create a test dataframe without unknown articles:

In [112]:
uai_test_df = pd.DataFrame(uai_test_red, columns=['user_id', 'article_id'])

We can then construct two dictionaries, which map the original user and article IDs to the integer IDs:

In [118]:
user_code_dict = pd.Series(uai_train_df.user_int_id.values,
                           index=uai_train_df.user_id).to_dict()

article_code_dict = pd.Series(uai_train_df.article_int_id.values,
                              index=uai_train_df.article_id).to_dict()

In [119]:
uai_test_df['user_int_id'] = [user_code_dict[user] for user in uai_test_df.user_id]
uai_test_df['article_int_id'] = [article_code_dict[art] for art in uai_test_df.article_id]

In [123]:
uai_test_df.head(3)

Unnamed: 0,user_id,article_id,user_int_id,article_int_id
0,U87243,N64932,564162,53345
1,U598644,N31055,448422,31394
2,U532401,N31323,388857,31569


In [124]:
uai_train_df.to_csv("../../data/MINDlarge_train/large_train.csv", index=False)

In [125]:
uai_test_df.to_csv("../../data/MINDlarge_train/large_test.csv", index=False)

Now we actually **create the dok-matrix**:

In [128]:
train_filename = "../../data/MINDlarge_train/large_train.csv"

In [127]:
num_users, num_articles = uai_train_df.user_id.nunique(), uai_train_df.article_id.nunique()
num_users, num_articles

(575603, 76236)

In [132]:
train_matrix = sp.dok_matrix((num_users, num_articles), dtype=np.float32)

with open(train_filename, "r") as f:
    header = f.readline()
    line = f.readline()
    print(header)
    print(line)
    while line != None and line != "":
        line_list = line.split(",")
        user, article = int(line_list[2]), int(line_list[3])
        train_matrix[user, article] = 1.0
        line = f.readline()

user_id,article_id,user_int_id,article_int_id

U87243,N8668,564162,67522



Later on, when we want to evaluate our recommender systems, we need to **compare the ranking for the known -- but not learned -- interactions with known non-interactions**, so that we can tell how useful our recommendation is: the higher the ranking of the test interaction, the better our model! In order to do this, we want to **extract 99 non read articles for every user in our test set**.

In [133]:
test_interactions = list(zip(uai_test_df.user_int_id, uai_test_df.article_int_id))

In [134]:
num_negatives = 99

In [136]:
negative_interactions = []
for u, i in test_interactions:
    negatives = []
    for t in range(num_negatives):
        j = np.random.randint(num_articles)
        while (u, j) in train_matrix.keys():
            j = np.random.randint(num_articles)
        negatives.append(j)
    negative_interactions.append(negatives)

In [137]:
len(test_interactions), len(negative_interactions), len(negative_interactions[0])

(574780, 574780, 99)

**Finally**, we want to **write our one positive interaction along with the randomly generated non interactions into a csv file** and we're done with preprocessing the data!

In [138]:
output = negative_interactions[:]

In [139]:
for i in range(len(test_interactions)):
    output[i].insert(0, test_interactions[i])

In [141]:
with open('../../data/MINDlarge_train/large_test_negatives.tsv', 'w') as f:
    for line in output:
        line_str = '\t'.join(str(ele) for ele in line) + "\n"
        f.write(line_str)