# Preprocessing and Cleaning of the Data

In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sp

import plotly.express as px

from progressbar import ProgressBar

### Loading the data

In [2]:
behaviors = pd.read_csv("../../data/mind_small_train/behaviors.tsv", sep='\t', header=None)
news = pd.read_csv("../../data/mind_small_train/news.tsv", sep='\t', header=None)

Let's first give our datasets some proper column names:

In [3]:
behaviors = behaviors.rename(columns={0:'impression_id', 1 : 'user_id', 2 : 'time', 3:'history', 4 : 'impressions'})
news = news.rename(columns={0:'article_id', 1:'category', 2:'subcategory', 3:'title', 4:'abstract', 5:'url', 6:'title_entities', 7:'abstract_entities'})


In [4]:
news.head(3)

Unnamed: 0,article_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."


In [5]:
news.shape

(51282, 8)

We have more than 100,000 news articles in our *news* dataset with information concerning the **news category, subcategory, it's title, abstract and even some entitiy embeddings** (most of the urls don't work anymore so we don't have access to the full bodies). Let's check whether these are all unique articles or if we also have some duplicates:

In [6]:
print("Number of unique news articles: ", news.title.nunique())
print("Number of duplicates:             ", news.shape[0] - news.title.nunique())

Number of unique news articles:  50434
Number of duplicates:              848


Apparently, there are **news articles with multiple IDs**. We don't just want to drop them, because this would result in a loss of useful information concerning the click behaviors and reading histories in our *behaviors* dataset, which looks like this:

In [7]:
behaviors.head(3)

Unnamed: 0,impression_id,user_id,time,history,impressions
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...


Let us first prepare this dataset before we get back to handling the duplicate news articles.

### Preparing the *behaviors* dataset

In [8]:
behaviors.shape

(156965, 5)

Here, we have more than two million online sessions on msn with information concerning **user ID, date and daytime, the click history, and the recommended articles and user behavior** (ending on -1 = clicked) for the respective session. Let's see how many unique users there are:

In [9]:
print(f'There are {len(behaviors.user_id.unique())} individual users in our dataset.')
print(f'The average number of sessions is: {behaviors.shape[0] / len(behaviors.user_id.unique()):.2f}')

There are 50000 individual users in our dataset.
The average number of sessions is: 3.14



Since we have to work with the click history, which was recorded before the sessions and is the same for all the sessions, as these two random samples show:

In [12]:
behaviors_id = behaviors[behaviors.user_id == 'U13740']
len(behaviors_id.history), len(behaviors_id.history.unique())

(3, 1)

In [14]:
behaviors_id = behaviors[behaviors.user_id == 'U73700']
len(behaviors_id.history), len(behaviors_id.history.unique())

(2, 1)

we want to include **only users with at least five articles read** in their history. So we need to drop ahistorical users as well as users with too few articles read:

In [15]:
behaviors.isna().sum()

impression_id       0
user_id             0
time                0
history          3238
impressions         0
dtype: int64

In [16]:
behaviors.dropna(inplace=True)

In [17]:
behaviors.shape

(153727, 5)

In [18]:
behaviors['length_history'] = behaviors.history.str.split()

In [19]:
behaviors['length_history'] = behaviors.length_history.map(len)

In [20]:
behaviors = behaviors[behaviors['length_history'] >= 5]

In [21]:
behaviors.length_history.value_counts()

5      6101
6      5591
7      5341
8      4636
9      4270
       ... 
198       5
293       4
283       2
193       1
301       1
Name: length_history, Length: 251, dtype: int64

In [22]:
behaviors.shape[0]

136120

With dropping those users, the numbers for our *behaviors* dataset now look like this:

In [23]:
print(f'There are {len(behaviors.user_id.unique())} individual users in our dataset.')
print(f'The average number of sessions is: {behaviors.shape[0] / len(behaviors.user_id.unique()):.2f}')

There are 40331 individual users in our dataset.
The average number of sessions is: 3.38


### Droppping duplicate article IDs in *news* and remapping them in *behaviors*
Now that we have set up our *behaviors* dataset with respect to users, let's get back to the duplicate news articles. With different IDs for the de facto same articles we would not be able to track similarities among users sufficiently. In the following, we will **replace every redundant article-ID with the first ID for the respective article**. In order to this, we first create a subset with all the duplicates in it:

In [24]:
duplis_title = news[news.duplicated(subset="title", keep=False)]

In [25]:
duplis_title.sort_values(by="title").head(3)

Unnamed: 0,article_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
35832,N34049,sports,football_nfl,'A game-changer': Titans' expansion project wi...,"The Titans have begun construction on a 60,000...",https://assets.msn.com/labs/mind/BBWHHH8.html,"[{""Label"": ""Tennessee Titans"", ""Type"": ""O"", ""W...","[{""Label"": ""Tennessee Titans"", ""Type"": ""O"", ""W..."
33999,N56680,sports,football_nfl,'A game-changer': Titans' expansion project wi...,"The Titans have begun construction on a 60,000...",https://assets.msn.com/labs/mind/BBWH8ZY.html,"[{""Label"": ""Tennessee Titans"", ""Type"": ""O"", ""W...","[{""Label"": ""Tennessee Titans"", ""Type"": ""O"", ""W..."
45242,N16531,news,newscrime,'Baby Trump' balloon slashed at Alabama appear...,"TUSCALOOSA, Ala. (AP) Organizers say a man sla...",https://assets.msn.com/labs/mind/BBWwk8L.html,"[{""Label"": ""Donald Trump baby balloon"", ""Type""...","[{""Label"": ""Donald Trump baby balloon"", ""Type""..."


In [26]:
title_set = duplis_title['title'].unique()

In [27]:
article_list = []
for title in title_set:
    x = duplis_title[duplis_title['title']==title]['article_id'].to_list()
    article_list.append(x)

In [28]:
article_list[:5]

[['N61864', 'N47020'],
 ['N59709', 'N13882', 'N57732', 'N56582'],
 ['N6632', 'N39995'],
 ['N14042', 'N21933'],
 ['N37736', 'N22941', 'N60979']]

We now have a **list which contains article IDs for every article which has multiple IDs** in our original dataset. With this list, we can generate a dictionary called articleID_dict, which maps all the redundant IDs (keys) to a single ID (value):

In [29]:
articleID_dict = {}
articles_to_change = []
for article in article_list:
    value = article[0]
    keys = article [1:]
    for k in keys:
        articleID_dict[k] = value
        articles_to_change.append(k)

Let's make a copy of the original behaviors dataframe and make it to a numpy array, so that we can **loop through all the redundant IDs in the *behaviors* dataset and homogenize them** according to our dictionary:

In [30]:
behav = behaviors.copy()
behav = behav.to_numpy()

In [31]:
pbar = ProgressBar()
userIDs_hist_changes = []
userIDs_impr_changes = []
users_to_change = []
articles_to_change_set = set(articles_to_change)

In [32]:
for idx in pbar(range(behav.shape[0])):
    user_row = behav[idx]
    hist_flag = False
    hist = user_row[3]
    hist_list = hist.split()
    hist_set = set(hist_list)
    hist_inter = hist_set & articles_to_change_set
    for art in hist_inter:
        hist_flag = True
        users_to_change.append(idx)
        userIDs_hist_changes.append(user_row[1])
        hist = hist.replace(art, articleID_dict[art])
    if hist_flag:
        behav[idx][3] = hist
    impression_flag = False
    impressions = user_row[4]
    impression_list = [l[:-2] for l in impressions.split()]
    impression_set = set(impression_list)
    for art in (impression_set & articles_to_change_set):
        impression_flag = True
        userIDs_impr_changes.append(user_row[1])
        impressions = impressions.replace(art, articleID_dict[art])
    if impression_flag:        
        behav[idx][4] = impressions


100% |########################################################################|


Let's **check wether our method worked**. For this task, we **construct a list containing all articles in our array and compare it to the articles_to_change list**, by making them to sets and calculating their intersection, which should be empty:

In [33]:
all_articles = []
for row in behav:
    hist_list=row[3].split(' ')
    for article in hist_list:
        all_articles.append(article)
    

In [34]:
set(all_articles) & articles_to_change_set

set()

Apparently, our method worked! Now let's make a **new dataframe out of our processed behavioral data**:

In [35]:
behaviors_new = pd.DataFrame(behav, columns= behaviors.columns)

and a new one for the information on **news articles without duplicates**:

In [36]:
news_new = news.drop_duplicates(subset="title", keep='first')

### Saving processed datasets
Now we want to save the processed data and write it to csv files:

In [38]:
behaviors_new.to_csv("../../data/mind_small_train/behaviors_processed_small.csv", index=False)

In [39]:
news_new.to_csv("../../data/mind_small_train/news_processed_small.csv", 
                    index=False)

### Preprocessing for collaborative filtering approaches
For the deployment of recommender systems which use collaborative filtering (CF) techniques, **user-article interactions play a pivotal role**. Because CF is of great importance to understand modern day recommender systems in general, we too want to construct and discuss different versions of this approach. In order to do this, it is useful to further process our data with repsect to user-article interactions. 

Because we **only work with the click history** when deploying CF methods, we only need one session per user:


In [40]:
behaviors_cf = behaviors_new.drop_duplicates(subset='user_id').copy()

In [41]:
behaviors_cf.shape[0] == behaviors_new.user_id.nunique()

True

In [42]:
behaviors_cf.head()

Unnamed: 0,impression_id,user_id,time,history,impressions,length_history
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0,9
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...,82
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...,16
3,4,U34670,11/11/2019 5:28:05 AM,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0,10
4,6,U19739,11/11/2019 6:52:13 PM,N39074 N14343 N32607 N32320 N22007 N442 N19001...,N21119-1 N53696-0 N33619-1 N25722-0 N2869-0,36


Now we want to construct a numpy array out of this smaller dataset

In [43]:
behav_cf = behaviors_cf.to_numpy(copy=True)

so that we can get **two arrays with user-article-interactions**. One for training and another one with the last article in history for testing purposes:

In [44]:
uai_train, uai_test = [], []

for row in behav_cf:
    user = row[1]
    hist = row[3].split(' ')
    for art in hist[:-1]:
        uai_train.append([user, art])
    last_art = hist[-1]
    uai_test.append([user, last_art])
     

Let's check if that worked:

In [45]:
len(behaviors.loc[0].history.split(' ')[:-1])

8

In [46]:
uai_train[0:15], uai_test[0]

([['U13740', 'N55189'],
  ['U13740', 'N42782'],
  ['U13740', 'N34694'],
  ['U13740', 'N45794'],
  ['U13740', 'N18445'],
  ['U13740', 'N63302'],
  ['U13740', 'N10414'],
  ['U13740', 'N19347'],
  ['U91836', 'N31739'],
  ['U91836', 'N6072'],
  ['U91836', 'N63045'],
  ['U91836', 'N23979'],
  ['U91836', 'N35656'],
  ['U91836', 'N43353'],
  ['U91836', 'N8129']],
 ['U13740', 'N31801'])

That looks good! Now we want to get some **extra user- and article integer IDs**, that we can later use for a **dictionary-of-keys-matrix**, which in turn will be **employed in a neural network**. For our train data, we can do it like this:

In [47]:
uai_train_df = pd.DataFrame(uai_train, columns=['user_id', 'article_id'])

In [48]:
uai_train_df['user_int_id'] = uai_train_df.user_id.astype('category').cat.codes
uai_train_df['article_int_id'] = uai_train_df.article_id.astype('category').cat.codes

In [49]:
uai_train_df.head(3)

Unnamed: 0,user_id,article_id,user_int_id,article_int_id
0,U13740,N55189,1810,24758
1,U13740,N42782,1810,17976
2,U13740,N34694,1810,13534


and for our test data, we need to make sure that it contains only the articles which are also in the train data. At first, we need to find those articles:

In [50]:
train_articles = [elem[1] for elem in uai_train]
test_articles = [elem[1] for elem in uai_test]
articles_to_drop = set(test_articles)-set(train_articles)

With this list, we're now able to reduce our test data:

In [51]:
uai_test_red = [ele for ele in uai_test if ele[1] not in articles_to_drop]

and can thus create a test dataframe without unknown articles:

In [52]:
uai_test_df = pd.DataFrame(uai_test_red, columns=['user_id', 'article_id'])

We can then construct two dictionaries, which map the original user and article IDs to the integer IDs:

In [53]:
user_code_dict = pd.Series(uai_train_df.user_int_id.values,
                           index=uai_train_df.user_id).to_dict()

article_code_dict = pd.Series(uai_train_df.article_int_id.values,
                              index=uai_train_df.article_id).to_dict()

In [54]:
uai_test_df['user_int_id'] = [user_code_dict[user] for user in uai_test_df.user_id]
uai_test_df['article_int_id'] = [article_code_dict[art] for art in uai_test_df.article_id]

Let's see how that worked:

In [58]:
uai_test_df_example = uai_test_df[uai_test_df['article_id'] == 'N31801']
uai_test_df_example.head()

Unnamed: 0,user_id,article_id,user_int_id,article_int_id
0,U13740,N31801,1810,11956
1357,U21185,N31801,5395,11956
1379,U19379,N31801,4529,11956
2051,U4891,N31801,18423,11956
3876,U81689,N31801,34165,11956


Great, now let's save those dataframes to csv files:

In [59]:
uai_train_df.to_csv("../../data/mind_small_train/small_train.csv", index=False)

In [60]:
uai_test_df.to_csv("../../data/mind_small_train/small_test.csv", index=False)

Now we actually **create the dok-matrix**:

In [61]:
train_filename = "../../data/mind_small_train/small_train.csv"

In [62]:
num_users, num_articles = uai_train_df.user_id.nunique(), uai_train_df.article_id.nunique()
num_users, num_articles

(40331, 32101)

In [63]:
train_matrix = sp.dok_matrix((num_users, num_articles), dtype=np.float32)

with open(train_filename, "r") as f:
    header = f.readline()
    line = f.readline()
    print(header)
    print(line)
    while line != None and line != "":
        line_list = line.split(",")
        user, article = int(line_list[2]), int(line_list[3])
        train_matrix[user, article] = 1.0
        line = f.readline()

user_id,article_id,user_int_id,article_int_id

U13740,N55189,1810,24758



Later on, when we want to evaluate our recommender systems, we need to **compare the ranking for the known -- but not learned -- interactions with known non-interactions**, so that we can tell how useful our recommendation is: the higher the ranking of the test interaction, the better our model! In order to do this, we want to **extract 99 non read articles for every user in our test set**.

In [64]:
test_interactions = list(zip(uai_test_df.user_int_id, uai_test_df.article_int_id))

In [65]:
num_negatives = 99

In [66]:
negative_interactions = []
for u, i in test_interactions:
    negatives = []
    for t in range(num_negatives):
        j = np.random.randint(num_articles)
        while (u, j) in train_matrix.keys():
            j = np.random.randint(num_articles)
        negatives.append(j)
    negative_interactions.append(negatives)

In [67]:
len(test_interactions), len(negative_interactions), len(negative_interactions[0])

(39846, 39846, 99)

Finally, we want to write our one positive interaction along with the randomly generated non interactions into a csv file and we're done with preprocessing the data!

In [68]:
output = negative_interactions[:]

In [69]:
for i in range(len(test_interactions)):
    output[i].insert(0, test_interactions[i])

In [70]:
with open('../../data/mind_small_train/small_test_negatives.tsv', 'w') as f:
    for line in output:
        line_str = '\t'.join(str(ele) for ele in line) + "\n"
        f.write(line_str)