# Preprocessing MIND dataset

This notebook contains steps to preprocess the [MIND (Small)](https://msnews.github.io/) dataset, which contains news articles and user behaviors from Microsoft News.

In [10]:
import os
import json
import numpy as np
import pandas as pd
import ast

# Downloading Data

In order to run this notebook, download the MINDsmall zip files from [this link](https://msnews.github.io/) and place them in the `data/raw` folder.
You can choose whether you want to use the `dev` dataset or the `train`.

In [11]:
input_filepath, output_filedir = './../data/raw/MINDsmall_dev.zip', './../data/dev/'
# input_filepath, output_filedir = './../data/raw/MINDsmall_train.zip', './../data/train/'

## Preprocessing data

In [12]:
def unzip_file(input_filepath:str, output_filedir:str):
    import zipfile
    with zipfile.ZipFile(input_filepath, "r") as zip:
        zip.extractall(output_filedir)

def convert_entities(entity:str) -> dict:
    if not isinstance(entity, str):
        return []
    return ast.literal_eval(entity)

unzip_file(input_filepath, output_filedir)

### Behaviors File

The behaviors file contains news articles impression data from users. Briefly, the columns are:

- `time`: impression time
- `history`: which items the `user_id` consumed prior to the impression
- `impressions`: list of items shown to `user_id` followed by a flag indicating whether the user clicked (or not) in the item

In [30]:
columns = ["impression_id", "user_id", "time", "history", "impressions"]

df_behaviors = pd.read_csv(os.path.join(output_filedir, 'behaviors.tsv'), sep='\t', header=None)
df_behaviors.columns = columns
df_behaviors["impressions"] = df_behaviors["impressions"].apply(lambda x: x.split(' '))
df_behaviors["history"] = df_behaviors["history"].apply(lambda x: [] if not isinstance(x, str) else x.split(' '))
df_behaviors["click"] = df_behaviors["impressions"].apply(lambda x: [1 if '-1' in impression else 0 for impression in x])
df_behaviors["impressions"] = df_behaviors["impressions"].apply(lambda x: [impression.split('-')[0] for impression in x])
df_behaviors['time'] = pd.to_datetime(df_behaviors['time'])
df_behaviors['timestamp'] = pd.to_datetime(df_behaviors['time']).map(pd.Timestamp.timestamp)
df_behaviors

Unnamed: 0,impression_id,user_id,time,history,impressions,click,timestamp
0,1,U80234,2019-11-15 12:37:50,"[N55189, N46039, N51741, N53234, N11276, N264,...","[N28682, N48740, N31958, N34130, N6916, N5472,...","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.573821e+09
1,2,U60458,2019-11-15 07:11:50,"[N58715, N32109, N51180, N33438, N54827, N2848...","[N20036, N23513, N32536, N46976, N35216, N3677...","[0, 1, 0, 0, 0, 0, 0]",1.573802e+09
2,3,U44190,2019-11-15 09:55:12,"[N56253, N1150, N55189, N16233, N61704, N51706...","[N36779, N62365, N58098, N5472, N13408, N55036...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",1.573812e+09
3,4,U87380,2019-11-15 15:12:46,"[N63554, N49153, N28678, N23232, N43369, N5851...","[N6950, N60215, N6074, N11930, N6916, N24802, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...",1.573831e+09
4,5,U9444,2019-11-15 08:25:46,"[N51692, N18285, N26015, N22679, N55556]","[N5940, N23513, N49285, N23355, N19990, N31958...","[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]",1.573806e+09
...,...,...,...,...,...,...,...
73147,73148,U77536,2019-11-15 20:40:16,"[N28691, N8845, N58434, N37120, N22185, N60033...","[N496, N35159, N59856, N13270, N47213, N26485,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",1.573850e+09
73148,73149,U56193,2019-11-15 13:11:26,"[N4705, N58782, N53531, N46492, N26026, N28088...","[N49285, N31958, N55237, N42844, N29862, N1999...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",1.573823e+09
73149,73150,U16799,2019-11-15 15:37:06,"[N40826, N42078, N15670, N15295, N64536, N4684...","[N7043, N512, N60215, N45057, N496, N37055, N1...","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.573832e+09
73150,73151,U8786,2019-11-15 08:29:26,"[N3046, N356, N20483, N46107, N44598, N18693, ...","[N23692, N19990, N20187, N5940, N13408, N31958...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.573807e+09


### News File

The news file contains metadata on the items, such as its title, abstract and url.

In [31]:
columns = [
    'item_id', 'category', 'subcategory', 'title', 'abstract', 'url', 'title_entities', 'abstract_entities'
]

df_items = pd.read_csv(os.path.join(output_filedir, 'news.tsv'), sep='\t', header=None)
df_items.columns = columns
df_items["title_entities"] = df_items["title_entities"].apply(convert_entities)
df_items["abstract_entities"] = df_items["abstract_entities"].apply(convert_entities)
df_items

Unnamed: 0,item_id,category,subcategory,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{'Label': 'Prince Philip, Duke of Edinburgh',...",[]
1,N18955,health,medical,Dispose of unwanted prescription drugs during ...,,https://assets.msn.com/labs/mind/AAISxPN.html,"[{'Label': 'Drug Enforcement Administration', ...",[]
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{'Label': 'Ukraine', 'Type': 'G', 'WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{'Label': 'National Basketball Association', ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{'Label': 'Skin tag', 'Type': 'C', 'WikidataI...","[{'Label': 'Skin tag', 'Type': 'C', 'WikidataI..."
...,...,...,...,...,...,...,...,...
42411,N63550,lifestyle,lifestyleroyals,Why Kate & Meghan Were on Different Balconies ...,There's no scandal here. It's all about the or...,https://assets.msn.com/labs/mind/BBWyynu.html,"[{'Label': 'Meghan, Duchess of Sussex', 'Type'...",[]
42412,N30345,entertainment,entertainment-celebrity,See the stars at the 2019 Baby2Baby gala,Stars like Chrissy Teigen and Kate Hudson supp...,https://assets.msn.com/labs/mind/BBWyz7N.html,[],"[{'Label': 'Kate Hudson', 'Type': 'P', 'Wikida..."
42413,N30135,news,newsgoodnews,Tennessee judge holds lawyer's baby as he swea...,Tennessee Court of Appeals Judge Richard Dinki...,https://assets.msn.com/labs/mind/BBWyzI8.html,"[{'Label': 'Tennessee', 'Type': 'G', 'Wikidata...","[{'Label': 'Tennessee Court of Appeals', 'Type..."
42414,N44276,autos,autossports,Best Sports Car Deals for October,,https://assets.msn.com/labs/mind/BBy5rVe.html,"[{'Label': 'Peugeot RCZ', 'Type': 'V', 'Wikida...",[]


**Publication Date**

In the original dataset, the news article's URL is provided. These could be used to scrap the article's publication date.
However, these URLs are no longer working. Therefore, I estimated the article's publication date by the first impression.

*Note: some articles may not have this estimated publication date.*

In [32]:
df_item_publication = (
    df_behaviors[['impressions', 'time', 'timestamp']]
        .explode('impressions')
        .groupby('impressions')
        .min()
        .reset_index()
        .rename({"impressions": "item_id", "time": "publish_time", "timestamp": "publish_timestamp"}, axis=1)
)
df_item_publication.tail()

Unnamed: 0,item_id,publish_time,publish_timestamp
5364,N9966,2019-11-15 19:51:23,1573847000.0
5365,N9971,2019-11-15 02:50:29,1573786000.0
5366,N9985,2019-11-15 16:06:57,1573834000.0
5367,N9991,2019-11-15 00:01:07,1573776000.0
5368,N9997,2019-11-15 05:20:47,1573795000.0


In [33]:
df_items = df_items.merge(df_item_publication, on='item_id', how='left')

In [34]:
df_items.to_parquet(f"{output_filedir}/df_items_preprocessed.parquet", index=None)