## Data Processing
This notebook contains notes and documentation on data processing that was used to prepare the data for our testing.

In [1]:
import pandas as pd
import numpy as np
import data_processing_modules as dpm
# for MIND_large data processing change the folder name in all associated functions

  from .autonotebook import tqdm as notebook_tqdm





### Initial Processing
Initially the data was stored in .tsv format without column headers. Via data_to_csv in data_processing_modules we were able to change it into a csv.

In [None]:
## Changes to csv format
# Behaviors and then news
dpm.data_to_csv(True, '../MIND_small/tsv/behaviors.tsv')
dpm.data_to_csv(False, '../MIND_small/tsv/news.tsv')

### Processing for popularity counts 
To access popularity counts for both categories and articles, we made create_popularity_dfs and create_popularity_csvs to extract popularity information and output it into a csv for later use in visualizations. 

In [6]:
behaviors = pd.read_csv('../MIND_small/csv/behaviors.csv', index_col=0)
news = pd.read_csv('../MIND_small/csv/news.csv', index_col=0)
# dpm.create_popularity_csvs(news, behaviors)

In [7]:
behaviors = behaviors[behaviors['history'].isna() == False] 
behaviors.isna().any()

impression_id    False
user_id          False
time             False
history          False
impressions      False
dtype: bool

### Tensorflow compatibility
Tensorflow recommenders requires the dataset to be in a specific format in order for it to be compatible with its systems. Using decompose_interactions we are able to create a dataframe that is tensorflow compatible.

In [8]:
num_rows = 500000 # Update to determine size of data used in decompose interactions
tf_dataset = dpm.decompose_interactions(num_rows, news, behaviors)
tf_dataset.to_csv('../MIND_small/csv/tensorflowDataset.csv')

Index(['news_id', 'category', 'sub_category', 'title', 'abstract', 'url',
       'title_entities', 'abstract_entities'],
      dtype='object')


### Temporal Processing
Due to the inclusion of the interaction timestamp in the behaviors data we analyzed the popularity of articles at different times of day. To process this data we used create_interaction_counts (behaviors_with_individual_counts). Subsequently we used modify_hourly which extracts the hour from the timestamp.

In [None]:
dpm.create_interaction_counts()
behaviors = pd.read_csv('../MIND_small/csv/behaviors_with_individual_counts.csv')
behaviors = dpm.modify_hourly(behaviors)

### Clustering Processing
With the goal of minimizing search spaces and making our recommenders more efficient we utilized clustering. During the clustering of news articles we extracted embeddings from a pre-trained BERT model and applied them to the abstracts and titles present in the dataset with create_text_embeddings. In addition to using BERT embeddings, we used scikit learn's bag of words and tf-idf vectorizers. Utilizing scikit-learn vectorizers requires only a few lines of code, therefore any preprocessing is done during clustering instead of prior like BERT embeddings below.

In [None]:
dpm.preprocess_BERT_embeddings(news, small=True)

In [None]:
# Might want to consider using UMAP union on the title and abstract embeddings since currently UMAP is
# reducing all of them together which could cause a loss of data quality