# Data Processing Documentation
Within this document we catalog the different steps that we took with our data from start to finish.

In [2]:
import data_processing_modules as dpm
import numpy as np
import pandas as pd
import sys
sys.path.append("/home/jovyan/work")
import Modelling.matrix_modules as matrix_modules

## First Steps
When first downloaded the MIND dataset is contained in .tsv format without column names. Via data_to_csv in data_processing_modules we changed the format to a csv including column names as defined in the MIND Github repository.

In [None]:
# Changes the file specified in the filepath included in the arguments to a csv. 
dpm.data_to_csv(True, '../MIND_large/tsv/behaviors.tsv')
dpm.data_to_csv(False, '../MIND_large/tsv/news.tsv')

## Tensorflow Compatibility
In order to create recommender systems with Tensorflow we needed to update the format of our user behaviors dataset to be compatible. Initially, each row counted as one impression containing a user ID, the users history and all interactions for the given impression; in order to work with Tensorflow we needed each item in btoh history and impressions to be its own row for each interaction. To create the Tensorflow compatible dataset we use `decompose_interactions` which iterates through the behaviors csv expanding out impressions as necessary. Tensorflow recommender systems also supported a time based split for train-test validation meaning that before we created our Tensorflow compatible dataset we binned the timestamps and sorted them with `modify_hourly`. Due to the size of the Tensorflow compatible dataset and our desire to utilize Git Large File Storage, we split the dataset into several chunks so that they are maintainable by Git LFS. The resulting dataset is stored in four chunks at `../MIND_large/csv/tensorflow_dataset_chunk{i}.csv` and is loaded in via the `load_in_tensorflow_full()` function from `matrix_modules`. 

In [None]:
# Loading in necessary data  
behaviors = pd.read_csv('../MIND_large/csv/behaviors.csv', index_col=0)
news = pd.read_csv('../MIND_large/csv/news.csv', index_col=0)

# Binning timestamps.
behaviors = dpm.modify_hourly(behaviors)

# Sorting the timestamps.
behaviors = behaviors.sort_values('time')

# Creating the tensorflow dataset.
tf_dataset = dpm.decompose_interactions(news, behaviors)

# Saving it in chunks.
dpm.chunk_tf_dataset(tf_dataset)
tf_dataset.head() 

In [None]:
# Run to free up memory.
del tf_dataset

## Exploratory Data Analysis
In our exploratory data analysis we examined article and category popularity, as well as category popularity at different times of day. Our data visualizations for this exploration required us to modify and extract information from behaviors and news. 

### Getting Popularity Counts
To access popularity counts for both categories and articles we use `create_popularity_csvs`, which iterates over the elements of the behaviors csv keeping track of category popularity and article popularity. The resulting data is stored in `../MIND_large/csv/news_with_popularity.csv` and `../MIND_large/csv/category_with_popularity.csv`.

In [None]:
dpm.create_popularity_csvs(news, behaviors, small=False)

### Preparing For Temporal Analysis
In order to explore how time of day might affect the popularity of categories we counted the per-impression category preference with `create_interaction_counts` which is used to get the popularity of categories for an interaction. The resulting data is stored in `../MIND_large/csv/behaviors_with_individual_counts.csv`. Before data visualization of category popularity given different times, `create_hourly_long` loads in the stored data and uses `modify_hourly` to bin the times before transforming the data into a format useable in the data visualization.



In [None]:
dpm.create_interaction_counts(behaviors)

## Feature Extraction
Before moving on to clustering we extracted and created features based off of what we learned from our exploratory data analysis. For users we extracted article preferences using the Tensorflow compatible dataset and their median time of interaction. For items we dummy coded their categories, used the previously extracted article popularity and dimension reduced embeddings (maybe on this one).    

In [3]:
# Creating User Features
tensorflow_ds = matrix_modules.load_in_tensorflow_full()
tensorflow_ds.head()

Unnamed: 0,user_id,time,news_id,category,sub_category,title,abstract,interaction_type,score
0,U66319,1,N10721,entertainment,entertainment-celebrity,Mike Johnson asks out Keke Palmer after Demi L...,Mike Johnson tried to ask out Keke Palmer in a...,history,1
1,U66319,1,N128129,movies,movies-celebrity,Brie Larson Has the Best Reaction Ever After T...,The 'Captain Marvel' star was left speechless ...,history,1
2,U66319,1,N28406,news,newsworld,Accused dine-and-dashers in viral video at St....,Five young black men who posted a video of a m...,history,1
3,U66319,1,N118998,news,newsgoodnews,Trooper pulls over to save flag on highway,The trooper is being praised for stopping his ...,history,1
4,U66319,1,N38884,sports,mma,UFC champ Khabib Nurmagomedov seen training in...,Khabib Nurmagomedov doesn't mess around.,history,1


In [3]:
user_features = dpm.create_user_features(tensorflow_ds)
user_features.head()


Unnamed: 0,user_id,news,entertainment,finance,video,tv,movies,music,autos,health,...,cardio,olympics-videos,hollywood,autosconvertibles,smartliving,soccer_fifa_wwc,strength,newslocalpolitics,games-news,median_time
0,U1,0.444444,0.236111,0.111111,0.083333,0.097222,0.013889,0.013889,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.375
1,U100,0.162791,0.046512,0.046512,0.046512,0.023256,0.0,0.046512,0.069767,0.116279,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25
2,U1000,0.444444,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.375
3,U10000,0.091954,0.022989,0.137931,0.034483,0.045977,0.022989,0.022989,0.011494,0.114943,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.291667
4,U100005,0.354839,0.0,0.043011,0.010753,0.032258,0.043011,0.0,0.0,0.021505,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5


In [4]:
item_features = dpm.create_item_features()
item_features.head()

Unnamed: 0,news_id,title,abstract,popularity,autos,entertainment,finance,foodanddrink,games,health,...,voices,watch,weatherfullscreenmaps,weathertopstories,weight-loss,weightloss,wellness,wines,wonder,yearinoffbeatgoodnews
0,N88753,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",10,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,N23144,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,5,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,N86255,Dispose of unwanted prescription drugs during ...,,8,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,N93187,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,221,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,N75236,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",1525,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
user_features.to_csv('../MIND_large/csv/user_features.csv')
item_features.to_csv('../MIND_large/csv/item_features.csv')

## Clustering
Clustering on both news and users was done to benefit the overall recommender system. To explore parameters for clustering news we vectorized the abstracts and titles, reduced the vectors to two dimensions with UMAP and then applied clustering algorithms like hdbscan and kmeans. Data processing for clustering items was very minimal and completely local, so we don't touch on it much here. With regards to users, previously generated user features were used for clustering experimentation. UMAP reduced embeddings created during this step were stored for quicker availability due to how much longer dimension reduction took for users. 


### Generating and storing embeddings
Prior to and during model evaluation the embeddings that we evaluated to be the best were stored and accessed when testing different numbers of clusters. We updated the number of clusters by calling either `user_cluster` or `item_cluster`, which takes in our features and applies clustering labels onto them.

Importantly, we do not create and evaluate clusters for a train-test split of our data. This is due to the following: clusters weren't used in Tensorflow modelling and clusters were only used in gradient descent and alternating least squares which update and evaluate as matrix factorization occurs.

## Matrix Factorization Models

### Data for Gradient Descent and Alternating Least Squares 
In our implementations of factorization models like GD and ALS a lot of the previously processed data gets used. Item and user features with appended cluster labels, and the full tensorflow compatible dataset modified to be easy to transform into a ratings matrix $R$. The clustered items, users and complete tensorflow dataset get loaded together with `load_dataset_for_matrix`. Once the data is loaded, hash maps containing indices for use within ALS and a matrix $R$ with either user clustering or item clustering are created with `create_user_cluster_mat` or `create_item_cluster_mat`.