## Data Processing
This notebook contains notes and documentation on data processing that was used to prepare the data for our testing.

In [1]:
import pandas as pd
import numpy as np
import data_processing_modules as dpm
# for MIND_large data processing change the folder name in all associated functions

2024-03-02 22:01:56.783900: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-02 22:01:56.820044: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-02 22:01:56.820081: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-02 22:01:56.821002: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-02 22:01:56.826733: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-02 22:01:56.827926: I tensorflow/core/platform/cpu_feature_guard.cc:1

### Initial Processing
Initially the data was stored in .tsv format without column headers. Via data_to_csv in data_processing_modules we were able to change it into a csv.

In [None]:
## Changes to csv format
# Behaviors and then news
dpm.data_to_csv(True, '../MIND_large/tsv/behaviors.tsv')
dpm.data_to_csv(False, '../MIND_large/tsv/news.tsv')

### Processing for popularity counts 
To access popularity counts for both categories and articles, we made create_popularity_dfs and create_popularity_csvs to extract popularity information and output it into a csv for later use in visualizations. 

In [2]:
behaviors = pd.read_csv('../MIND_large/csv/behaviors.csv', index_col=0)
news = pd.read_csv('../MIND_large/csv/news.csv', index_col=0)
# dpm.create_popularity_csvs(news, behaviors, small=False)

In [3]:
behaviors = dpm.modify_hourly(behaviors)
behaviors = behaviors.sort_values('time')
behaviors.head()

Unnamed: 0,impression_id,user_id,time,history,impressions,hour
359673,359674,U66319,2019-11-15 00:00:00,N10721 N128129 N28406 N118998 N38884 N96764 N1...,N98015-0 N44109-0 N19586-0 N26122-0 N109853-0 ...,1
271465,271466,U714534,2019-11-15 00:00:01,,N43355-0 N125161-0 N26122-1 N102958-0 N129510-...,1
348597,348598,U231191,2019-11-15 00:00:02,N7742 N90185 N58034 N48215 N58477 N48215 N1167...,N63342-0 N106619-0 N43490-0 N68688-0 N9989-0 N...,1
214469,214470,U458420,2019-11-15 00:00:04,N18843 N112324 N43234 N128546 N108752 N81970 N...,N74401-0 N76254-0 N67937-0 N69938-0 N102846-0 ...,1
257816,257817,U167725,2019-11-15 00:00:05,N80865 N77603 N14898 N75485,N43490-0 N100425-0 N32626-0 N129314-0 N74875-0...,1


In [4]:
tf_dataset = dpm.decompose_interactions(news, behaviors)
tf_dataset.head()

Index(['news_id', 'category', 'sub_category', 'title', 'abstract', 'url',
       'title_entities', 'abstract_entities'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
Starting loop


  0%|          | 0/376471 [00:00<?, ?it/s]

100%|██████████| 376471/376471 [1:32:30<00:00, 67.83it/s]


Unnamed: 0,user_id,time,news_id,category,sub_category,title,abstract,interaction_type,score
0,U66319,1,N10721,entertainment,entertainment-celebrity,Mike Johnson asks out Keke Palmer after Demi L...,Mike Johnson tried to ask out Keke Palmer in a...,history,1
1,U66319,1,N128129,movies,movies-celebrity,Brie Larson Has the Best Reaction Ever After T...,The 'Captain Marvel' star was left speechless ...,history,1
2,U66319,1,N28406,news,newsworld,Accused dine-and-dashers in viral video at St....,Five young black men who posted a video of a m...,history,1
3,U66319,1,N118998,news,newsgoodnews,Trooper pulls over to save flag on highway,The trooper is being praised for stopping his ...,history,1
4,U66319,1,N38884,sports,mma,UFC champ Khabib Nurmagomedov seen training in...,Khabib Nurmagomedov doesn't mess around.,history,1


In [6]:
tf_dataset

Unnamed: 0,user_id,time,news_id,category,sub_category,title,abstract,interaction_type,score
0,U66319,01,N10721,entertainment,entertainment-celebrity,Mike Johnson asks out Keke Palmer after Demi L...,Mike Johnson tried to ask out Keke Palmer in a...,history,1
1,U66319,01,N128129,movies,movies-celebrity,Brie Larson Has the Best Reaction Ever After T...,The 'Captain Marvel' star was left speechless ...,history,1
2,U66319,01,N28406,news,newsworld,Accused dine-and-dashers in viral video at St....,Five young black men who posted a video of a m...,history,1
3,U66319,01,N118998,news,newsgoodnews,Trooper pulls over to save flag on highway,The trooper is being praised for stopping his ...,history,1
4,U66319,01,N38884,sports,mma,UFC champ Khabib Nurmagomedov seen training in...,Khabib Nurmagomedov doesn't mess around.,history,1
...,...,...,...,...,...,...,...,...,...
20645888,U491432,00,N87192,finance,finance-companies,Bill Gates tops Jeff Bezos as world's richest ...,This time it's official.,impression,0
20645889,U491432,00,N31918,news,newsworld,Oscar Wilde's stolen ring found by Dutch 'art ...,A golden ring once given as a present by the f...,impression,0
20645890,U491432,00,N73556,sports,football_nfl,Report: At least 24 teams expected to attend K...,It looks like the majority of the teams in the...,impression,0
20645891,U491432,00,N92223,lifestyle,lifestylebuzz,This 10-Year-Old Girl's Demanding Christmas Wi...,A father recently shared his 10-year-old daugh...,impression,0


In [5]:
tf_dataset.to_csv('../MIND_large/csv/tensorflowDatasetFullGood.csv')

In [18]:
fpath = f'../MIND_large/csv/tensorflow_dataset'
idx = [5161473 * i for i in range(5)]
idx[-1] += 1
for i in range(len(idx)-1):
    start, end = idx[i], idx[i+1]
    chunk = tf_dataset[(tf_dataset.index >= start) & (tf_dataset.index < end)]
    chunk.to_csv(fpath + f'_chunk{i}.csv')

In [10]:
20645892 * 0.7

14452124.399999999

In [12]:
5161473 * 5 * .8
train_test = '80_20'
train = tf_dataset[(tf_dataset.index >= 0) & (tf_dataset.index < 14452123)] # tentative
test = tf_dataset[(tf_dataset.index < 20645893.0) & (tf_dataset.index >= 14452123)]
# Then for the determined split we need to place our data into the specified folder
train.to_csv(fpath + 'train.csv')
test.to_csv(fpath + 'test.csv')


In [16]:
 
[i * 7226061.0 for i in range(3)] 

[0.0, 7226061.0, 14452122.0]

In [17]:
idx = [i * 7226061.0 for i in range(3)] 
idx[-1] += 1
for i in range(len(idx)-1):
    start, end = idx[i], idx[i+1]
    chunk = tf_dataset[(tf_dataset.index >= start) & (tf_dataset.index < end)]
    chunk.to_csv(fpath + f'train_chunk{i}.csv')

### Tensorflow compatibility
Tensorflow recommenders requires the dataset to be in a specific format in order for it to be compatible with its systems. Using decompose_interactions we are able to create a dataframe that is tensorflow compatible.

In [None]:
news.head()

Index(['news_id', 'category', 'sub_category', 'title', 'abstract', 'url',
       'title_entities', 'abstract_entities'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>


In [None]:
tf_dataset = pd.read_csv('../MIND_large/csv/tensorflowDataset.csv')
tf_dataset.head()

### Temporal Processing
Due to the inclusion of the interaction timestamp in the behaviors data we analyzed the popularity of articles at different times of day. To process this data we used create_interaction_counts (behaviors_with_individual_counts). Subsequently we used modify_hourly which extracts the hour from the timestamp.

In [None]:
# dpm.create_interaction_counts()
behaviors = pd.read_csv('../MIND_large/csv/behaviors_with_individual_counts.csv', index_col=0).drop(columns='Unnamed: 0')
behaviors = dpm.modify_hourly(behaviors)

In [None]:
import pandas as pd
import numpy as np
import data_processing_modules as dpm
user_impressions_df = pd.read_csv('../MIND_large/csv/behaviors_grouped_with_history.csv').reset_index()
user_impression_preference = pd.read_csv('../MIND_large/csv/behaviors_grouped_with_history.csv').reset_index()
feature_matrix = user_impressions_df.merge(user_impression_preference)
feature_matrix
user_interacted = feature_matrix[['user_id', 'history', 'impressions']]
user_interacted.head() 
del feature_matrix
del user_impressions_df
del user_impression_preference
news = pd.read_csv('../MIND_large/csv/news.csv')['news_id']
news_data = {news_id : np.full(255990, -1, dtype='int8') for news_id in news}
def populate_dictionaries(behaviors_frame):
    """
    Populates the news data dictionary with user preferences where each user_id corresponds to a row index and the columns correspond to news articles.
    """
    index = 0
    # Might just want to use the popularity counts that are already found in behaviors with popularity counts? that could be a lot better imho
    for history, impressions in zip(behaviors_frame['history'], behaviors_frame['impressions']):

        if history != '-1':
            for news_id in history.split():
                meep = news_data[news_id]
                meep[index] = 1

        if type(impressions) != float:    
            
            impressions = impressions.replace('[', '')
            impressions = impressions.replace(']', '')

            for impression_string in impressions.split(','):
                impression_string = impression_string.replace("'", "")
                for impression in impression_string.split():
                
                    impression_info = dpm.clean_impression(impression)
                    if impression_info['score'] == '1':
                        news_data[impression_info['article_ID']][index] = 1
                    else:
                        news_data[impression_info['article_ID']][index] = 0
        index += 1

populate_dictionaries(user_interacted)
del user_interacted

In [None]:
72023 / 7
matrix_separators = [10289 * i for i in range(1, 8)]
matrix_separators.insert(0, 0)
matrix_separators

In [None]:
for index in range(len(matrix_separators)-1):
    start, end = matrix_separators[index], matrix_separators[index+1]
    user_item_chunk = np.empty((255990, 10290), dtype='int8')
    for index in range(start, end):
        key = news[index]
        user_item_chunk[:, index] = news_data[key]
        del news_data[key]
    np.save(f'../MIND_large/{index+1}user_item_mat.npy', user_item_chunk)
    del user_item_chunk



### Clustering Processing
With the goal of minimizing search spaces and making our recommenders more efficient we utilized clustering. During the clustering of news articles we extracted embeddings from a pre-trained BERT model and applied them to the abstracts and titles present in the dataset with create_text_embeddings. In addition to using BERT embeddings, we used scikit learn's bag of words and tf-idf vectorizers. Utilizing scikit-learn vectorizers requires only a few lines of code, therefore any preprocessing is done during clustering instead of prior like BERT embeddings below.

In [None]:
dpm.preprocess_BERT_embeddings(news, small=True)