# ALS and SDG
### Main Steps
* Initialize $U$ and $V$
* Determine accuracy metrics
* Set up max distance calculation
* Load in matrix 
* Compute solutions to $U$ and $V$
* Monitor Optimization Progress
* Visualize Results

#### Initializing U and V
Random initialization for now


In [None]:
import numpy as np
import pandas as pd

In [None]:
K = 5 # five latent factors tentatively 
np.random.seed(42)
# I is the number of users <- these will be determined by cluster size
# M is the number of items 
# Can either fill with 1's or random values
U = np.random.uniform(0, 1, size=K*I).reshape((I, K))
V = np.random.uniform(0, 1, size=K*M).reshape((M, K))
Uold = np.zeros_like(U)
Vold = np.zeros_like(V)

#### Accuracy metrics
Choosing RMSE for simplicity as well as setting up max update

In [None]:
def rmse(X, Y):
    return np.sqrt(np.nanmean((X-Y)**2))

def max_update(X, Y):
    return np.noram(((X-Y)/Y).ravel(), np.inf)

error = [(0, rmse(R, np.inner(U, V)))]

## SGD function
For the stochastic gradient descent function, we make update formulae for the U and V respectively under SDG.

Within SGD the update functions for vectors in the user and item matrices are as follows: 

For all $m, i \in R$ where $R_{m,i}$ is an observed rating and $\alpha$ is the rate parameter,

$U_{i} = U_{i} + \alpha V_{m}(R_{m,i} - <V_{m}, U_{i}>)$

$V_{m} = V_{m} + \alpha U_{i}(R_{m,i} - <V_{m}, U_{i}>)$

Within the error function we pass in the inner product of $V$ and $U$, $\hat{R}$, along with ratings matrix $R$ to calculate the RMSE. RMSE was calculated as follows:

For vectors $x_i \in R$, $y_i \in \hat{R}$, $\text{RMSE}(R, \hat{R}) =  \left[\frac{1}{n}\sum_{i=1}^{n} \|x_i - y_i\|_2^2 \right]^{1/2}$

In [None]:
def standard_SGD(U, V, R, error, update, rate=0.1, max_iterations=300, threshold=0.001):
    """
    Performs stochastic gradient descent on user and item matrices U and V optimizing the RMSE. 
    """
    # Optimize over 300 iterations
    for iteration in range(1, max_iterations): # starting from one due to first iteration being hardcoded
        for m, i in zip(*np.where(~np.isnan(R))): # might want to change to not is nan but is zero as that is where no ratings are
            U[i] = U[i] + rate*V[m]*(R.iloc[m,i] - np.inner(V[m], U[i]))
            V[m] = V[m] + rate*U[i]*(R.iloc[m,i] - np.inner(V[m], U[i]))
        error += [(t, rmse(R, np.inner(V,U)))]
    return U, V, error


error = pd.DataFrame(error, columns=['iteration', 'rmse'])

## Matrix function testing


Now we want to create our matrices while respecting the need for hash maps for the ALS algorithm 
We need a map of 'item' index and user indexes to optimize V
We need a map of 'user' index and item indexes to optimize U 
When we create the item cluster matrix we are taking the dataset as well as the number of users or items depending on our clustering method

So generally speaking for our map building, 
For item index and the user indices, we initialize a map of all 'items' as indices, and since were quantifying users by their index after grouping them together we can just append them to a list at the item index whenever they show up
For user index and item indices, in the for loop all we have to do is check if the user is there and if not we add them but if so we add their item index, which for item clustering is just their cluster, under user clustering with normal items we would have to find each items index and then after having each items index we would need to see when a cluster rates something we'd attach that index to their list, and then add that cluster to the item index list in the other table, so tbh lots of hash maps wow 

#### For testing purposes we utilize the top user indices for subsetting of our matrx
with U being set to 30 and V being set to 15 and for actual subsetting we would just subset with indices pulled from the hash map, but then for item subsetting we use the top user indices? something to figure out when writing the actual code

In [None]:
from tqdm import tqdm

## ALS introduction
Our implementation of ALS was based off of lecture 14 from CME 323: Distributed Algorithms and Optimization, Spring 2015 from Stanford. ALS is quite similar to stochastic gradient descent but differs in one key aspect; instead of updating by vector, $U$ and $V$ alternate on being fixed while optimizing the other. Additionally, under our implementation of ALS from Stanford the user and item matrices $U$ and $V$ both are of dimension  $k$ x $n$ and $k$ x $m$ respectively. The complete ratings matrix $R$ is thus estimated via $\hat{R} = U^TV$. 

This is formulated as the following non-convex optimization problem which seeks to minimize least squared error and handle regularization to avoid overfitting:

$$\min_{U,V} \sum_{r_{ij} \text{observed}}{(r_{ij}-u_{i}^Tv_{j})^2} + \lambda(\sum_{i}\|u_i\|^2 + \sum_{j}\|v_j\|^2)$$

While gradient descent can be used, it is slow and requires a large amount of iterations which leads us into ALS. By fixing $U$ we obtain a convex function of $V$ and vice versa. Therefore in ALS we fix and optimize opposite matrices until convergence. Below is the general algorithm as described in the Stanford materials:

* Initialize $k$ x $n$ and $k$ x $m$ matrices $U$ and $V$
* Repeat the the following until convergence
    * For all column vectors $i = 1,... , n$    
    $$ u_i = (\sum_{r_{ij}\in r_{i *}}{v_jv_j^T + \lambda I_k})^{-1} \sum_{r_{ij}\in r_{i *}}{r_{ij}v_{j}}$$

    * For all column vectors $j = 1,... , m$
    $$ v_j = (\sum_{r_{ij}\in r_{* j}}{u_iu_i^T + \lambda I_k})^{-1} \sum_{r_{ij}\in r_{* j}}{r_{ij}u_{i}}$$

To break it down into pieces:
* $\sum v_jv_j^T$ and $\sum u_iu_i^T$ represent the sum of column vectors multiplied by their transpose where the vectors are determined by either the column vectors correspond to items that user u_i has rated in $V$ or the column vectors correspond to the users in $U$ that have rated item $v_j$.
* $\lambda I_k$ represents the addition of a regularization term $\lambda$ to avoid overfitting.
* $\sum_{r_{ij}\in r_{i *}}{r_{ij}v_{j}}$ and $\sum_{r_{ij}\in r_{* j}}{r_{ij}u_{i}}$ represent the scaling of each column feature vector by a rating with indexing handled in the same way as $\sum v_jv_j^T$ and $\sum u_iu_i^T$

maybe I want to just discuss indexing separately at the start and then modify talking about it in each point?


## ALS implementation 
### (dont forget to discuss regularization terms`)
ALS begins when the matrix $R$ is created. Since ALS requires us to subset $V$ and $U$ for columns that correspond to items a user has rated or users that have rated an item we used several hash maps to store these indices. Hash maps were created during matrix initialization which was efficient as we were already iterating over items that a user has rated meaning we could efficiently populate our hash maps with necessary information. With the matrix created and our maps initialized, we created $U$ and $V$ as random matrices with numbers drawn from a uniform distribution. For the optimization steps we found that $\sum_{r_{ij}\in r_{i *}}{v_jv_j^T}$ and $\sum_{r_{ij}\in r_{* j}}{u_iu_i^T}$ are the same as $V_jV_j^T$ and $U_iU_i^T$ with $U_i$ and $U_j$ being the subsets of $U$ and $V$ corresponding to observed ratings. However, this same process did not apply to $\sum_{r_{ij}\in r_{i *}}{r_{ij}v_{j}}$ and $\sum_{r_{ij}\in r_{* j}}{r_{ij}u_{i}}$, instead we found that we could multiply the observed ratings as a row vector by $V_j^T$ or $U_i^T$ and get the same result as taking the sum.

Our final update functions for our matrices thus looked like:

$$ u_i = ({V_jV_j^T + \lambda I_k})^{-1} {R_{i*}V_{j}^T}$$

and 
$$ v_j = ({U_iU_i^T + \lambda I_k})^{-1} {R_{*j}U_{i}^T}$$

## Might need to include handling for when R is not a row vector and is instead a column, like for items

### Handling a subset for testing purposes
we need the jth rating in row Ri* to correspond to the ith vector in VT, does it already do this even if we dont sort it, and will it do it for the full matrix? I do believe we will need to sort all the arrays that we have initialized

In [None]:
# handles imports and sets up the matrix and hash maps

# imports
import matrix_modules
import pandas as pd
import numpy as np

# load in the data
ratings, news, users = matrix_modules.load_dataset()

# create the matrix and user and item hash maps 
matrix, item_idx, user_idx = matrix_modules.create_item_cluster_mat(ratings, news, isALS=True)
# make them into lists
item_idx = {cluster_number : list(users) for cluster_number, users in item_idx.items()}
user_idx = {user_id : list(ratings) for user_id, ratings in user_idx.items()}
# For testing the item clustered data
# M being the top M users
M = 15
# Sum across rows and columns
sum_rows = matrix.sum(axis=1)

# For users
sorted_row_indices = np.argsort(sum_rows)[-30:]  # gets the top 30 rows
top_user_indices = sorted_row_indices[-M:]  # Last M of these

# Now creates the sub-matrix
top_user_indices.sort()
sub_mat = matrix[top_user_indices, :]
df = pd.DataFrame(sub_mat, index=ratings.iloc[top_user_indices]['user_id'].index)
df

# This is to make sure that item indices are only for our users that we have in the subset, not used in the bigger example
clean_item_idx = {}
for i in tqdm(range(30), total=29):
    cluster = item_idx[i]
    clean_item_idx[i] = []
    for user in cluster:
        if user in list(ratings.iloc[top_user_indices]['user_id'].index):
            clean_item_idx[i].append(user)

# used for the subset matrix
translator_map = {user_id : new_dx for new_dx, user_id in enumerate(ratings.iloc[top_user_indices]['user_id'].index)}
translator_inverse = {index:user_id for user_id, index in translator_map.items()}

# Sort the hash maps 
for cluster in range(30):
    clean_item_idx[cluster].sort() 
    # item_idx[cluster].sort()
for user in range(len(user_idx)):
    user_idx[user].sort()

In [2]:
full_ratings, news, full_users = matrix_modules.load_dataset(full=True)

In [3]:
full_ratings

Unnamed: 0,user_id,news_id,scores
0,U1,"[N14639, N27258, N63237, N112729, N42180, N109...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,U100,"[N99587, N61339, N129790, N12721, N100405, N10...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,U1000,"[N33446, N20131, N65823, N65823, N111503, N399...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, ..."
3,U10000,"[N34918, N81659, N128643, N97343, N103301, N92...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,U100005,"[N53796, N41484, N95178, N27038, N72493, N3850...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...
255985,U99984,"[N121592, N127001, N53018, N129416, N2110, N18...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
255986,U99989,"[N91, N108143, N26493, N15986, N82348, N71068,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
255987,U99993,"[N67397, N67770, N41129, N128503, N51724, N721...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
255988,U99994,"[N94577, N30729, N127177, N14611, N82747, N544...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, ..."


In [2]:
# Imports
from tqdm import tqdm
import matrix_modules
import pandas as pd
import numpy as np

# load in the data
ratings, news, users = matrix_modules.load_dataset()

# create the matrix and user and item hash maps 
R, item_idx, user_idx = matrix_modules.create_item_cluster_mat(ratings, news, isALS=True)

# make them into lists and sort
item_idx = {cluster_number : sorted(list(users)) for cluster_number, users in item_idx.items()}
user_idx = {user_id : sorted(list(ratings)) for user_id, ratings in user_idx.items()}

# is it possible to just filter out users that havent shown up yet? no but this is relying on the full data of them
# so well need to implement more

# initialize U and V
K = 5 # five latent factors tentatively 
I = len(user_idx) # number of users
M = 30 # number of items
np.random.seed(42)
U = np.random.uniform(0, 1, size=K*I).reshape((K, I))
V = np.random.uniform(0, 1, size=K*M).reshape((K, M))
Uold = np.zeros_like(U)
Vold = np.zeros_like(V)

# initialize a dataframe of the matrix to look at data
df = pd.DataFrame(R)
df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,1,27,0,14,2,0,0,0,3,0,...,3,0,1,0,0,2,0,0,0,2
U100,3,4,0,1,1,1,0,0,3,0,...,4,2,0,0,1,1,2,0,0,3
U1000,1,2,0,0,0,0,0,0,0,0,...,4,0,0,0,0,0,0,0,0,0
U10000,1,14,0,1,3,0,0,4,2,0,...,3,0,2,0,12,3,0,0,0,5
U100005,2,7,0,6,2,2,4,2,2,0,...,15,1,4,0,9,5,0,0,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
U99983,0,2,0,0,2,0,1,0,1,0,...,3,0,0,0,0,1,0,0,0,3
U99989,2,3,1,1,1,0,0,0,0,0,...,2,0,0,0,0,1,0,0,0,1
U99993,0,5,0,0,1,0,0,0,0,1,...,4,0,0,0,0,1,0,0,0,1
U99994,0,4,0,1,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
def rmse(X, Y):
    return np.sqrt(np.nanmean((X-Y)**2))

def max_update(X, Y):
    return np.noram(((X-Y)/Y).ravel(), np.inf)

error = [(0, rmse(R, np.inner(U, V)))]
update = [(0, max(max_update(Uold, U), max_update(Vold, V)))]

Maybe something else we can consider is making the full frame of all users and then only including the information for users that have ratings up until that time point, and then do the ALS then compare the two once done

In [3]:
def alternating_least_squares(U, V, R, user_map, item_map, max_iterations=10, lambda_reg=0.01):
    """
    Takes in the ratings matrix, user and item matrices, and performes alternating least squares optimization for iterations
    determined by max_iterations regularized by lambda_reg.

    Args:
        U (np.ndarray) : The k x n user feature matrix.
        V (np.ndarray) : The k x m item feature matrix.
        R (np.ndarray) : The ratings matrix.
        user_map (dict) : The hash map containing user ids as keys and item indices as values, gets used to subset the ratings matrix and V.
        item_map (dict) : The hash map containing item ids as keys and user indices as values, gets used to subset the ratings matrix and U.
        max_iterations (int) : The number of iterations to run alternating least squares for.
        lambda_reg (float) : The regularization term in the alternating least squares algorithm.

    Returns:
        U (np.ndarray) : The optimized user feature matrix.
        V (np.ndarray) : The optimized item feature matrix.
    """
    # Initialize k and the number of columns in each matrix
    k, u_cols = U.shape
    _, v_cols = V.shape
    k_In = np.diag(np.full(k, lambda_reg))

    # Start optimizing U and V
    for iteration in tqdm(range(1, max_iterations+1), total=max_iterations, desc='Starting ALS iterations'):
        # Fix V and optimize U
        for i in tqdm(range(u_cols), total=u_cols, desc='Optimizing U', leave=False):
            # Using translator inverse here to make sure we are using small matrix
            ratings_row = R[i, user_map[i]]
            rated_items = V[:, user_map[i]]

            # Update the ith vector of U
            U[:, i] = np.linalg.inv((rated_items @ rated_items.T) + k_In) @ (ratings_row @ rated_items.T)

        # Fix U and optimize V
        for j in tqdm(range(v_cols), total=v_cols, desc='Optimizing V', leave=False):
            # Get the ratings for the item 
            ratings_row = R[item_map[j], j]
            user_features = U[:, item_map[j]]

            # Update the jth vector of V
            V[:, j] = np.linalg.inv((user_features @ user_features.T) + k_In) @ (ratings_row @ user_features.T)

        Uold = U.copy()
        Vold = V.copy()
        # Calculate the error for this iteration
        error += [(iteration, rmse(R, np.inner(V,U)))]
        update += [(iteration, max(max_update(Uold, U), max_update(Vold, V)))]

    
    return U, V


In [4]:
U_new, V_new = alternating_least_squares(U, V, R, user_idx, item_idx)

Optimizing U: 100%|██████████| 187934/187934 [00:04<00:00, 42288.43it/s]
Optimizing V: 100%|██████████| 30/30 [00:00<00:00, 96.49it/s]
Optimizing U: 100%|██████████| 187934/187934 [00:04<00:00, 39249.18it/s]
Optimizing V: 100%|██████████| 30/30 [00:00<00:00, 101.15it/s]
Optimizing U: 100%|██████████| 187934/187934 [00:04<00:00, 38342.66it/s]
Optimizing V: 100%|██████████| 30/30 [00:00<00:00, 100.01it/s]
Optimizing U: 100%|██████████| 187934/187934 [00:04<00:00, 38725.48it/s]
Optimizing V: 100%|██████████| 30/30 [00:00<00:00, 91.23it/s]
Optimizing U: 100%|██████████| 187934/187934 [00:04<00:00, 41298.26it/s]
Optimizing V: 100%|██████████| 30/30 [00:00<00:00, 83.20it/s]
Optimizing U: 100%|██████████| 187934/187934 [00:04<00:00, 39293.17it/s]
Optimizing V: 100%|██████████| 30/30 [00:00<00:00, 67.74it/s]
Optimizing U: 100%|██████████| 187934/187934 [00:04<00:00, 38634.19it/s]
Optimizing V: 100%|██████████| 30/30 [00:00<00:00, 71.06it/s]
Optimizing U: 100%|██████████| 187934/187934 [00:04<0

In [7]:
r_hat = U_new.T @ V_new

In [8]:
sample = pd.DataFrame(matrix[:15])
new = pd.DataFrame(r_hat[:15])

In [9]:
new

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,1.632723,27.068923,3.910091,7.498986,2.58741,3.909736,1.924279,1.861832,4.621027,1.229485,...,2.398902,2.758063,2.504717,1.149447,1.947392,2.209134,3.960078,2.246385,1.342757,3.646889
1,1.592757,3.916329,1.226662,1.55249,2.675147,0.504441,1.377265,0.670341,2.970209,0.582975,...,3.877909,1.499775,1.955401,0.591964,2.515494,2.015964,1.064654,0.635102,0.598527,3.498109
2,1.046131,1.998683,0.590506,1.02829,1.168455,0.41418,0.721337,0.340074,1.280056,0.29581,...,3.996232,0.645667,0.971729,0.301243,0.870215,1.206642,0.66225,0.342446,0.296103,1.776932
3,2.006007,13.807272,0.988146,0.819445,5.498879,-0.829227,1.9968,0.652008,5.355492,0.825071,...,1.980258,-0.25414,2.556029,0.64344,5.885691,3.061179,0.321573,0.06647,0.891056,7.755422
4,3.71914,7.002691,2.075693,4.113008,4.316897,1.369226,2.620324,1.193437,4.336246,1.045626,...,14.110907,1.970487,3.641885,1.054168,3.154874,4.341922,2.283521,1.147987,1.049688,6.420974
5,1.495362,4.023212,1.529099,-0.708498,1.409498,1.394809,1.003858,0.818345,3.982244,0.614623,...,4.205897,3.691262,0.86153,0.709016,1.771818,1.554724,1.678744,1.181111,0.611665,2.326178
6,0.962712,3.906607,1.049704,1.441345,2.200339,0.350534,1.045016,0.543376,2.353678,0.456469,...,0.952791,1.11456,1.572142,0.453436,2.20014,1.352492,0.783764,0.495597,0.480627,2.615518
7,0.766095,2.031942,0.662391,0.995682,1.092765,0.424367,0.638727,0.350842,1.284997,0.281073,...,2.179866,0.800373,0.910009,0.291886,0.939543,0.931962,0.639188,0.375108,0.287722,1.442474
8,1.159475,4.948357,2.070152,1.180039,1.313749,1.916122,1.070916,1.011533,3.55162,0.664561,...,1.916813,3.903492,1.290693,0.773917,1.498784,1.265474,2.093401,1.453159,0.679942,1.544426
9,0.911686,3.969593,-0.532324,3.631117,3.954033,-2.109354,1.225968,-0.177788,-0.162301,0.156148,...,1.108588,-4.345146,2.531708,-0.069845,3.159591,1.871439,-1.229027,-1.12871,0.202834,4.8006


In [None]:
def sum_ri_yi(R, Y):
    y_row, _ = Y.shape
    rating_row = R[1, user_idx[translator_inverse[1]]]
    print(rating_row)
    Y_subset = Y[:, user_idx[translator_inverse[1]]]
    total = np.zeros((y_row, ))
    
    for i in range(len(rating_row)):
        rating = rating_row[i]
        col_vector = Y_subset[:, i]
        # print(total)
        total += rating * col_vector
        # print(total)
        # if i == 2:
        #     break
    test = R[1, user_idx[translator_inverse[1]]] @ Y[:, user_idx[translator_inverse[1]]].T # this is a row vector of all observed ratings corresponding to items in Y and the column vectors of Y
    return total, test

total, test = sum_ri_yi(sub_mat, V)
print("Total")
print(total)
print("\n ~~~~ \n")
print("Test")
print(test)

# sub_mat[1, list(user_idx[(translator_inverse[1])])] @ V[:, list(user_idx[(translator_inverse[1])])].T

## Factorization Machines
Introduced in 2010, factorization machines offered a combination between matrix factorization methods and regression _check_. Factorization machines capture all single and pairwise interactions between variables with a closed model equation computable in linear time. This is advantagous as it allows for the usage of stochastic gradient descent to learn model parameters. 

what need do for factorization machines? 

need data in different format
* how do we want our data to look? 
* we would just want to one hot encode our entire tensorflow dataset which is pretty fucked regarding memory
need update gradients for w, wnot, and the other thing
need evaluate gradients after updates



In [1]:
import matrix_modules
import pandas as pd
import numpy as np

In [4]:
dataset = pd.DataFrame()
train_split = '80_20'
for i in range(2):
    df = pd.read_csv(f"../MIND_large/{train_split}/train_chunk{i}.csv", index_col=0)
    dataset = pd.concat([dataset, df])   
dataset.head()

Unnamed: 0,user_id,time,news_id,category,sub_category,title,abstract,interaction_type,score
0,U66319,1,N10721,entertainment,entertainment-celebrity,Mike Johnson asks out Keke Palmer after Demi L...,Mike Johnson tried to ask out Keke Palmer in a...,history,1
1,U66319,1,N128129,movies,movies-celebrity,Brie Larson Has the Best Reaction Ever After T...,The 'Captain Marvel' star was left speechless ...,history,1
2,U66319,1,N28406,news,newsworld,Accused dine-and-dashers in viral video at St....,Five young black men who posted a video of a m...,history,1
3,U66319,1,N118998,news,newsgoodnews,Trooper pulls over to save flag on highway,The trooper is being praised for stopping his ...,history,1
4,U66319,1,N38884,sports,mma,UFC champ Khabib Nurmagomedov seen training in...,Khabib Nurmagomedov doesn't mess around.,history,1


In [5]:
dataset['identifier'] = dataset['news_id']
previously_viewed = dataset.groupby('user_id')['news_id'].apply(list)
# maybe we just maintain a hash map of every user with all their viewed articles
# and then for every interaction we just remove the instance of the article that isnt there?


user_id
U1         [N14639, N27258, N63237, N112729, N42180, N109...
U100       [N99587, N61339, N129790, N12721, N100405, N10...
U1000      [N33446, N20131, N65823, N65823, N111503, N399...
U10000     [N34918, N81659, N128643, N97343, N103301, N92...
U100005    [N53796, N41484, N95178, N27038, N72493, N3850...
Name: news_id, dtype: object

In [7]:
user_item_map = previously_viewed.to_dict()

In [12]:
dataset.head()

Unnamed: 0,user_id,time,news_id,category,sub_category,title,abstract,interaction_type,score,identifier
0,U66319,1,N10721,entertainment,entertainment-celebrity,Mike Johnson asks out Keke Palmer after Demi L...,Mike Johnson tried to ask out Keke Palmer in a...,history,1,N10721
1,U66319,1,N128129,movies,movies-celebrity,Brie Larson Has the Best Reaction Ever After T...,The 'Captain Marvel' star was left speechless ...,history,1,N128129
2,U66319,1,N28406,news,newsworld,Accused dine-and-dashers in viral video at St....,Five young black men who posted a video of a m...,history,1,N28406
3,U66319,1,N118998,news,newsgoodnews,Trooper pulls over to save flag on highway,The trooper is being praised for stopping his ...,history,1,N118998
4,U66319,1,N38884,sports,mma,UFC champ Khabib Nurmagomedov seen training in...,Khabib Nurmagomedov doesn't mess around.,history,1,N38884


In [14]:
new_data = []
for user_id in dataset['user_id']:
    new_data.append(user_item_map[user_id])
dataset['viewed_items'] = new_data


Unnamed: 0,user_id,time,news_id,category,sub_category,title,abstract,interaction_type,score,viewed_items
0,U66319,1,N10721,entertainment,entertainment-celebrity,Mike Johnson asks out Keke Palmer after Demi L...,Mike Johnson tried to ask out Keke Palmer in a...,history,1,"[N10721, N128129, N28406, N118998, N38884, N96..."
1,U66319,1,N128129,movies,movies-celebrity,Brie Larson Has the Best Reaction Ever After T...,The 'Captain Marvel' star was left speechless ...,history,1,"[N10721, N128129, N28406, N118998, N38884, N96..."
2,U66319,1,N28406,news,newsworld,Accused dine-and-dashers in viral video at St....,Five young black men who posted a video of a m...,history,1,"[N10721, N128129, N28406, N118998, N38884, N96..."
3,U66319,1,N118998,news,newsgoodnews,Trooper pulls over to save flag on highway,The trooper is being praised for stopping his ...,history,1,"[N10721, N128129, N28406, N118998, N38884, N96..."
4,U66319,1,N38884,sports,mma,UFC champ Khabib Nurmagomedov seen training in...,Khabib Nurmagomedov doesn't mess around.,history,1,"[N10721, N128129, N28406, N118998, N38884, N96..."
...,...,...,...,...,...,...,...,...,...,...
14452118,U33064,12,N129416,music,musicnews,"Taylor Swift Rep Hits Back at Big Machine, Cla...",Taylor Swift's team responds to Big Machine's ...,impression,0,"[N128643, N129175, N51342, N128129, N94365, N7..."
14452119,U698812,12,N76525,news,newscrime,Sean Kratz Found Guilty Of First-Degree Murder...,Sean Kratz has been found guilty in the murder...,impression,0,"[N60169, N123045, N47101, N53521, N110161, N90..."
14452120,U698812,12,N13243,foodanddrink,foodnews,Sky Bars are coming back to stores and classic...,Get your sweet tooth ready.,impression,0,"[N60169, N123045, N47101, N53521, N110161, N90..."
14452121,U698812,12,N123209,tv,tvnews,Survivor Contestants Missy Byrd and Elizabeth ...,Survivor contestants Missy Byrd and Elizabeth ...,impression,0,"[N60169, N123045, N47101, N53521, N110161, N90..."


In [15]:
dataset.drop(columns='identifier', inplace=True)

In [17]:
dataset['time'] = dataset['time'].apply(str)
labels = dataset['score'].copy()
dataset.drop(columns='score')

Unnamed: 0,user_id,time,news_id,category,sub_category,title,abstract,interaction_type,viewed_items
0,U66319,1,N10721,entertainment,entertainment-celebrity,Mike Johnson asks out Keke Palmer after Demi L...,Mike Johnson tried to ask out Keke Palmer in a...,history,"[N10721, N128129, N28406, N118998, N38884, N96..."
1,U66319,1,N128129,movies,movies-celebrity,Brie Larson Has the Best Reaction Ever After T...,The 'Captain Marvel' star was left speechless ...,history,"[N10721, N128129, N28406, N118998, N38884, N96..."
2,U66319,1,N28406,news,newsworld,Accused dine-and-dashers in viral video at St....,Five young black men who posted a video of a m...,history,"[N10721, N128129, N28406, N118998, N38884, N96..."
3,U66319,1,N118998,news,newsgoodnews,Trooper pulls over to save flag on highway,The trooper is being praised for stopping his ...,history,"[N10721, N128129, N28406, N118998, N38884, N96..."
4,U66319,1,N38884,sports,mma,UFC champ Khabib Nurmagomedov seen training in...,Khabib Nurmagomedov doesn't mess around.,history,"[N10721, N128129, N28406, N118998, N38884, N96..."
...,...,...,...,...,...,...,...,...,...
14452118,U33064,12,N129416,music,musicnews,"Taylor Swift Rep Hits Back at Big Machine, Cla...",Taylor Swift's team responds to Big Machine's ...,impression,"[N128643, N129175, N51342, N128129, N94365, N7..."
14452119,U698812,12,N76525,news,newscrime,Sean Kratz Found Guilty Of First-Degree Murder...,Sean Kratz has been found guilty in the murder...,impression,"[N60169, N123045, N47101, N53521, N110161, N90..."
14452120,U698812,12,N13243,foodanddrink,foodnews,Sky Bars are coming back to stores and classic...,Get your sweet tooth ready.,impression,"[N60169, N123045, N47101, N53521, N110161, N90..."
14452121,U698812,12,N123209,tv,tvnews,Survivor Contestants Missy Byrd and Elizabeth ...,Survivor contestants Missy Byrd and Elizabeth ...,impression,"[N60169, N123045, N47101, N53521, N110161, N90..."


In [18]:
labels

0           1
1           1
2           1
3           1
4           1
           ..
14452118    0
14452119    0
14452120    0
14452121    0
14452122    0
Name: score, Length: 14452123, dtype: int64

seems like we need to make a separate dataframe for history features and then concatenate it to our main one. We then want to load in our news dataset and create a column for every news ID / do this as a dictionary and then iterate adding 1s and 0s for features then making this into a dataframe

In [21]:
news = pd.read_csv('../MIND_large/csv/news.csv', index_col=0)
news.head()

Unnamed: 0,news_id,category,sub_category,title,abstract,url,title_entities,abstract_entities
0,N88753,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N23144,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N86255,health,medical,Dispose of unwanted prescription drugs during ...,,https://assets.msn.com/labs/mind/AAISxPN.html,"[{""Label"": ""Drug Enforcement Administration"", ...",[]
3,N93187,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
4,N75236,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."


In [26]:
new_dict = {news_id : np.full(len(dataset), 0, dtype='int8') for news_id in news['news_id']} # should not be len news, has to be len tfds 


: 

In [23]:
new_dict

{'N88753': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N23144': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N86255': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N93187': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N75236': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N99744': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N5771': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N124534': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N51947': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N59220': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N17957': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N40259': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N42222': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N46520': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N40599': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N22273': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N30547': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N42639': array([0, 0, 0, ..., 0, 0, 0], dtype=int8),
 'N117551'

In [None]:
# then for subsetting we iterate through each user 
# user -> item history map dict -> history list
# for item in history list:
    # then counter -> new_dict[history list][counter] = 1

for num, user in enumerate(dataset['user_id']):
    history = user_item_map[user]
    for item in history:
        new_dict[item][num] = 1

new_dict

IndexError: index 72023 is out of bounds for axis 0 with size 72023