# Description of modelling practices
To include a wide range of testing and frameworks we evaluated recommender systems under a variety of models: stochastic gradient descent, alternating least squares, factorization machines, Tensorflow Recommenders and more notably NVIDIA Merlin. Using clustering we saved on compute time by minimizing the search space that our matrix factorization needed to operate on.

## Utilizing clustering
Under stochastic gradient descent and alternating least squares we use a user-item matrix $R$, via clustering we were able to experiment on this. Three different types of the user-item ratings matrix $R$ were used for our baseline, two using either user or item clustering and one using both. By setting up $R$ in this way we increased the range of ratings and introduced collaboritve, content and hybrid filtering techniques in the models without them and enhanced those that did. 

### Ratings calculations for clusters
#### Item and User Clusters

The ratings for user $u_i$ in an item cluster $IC$ is denoted as $IC_i$ and is calculated as the sum of all interactions that user $u_i$ had with various items in the cluster. Mathematically, it can be expressed as:
$$
IC_{i} = \sum_{j=1}^{n} I_{ij}
$$
The ratings for item $v_{j}$ in a user cluster $UC$ is denoted as $UC_{j}$ and is calculated as the sum of all interactions that users in the cluster had with item $v_{j}$. Mathematically, it can be expressed as:
$$
UC_{j} = \sum_{i=1}^{m} I_{ij}
$$

Under both user and item clustering:
- $n$ is the total number of items in the cluster,
- $I_{ij}$ is a binary indicator defined as follows $ I_{ij} = \begin{cases} 1, & \text{if user } u_{i} \text{ has interacted with item } v_{j} \\ 
0, & \text{otherwise} \end{cases}
$

#### Using both clusters
To further experiment on how we can use clustering we also composed $R$ of users from a $UC$ and the items from their highest rated $IC$

#### How our matrices might look
Maybe a visualization or drawing of our matrices?

### Adding Features 
To further utilize complex features, we experimented with appending user and item features to $U$ and $V$ respectively. Under ALS respective features were concatenated to the bottom of $U$ and $V$ whereas in SGD the features were concatenated to the right side.

# Modelling : Ranking
Ranking is the step where we take a sparse set of ratings and utilize several different methods to predict the ratings that other users will give to currently unrated articles.

For matrix factorization techniques, predicting ratings is formulated as the following non-convex optimization problem which seeks to minimize least squared error and use regularization to avoid overfitting:

$$\min_{U,V} \sum_{r_{ij} \text{observed}}{(r_{ij}-u_{i}^Tv_{j})^2} + \lambda(\sum_{i}\|u_i\|^2 + \sum_{j}\|v_j\|^2)$$ 

## Gradient Descent
### Introduction
Under gradient descent, we create partial derivatives of the aformentioned function with respect to $u_i$ and $v_j$. and update each vector in U and V at indices corresponding to observed ratings.

The update formulae for $u_i$ and $v_m$ with $i$ and $m$ corresponding to the row and column of an observed rating, are as follows:

$$
\begin{aligned}
u_i^{\text{new}} &= u_i + 2\alpha (r_{mi} -  v_m u_i^T)\cdot v_m - 2\alpha\lambda u_i\\
v_m^{\text{new}} &= v_m + 2\alpha (r_{mi} -  v_m u_i^T)\cdot u_i - 2\alpha\lambda v_m
\end{aligned}
$$

### Implementation
Our implementation of gradient descent for ratings matrix predictions is simple. For the training of U and V we iterate over the indices of all observed ratings updating corresponding rows in U and V at each rating. Additionally, we track the updates so that if we notice that our function starts to converge we can break out of the loop to avoid wasting any additional compute. 

## Alternating Least Squares
### Introduction
Our implementation of ALS was based off of lecture 14 from CME 323: Distributed Algorithms and Optimization, Spring 2015 from Stanford. ALS is quite similar to stochastic gradient descent but differs in one key aspect; instead of updating by vector, $U$ and $V$ alternate on being fixed while optimizing the other. Additionally, under our implementation of ALS from Stanford the user and item matrices $U$ and $V$ both are of dimension  $k$ x $n$ and $k$ x $m$ respectively. The complete ratings matrix $R$ is thus estimated via $\hat{R} = U^TV$. 

This is formulated as the following non-convex optimization problem which seeks to minimize least squared error and handle regularization to avoid overfitting:

$$\min_{U,V} \sum_{r_{ij} \text{observed}}{(r_{ij}-u_{i}^Tv_{j})^2} + \lambda(\sum_{i}\|u_i\|^2 + \sum_{j}\|v_j\|^2)$$

While gradient descent can be used, it is slow and requires a large amount of iterations which leads us into ALS. By fixing $U$ we obtain a convex function of $V$ and vice versa. Therefore in ALS we fix and optimize opposite matrices until convergence. Below is the general algorithm as described in the Stanford materials:

* Initialize $k$ x $n$ and $k$ x $m$ matrices $U$ and $V$
* Repeat the the following until convergence
    * For all column vectors $i = 1,... , n$    
    $$ u_i = (\sum_{r_{ij}\in r_{i *}}{v_jv_j^T + \lambda I_k})^{-1} \sum_{r_{ij}\in r_{i *}}{r_{ij}v_{j}}$$

    * For all column vectors $j = 1,... , m$
    $$ v_j = (\sum_{r_{ij}\in r_{* j}}{u_iu_i^T + \lambda I_k})^{-1} \sum_{r_{ij}\in r_{* j}}{r_{ij}u_{i}}$$

To break it down into pieces:
* $\sum v_jv_j^T$ and $\sum u_iu_i^T$ represent the sum of column vectors multiplied by their transpose where the vectors are determined by either the column vectors correspond to items that user u_i has rated in $V$ or the column vectors correspond to the users in $U$ that have rated item $v_j$.
* $\lambda I_k$ represents the addition of a regularization term $\lambda$ to avoid overfitting.
* $\sum_{r_{ij}\in r_{i *}}{r_{ij}v_{j}}$ and $\sum_{r_{ij}\in r_{* j}}{r_{ij}u_{i}}$ represent the scaling of each column feature vector by a rating with indexing handled in the same way as $\sum v_jv_j^T$ and $\sum u_iu_i^T$

maybe I want to just discuss indexing separately at the start and then modify talking about it in each point?


### Implementation 
ALS begins when the matrix $R$ is created. Since ALS requires us to subset $V$ and $U$ for columns that correspond to items a user has rated or users that have rated an item we used several hash maps to store these indices. Hash maps were created during matrix initialization which was efficient as we were already iterating over items that a user has rated meaning we could efficiently populate our hash maps with necessary information. With the matrix created and our maps initialized, we created $U$ and $V$ as random matrices with numbers drawn from a uniform distribution. For the optimization steps we found that $\sum_{r_{ij}\in r_{i *}}{v_jv_j^T}$ and $\sum_{r_{ij}\in r_{* j}}{u_iu_i^T}$ are the same as $V_jV_j^T$ and $U_iU_i^T$ with $U_i$ and $U_j$ being the subsets of $U$ and $V$ corresponding to observed ratings. However, this same process did not apply to $\sum_{r_{ij}\in r_{i *}}{r_{ij}v_{j}}$ and $\sum_{r_{ij}\in r_{* j}}{r_{ij}u_{i}}$, instead we found that we could multiply the observed ratings as a row vector by $V_j^T$ or $U_i^T$ and get the same result as taking the sum. Additionally to introduce regularization we added $\lambda$ multiplied by the $k$ x $k$ identity matrix to the $U_i/V_i * U_i^T/V_i^T$ step. 

Our final update functions for our matrices thus looked like:

$ u_i = ({V_jV_j^T + \lambda I_k})^{-1} {R_{i*}V_{j}^T}$


$ v_j = ({U_iU_i^T + \lambda I_k})^{-1} {R_{*j}U_{i}^T}$

## Factorization Machines
### Introduction
Introduced in 2010, factorization machines offered a combination between matrix factorization methods and regression? svm? _check_. Factorization machines capture all single and pairwise interactions between variables with a closed model equation computable in linear time. This is advantagous as it allows for the usage of stochastic gradient descent to learn model parameters. 
Factorization machines utilize high dimensional feature vectors along with a feature matrix denoted as $V$. For our implementation we implemented a factorization model of degree 2, which per __source__ has the following equation: 

$$ \hat{y}(x) := w_0 + \sum_{i=1}^{n}{w_i x_i} + \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \bold{v}_i, \bold{v}_j\rangle x_i x_j$$

The model parameters that are estimated include $w_0$, $w$ and $V$ where $w_0$ represents global bias, $w$ represents the weights of all possible features in a feature vector $x$ and $V$ is a $n$ x $k$ feature matrix. Pairwise interactions are modeled by $\langle \bold{v}_i, \bold{v}_j\rangle$.

A row within the feature matrix $V$ is defined as $v_i$ which describes the i-th feature with $k$ factors where $k$ represents the dimensionality of the factorization. 

### Gradient Descent
Per the introductory paper on factorization machines, model parameters $w_0$, $w$ and $V$ can all be learned via gradient descent methods on a variety of losses. As a result, we utilized stochastic gradient descent to optimize and tune our model parameters with our data. Below is the gradient vector of the function $\hat{y}$ for the estimated model parameters.
$$\frac{\partial}{\partial\theta}\hat{y}(x) = \begin{cases} 1, & \text{if } \theta \text{ is } w_0 \\ x_i, & \text{if } \theta \text{ is } w_i \\ x_i\sum_{j=1}^{n}{v_{j,f}}x_j - v_{i,f}x_i^2, & \text{if } \theta \text{ is } v_{i,f} \end{cases} $$

Additionally, to stay consistent in our judgement of our baseline models, we focused on minimizing residuals under the squared loss function, $(y - \hat{y})^2$, which is shown below: 

 $$(y -  (w_0 + \sum_{i=1}^{n}{w_i x_i} + \sum_{i=1}^{n}\sum_{j=i+1}^{n} \langle \bold{v}_i, \bold{v}_j\rangle x_i x_j))^2$$

Given the gradient vector and standard loss we utilize the following update formulae: 

$w_0^{new} = w_0 + \alpha2(y - \hat{y})$

$w_i^{new} = w_i + \alpha2(y - \hat{y}) * x_i $

$v_{i,f}^{new} = v_{i,f} + \alpha2(y - \hat{y}) * x_i\sum_{j=1}^{n}{v_{j,f}}x_j - v_{i,f}x_i^2$

### Implementation
Factorization machines facilitate the usage of high dimensional feature vectors. To manage the large amount of users and items in our dataset, a sparse matrix was created from the tensorflow compatible dataset as a way to hold our feature vectors efficiently in memory. Feature vectors included information about the user, their interaction, rating, previous interactions, median time of day of interactions, item features and personal taste (maybe add later on once it all works). Model parameters $V$, $w_0$ and $w$ were initialized randomly with samples from a uniform distribution. Parameters were updated at different cadences as every feature vector contained multiple $w_i$ and $v_{i,f}$, meaning $w_0$ was updated once every feature vector and $w_i$ and $v_{i,f}$ were updated for every instance of a feature within a single feature vector. 

In [None]:
import pandas as pd
import numpy as np
import matrix_modules

In [None]:
def y_hat(w_0, x, w, features_matrix):
    """ 
    Calculates the predicted error for the given parameters: w_0, x, w and the subset of V corresponding to the features present in the feature vector.

    Args:
        w_0 (int) : w_0 represents the global bias term added at the beggining of the y_hat calculation.
        x (dict) : The feature vector corresponding to a particular row index. x has two keys, indices and scores, which signify the indices in V and w that have scores and the corresponding scores.
        w (np.ndarray) : w represents the vector of feature weights and is a 1 x num_features dimensional row vector.
        features_matrix (np.ndarray) : features_matrix represents the subsetted matrix of V corresponding to the indices containing values in x.
    
    Returns:
        y_hat (int) : A predicted score for the feature vector.
    """

    # Get the feature indices and their corresponding scores from the feature vector x.
    feature_indices = x["indices"]
    scores = x["scores"]

    # Get the \sum{i=1}^{n}{w_ix_i} term.
    scaled_feature_weights = w[feature_indices] * scores
    
    # Initialize a total for the summation of inner product of pairwise rows in V with scores.
    total = 0

    # Get the number of rows in the subset of the original features matrix for looping.
    rows, _ = features_matrix.shape

    # Loop through all pairwise groups of rows finding their innner product and multiplying by their scores, summing the whole thing.
    for row_1 in range(rows):
        for row_2 in range(rows):
            # Check to see if one row is the same as another as we wish to avoid that case. 
            if row_1 == row_2:
                pass
            else:
                # Add the full calculation to total.
                # Subsetting scores in this way works because the number of scored items directly corresponds to the number of rows in the subset of our feature matrix V.
                total += scores[row_1] * x[row_2] * np.inner(features_matrix[row_1, :], features_matrix[row_2, :])
                
    # Return the linear combination of what we calculate to get the score prediction.
    return w_0 + total + scaled_feature_weights

In [None]:
def update_w_0(w_0, err, alpha):
    """
    Update the global bias term using the error and the learning rate alpha.
    """
    return w_0 + 2 * alpha * err

In [None]:
def update_w_i(x_i_score, w_i, err, alpha):
    """
    Update an index in the feature vector w using the the score in the feature vector, its weight in w, the error and the learning rate alpha.

    Args:
        x_i_score (int) : The score in the feature vector x corresponding to the feature with weight w_i.
        w_i (int) : The weight of the feature in w.
        err (int) : The error calculated by finding y - y_hat.
        alpha (int) : The learning rate that was chosen at model creation. 

    Returns:
        w_i (int) : The updated model weight at the ith index in w.
    """
    return w_i + 2*alpha*err*x_i_score


In [None]:
def update_v_ij(x, v_ij, subset, row_i, err, alpha):
    """
    Updates the ith vector in the feature matrix V.

    Args:
        x (np.ndarray) : The row vector containing scores from a feature vector. 
        v_ij (np.ndarray) : A row vector from the feature matrix V. 
        subset (np.ndarray) : The subset of V corresponding to the indices of interacted with features.
        row_i (int) : The row index in the subset that is being updated.
        err (int) : The error calculated by finding y - y_hat.
        alpha (int) : The learning rate that was chosen at model creation. 
    
    Returns:
        Updated v_ij, the updated row vector in V.
    """
    # Get the number of rows in the subset to determine number of looping iterations.
    rows, k = subset.shape

    # Initialize a total to keep track of the sum.
    total = np.zeros((1, k))

    # Start looping over the rows of the subset not corresponding to the row determined by row_i.
    for j in range(rows):
        if j == row_i:
            pass
        else:
            total += subset[j, :] * x[j] - subset[row_i, :] * x[row_i]**2

    return v_ij + 2 * alpha * err * x[row_i] * total


In [None]:
def factorization_machine(feature_vectors, k, num_features, alpha):
    """
    Takes in a set of feature vectors, the desired number of latent factors and the learning rate and then
    performs gradient descent in the vain of a factorization machine to train model parameters w_0, w and V.

    Args:
        feature_vectors (dict) : The sparse representation of feature vectors as a dictionary with keys 'indices' and 'scores'
        k (int) : The desired number of latent factors. 
        num_features (int) : The maximum index in the row vectors which gets used in the creation of V.
        alpha (int) : The learning rate for the gradient descent.
    
    Returns:
        w_0, w and V which after running will be trained model weights that can be used on new observations.
    """

    # Initialize w_0.
    w_0 = 1
    
    # Initialize w.
    w = np.random.uniform(0, 1, size=num_features).reshape((1, num_features))
    w_old = np.zeros_like(w)

    # Initialize V
    V = np.random.uniform(0, 1, size=num_features*k).reshape((num_features, k))
    V_old = np.zeros_like(V)

    # Iterate through all rows provided by the feature feature vectors argument.
    for row in range(len(feature_vectors)):

        # Get the indices of the features that are used and scores.
        indices = feature_vectors[row]["indices"]
        scores = feature_vectors[row]["scores"]
        rating = feature_vectors[row]["rating"]
        # Subset V for the rows corresponding to rated feature indices.
        V_subset = V[indices, :]

        # Calculate y_hat.
        rating_estimate = y_hat(w_0, row, w, V_subset)
        error =  rating - rating_estimate

        for index in range(len(scores)):
            # We first update the ith weight in w.
            w[:, indices[index]] = update_w_i(scores[index], w[:, indices[index]], error, alpha)

            # We then update the ith row of the feature matrix V.
            V[indices[index], :] = update_v_ij(scores, V[indices[index], :], V_subset, index, error, alpha)

        w_0 = update_w_0(w_0, error, alpha)
    
    return w_0, w, V


## Testing
Starting off with testing we evaluate the performance of stochastic gradient descent, alternating least squares, and factorization machines. Factorization machines were implemented with an existing framework ___ in order to save time on fine tuning linear algebra for compute efficiency.

# Modelling : Making Recommendations

The other largest sub-system of any recommender system is one that takes action on predicted ratings to make recommendations. What's considered a 'good' recommendation differs greatly depending on the goals in place for the recommender system as a whole. For example, if there are business incentives to promote articles of a certain category then 'good' recommendations are ones that a user will interact with and that further business goals. To ensure that our evaluation was comprehensive, we used created a basic system to make recommendations on which we improved upon. 

## Making Recommendations : Success Metrics
Recommendations can be evaluated by several success metrics: novelty, coverage and serendipity to name a few. Novelty is a measure of the newness of a recommendation. Coverage is a measure of how much of the catalogue is represented in recommendations. Serendipity measures both the newness of recommendations and how exciting they are. An example of a serendipitous recommendation would be if a user generally reads stories about sports in the united states and is then recommended a story about a lesser known sport from a different country they were unaware of. Improving and evaluating the serendipity of recommendations is difficult, therefore we chose the following ____ (research serendipity calculations) 

### Making Recommendations : Implementation of metrics

* To understand the coverage of our recommendations, we took the summation of all recommendations made and then determined the portion of the overall catalogue that was comprised of the recommendations. 
* To calculate the novelty of our recommendations, we utilized a vector similarity of item features to determine how similar recommended items were to those that were previously rated in addition to the popularity of articles.
* To calculate serendipity, we ___  

We also need to consider how to handle users that have ranked every cluster / other shit like that

### Making Recommendations : Basic Implementation
For matrix factorization methods like gradient descent and alternating least squares, the most simple form of retrieval is to look at the highest n ratings present in a row of $\hat{R}$.  

The detriment behind this method of providing recommendations is that we are prone to providing recommendations to users for items they have already interacted with.

### Making Recommendations : Improvements
Improving the quality of recommendations from ratings created by matrix factorization methods like ALS and SGD is a varied process. Since there's no 'best' way to recommend items we focused on improving recommendations in a few contexts: improving recommendations for a specific business goal, making recommendations more personal and adding another step to make intra-cluster recommendations. 

As mentioned previously, one of our first goals for improving recommendations was to avoid recommending items that a user has already rated. The way we implemented this was by utilizing the hash maps that are made during matrix creation to easily subset out viewed items from the set of recommendations prior to sorting it for the highest ratings.




### Making Recommendations : Utilizing Item Features

Current thoughts is leaning towards some weighted sum score for each item, like a feature similarity + dissimilarity + predicted ratings thingaling 

One way that we improved the quality of recommendations was to utilize both latent factors and given features from users and items to  
We further experimented on improving recommendations by utilizing the existing features alongside latent factors generated during matrix factorization for both users and items. Features were placed inside of several KNN models where a vector similarity could be used on highly rated and seen items to introduce the usage of features in similarity. User features were used to get the similar users and find recommendations that way.

In [3]:
# Import necessary libraries and load functions
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
import pandas as pd
import numpy as np
import matrix_modules

def calculate_coverage(items_recommended, items_present):
    """ 
    Calculates the coverage of recommendations made.

    Args:
        items_recommended (int) : The total number of items that are recommended to users.
        items_present (int) : The total number of items present in the catalog.

    Returns:
        coverage (float) : Returns the percent coverage of the catalog.
    """

    if items_recommended > items_present:
        return 1   
    else:
        return items_recommended / items_present

def calculate_novelty(recommendations, item_features, max_popularity):
    """
    Calculates the novelty of recommendations. Since novelty can be a measure of how popular items are 
    we calculate novelty by 1 - popularity(i) / max popularity
    """
    # Here we would just want to subset the recommendations made in the news features for their popularity and do calculations off of that
    # Iterate over all recommendations that are being made
    novelty_scores = []
    for recommendation in recommendations:

        # Recommendation is an index so we subset the item features for that index and get the popularity there 
        novelty_scores.append(1 - (item_features.loc[recommendation, 'popularity'] / max_popularity))

    return novelty_scores

def filer_recommendations(sorted_indices, viewed_items):
    """
    Filters the recommendations to exclude those that have already been rated by the user
    """
    # Create an empty list to populate with recommendable indices.
    recommendable_items = []

    # Sort through all indices that are recommendable.
    for index in sorted_indices:

        # If the index corresponds with one that has not already been viewed, add it to the list of recommendable items.
        if index not in viewed_items:
            recommendable_items.append(index)
    
    # If the recommendable items list is empty, return the sorted indices. 
    if not recommendable_items:
        recommendable_items = sorted_indices
        print("User has interacted with all possible items")        
    
    # If the recommendable items list is not empty, return it.
    return recommendable_items

In [4]:
# Set up the ratings matrix
print("Loading Dataset")
full_ratings, news, users = matrix_modules.load_dataset(full=True)

print("Getting Ratings Matrix")
R, item_idx, user_idx = matrix_modules.create_item_cluster_mat(full_ratings, news, num_users=len(users), isALS=True, num_clusters=len(news['cluster'].unique()))
item_idx = {num : sorted(list(users)) for num, users in item_idx.items()}
user_idx = {user_id : sorted(list(ratings)) for user_id, ratings in user_idx.items()}
seen = {user_id : set(ratings) for user_id, ratings in user_idx.items()}
K = 5
I = len(user_idx) 
M = len(item_idx)
U = np.random.uniform(0, 1, size=K*I).reshape((K, I))
V = np.random.uniform(0, 1, size=K*M).reshape((K, M))
print("Starting ALS")
U, V, track_error, track_update = matrix_modules.alternating_least_squares(U, V, R, user_idx, item_idx, max_iterations=10)
R_hat = U.T @ V

Loading Dataset


Getting Ratings Matrix
Starting ALS


Starting ALS iterations: 100%|██████████| 10/10 [01:07<00:00,  6.70s/it]


In [None]:
i = 1
ratings_row = R_hat[i, :]
sorted_indices = np.argsort(ratings_row)[::-1]


# indices = filer_recommendations(sorted_indices, seen[0])
# ratings_row[list(indices)]

In [None]:
# Load in the user features to use for vector similarity calculations
user_features = pd.read_csv("../MIND_large/csv/full_user_clusters.csv", index_col=0)
item_features = pd.read_csv("../MIND_large/csv/full_item_features.csv", index_col=0)

In [None]:
# Select the number of neighbors
n_neighbors = 5

# Initialize the Knn models
user_features_knn = NearestNeighbors(n_neighbors=n_neighbors, metric = 'euclidean')
user_factors_knn =  NearestNeighbors(n_neighbors=n_neighbors, metric = 'euclidean')

# Fit the KNN models with the vectors
user_features_knn.fit(user_features.drop(columns=['user_id', 'cluster']))
user_factors_knn.fit(U.T);

In [None]:
# Process item features
grouped_item_features = item_features.drop(columns=['news_id', 'reduced_embeddings_1', 'reduced_embeddings_2']).groupby("cluster").agg(sum)

In [None]:
# Initialize knn models for item features / factors
item_features_knn = NearestNeighbors(n_neighbors=n_neighbors, metric = 'euclidean')
item_factors_knn =  NearestNeighbors(n_neighbors=n_neighbors, metric = 'euclidean')

# Fit the models with the item feature data, making note to drop the popularity column so that it doesnt skew results
gif_wo_pop = grouped_item_features.drop(columns=['popularity'])
item_features_knn.fit(gif_wo_pop)
item_factors_knn.fit(V.T);

In [None]:
# Create the lambda function that gets indices
get_factor_indices = lambda vector, knn : knn.kneighbors([vector.T])  
get_feature_indices = lambda vector, knn : knn.kneighbors([vector])  

# _, item_factor_col_indices = get_factor_indices(V[:, 1], item_factors_knn)
# _, user_factor_col_indices = get_factor_indices(U[:, 1], user_factors_knn)
# _, item_feature_row_indices = get_feature_indices(user_features.iloc[1, 1:-1], item_features_knn) # something like this 
# _, user_feature_row_indices = get_feature_indices(item_features.iloc[1, 1:-1], user_features_knn) # by subsetting with an iloc [1, 1:-1] we avoid columns that arent included in the model

In [None]:
def calculate_ratings_weights(items, user_features_row, item_features):
    """ 
    Calculates ratings weights by taking the dot product of the user feature vector with each items feature vector.
    Or we could just consider taking the product of the users preference score with the category and then do 
    something with embeddings maybe?

    item_features (rows of itme feature dataframe forming a subset for interacted with items)
    user_features_row (the row of user preferences)
    """
    # need to look at the feature table for the item indices and make sure that everything is lining up
    # need to also make sure that item indices are all fine for user clustering methods 
    # then also since we have these latent factors of k dimension we can even consider doing a UMAP reduction of the K factors to
    # fit the number of features and then utilize those for recommendations as well maybe? 
    weights = []
    for row_index in range(len(item_features)):
        # user_features_row['']
        # Basically we just get the items category
        item_row = item_features.iloc[row_index, :]
        item_row = item_row[item_row != 0] # need to test this
        category = item_row['category'].columns#? # need to subset where the value is 1

        # Then the weight becomes the users preference for that category
        user_features_row[category]
        # then in regards to opportunities for embeddings, we could get all embeddings that a user prefers and then do something with that in regards
        # to the embeddings of the article? 
        # And then a popularity weighting?
        # maybe we want to include some sort of inverse weighting penalty system where we subtract the vector similarity from the weight?
        # weights.append()

    

In [None]:
def get_similar_user_ratings(indices, R_hat):
    """
    Takes in a list of indices corresponding to users for similar features.
    """
    
    for index in indices:
        row = R_hat[index, :]
    


In [None]:
# Now for similar users we iterate through their highest rated items and add them to the set of filtered recommendations
# Or maybe we need to implement a function that utilizes item features and the users preferences to get the highest quality
# recommendations when fed a set of indices




In [None]:
# now we consider how we would combine user features and item features / factors
# we could always just take their feature preferences w/o the timestamp and make sure that the order of elements is the same ?
# like we could take the vector of preferences u and the vector of item features and then take their dot product and that would 
# be considered a weight on the users predicted score for the item. 
# We also might want to dimension reduce or make it so the preferences are the same