# **Guidelines**

## Process Overview
The ranking step includes 2 sub-processes:
1. Scoring: In this step, you will train a scoring model to predict 3 types of score; probability of like, probability of dislike, and engagement time.
2. Ranking: In this step, you will create a ranking policy to rank the items based on the output from previous step.


## Your Task

### 1. Modify `DataCollectorExample()`
also rename it to `DataCollector{Team}()` juts like previous assignment. All of the functions in `DataCollectorExample()` need to be modified (They're the ones with `raise NotImplementedError("you need to implement this")` in `DataCollector()`). The explaination of each function includes in their own docstring with example code.

Some important points worth noting are:
1. for `.feature_generation_user()` and `.feature_generation_content()`, you DO NOT need to apply one-hot encoding or scaler since this process will be done by our pipeline using via `.postprocess_feature()`, which will also save the `Postprocessor()` class as a pickle file to be used for testing and inference step. You will need to output nsames of numberical features and categorical features along with the feature dataframe.
2. When engineering target variables with `.get_Ys()`, you'll need to output 3 columns of target variables (`'like'`, `'dislike'`, `'engage_time'`). Be careful of how you create each target variable. What happen if a user like an item then change their mind and dislike it? What if a user see the same content twice, what would be the engagement time?
3. Be creative of how you would create your ranking policy in `.rank()`. You have 3 scores that your ranking can be based on. Which score(s) would you optimise your ranking for? How would you trade-off one with others? Can you rank base on all 3 of them, if so, how would you combine them? You can also join the score dataframe with `self.generated_content_metadata` to use the original content features as part of ranking.

### 2. Train Model
Use the cell `Training: Create your own training` to train your model. Feel free to select any model you like. In the example, I use train 1 model for each target variable, resulting in 3 models. If you can find a model that can produce 3 output values (e.g. Neural Network), feel free to do so.

Some important points worth noting are:
1. Make sure you save the model to **ONLY 1 file**. Even if you make 3 models, put them into a list or dictionary and save the whole object to 1 file.
2. Depending on your modelling approach, you'll need your own way to load the model to make predictions. Thus, modify the `.load_model()` in `DataCollectorExample()` reflect this.
3. Once you train your model, you can use the example evaluation cells to evaluate your model. There's no need to change anything about the evaluation except the thresholds variable. These thresholds decide at what probability we'll consider the prediction as like and dislike. **NOTE** that this only test your scoring model, not your ranking policy. Testing for ranking policy is much more complicated since we're not only optimizing the number of likes and thus cannot be done offline.


# Test Your Model
Once you have the scoring model and ranking policy, you can use the `Inference Example` cell to run your pipeline and see the output of recommended items from your model.

# Submit your work

1. Put `postprocessor.pkl` and your model file into `Columbia-E4579/services/backend/src/recommendation_system/recommendation_flow/model_prediction` folder on your branch. Make sure to rename them as `{team}_postprocessor.pkl` and `{team}_model.pkl` (E.g. `alpha_postprocessor.pkl` and `alpha_model.pkl`). Both `postprocessor.pkl` and your model file will be saved in `sampled_data` in your COlab work space. Since this workspace is cleared everytime you restart the Colab, please also save the postprocessor and model files on your local machine if you want to keep them.

2. Download your Colab as `.ipynb` file, rename it as `{team}_ranking.ipynb` (e.g. `alpha_ranking.ipynb`), and also place it in `Columbia-E4579/services/backend/src/recommendation_system/recommendation_flow/model_prediction` folder on your branch.

You'll then merge your branch with these files into Professor's branch.


# Imports

In [1]:
import pandas as pd
import numpy as np
from google.colab import drive
from tqdm import tqdm

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# DataCollector - Do Not Modify

In [2]:
from sqlalchemy.sql.schema import ScalarElementColumnDefault
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, Normalizer
import numpy as np
from typing import Tuple, List, Optional
import pickle


class Postprocessor:

    def __init__(self,
                 numberical_features: List[str],
                 categorical_features: List[str]):

        self.numberical_features = numberical_features
        self.categorical_features = categorical_features

        self.scaler = StandardScaler()
        self.encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
        self.encode_cols = []

    def fit(self, features_df: pd.DataFrame):

        self.scaler.fit(features_df[self.numberical_features])

        if len(self.categorical_features) > 0:
            self.encoder.fit(features_df[self.categorical_features])
            self.encode_cols = list(self.encoder.get_feature_names_out())

    def transform(self, features_df: pd.DataFrame) -> pd.DataFrame:

        features_df[self.numberical_features] = self.scaler.transform(features_df[self.numberical_features])

        if len(self.categorical_features) > 0:
            features_df[self.encode_cols] = self.encoder.transform(features_df[self.categorical_features])

        return features_df

    def fit_transform(self, features_df: pd.DataFrame) -> pd.DataFrame:

        self.fit(features_df)
        features_df = self.transform(features_df)

        return features_df


class DataCollector:

    def __init__(self,
                 engagement_path=None,
                 content_meta_path=None):


        self.engagement_path = engagement_path
        self.content_meta_path = content_meta_path

        self.objects_dir = 'sample_data'  #TODO change this
        self.numerical_features = []
        self.categorical_features = []

        self.postprocessor = None
        self.model = None

    def feature_generation_user(self) -> Tuple[pd.DataFrame, List[str], List[str]]:
        """
        Returns
          pd.DataFrame: User feature dataframe
          List[str]: List of numerical features. E.g. ['feat_1', 'feat_3, ...]
          List[str]: List of categorical features. E.g. ['feat_2', 'feat_4, ...]
        """
        raise NotImplementedError("you need to implement this")

    def feature_generation_content(self) -> Tuple[pd.DataFrame, List[str], List[str]]:
        """
        Returns
          pd.DataFrame: Content feature dataframe
          List[str]: List of numerical features. E.g. ['feat_1', 'feat_3, ...]
          List[str]: List of categorical features. E.g. ['feat_2', 'feat_4, ...]
        """
        raise NotImplementedError("you need to implement this")

    def get_Ys(self) -> pd.DataFrame:
        """Engineers taget variable.
        Args
            data (pd.DataFrame): Engagement data.
        Returns
            pd.DataFrame: Dataframe of 5 columns;
                'user_id', 'content_id', 'like', 'dislike', 'engage_time'
        """

        raise NotImplementedError("you need to implement this")

    def feature_generation(self, is_train=False) -> pd.DataFrame:
        """Generate features. If is_train, will generate features for user-content pairs
        exist in self.engagement_data. Else, will generate features for
        all possible user-content pairs.

        Args:
            is_train (bool): Whether in training mode.

        Returns:
            pd.DataFrame: Feature dataframe.

        """

        user_feature_df, user_num_feats, user_cat_feats = self.feature_generation_user()
        content_feature_df, content_num_feats, content_cat_feats = self.feature_generation_content()
        self.user_feature_df = user_feature_df
        self.content_feature_df = content_feature_df

        self.numerical_features = user_num_feats + content_num_feats
        self.categorical_features = user_cat_feats + content_cat_feats

        if is_train:
            interaction_pairs = self.engagement_data[
                ['user_id', 'content_id']].drop_duplicates()

        else:
            all_users = self.engagement_data['user_id'].drop_duplicates().tolist()
            all_contents = self.generated_content_metadata['content_id'].drop_duplicates().tolist()

            interaction_pairs = [(u, c) for u in all_users for c in all_contents]
            interaction_pairs = pd.DataFrame(interaction_pairs, columns=['user_id', 'content_id'])

        features_df = pd.merge(interaction_pairs,
                               user_feature_df, on='user_id', how='left')

        features_df = pd.merge(features_df,
                               content_feature_df, on='content_id', how='left')

        return features_df


    def get_engagement_data(self, user_id=None, content_ids=None):

        if self.engagement_path is None:
            #TODO: read from database
            pass
        else:
            df = pd.read_csv(self.engagement_path, sep="\t")

        if content_ids is not None:
            df = df[df['content_id'].isin(content_ids)]

        if user_id is not None:
            df = df[df['user_id'] == user_id]

        return df

    def get_generated_content_metadata(self, content_ids=None):

        if self.content_meta_path is None:
            #TODO: read from database
            pass
        else:
            df = pd.read_csv(self.content_meta_path, sep="\t")

        if content_ids is not None:
            df = df[df['content_id'].isin(content_ids)]

        return df

    def get_user_data(self, user_id=None):

        if self.engagement_path is None:
            #TODO: read from database
            pass
        else:
            df = pd.read_csv(self.engagement_path, sep="\t")

        if user_id is not None:
            df = df[df['user_id'] == user_id]

        return df

    def gather_data(self, user_id, content_ids):
        self.engagement_data = self.get_engagement_data(user_id, content_ids)
        self.generated_content_metadata = self.get_generated_content_metadata(content_ids)
        self.user_data = self.get_user_data(user_id)

        if len(self.engagement_data) == 0:
            raise Exception("either user_id or content_ids leads to empty engagement_data")

        if len(self.generated_content_metadata) == 0:
            raise Exception("content_ids leads to empty generated_content_metadata")

        if len(self.user_data) == 0:
            raise Exception("user_id leads to empty user_data")

    def postprocess_feature(self, features_df: pd.DataFrame, is_train=False) -> pd.DataFrame:
        """Applied postprocessings (one-hot encoding & scaler) to the feature dataframe.

        Args:
            features_df (pd.DataFrame): Input feature dataframe.
            is_train (bool): Whether in training mode. If True, will fit the
                Postprocessor() and save to a pickle file. Else, will load the
                saved Postprocessor() and use it.

        Returns:
            pd.DataFrame: Output feature dataframe.
        """

        if is_train:
            self.postprocessor = Postprocessor(self.numerical_features, self.categorical_features)
            features_df = self.postprocessor.fit_transform(features_df)
            self.save_postprocessor()

        else:
            self.postprocessor = self.load_postprocessor()
            features_df = self.postprocessor.transform(features_df)

        self.all_numeric_features = self.numerical_features + self.postprocessor.encode_cols


        return features_df


    def gen_model_input(self,
                        user_id: Optional[int] = None,
                        content_ids: Optional[list] = None,
                        is_train: bool = False) -> pd.DataFrame:
        """Generates input data (X) for model.

        Args:
            user_id (Optional[int]): User ID to generate features for.
                If None, will generate features for all available users in self.engagement_data.
            content_ids (Optional[list]): List of content ID to generate features for.
                If None, will generate features for all available contents in self.engagement_data.
            is_train (bool): Whether in training mode. If True, will generate
                features for user-content pairs exist in self.engagement_data.
                Else, will generate features for all possible user-content pairs.

        Returns:
            pd.DataFrame: Dataframe of features with 2-level index of ('user_id', 'content_id').
        """

        self.gather_data(user_id, content_ids)
        features_df = self.feature_generation(is_train)
        features_df = self.postprocess_feature(features_df, is_train)

        X = features_df.set_index(['user_id', 'content_id'])
        X = X[self.all_numeric_features]
        X = X.fillna(0)

        return X


    def gen_target_vars(self,
                        engagement_data: Optional[pd.DataFrame] = None
                        ) -> pd.DataFrame:
        """Wrapper to generate target variables.

        Args:
            engagement_data (Optional[pd.DataFrame]): Engagement data. If None,
                will use self.engagement_data which is loaded for training.
                For testing, parse in the engagement_data for testing.

        Returns:
            pd.DataFrame: Dataframe of 3 columns; 'like', 'dislike', 'engage_time'
                and 2-level index of ('user_id', 'content_id').
        """

        if engagement_data is None:
            engagement_data = self.engagement_data

        target_df = self.get_Ys(engagement_data)

        return target_df.set_index(['user_id', 'content_id'])


    def save_postprocessor(self):

        with open(f'{self.objects_dir}/postprocessor.pkl', 'wb') as f:
            pickle.dump(self.postprocessor, f)

    def load_postprocessor(self):

        with open(f'{self.objects_dir}/postprocessor.pkl', 'rb') as f:
            return pickle.load(f)

    def load_model(self):
        raise NotImplementedError("you need to implement this")

    def predict(self, X) -> Tuple[list, list, list]:
        raise NotImplementedError("you need to implement this")

    def rank(self, pred_score):
        raise NotImplementedError("you need to implement this")

    def score(self,
              user_id: Optional[int] = None,
              content_ids: Optional[list] = None) -> pd.DataFrame:
        """Predict the scores.

        Args:
            user_id (Optional[int]): User ID to generate features for.
                If None, will generate features for all available users in self.engagement_data.
            content_ids (Optional[list]): List of content ID to generate features for.
                If None, will generate features for all available contents in self.engagement_data.

        Returns:
            pd.DataFrame: Predicted score dataframe with 2-level index of (user_id, content_id).
                The dataframe also comes with the original content metadata which also
                can be used for ranking.
        """

        X = self.gen_model_input(user_id, content_ids, is_train=False)

        pred_like, pred_dislike, pred_engtime = self.predict(X)

        pred_df = pd.DataFrame(np.array([pred_like, pred_dislike, pred_engtime]).T,
                               index=X.index,
                               columns=['like', 'dislike', 'engage_time']).reset_index()

        pred_df = pd.merge(self.generated_content_metadata,
                           pred_df,
                           how='right',
                           on='content_id')

        return pred_df.set_index(['user_id', 'content_id'])

    def recommend(self, user_id, content_ids=None, top_k=20):

        score_df = self.score(user_id, content_ids).reset_index()

        rank = self.rank(score_df, user_id, content_ids)

        return rank[:top_k]




In [3]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import mean_squared_error


def evaluate(true_df: pd.DataFrame,
             pred_df: pd.DataFrame,
             thres_like: float = 0.5,
             thres_dislike: float = 0.5
             ) -> dict:

    """Compute evaluation metrics.

    Args:
        true_df (pd.DataFrame): Dataframe of true target variables.
        pred_df (pd.DataFrame): Dataframe of predicted target variables.
        thres_like (float): Probability threshold to consider a prediction as like.
        thres_dislike (float): Probability threshold to consider as a prediction dislike.

    Returns:
        dict: Dictionary of metrics.
    """

    true_df = true_df.reset_index()
    pred_df = pred_df[['like', 'dislike', 'engage_time']].reset_index()

    pred_df['like'] = (pred_df['like'] > thres_like).astype(int)
    pred_df['dislike'] = (pred_df['dislike'] > thres_dislike).astype(int)

    actual_user_content = true_df[['user_id', 'content_id']]
    pred_user_content = pred_df[['user_id', 'content_id']]

    common_user_content = pd.merge(actual_user_content,
                                   pred_user_content,
                                   how='inner',
                                   on=['user_id', 'content_id'])

    true_df = pd.merge(common_user_content,
                         true_df,
                         how='left',
                         on=['user_id', 'content_id'])


    pred_df = pd.merge(common_user_content,
                       pred_df,
                       how='left',
                       on=['user_id', 'content_id'])


    metrics = {}
    for col in ['like', 'dislike', 'engage_time']:
        metrics[col] = {}

        if col == 'engage_time':
            metrics[col]['rmse'] = np.sqrt(mean_squared_error(true_df[col], pred_df[col]))
        else:
            metrics[col]['precision'] = precision_score(true_df[col], pred_df[col])
            metrics[col]['recall'] = recall_score(true_df[col], pred_df[col])

    return metrics

# Your Implementation - Example Here, Must Modify

In [92]:
class DataCollectorDelta(DataCollector):

    def feature_generation_user(self) -> Tuple[pd.DataFrame, List[str], List[str]]:
        """Generates user features. Keep all the categorical variables as is,
        since the one-hot encoding will be done by our own pipeline. Along with
        the feature dataframe, you'll need to output lists of numberical features
        and categorical features as well.

        Returns
          pd.DataFrame: User feature dataframe
          List[str]: List of numerical features. E.g. ['feat_1', 'feat_3, ...]
          List[str]: List of categorical features. E.g. ['feat_2', 'feat_4, ...]
        """

        # Filtering like/dislike engagements
        like_data = self.user_data[engagement_data['engagement_type'] == 'Like']

        # Grouping by 'user_id' and 'content_id' and getting the latest engagement for each pair
        latest_like_data = like_data.sort_values('created_date').groupby(['user_id', 'content_id']).tail(1)

        # Getting total likes for each user
        like_engagements = latest_like_data[(latest_like_data['engagement_value']==1)].copy()
        like_feature_df = like_engagements.groupby('user_id')['engagement_value'].sum().reset_index()
        like_feature_df.rename(columns={'engagement_value': 'user_likes'}, inplace=True)
        # Fill NaN values with 0 (users with no "like" engagements)
        like_feature_df['user_likes'].fillna(0, inplace=True)


        # Getting total dislikes for each user
        dislike_engagements = latest_like_data[(latest_like_data['engagement_value']==-1)].copy()
        dislike_feature_df = dislike_engagements.groupby('user_id')['engagement_value'].sum().reset_index()
        dislike_feature_df.rename(columns={'engagement_value': 'user_dislikes'}, inplace=True)
        # Fill NaN values with 0 (users with no "dislike" engagements)
        dislike_feature_df['user_dislikes'].fillna(0, inplace=True)

        # Getting average engage time for each user
        time_engagements = self.user_data[self.user_data['engagement_type'] == 'MillisecondsEngagedWith'].copy()

        # consider each user's max engagement time with each content
        time_engagements = time_engagements.groupby(['user_id', 'content_id'])['engagement_value'].max().reset_index()
        engage_feature_df = time_engagements.groupby('user_id')['engagement_value'].mean().reset_index()
        engage_feature_df.rename(columns={'engagement_value': 'user_engagetime'}, inplace=True)
        # fill NaN values with avg_engage (users with no engagment time data)
        avg_engage = engage_feature_df['user_engagetime'].mean()
        engage_feature_df['user_engagetime'].fillna(avg_engage, inplace=True)

        feature_df = pd.merge(like_feature_df, dislike_feature_df , on='user_id', how='left')
        feature_df = pd.merge(feature_df, engage_feature_df , on='user_id', how='left')

        return feature_df, ['user_likes', 'user_dislikes', 'user_engagetime'], []


    def feature_generation_content(self) -> Tuple[pd.DataFrame, List[str], List[str]]:
        """Generates content features. Keep all the categorical variables as is,
        since the one-hot encoding will be done by our own pipeline. Along with
        the feature dataframe, you'll need to output lists of numberical features
        and categorical features as well.

        Returns
          pd.DataFrame: User feature dataframe
          List[str]: List of numerical features. E.g. ['feat_1', 'feat_3, ...]
          List[str]: List of categorical features. E.g. ['feat_2', 'feat_4, ...]
        """

        feature_df = self.generated_content_metadata.copy()

        # numerical feature 1: (average) guidance scale
        mean_engage = feature_df["guidance_scale"].mean()
        guide_df = feature_df.groupby('content_id')['guidance_scale'].mean().reset_index()
        guide_df = guide_df.rename(columns={'guidance_scale': 'content_guidance_scale'})
        feature_df = pd.merge(feature_df, guide_df, on='content_id', how='left')
        feature_df['content_guidance_scale'].fillna(mean_engage, inplace=True)

        # numerical feature 2: num inference steps
        mean_inf = feature_df["num_inference_steps"].mean()
        inf_df = feature_df.groupby('content_id')['num_inference_steps'].mean().reset_index()
        inf_df = inf_df.rename(columns={'num_inference_steps': 'content_inference_steps'})
        feature_df = pd.merge(feature_df, inf_df, on='content_id', how='left')
        feature_df['content_inference_steps'].fillna(mean_inf, inplace=True)


        # categorical feature 1: source
        feature_df['content_source'] = 'other'
        feature_df.loc[feature_df['source'] == 'human_prompts', 'content_source'] = 'human_prompts'
        feature_df.loc[feature_df['source'] == 'r/Showerthoughts', 'content_source'] = 'r/Showerthoughts'


        # categorical feature 2: artist style
        style_list = [
            'studio',
            'medieval',
            'anime',
            'kerry_james_marshall',
            'gta_v',
            'scifi',
            'van_gogh',
            'salvador_dali',
            'jean-michel_basquiat',
            'face_and_lighting'
        ]
        #style_list = ['movie', 'empty']
        feature_df['content_style'] = feature_df['artist_style']
        feature_df['content_style'].fillna("empty", inplace=True)
        feature_df.loc[feature_df['content_style'].str.startswith('movie:'), 'content_style'] = 'movie'
        feature_df.loc[~feature_df['content_style'].isin(style_list), 'content_style'] = 'other'


        return feature_df, ['content_inference_steps'], ['content_source', 'content_style']


    def get_Ys(self, engagement_data) -> pd.DataFrame:
        """Engineers taget variable that you are predicting.
        Args
            engagement_data (pd.DataFrame): Engagement data.
        Returns
            pd.DataFrame: Dataframe of 5 columns;
                'user_id', 'content_id', 'like', 'dislike', 'engage_time'
        """
        # Filtering Like-type engagements
        like_data = engagement_data[engagement_data['engagement_type'] == 'Like']

        # Grouping by 'user_id' and 'content_id' and getting the latest engagement for each pair
        latest_engagements = like_data.sort_values('created_date').groupby(['user_id', 'content_id']).tail(1)

        # Creating the target DataFrame with unique pairs of user_id and content_id
        target_df = engagement_data[['user_id', 'content_id']].drop_duplicates()

        # Merging latest engagements to update 'like' and 'dislike' columns
        target_df = pd.merge(target_df, latest_engagements[['user_id', 'content_id', 'engagement_value']],
                            on=['user_id', 'content_id'], how='left')

        # Updating 'like' and 'dislike' columns based on the latest engagement values
        target_df['like'] = (target_df['engagement_value'] == 1).astype(int)
        target_df['dislike'] = (target_df['engagement_value'] == -1).astype(int)

        # Filling NaN values with 0 for pairs without like/dislike
        target_df.fillna(0, inplace=True)


        # Set "engage_time" based on engagement_type and engagement_value
        # assign existing engagement time if doesn't have that data, assign zero
        engage_times = engagement_data[engagement_data['engagement_type'] == 'MillisecondsEngagedWith']

        #considering max engagetime for each user and content pair
        engage_times = engage_times.groupby(['user_id', 'content_id'])['engagement_value'].max().reset_index()

        engage_times.rename(columns={'engagement_value': 'engage_time'}, inplace=True)

        target_df = pd.merge(target_df, engage_times[['user_id', 'content_id', 'engage_time']],
                            on=['user_id', 'content_id'], how='left')


        # Filling NaN values with 0 for pairs without engage_time (doesn't exist but double check)
        target_df['engage_time'].fillna(0, inplace=True)

        # Select and rename the required columns
        target_df = target_df[['user_id', 'content_id', 'like', 'dislike', 'engage_time']].copy()

        return target_df


    def predict(self, X: pd.DataFrame) -> Tuple[list, list, list]:
        """Predicts the 3 target variables by using the model that you trained.
        Make sure you load the model properly.

        Args:
            X (pd.DataFrame): Feature dataframe with 2-level index of (user_id, content_id)

        Returns:
            (list, list, list): (predicted prbability of like,
                                 predicted probability of dislike,
                                 predicted engagement time)
        """

        model = self.load_model()

        pred_like = model['like'].predict(X).flatten()
        pred_dislike = model['dislike'].predict(X).flatten()
        pred_engtime = model['engage_time'].predict(X).flatten()

        return pred_like, pred_dislike, pred_engtime

    def rank(self,
             score_df: pd.DataFrame,
             user_id: int,
             content_ids: Optional[list] = None) -> list:

        """Ranks the items for a given user based on your own criteria.

        Args:
            score_df (pd.DataFrame): Predicted-score Dataframe of columns;
                'user_id', 'content_id', 'like', 'dislike', 'engage_time', and
                also columns for content metadata.
            user_id (int): User ID to rank the items for.
            content_ids (Optional[list]): List of content ids to be considered for ranking.
        """

        user_df = score_df[score_df['user_id'] == user_id]

        def select_artist_style(style):
            if pd.isna(style) or str(style).startswith('movie:'):
                return 'other'
            else:
                return style

        user_df['selected_artiststyle'] = user_df['artist_style'].apply(select_artist_style)

        # value function formula: like - dislike + int(engagetime/1000) (1 point for every 4 seconds)
        user_df['value'] = user_df['like'] - user_df['dislike'] + (user_df['engage_time'] / 4000).astype(int)
        user_df_sorted = user_df.sort_values(by='value', ascending=False)

        # additional ordering: if not have specific art styles, order so no same styles seen consequently
        sorted_content_ids = []
        last_artist_style = 1
        keep_styles = {'other', 'gta_v', 'medieval', 'detailed_portrait', 'van_gogh', 'unreal_engine', 'face_and_lighting', 'scifi', 'oil_on_canvas', 'anime', 'studio'}

        while not user_df_sorted.empty:
            selected_rows = user_df_sorted.loc[(user_df_sorted['selected_artiststyle'] != last_artist_style) | (user_df_sorted['selected_artiststyle'].isin(keep_styles))]
            if selected_rows.empty:
                break
            selected_row = selected_rows.iloc[0]
            sorted_content_ids.append(selected_row['content_id'])
            last_artist_style = selected_row['selected_artiststyle']
            user_df_sorted = user_df_sorted.drop(selected_row.name)

        return sorted_content_ids

    def load_model(self) -> object:
        """Loads your model. Since different ML frameworks requires different
        ways to load the model. Change this to reflect your choice of framework.

        Returns:
            object: Model object
        """



        with open('delta_model.pkl', 'rb') as file:
            loaded_models_dict = pickle.load(file)

        #for neural network
        #from tensorflow.keras.models import model_from_json
        #models = {model_name: model_from_json(model_data) for model_name, model_data in loaded_models_dict.items()}
        return loaded_models_dict

# Training

## Train Test Split

In [93]:
engagement_data = pd.read_csv('sample_data/engagement.csv', sep="\t")
content_meta = pd.read_csv('sample_data/generated_content_metadata.csv', sep="\t")

interactions = engagement_data[
    ['user_id', 'content_id']].drop_duplicates()

interactions_train, interactions_test = train_test_split(interactions, test_size=0.2, random_state=42)

engagement_train = pd.merge(interactions_train, engagement_data, how='left', on=['user_id', 'content_id'])
engagement_test = pd.merge(interactions_test, engagement_data, how='left', on=['user_id', 'content_id'])

engagement_train.to_csv('sample_data/engagement_train.csv', sep="\t")
engagement_test.to_csv('sample_data/engagement_test.csv', sep="\t")

In [95]:
#@title get training data
data_collector = DataCollectorDelta(
    engagement_path='sample_data/engagement_train.csv',
    content_meta_path='sample_data/generated_content_metadata.csv'
    )

X_train = data_collector.gen_model_input(is_train=True)
y_train = data_collector.gen_target_vars()

# ensure that each row of y_train corresponds to the correct user-content in X_train
y_train = y_train.reindex(index=X_train.index)

  like_data = self.user_data[engagement_data['engagement_type'] == 'Like']


## Training: Create your own training
Make sure you save the model somewhere so you can send the model file to the professor later.

Sample Experimented Neural Network

In [None]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
import pandas as pd
import numpy as np
import pickle
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

tf.random.set_seed(1)


# Model for 'like'

# Oversample minority class to 60% of majority class
smote = SMOTE(random_state=42, sampling_strategy=0.66)
X_like, y_like = smote.fit_resample(X_train, y_train['like'])

# Creating a Sequential model
model_like = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),  # Adding dropout for regularization
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')  # Output layer with 1 neuron for regression
])
model_like.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_like.fit(X_like, y_like, epochs=20)


# Model for 'dislike'

# Oversample minority class to 60% of majority class
smote = SMOTE(random_state=42, sampling_strategy=0.66)
X_dislike, y_dislike = smote.fit_resample(X_train, y_train['dislike'])

# Creating a Sequential model
model_dislike = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.2),  # Adding dropout for regularization
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])
model_dislike.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model_dislike.fit(X_dislike, y_dislike, epochs=20)


# Creating a Sequential model
model_engtime = Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(128, activation='relu'),
    Dropout(0.2),  # Adding dropout for regularization
    Dense(64, activation='relu'),
    Dense(32, activation='relu'),
    Dense(1)
])
model_engtime.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])
model_engtime.fit(X_train, y_train['engage_time'], epochs=30, batch_size=32)


Main Model (Random Forest)

In [97]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
import pandas as pd
import numpy as np
import pickle

from imblearn.over_sampling import SMOTE

# Model for 'like'
smote = SMOTE(random_state=42, sampling_strategy=0.66)
X_like, y_like = smote.fit_resample(X_train, y_train['like'])

model_like = RandomForestClassifier(random_state=42)
model_like.fit(X_like, y_like)


# Model for 'dislike'
smote = SMOTE(random_state=42, sampling_strategy=0.66)
X_dislike, y_dislike = smote.fit_resample(X_train, y_train['dislike'])

model_dislike = RandomForestClassifier(random_state=42)
model_dislike.fit(X_dislike, y_dislike)


# Model for 'engage_time'
model_engtime = RandomForestRegressor(random_state=42)
model_engtime.fit(X_train, y_train['engage_time'])

Computing train accuracy, recall, and precision

In [80]:
# Save the models to a file
models_dict = {
    'like': model_like,
    'dislike': model_dislike,
    'engage_time': model_engtime
}

with open('delta_model.pkl', 'wb') as f:
    pickle.dump(models_dict, f)

In [65]:
# Save the models to a file (neural network)
# models_dict = {
#     'like': model_like,
#     'dislike': model_dislike,
#     'engage_time': model_engtime
# }
# for model_name, model in models_dict.items():
#     models_dict[model_name] = model.to_json()

# with open('delta_model.pkl', 'wb') as f:
#     pickle.dump(models_dict, f)

In [82]:
from tensorflow.keras.models import model_from_json

with open('delta_model.pkl', 'rb') as file:
    models = pickle.load(file)

In [None]:
#reading neural network
# from tensorflow.keras.models import model_from_json

# with open('delta_model.pkl', 'rb') as file:
#     loaded_models_dict = pickle.load(file)

# models = {model_name: model_from_json(model_data) for model_name, model_data in loaded_models_dict.items()}

In [83]:
models

{'like': RandomForestClassifier(random_state=42),
 'dislike': RandomForestClassifier(random_state=42),
 'engage_time': RandomForestRegressor(random_state=42)}

In [84]:
preds = models['like'].predict(X_train).flatten()

In [85]:
like_preds = np.where(preds > 0.55, 1, 0)
like_accuracy = accuracy_score(y_train['like'], like_preds)
like_accuracy

0.7661705449611717

In [86]:
precision_score(y_train['like'], like_preds)

0.5718637156402762

In [87]:
recall_score(y_train['like'], like_preds)

0.5798689814930131

In [88]:
preds = models['dislike'].predict(X_train).flatten()

In [89]:
dislike_preds = np.where(preds > 0.6, 1, 0)
dislike_accuracy = accuracy_score(y_train['dislike'], dislike_preds)
dislike_accuracy

0.7752589128048489

In [90]:
precision_score(y_train['dislike'], dislike_preds)

0.5027167789268098

In [91]:
recall_score(y_train['dislike'], dislike_preds)

0.6050934297769741

# Evaluation

Achieved the following train accuracy: like 76%, dislike 78%


In [98]:
data_collector.load_model()

{'like': RandomForestClassifier(random_state=42),
 'dislike': RandomForestClassifier(random_state=42),
 'engage_time': RandomForestRegressor(random_state=42)}

In [99]:
train_preds = data_collector.predict(X_train)

In [100]:
train_preds

(array([0, 0, 0, ..., 0, 0, 0]),
 array([1, 1, 0, ..., 0, 0, 0]),
 array([  6063.02727906,  75468.25916918, 102384.92249932, ...,
          1299.90410653,   2160.87754436,   3073.85165864]))

In [101]:
dislike_preds = np.where(train_preds[1] > 0.5, 1, 0)
dislike_accuracy = accuracy_score(y_train['dislike'], dislike_preds)
dislike_accuracy

0.7752589128048489

In [102]:
precision_score(y_train['dislike'], dislike_preds)

0.5027167789268098

In [103]:
recall_score(y_train['dislike'], dislike_preds)

0.6050934297769741

In [104]:
like_preds = np.where(train_preds[0] > 0.5, 1, 0)
like_accuracy = accuracy_score(y_train['like'], like_preds)
like_accuracy

0.7661705449611717

In [106]:
recall_score(y_train['like'], like_preds)

0.5798689814930131

In [107]:
precision_score(y_train['like'], like_preds)

0.5718637156402762

In [108]:
# Simulates contents filtered from previous stage.
# Feel free to change this to reflect your previous stage.

sample_contents = content_meta['content_id'].sample(frac=0.01)

In [109]:
# Get true target variables
y_true = data_collector.gen_target_vars(engagement_test)

# Make predictions
y_pred = data_collector.score(content_ids = sample_contents)

  like_data = self.user_data[engagement_data['engagement_type'] == 'Like']


In [110]:
thres_like = 0.5
thres_dislike = 0.5
evaluate(y_true, y_pred, thres_like, thres_dislike)

{'like': {'precision': 0.6024096385542169, 'recall': 0.5434782608695652},
 'dislike': {'precision': 0.45161290322580644, 'recall': 0.5454545454545454},
 'engage_time': {'rmse': 1604929.7365899044}}

In [101]:
#experiment with neural network
thres_like = 0.5
thres_dislike = 0.5
evaluate(y_true, y_pred, thres_like, thres_dislike)

{'like': {'precision': 0.5520833333333334, 'recall': 0.5145631067961165},
 'dislike': {'precision': 0.4457831325301205, 'recall': 0.46835443037974683},
 'engage_time': {'rmse': 229312.72951327276}}

# Inference Example

In [None]:
sample_contents = content_meta['content_id'].sample(frac=0.01)  # simulated contents filtered from previous stage


data_collector = DataCollectorDelta(
    engagement_path='sample_data/engagement_train.csv',  # will be None in real production
    content_meta_path='sample_data/generated_content_metadata.csv'  # will be None in real production
    )

recs = data_collector.recommend(user_id=8, content_ids=sample_contents, top_k=20)

In [None]:
recs

[134914,
 114789,
 93707,
 132663,
 90348,
 116224,
 68745,
 57675,
 133872,
 132480,
 40220,
 80477,
 91759,
 86365,
 101628,
 114154,
 86020,
 30778,
 35987,
 130466]