# **Guidelines**

## Process Overview
The ranking step includes 2 sub-processes:
1. Scoring: In this step, you will train a scoring model to predict 3 types of score; probability of like, probability of dislike, and engagement time.
2. Ranking: In this step, you will create a ranking policy to rank the items based on the output from previous step.


## Your Task

### 1. Modify `DataCollectorExample()`
also rename it to `DataCollector{Team}()` juts like previous assignment. All of the functions in `DataCollectorExample()` need to be modified (They're the ones with `raise NotImplementedError("you need to implement this")` in `DataCollector()`). The explaination of each function includes in their own docstring with example code.

Some important points worth noting are:
1. for `.feature_generation_user()` and `.feature_generation_content()`, you DO NOT need to apply one-hot encoding or scaler since this process will be done by our pipeline using via `.postprocess_feature()`, which will also save the `Postprocessor()` class as a pickle file to be used for testing and inference step. You will need to output nsames of numberical features and categorical features along with the feature dataframe.
2. When engineering target variables with `.get_Ys()`, you'll need to output 3 columns of target variables (`'like'`, `'dislike'`, `'engage_time'`). Be careful of how you create each target variable. What happen if a user like an item then change their mind and dislike it? What if a user see the same content twice, what would be the engagement time?
3. Be creative of how you would create your ranking policy in `.rank()`. You have 3 scores that your ranking can be based on. Which score(s) would you optimise your ranking for? How would you trade-off one with others? Can you rank base on all 3 of them, if so, how would you combine them? You can also join the score dataframe with `self.generated_content_metadata` to use the original content features as part of ranking.

### 2. Train Model
Use the cell `Training: Create your own training` to train your model. Feel free to select any model you like. In the example, I use train 1 model for each target variable, resulting in 3 models. If you can find a model that can produce 3 output values (e.g. Neural Network), feel free to do so.

Some important points worth noting are:
1. Make sure you save the model to **ONLY 1 file**. Even if you make 3 models, put them into a list or dictionary and save the whole object to 1 file.
2. Depending on your modelling approach, you'll need your own way to load the model to make predictions. Thus, modify the `.load_model()` in `DataCollectorExample()` reflect this.
3. Once you train your model, you can use the example evaluation cells to evaluate your model. There's no need to change anything about the evaluation except the thresholds variable. These thresholds decide at what probability we'll consider the prediction as like and dislike. **NOTE** that this only test your scoring model, not your ranking policy. Testing for ranking policy is much more complicated since we're not only optimizing the number of likes and thus cannot be done offline.


# Test Your Model
Once you have the scoring model and ranking policy, you can use the `Inference Example` cell to run your pipeline and see the output of recommended items from your model.

# Submit your work

1. Put `postprocessor.pkl` and your model file into `Columbia-E4579/services/backend/src/recommendation_system/recommendation_flow/model_prediction` folder on your branch. Make sure to rename them as `{team}_postprocessor.pkl` and `{team}_model.pkl` (E.g. `alpha_postprocessor.pkl` and `alpha_model.pkl`). Both `postprocessor.pkl` and your model file will be saved in `sampled_data` in your COlab work space. Since this workspace is cleared everytime you restart the Colab, please also save the postprocessor and model files on your local machine if you want to keep them.

2. Download your Colab as `.ipynb` file, rename it as `{team}_ranking.ipynb` (e.g. `alpha_ranking.ipynb`), and also place it in `Columbia-E4579/services/backend/src/recommendation_system/recommendation_flow/model_prediction` folder on your branch.

You'll then merge your branch with these files into Professor's branch.


# Imports

In [None]:
from google.colab import drive
import os

def mount_drive():
  # mounting ggdrive regardless of who's running the code
  drive.mount('/content/drive', force_remount=True)

  # custom data paths - each person should add their own path to their documents folder if they want to use ggcolab :D
  paths = {'luqman': '/content/drive/MyDrive/recomsys3',
           'max': '/content/drive/MyDrive/1_Fall 2023/2_MRS/final assignment',
           'ivan': '',
           'ritesh': '',
           'jack': '',
          }

  # setting os chdir
  runner = input("Who's running the notebook? ")
  os.chdir(paths[runner])
  print("Drive mounted for %s!" %runner)

mount_drive()

Mounted at /content/drive
Who's running the notebook? max
Drive mounted for max!


In [None]:
import pandas as pd
import numpy as np
from google.colab import drive
from tqdm import tqdm

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# DataCollector - Do Not Modify

In [None]:
from sqlalchemy.sql.schema import ScalarElementColumnDefault
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler, Normalizer
import numpy as np
from typing import Tuple, List, Optional
import pickle
import math
from sklearn.preprocessing import MinMaxScaler

class Postprocessor:

    def __init__(self,
                 numberical_features: List[str],
                 categorical_features: List[str]):

        self.numberical_features = numberical_features
        self.categorical_features = categorical_features

        self.scaler = StandardScaler()
        self.encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
        self.encode_cols = []

    def fit(self, features_df: pd.DataFrame):

        self.scaler.fit(features_df[self.numberical_features])

        if len(self.categorical_features) > 0:
            self.encoder.fit(features_df[self.categorical_features])
            self.encode_cols = list(self.encoder.get_feature_names_out())

    def transform(self, features_df: pd.DataFrame) -> pd.DataFrame:

        features_df[self.numberical_features] = self.scaler.transform(features_df[self.numberical_features])

        if len(self.categorical_features) > 0:
            features_df[self.encode_cols] = self.encoder.transform(features_df[self.categorical_features])

        return features_df

    def fit_transform(self, features_df: pd.DataFrame) -> pd.DataFrame:

        self.fit(features_df)
        features_df = self.transform(features_df)

        return features_df


class DataCollector:

    def __init__(self,
                 engagement_path=None,
                 content_meta_path=None):


        self.engagement_path = engagement_path
        self.content_meta_path = content_meta_path

        self.objects_dir = 'sample_data'  #TODO change this
        self.numerical_features = []
        self.categorical_features = []

        self.postprocessor = None
        self.model = None

    def feature_generation_user(self) -> Tuple[pd.DataFrame, List[str], List[str]]:
        """
        Returns
          pd.DataFrame: User feature dataframe
          List[str]: List of numerical features. E.g. ['feat_1', 'feat_3, ...]
          List[str]: List of categorical features. E.g. ['feat_2', 'feat_4, ...]
        """
        raise NotImplementedError("you need to implement this")

    def feature_generation_content(self) -> Tuple[pd.DataFrame, List[str], List[str]]:
        """
        Returns
          pd.DataFrame: Content feature dataframe
          List[str]: List of numerical features. E.g. ['feat_1', 'feat_3, ...]
          List[str]: List of categorical features. E.g. ['feat_2', 'feat_4, ...]
        """
        raise NotImplementedError("you need to implement this")

    def get_Ys(self) -> pd.DataFrame:
        """Engineers taget variable.
        Args
            data (pd.DataFrame): Engagement data.
        Returns
            pd.DataFrame: Dataframe of 5 columns;
                'user_id', 'content_id', 'like', 'dislike', 'engage_time'
        """

        raise NotImplementedError("you need to implement this")

    def feature_generation(self, is_train=False) -> pd.DataFrame:
        """Generate features. If is_train, will generate features for user-content pairs
        exist in self.engagement_data. Else, will generate features for
        all possible user-content pairs.

        Args:
            is_train (bool): Whether in training mode.

        Returns:
            pd.DataFrame: Feature dataframe.

        """

        user_feature_df, user_num_feats, user_cat_feats = self.feature_generation_user()
        content_feature_df, content_num_feats, content_cat_feats = self.feature_generation_content()
        self.user_feature_df = user_feature_df
        self.content_feature_df = content_feature_df

        self.numerical_features = user_num_feats + content_num_feats
        self.categorical_features = user_cat_feats + content_cat_feats

        if is_train:
            interaction_pairs = self.engagement_data[
                ['user_id', 'content_id']].drop_duplicates()

        else:
            all_users = self.engagement_data['user_id'].drop_duplicates().tolist()
            all_contents = self.generated_content_metadata['content_id'].drop_duplicates().tolist()

            interaction_pairs = [(u, c) for u in all_users for c in all_contents]
            interaction_pairs = pd.DataFrame(interaction_pairs, columns=['user_id', 'content_id'])

        features_df = pd.merge(interaction_pairs,
                               user_feature_df, on='user_id', how='left')

        features_df = pd.merge(features_df,
                               content_feature_df, on='content_id', how='left')

        return features_df


    def get_engagement_data(self, user_id=None, content_ids=None):

        if self.engagement_path is None:
            #TODO: read from database
            pass
        else:
            df = pd.read_csv(self.engagement_path, sep="\t")

        if content_ids is not None:
            df = df[df['content_id'].isin(content_ids)]

        if user_id is not None:
            df = df[df['user_id'] == user_id]

        return df

    def get_generated_content_metadata(self, content_ids=None):

        if self.content_meta_path is None:
            #TODO: read from database
            pass
        else:
            df = pd.read_csv(self.content_meta_path, sep="\t")

        if content_ids is not None:
            df = df[df['content_id'].isin(content_ids)]

        return df

    def get_user_data(self, user_id=None):

        if self.engagement_path is None:
            #TODO: read from database
            pass
        else:
            df = pd.read_csv(self.engagement_path, sep="\t")

        if user_id is not None:
            df = df[df['user_id'] == user_id]

        return df

    def gather_data(self, user_id, content_ids):
        self.engagement_data = self.get_engagement_data(user_id, content_ids)
        self.generated_content_metadata = self.get_generated_content_metadata(content_ids)
        self.user_data = self.get_user_data(user_id)

        if len(self.engagement_data) == 0:
            raise Exception("either user_id or content_ids leads to empty engagement_data")

        if len(self.generated_content_metadata) == 0:
            raise Exception("content_ids leads to empty generated_content_metadata")

        if len(self.user_data) == 0:
            raise Exception("user_id leads to empty user_data")

    def postprocess_feature(self, features_df: pd.DataFrame, is_train=False) -> pd.DataFrame:
        """Applied postprocessings (one-hot encoding & scaler) to the feature dataframe.

        Args:
            features_df (pd.DataFrame): Input feature dataframe.
            is_train (bool): Whether in training mode. If True, will fit the
                Postprocessor() and save to a pickle file. Else, will load the
                saved Postprocessor() and use it.

        Returns:
            pd.DataFrame: Output feature dataframe.
        """

        if is_train:
            self.postprocessor = Postprocessor(self.numerical_features, self.categorical_features)
            features_df = self.postprocessor.fit_transform(features_df)
            self.save_postprocessor()

        else:
            self.postprocessor = self.load_postprocessor()
            features_df = self.postprocessor.transform(features_df)

        self.all_numeric_features = self.numerical_features + self.postprocessor.encode_cols


        return features_df


    def gen_model_input(self,
                        user_id: Optional[int] = None,
                        content_ids: Optional[list] = None,
                        is_train: bool = False) -> pd.DataFrame:
        """Generates input data (X) for model.

        Args:
            user_id (Optional[int]): User ID to generate features for.
                If None, will generate features for all available users in self.engagement_data.
            content_ids (Optional[list]): List of content ID to generate features for.
                If None, will generate features for all available contents in self.engagement_data.
            is_train (bool): Whether in training mode. If True, will generate
                features for user-content pairs exist in self.engagement_data.
                Else, will generate features for all possible user-content pairs.

        Returns:
            pd.DataFrame: Dataframe of features with 2-level index of ('user_id', 'content_id').
        """

        self.gather_data(user_id, content_ids)
        features_df = self.feature_generation(is_train)
        features_df = self.postprocess_feature(features_df, is_train)

        X = features_df.set_index(['user_id', 'content_id'])
        X = X[self.all_numeric_features]
        X = X.fillna(0)

        return X


    def gen_target_vars(self,
                        engagement_data: Optional[pd.DataFrame] = None
                        ) -> pd.DataFrame:
        """Wrapper to generate target variables.

        Args:
            engagement_data (Optional[pd.DataFrame]): Engagement data. If None,
                will use self.engagement_data which is loaded for training.
                For testing, parse in the engagement_data for testing.

        Returns:
            pd.DataFrame: Dataframe of 3 columns; 'like', 'dislike', 'engage_time'
                and 2-level index of ('user_id', 'content_id').
        """

        if engagement_data is None:
            engagement_data = self.engagement_data

        target_df = self.get_Ys(engagement_data)

        return target_df.set_index(['user_id', 'content_id'])


    def save_postprocessor(self):

        with open(f'{self.objects_dir}/postprocessor.pkl', 'wb') as f:
            pickle.dump(self.postprocessor, f)

    def load_postprocessor(self):

        with open(f'{self.objects_dir}/postprocessor.pkl', 'rb') as f:
            return pickle.load(f)

    def load_model(self):
        raise NotImplementedError("you need to implement this")

    def predict(self, X) -> Tuple[list, list, list]:
        raise NotImplementedError("you need to implement this")

    def rank(self, pred_score):
        raise NotImplementedError("you need to implement this")

    def score(self,
              user_id: Optional[int] = None,
              content_ids: Optional[list] = None) -> pd.DataFrame:
        """Predict the scores.

        Args:
            user_id (Optional[int]): User ID to generate features for.
                If None, will generate features for all available users in self.engagement_data.
            content_ids (Optional[list]): List of content ID to generate features for.
                If None, will generate features for all available contents in self.engagement_data.

        Returns:
            pd.DataFrame: Predicted score dataframe with 2-level index of (user_id, content_id).
                The dataframe also comes with the original content metadata which also
                can be used for ranking.
        """

        X = self.gen_model_input(user_id, content_ids, is_train=False)

        pred_like, pred_dislike, pred_engtime = self.predict(X)

        pred_df = pd.DataFrame(np.array([pred_like, pred_dislike, pred_engtime]).T,
                               index=X.index,
                               columns=['like', 'dislike', 'engage_time']).reset_index()

        pred_df = pd.merge(self.generated_content_metadata,
                           pred_df,
                           how='right',
                           on='content_id')

        return pred_df.set_index(['user_id', 'content_id'])

    def recommend(self, user_id, content_ids=None, top_k=20):

        score_df = self.score(user_id, content_ids).reset_index()

        rank = self.rank(score_df, user_id, content_ids)

        return rank[:top_k]




In [None]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import mean_squared_error


def evaluate(true_df: pd.DataFrame,
             pred_df: pd.DataFrame,
             thres_like: float = 0.5,
             thres_dislike: float = 0.5
             ) -> dict:

    """Compute evaluation metrics.

    Args:
        true_df (pd.DataFrame): Dataframe of true target variables.
        pred_df (pd.DataFrame): Dataframe of predicted target variables.
        thres_like (float): Probability threshold to consider a prediction as like.
        thres_dislike (float): Probability threshold to consider as a prediction dislike.

    Returns:
        dict: Dictionary of metrics.
    """

    true_df = true_df.reset_index()
    pred_df = pred_df[['like', 'dislike', 'engage_time']].reset_index()

    pred_df['like'] = (pred_df['like'] > thres_like).astype(int)
    pred_df['dislike'] = (pred_df['dislike'] > thres_dislike).astype(int)

    actual_user_content = true_df[['user_id', 'content_id']]
    pred_user_content = pred_df[['user_id', 'content_id']]

    common_user_content = pd.merge(actual_user_content,
                                   pred_user_content,
                                   how='inner',
                                   on=['user_id', 'content_id'])

    true_df = pd.merge(common_user_content,
                         true_df,
                         how='left',
                         on=['user_id', 'content_id'])


    pred_df = pd.merge(common_user_content,
                       pred_df,
                       how='left',
                       on=['user_id', 'content_id'])


    metrics = {}
    for col in ['like', 'dislike', 'engage_time']:
        metrics[col] = {}

        if col == 'engage_time':
            metrics[col]['rmse'] = np.sqrt(mean_squared_error(true_df[col], pred_df[col]))
        else:
            metrics[col]['precision'] = precision_score(true_df[col], pred_df[col])
            metrics[col]['recall'] = recall_score(true_df[col], pred_df[col])

    return metrics

# Your Implementation - Example Here, Must Modify

In [None]:
class DataCollectorAlpha(DataCollector):

    def feature_generation_user(self) -> Tuple[pd.DataFrame, List[str], List[str]]:
       ## TODO: Fix NaN Values
        """Generates user features. Keep all the categorical variables as is,
        since the one-hot encoding will be done by our own pipeline. Along with
        the feature dataframe, you'll need to output lists of numberical features
        and categorical features as well.

        Returns
          pd.DataFrame: User feature dataframe
          List[str]: List of numerical features. E.g. ['feat_1', 'feat_3, ...]
          List[str]: List of categorical features. E.g. ['feat_2', 'feat_4, ...]
        """
        target_df = self.get_Ys(self.user_data)

        user_df = self.user_data
        user_avg_engagement_time = target_df[['user_id', 'engage_time']].groupby(['user_id']).mean().rename(columns={'engage_time':'user_avg_engagement_time'})
        user_like_rate = target_df[['user_id', 'like']].groupby(['user_id']).mean().rename(columns={'like':'user_like_rate'})
        user_dislike_rate = target_df[['user_id', 'dislike']].groupby(['user_id']).mean().rename(columns={'dislike':'user_dislike_rate'})
        frequency = user_df.groupby(['user_id']).size().reset_index(name='frequency')
        content_diversity = user_df.groupby('user_id')['content_id'].nunique().reset_index(name='content_diversity')
        duration_variability = user_df[user_df['engagement_type'] == 'MillisecondsEngagedWith'].groupby('user_id')['engagement_value'].std().reset_index(name='duration_variability').fillna(0)


        feature_df = self.user_data[['user_id']].drop_duplicates().copy()
        feature_df = feature_df.merge(user_avg_engagement_time,on='user_id')
        feature_df = feature_df.merge(user_like_rate ,on='user_id')
        feature_df = feature_df.merge(user_dislike_rate,on='user_id')
        feature_df = feature_df.merge(frequency,on='user_id')
        feature_df = feature_df.merge(content_diversity ,on='user_id')
        feature_df = feature_df.merge(duration_variability,on='user_id')

        return feature_df, ['user_avg_engagement_time','user_like_rate','user_dislike_rate', 'frequency', 'content_diversity', 'duration_variability'], []

    def feature_generation_content(self) -> Tuple[pd.DataFrame, List[str], List[str]]:
          #TODO add more features
          """Generates content features. Keep all the categorical variables as is,
          since the one-hot encoding will be done by our own pipeline. Along with
          the feature dataframe, you'll need to output lists of numberical features
          and categorical features as well.

          Returns
            pd.DataFrame: User feature dataframe
            List[str]: List of numerical features. E.g. ['feat_1', 'feat_3, ...]
            List[str]: List of categorical features. E.g. ['feat_2', 'feat_4, ...]
          """

          content_df = self.generated_content_metadata
          target_df = self.get_Ys(self.user_data)
          content_avg_engagement_time = target_df[['content_id', 'engage_time']].groupby(['content_id']).mean().rename(columns={'engage_time':'content_avg_engagement_time'})
          content_like_rate = target_df[['content_id', 'like']].groupby(['content_id']).mean().rename(columns={'like':'content_like_rate'})
          content_dislike_rate = target_df[['content_id', 'dislike']].groupby(['content_id']).mean().rename(columns={'dislike':'content_dislike_rate'})

          # guidance_scale = content_df.groupby('content_id')['guidance_scale'].apply(lambda x: x.mode()[0])
          # num_inference_steps = content_df.groupby('content_id')['num_inference_steps'].apply(lambda x: x.mode()[0])
          source_list = ['human_prompts', 'r/Showerthoughts', 'r/EarthPorn', 'r/scifi', 'r/pics',
          'r/Damnthatsinteresting', 'r/MadeMeSmile', 'r/educationalgifs','r/SimplePrompts']
          style_list = ['van_gogh', 'jean-michel_basquiat', 'detailed_portrait',
       'kerry_james_marshall', 'medieval', 'studio', 'edward_hopper',
       'takashi_murakami', 'anime', 'leonardo_da_vinci',
       'laura_wheeler_waring', 'ma_jir_bo', 'jackson_pollock',
       'shepard_fairey', 'unreal_engine', 'face_and_lighting', 'keith_haring',
       'marta_minujín', 'franck_slama', 'oil_on_canvas', 'scifi', 'gta_v',
       'louise bourgeois', 'salvador_dali', 'ibrahim_el_salahi', 'juan_gris']
          def map_source(row):
            if row['source'] in source_list:
                return row['source']
            else:
              return 'other'
          def map_style(row):
            if row['artist_style'] in source_list:
                return row['artist_style']
            else:
              return 'other'

          feature_df = self.generated_content_metadata[['content_id','guidance_scale','num_inference_steps','source','artist_style']].drop_duplicates().copy()
          feature_df['source'] = feature_df.apply(map_source,axis=1)
          feature_df['artist_style'] = feature_df.apply(map_source,axis=1)
          feature_df = feature_df.merge(content_avg_engagement_time,on='content_id')
          feature_df = feature_df.merge(content_like_rate ,on='content_id')
          feature_df = feature_df.merge(content_dislike_rate,on='content_id')



          # feature_df['guidance_scale'] = guidance_scale
          # feature_df['num_inference_steps'] = num_inference_steps

          # feature_df['content_source'] = self.generated_content_metadata['source']

          return feature_df, ['content_avg_engagement_time','content_like_rate', 'content_dislike_rate','guidance_scale', 'num_inference_steps'], ['source','artist_style']

    def get_Ys(self, engagement_data) -> pd.DataFrame:
        """Engineers taget variable that you are predicting.
        Args
            engagement_data (pd.DataFrame): Engagement data.
        Returns
            pd.DataFrame: Dataframe of 5 columns;
                'user_id', 'content_id', 'like', 'dislike', 'engage_time'
        """

        np.random.seed(42)

        target_df = engagement_data[['user_id', 'content_id']].drop_duplicates().copy()
        def map_like_dislike(row):
            if str(row['engagement_type']) == 'Like':
                if row['engagement_value'] > 0:
                    return 1
                elif row['engagement_value'] < 0:
                    return -1
            else:
                return None
        def map_engagement_time(row):
            if str(row['engagement_type']) == 'MillisecondsEngagedWith':
                return row['engagement_value']
            else:
                return None

        engagement_data["engagement_like_dislike"] = engagement_data.apply(map_like_dislike, axis=1)
        engagement_data["engage_time"] = engagement_data.apply(map_engagement_time, axis=1)
        like_dislike = engagement_data[engagement_data['engagement_type'] == 'Like'][['user_id', 'content_id','engagement_like_dislike']].groupby(['user_id', 'content_id']).max().reset_index()
        engagement_time = engagement_data[engagement_data['engagement_type'] == 'MillisecondsEngagedWith'][['user_id', 'content_id','engage_time']].groupby(['user_id', 'content_id']).sum().reset_index()

        target_df = target_df.merge(engagement_time, on=('user_id', 'content_id'), how = 'left').merge(like_dislike,on=('user_id', 'content_id'),how='left').fillna(0)

        target_df['like'] = (target_df['engagement_like_dislike'] > 0).astype(int)
        target_df['dislike'] = (target_df['engagement_like_dislike'] < 0).astype(int)
        target_df = target_df.drop('engagement_like_dislike',axis=1)

        return target_df

    def predict(self, X: pd.DataFrame) -> Tuple[list, list, list]:
        """Predicts the 3 target variables by using the model that you trained.
        Make sure you load the model properly.

        Args:
            X (pd.DataFrame): Feature dataframe with 2-level index of (user_id, content_id)

        Returns:
            (list, list, list): (predicted prbability of like,
                                predicted probability of dislike,
                                predicted engagement time)
        """

        model = self.load_model()

        pred_like = model['like'].predict_proba(X)[:,0]
        pred_dislike = model['dislike'].predict_proba(X)[:,0]
        pred_engtime = model['engage_time'].predict(X)
        # pred_engtime = np.reshape(pred_engtime, (pred_engtime.shape[0],))

        return pred_like, pred_dislike, pred_engtime

    def rank(self,
            score_df: pd.DataFrame,
            user_id: int,
            content_ids: Optional[list] = None) -> list:

        """Ranks the items for a given user based on your own criteria.

        Args:
            score_df (pd.DataFrame): Predicted-score Dataframe of columns;
                'user_id', 'content_id', 'like', 'dislike', 'engage_time', and
                also columns for content metadata.
            user_id (int): User ID to rank the items for.
            content_ids (Optional[list]): List of content ids to be considered for ranking.
        """

        score_df = score_df[score_df['user_id'] == user_id]
        scaler = MinMaxScaler()
        score_df['engage_time'] = scaler.fit_transform(score_df['engage_time'].values.reshape(-1, 1))
        score_df['artist_style_theme'] = score_df['artist_style'].apply(lambda x: 'movie' if 'movie' in str(x) else x)
        score_group_df = score_df.groupby(['artist_style_theme']).apply(lambda x: x.sort_values(['like','dislike'], ascending = False))
        score_group_df['like-dislike'] = score_group_df.apply(lambda x: x['like']-x['dislike'], axis=1)
        score_group_df['overall_score'] = score_group_df.apply(lambda x: (0.8*x['like-dislike'])+(0.2*x['engage_time']) if x['like-dislike']>0 else ((0.6*x['like-dislike'])+(0.4*x['engage_time']) if x['like-dislike']<0 else x['engage_time']), axis=1)
        score_group_df = score_group_df.drop(columns = {'artist_style_theme'})
        new_group_df = score_group_df.groupby(['artist_style_theme']).apply(lambda x: x.sort_values(['overall_score'], ascending = False))
        new_group_df = new_group_df.droplevel(0)
        new_group_df.reset_index(inplace=True)
        new_df = new_group_df[['content_id', 'artist_style_theme', 'overall_score']]
        output_df = new_df.sort_values(by=['overall_score', 'artist_style_theme'], ascending=[False, True])

        assorted_df = pd.DataFrame(columns=output_df.columns)
        style_counts = output_df.value_counts('artist_style_theme')

        for style, size in style_counts.items():
          if style=='movie' and size/len(output_df) > 0.3:
            excl_movie_size = len(output_df) - size
            style_counts[style] = 0.3
            movie_diff = size/len(output_df) - 0.3
          else:
            style_counts[style] =  size/excl_movie_size + ((size/excl_movie_size) * movie_diff)

        # Initialize an empty list to store the DataFrames
        dfs_to_concat = []

        # Iterate over each style
        for style, count in style_counts.items():
            # Filter the DataFrame for the current style and add it to the list
            dfs_to_concat.append(output_df[output_df['artist_style_theme'] == style])

        # Concatenate all the DataFrames in the list
        assorted_df = pd.concat(dfs_to_concat, ignore_index=True)

        assorted_df['tile'] = assorted_df.groupby(['artist_style_theme']).cumcount()+1
        assorted_df = assorted_df.sort_values('tile')

        return assorted_df['content_id'].tolist()

    def load_model(self) -> object:
        """Loads your model. Since different ML frameworks requires different
        ways to load the model. Change this to reflect your choice of framework.

        Returns:
            object: Model object
        """

        with open(f'{self.objects_dir}/model.pkl', 'rb') as f:
            return pickle.load(f)

# Training

## Train Test Split

In [None]:
engagement_data = pd.read_csv('sample_data/engagement.csv', sep="\t")
content_meta = pd.read_csv('sample_data/generated_content_metadata.csv', sep="\t")

interactions = engagement_data[
    ['user_id', 'content_id']].drop_duplicates()

interactions_train, interactions_test = train_test_split(interactions, test_size=0.2, random_state=42)

engagement_train = pd.merge(interactions_train, engagement_data, how='left', on=['user_id', 'content_id'])
engagement_test = pd.merge(interactions_test, engagement_data, how='left', on=['user_id', 'content_id'])

engagement_train.to_csv('sample_data/engagement_train.csv', sep="\t")
engagement_test.to_csv('sample_data/engagement_test.csv', sep="\t")

In [None]:
#@title get training data
data_collector = DataCollectorAlpha(
    engagement_path='sample_data/engagement_train.csv',
    content_meta_path='sample_data/generated_content_metadata.csv'
    )

X_train = data_collector.gen_model_input(is_train=True)
y_train = data_collector.gen_target_vars()

# ensure that each row of y_train corresponds to the correct user-content in X_train
y_train = y_train.reindex(index=X_train.index)

## Training: Create your own training
Make sure you save the model somewhere so you can send the model file to the professor later.

In [None]:
## TODO: Build Models
class DummyModel:

    def __init__(self):
        pass

    def fit(self, X, y):
        return None

    def predict(self, X):
        return np.random.uniform(0, 1, len(X))

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from tensorflow.keras.optimizers import SGD


model_like = LogisticRegression(random_state=0, penalty='l2', solver='lbfgs')

model_dislike = LogisticRegression(random_state=0, penalty='l2', solver='lbfgs')

model_engtime = DecisionTreeRegressor(max_depth=20, min_samples_split=20, min_samples_leaf=10, random_state=42)


model_like.fit(X_train, y_train['like'])

model_dislike.fit(X_train, y_train['dislike'])

model_engtime.fit(X_train, y_train['engage_time'])


model = {
    'like': model_like,
    'dislike': model_dislike,
    'engage_time': model_engtime
}

with open('sample_data/model.pkl','wb') as f:
    pickle.dump(model,f)

In [None]:
model_engtime.predict(X_train)

array([1039.45      ,  265.4       , 1248.39018692, ..., 1127.46666667,
       3232.01265823, 3421.40740741])

In [None]:
y_train

Unnamed: 0_level_0,Unnamed: 1_level_0,engage_time,like,dislike
user_id,content_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
51,124105,1007.0,0,0
30,77269,0.0,1,0
53,106302,1224.0,0,0
64,72623,887.0,0,0
89,41445,5596.0,0,0
...,...,...,...,...
59,81683,90133.0,1,0
62,101336,3207.0,0,0
100,92628,2062.0,1,0
103,48548,3384.0,0,1


In [None]:
y_train['engage_time']

user_id  content_id
51       124105         1007.0
30       77269             0.0
53       106302         1224.0
64       72623           887.0
89       41445          5596.0
                        ...   
59       81683         90133.0
62       101336         3207.0
100      92628          2062.0
103      48548          3384.0
82       120005         4621.0
Name: engage_time, Length: 146671, dtype: float64

# Evaluation

In [None]:
# Simulates contents filtered from previous stage.
# Feel free to change this to reflect your previous stage.

sample_contents = content_meta['content_id'].sample(frac=0.01)

In [None]:
# Get true target variables
y_true = data_collector.gen_target_vars(engagement_test)

# Make predictions
y_pred = data_collector.score(content_ids = sample_contents)

In [None]:
thres_like = 0.5
thres_dislike = 0.5
evaluate(y_true, y_pred, thres_like, thres_dislike)

{'like': {'precision': 0.23703703703703705, 'recall': 0.6153846153846154},
 'dislike': {'precision': 0.1930379746835443, 'recall': 0.8243243243243243},
 'engage_time': {'rmse': 337097.61203835334}}

# Inference Example

In [None]:
sample_contents = content_meta['content_id'].sample(frac=0.01)  # simulated contents filtered from previous stage


data_collector = DataCollectorAlpha(
    engagement_path='sample_data/engagement_train.csv',  # will be None in real production
    content_meta_path='sample_data/generated_content_metadata.csv'  # will be None in real production
    )

recs = data_collector.recommend(user_id=8, content_ids=sample_contents, top_k=20)

In [None]:
recs

[71048,
 110870,
 112341,
 111113,
 115161,
 116338,
 110611,
 111262,
 79983,
 116864,
 111136,
 82712,
 113945,
 80840,
 83499,
 81533,
 110707,
 110683,
 112851,
 115475]