# 📰 News Recommendation System Part 0 - Baseline

## 📘 Project Introduction
This project explores user behavior prediction in a news recommendation scenario. The goal is to build a model that can predict a user's future click behavior based on their historical browsing and clicking behavior data, specifically the last news article they clicked on.

The setting is inspired by a real-world news app, where delivering timely, relevant content is essential for user engagement. This project aims to simulate a practical recommender system, combining business intuition with machine learning techniques to address a realistic problem in the content recommendation space.

## 📊 Data Overview
The dataset contains user interaction data from a large-scale news platform, including:
- 300,000 users
- ~3 million clicks
- 360,000+ unique news articles; each news article is represented by a pre-trained embedding vector, capturing semantic relationships between articles.

We extracted click log data from 200,000 users as the training set, 50,000 users as test set A, and 50,000 users as test set B.

## 📄 Data Tables

- train_click_log.csv: Training set user click logs
- testA_click_log.csv: Test set user click logs
- articles.csv: News article information data table
- articles_emb.csv: Embedding vector representation of news articles

|        **Field**        |         **Description**          |
| :---------------------: | :------------------------------: |
|         user_id         |              User ID             |
|    click_article_id     |            Clicked article ID    |
|     click_timestamp     |            Click timestamp        |
|    click_environment    |             Click environment     |
|    click_deviceGroup    |            Click device group     |
|        click_os         |           Click operating system  |
|      click_country      |             Click city            |
|      click_region       |             Click region          |
|   click_referrer_type   |           Click source type       |
|       article_id        | Article ID, corresponding to click_article_id |
|       category_id       |            Article type ID        |
|      created_at_ts      |          Article creation timestamp |
|       words_count       |             Article word count     |
| emb_1,emb_2,...,emb_249 |      Article embedding vector representation |

## 📏 Evaluation Metrics
The final recommendation for each user will include five recommended articles, sorted by click probability.

For example, for user1, our recommendation would be:
> user1, article1, article2, article3, article4, article5.

There is only one correct answer for each user's last clicked article, so we check if any of the recommended five articles match the actual answer. We will use **mean reciprocal rank** as the evaluation metric. The formula is as follows:
$$
score(user) = \sum_{k=1}^5 \frac{s(user, k)}{k}
$$

If article1 is the actual article clicked by the user, then s(user1, 1) = 1, and s(user1, 2-4) are all 0. If article2 is the article clicked by the user, then s(user, 2) = 1/2, and s(user, 1, 3, 4, 5) are all 0. Thus, score(user) = the reciprocal of the rank at which the match occurs. If there are no matches, score(user1) = 0. This is reasonable because we want hits to be as high-ranking as possible, which yields a higher score.

## 💡 Project Understanding
The goal of this project is to **predict the last news article a user clicked, based on their historical browsing data**. Unlike traditional structured prediction problems, this is more aligned with real-world recommendation systems, using raw user click logs rather than neatly labeled data.

To approach this, I framed the task as a **supervised learning** problem by transforming user-article interactions into "features + labels" training data. The core idea is to predict the likelihood of a user clicking a given article, turning this into a click-through rate (CTR) prediction task. This reframing allows for the use of **classification models**—starting with simple baselines like logistic regression and moving toward deep learning approaches.

Now, we have converted this problem into a classification problem, where the classification label is whether the user will click on a particular article. The features of the classification problem will include the user and the article. We need to train a classification model to predict the probability of a particular user clicking on a specific article. This raises several additional questions:
- How to create training and testing datasets?
- What specific features can we leverage?
- What models can we attempt?
- With 360,000 articles and over 200,000 users, what strategies do we have to reduce the problem's scale? How do we make the final predictions?

**For this beginning part, we will run a baseline for our news recommendation system project.**

## 1. Import Packages

In [None]:
# Import packages
import time  # Import the time module for handling time-related operations.
import math  # Import the math module, which provides mathematical functions and constants.
import os  # Import the os module for interacting with the operating system.
from tqdm import tqdm  # Import the tqdm module for creating progress bars to visualize the progress of iterations.
import gc  # Import the garbage collection module for releasing memory space.
import pickle  # Import the pickle module for serializing and deserializing Python objects.
import random  # Import the random module for generating random numbers.
from datetime import datetime  # Import the datetime module for handling date and time.
from operator import itemgetter  # Import the itemgetter function from the operator module for retrieving elements based on an index or key.
import numpy as np  # Import the NumPy library, which provides high-performance numerical computing capabilities.
import pandas as pd  # Import the Pandas library, which offers data analysis and manipulation functionalities.
import warnings  # Import the warnings module for controlling the display of warning messages.
from collections import defaultdict  # Import the defaultdict class from the collections module, which provides a dictionary that allows setting default values.
import collections  # Import the collections module, which provides commonly used collection classes.
warnings.filterwarnings('ignore')  # Ignore the display of warning messages.

In [None]:
# If using Google Colab, use this cell to load data
from google.colab import drive

# Connect to Google Drive
drive.mount('/content/drive')

# Define file paths
data_path = '/content/drive/MyDrive/Datasets/news-rec-sys/'
save_path = '/content/drive/MyDrive/Datasets/news-rec-sys/temp_results/'

In [None]:
# If using a local machine
data_path = './data/' 
save_path = './data/temp_results/' # save temperary result

### df Save storage by using sample data 

In [None]:
# A standard function for memory optimization
def reduce_mem(df):
    starttime = time.time()  # Record the start time of the function
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']  # List of numeric data types
    start_mem = df.memory_usage().sum() / 1024**2  # Calculate the memory usage of the DataFrame (in Mb)

    # Iterate through each column of the DataFrame
    for col in df.columns:
        col_type = df[col].dtypes  # Get the data type of the column
        if col_type in numerics:  # If the column's data type is numeric
            c_min = df[col].min()  # Get the minimum value in the column
            c_max = df[col].max()  # Get the maximum value in the column

            # Check if there are missing values in the minimum and maximum values
            if pd.isnull(c_min) or pd.isnull(c_max):
                continue

            # Choose the appropriate data type conversion based on the data type's range
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2  # Calculate the memory usage of the DataFrame after conversion (in Mb)
    print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction), time spend:{:2.2f} min'.format(end_mem,
                                                                                                  100*(start_mem-end_mem)/start_mem,
                                                                                                  (time.time()-starttime)/60))
    return df

## 2. Read Sampled or Full Data

In [None]:
# Debug mode: Sample a portion of data from the training set for code debugging
def get_all_click_sample(data_path, sample_nums=10000):
    """
    Samples a portion of data from the training set for debugging purposes.
    data_path: The path where the original data is stored.
    sample_nums: The number of samples to take (can be a small number of users due to memory limitations).
    """
    all_click = pd.read_csv(data_path + 'train_click_log.csv')  
    all_user_ids = all_click.user_id.unique()  # Get unique identifiers for all users

   # Randomly select a specified number of users from all users as sampled users
    sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False)
    all_click = all_click[all_click['user_id'].isin(sample_user_ids)]  # Retain click data for sampled users

    all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))  # 去除重复的点击数据
    return all_click

# Read click data, which is divided into online and offline. If it is for online submission results, the test set's click data should be merged into the overall data.
# If it is for offline validation of the model's effectiveness or feature effectiveness, you can use only the training set.
def get_all_click_df(data_path='./data_raw/', offline=True):
    if offline:
        all_click = pd.read_csv(data_path + 'train_click_log.csv')  # Read the click data from the training set
    else:
        trn_click = pd.read_csv(data_path + 'train_click_log.csv') # Read the click data from the training set
        tst_click = pd.read_csv(data_path + 'testA_click_log.csv')   # Read the click data from the test set

        all_click = trn_click.append(tst_click)  # Combine the click data from the training set and test set

    all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))  # Remove duplicate click data
    return all_click

In [None]:
all_click_df = get_all_click_df(data_path, offline=False)

In [None]:
all_click_df.info()

In [None]:
all_click_df.head()

## 3. Create a dictionary that maps users to articles and their corresponding click times

In [None]:
# Get the user's article-click time sequence based on click time: {user1: [(item1, time1), (item2, time2), ...]...}
def get_user_item_time(click_df):
    # Sort the click DataFrame by click timestamp
    click_df = click_df.sort_values('click_timestamp')

    def make_item_time_pair(df):
        # Create a list of article IDs and corresponding timestamps
        return list(zip(df['click_article_id'], df['click_timestamp']))

    # Group by user, generate a list of article ID and timestamp pairs for each user, reset the index, and rename the columns
    user_item_time_df = click_df.groupby('user_id')[['click_article_id', 'click_timestamp']].apply(lambda x: make_item_time_pair(x))\
                                                            .reset_index().rename(columns={0: 'item_time_list'})
    # Create a dictionary with user IDs as keys and lists of article ID and timestamp pairs as values
    user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))

    return user_item_time_dict

## 4. Retrieve the top k clicked article

In [None]:
# Get the most recently clicked articles
def get_item_topk_click(click_df, k):
    """
    Get the top k article IDs with the most clicks.

    Parameters:
        click_df: DataFrame containing click data.
        k: Number of top article IDs to retrieve.

    Returns:
        topk_click: List of the top k article IDs with the most clicks.
    """
    # Use the value_counts() function to count the occurrences of each article ID in the click_df DataFrame's click_article_id column.
    # Then, use index slicing [:k] to select the top k articles with the highest click counts.
    topk_click = click_df['click_article_id'].value_counts().index[:k]
    return topk_click

## 5. itemcf Calculation of the similarity matrix

In [None]:
def itemcf_sim(df):
    """
    Calculation of the similarity matrix between articles

    Parameters:
        df: Data table
        item_created_time_dict: Dictionary of article creation times

    Returns:
        i2i_sim_: Matrix of similarity between articles

    Idea:
    Collaborative filtering based on items (for details, refer to the previous team learning on basic recommendation systems).
    In the multi-recall section, a recall strategy based on association rules will be added.
    """

    # Call a function to obtain a dictionary of user-item-click time data
    user_item_time_dict = get_user_item_time(df)

    # Calculate item similarity
    i2i_sim = {}  # Store a dictionary of item-item similarity
    item_cnt = defaultdict(int)  # Dictionary to count item occurrences

    # Iterate through the user-item-click time data dictionary
    for user, item_time_list in tqdm(user_item_time_dict.items()):
        # Consider time factors when optimizing item-based collaborative filtering

        # Iterate through the list of items and their click times for the same user
        for i, i_click_time in item_time_list:
            # Count item occurrences
            item_cnt[i] += 1
            i2i_sim.setdefault(i, {})

            # Iterate through the list of other items and their click times for the same user
            for j, j_click_time in item_time_list:
                if i == j:
                    continue
                i2i_sim[i].setdefault(j, 0)
                i2i_sim[i][j] += 1 / math.log(len(item_time_list) + 1)

    i2i_sim_ = i2i_sim.copy()

    # Further process and optimize the similarity dictionary
    for i, related_items in i2i_sim.items():
        for j, wij in related_items.items():
            i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])

    # Save the obtained similarity matrix locally
    pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))

    return i2i_sim_

In [None]:
i2i_sim = itemcf_sim(all_click_df)

## 6. itemcf article recommendation 

In [None]:
def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click):
    """
    Recommendation based on item-based collaborative filtering.
    
    Parameters:
        user_id: User ID
        user_item_time_dict: Dictionary of user-clicked article sequences based on click time {user1: [(item1, time1), (item2, time2), ...]...}
        i2i_sim: Dictionary, article similarity matrix
        sim_item_topk: Integer, choose the top k most similar articles to the current article
        recall_item_num: Integer, the number of recalled articles in the end
        item_topk_click: List, the list of most-clicked articles for user recall completion
    
    Returns:
        item_rank: Recommended articles {item1: score1, item2: score2, ...}
        
    Note: In the multi-recall part, a recall strategy based on association rules will be added.
    """

    # Get the articles that the user has interacted with in the past
    user_hist_items = user_item_time_dict[user_id]  # Get the list of articles the user has clicked on
    user_hist_items_ = {user_id for user_id, _ in user_hist_items}  # Convert the list of articles the user has clicked on to a set for easy lookup

    item_rank = {}  # Store a dictionary of articles and their similarity scores
    for loc, (i, click_time) in enumerate(user_hist_items):
        # Iterate through the list of articles the user has clicked on and their corresponding click times
        for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:
            # Iterate through the top sim_item_topk articles most similar to the current article and their similarity scores
            if j in user_hist_items_:
                continue  # If the similar article is already in the user's historical click list, skip it
            item_rank.setdefault(j, 0)
            item_rank[j] += wij  # Accumulate the similarity score of the similar articles

    # If there are less than recall_item_num articles, complete with popular items
    if len(item_rank) < recall_item_num:
        for i, item in enumerate(item_topk_click):
            if item in item_rank.items():  # If the completed article is already in the previous list, skip it
                continue
            item_rank[item] = -i - 100  # Assign a negative score to the completed article (arbitrarily set)
            if len(item_rank) == recall_item_num:
                break  # Exit the loop after reaching the specified number of recalled articles

    item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]  # Sort articles in descending order based on score and truncate to the specified number of articles

    return item_rank  # Return the list of recalled articles, including articles and their scores

## 7. Recommend articles for each user based on item-based collaborative filtering

In [None]:
# Define
user_recall_items_dict = collections.defaultdict(dict)

# Get the user-item-click time dictionary
user_item_time_dict = get_user_item_time(all_click_df)

# Load item-item similarity
i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))

# Number of similar articles to consider
sim_item_topk = 10

# Number of recalled articles
recall_item_num = 10

# User recall completion with popular items
item_topk_click = get_item_topk_click(all_click_df, k=50)

# Loop through all users
for user in tqdm(all_click_df['user_id'].unique()):
    # Recommend articles for each user based on item-based collaborative filtering
    user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim,
                                                        sim_item_topk, recall_item_num, item_topk_click)

## 8. Transform recall dictionary to df

In [None]:
user_item_score_list = []

for user, items in tqdm(user_recall_items_dict.items()):
    for item, score in items:
        user_item_score_list.append([user, item, score])

recall_df = pd.DataFrame(user_item_score_list, columns=['user_id', 'click_article_id', 'pred_score'])

## 9. Save Recommendation Results

In [None]:
# create submit file
def submit(recall_df, topk=5, model_name=None):
    recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])
    recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')

    # detect wether every user has 5 articles
    tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())
    assert tmp.min() >= topk

    del recall_df['pred_score']
    submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()

    submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]
    # define column name
    submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2',
                                                  3: 'article_3', 4: 'article_4', 5: 'article_5'})

    save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'
    submit.to_csv(save_name, index=False, header=True)

In [None]:
# create test set
tst_click = pd.read_csv(data_path + 'testA_click_log.csv')
tst_users = tst_click['user_id'].unique()

# select the users from the recall data that are in the test set, you can follow these steps
tst_recall = recall_df[recall_df['user_id'].isin(tst_users)]

# submit final file, which is the baseline
submit(tst_recall, topk=5, model_name='itemcf_baseline')