# 📰 News Recommendation System Part 3 - Feature Engineering

## 📘 Project Introduction
This project explores user behavior prediction in a news recommendation scenario. The goal is to build a model that can predict a user's future click behavior based on their historical browsing and clicking behavior data, specifically the last news article they clicked on.

The setting is inspired by a real-world news app, where delivering timely, relevant content is essential for user engagement. This project aims to simulate a practical recommender system, combining business intuition with machine learning techniques to address a realistic problem in the content recommendation space.

## 📊 Data Overview
The dataset contains user interaction data from a large-scale news platform, including:
- 300,000 users
- ~3 million clicks
- 360,000+ unique news articles; each news article is represented by a pre-trained embedding vector, capturing semantic relationships between articles.

We extracted click log data from 200,000 users as the training set, 50,000 users as test set A, and 50,000 users as test set B.

## 📄 Data Tables

- train_click_log.csv: Training set user click logs
- testA_click_log.csv: Test set user click logs
- articles.csv: News article information data table
- articles_emb.csv: Embedding vector representation of news articles

|        **Field**        |         **Description**          |
| :---------------------: | :------------------------------: |
|         user_id         |              User ID             |
|    click_article_id     |            Clicked article ID    |
|     click_timestamp     |            Click timestamp        |
|    click_environment    |             Click environment     |
|    click_deviceGroup    |            Click device group     |
|        click_os         |           Click operating system  |
|      click_country      |             Click city            |
|      click_region       |             Click region          |
|   click_referrer_type   |           Click source type       |
|       article_id        | Article ID, corresponding to click_article_id |
|       category_id       |            Article type ID        |
|      created_at_ts      |          Article creation timestamp |
|       words_count       |             Article word count     |
| emb_1,emb_2,...,emb_249 |      Article embedding vector representation |

## 📏 Evaluation Metrics
The final recommendation for each user will include five recommended articles, sorted by click probability.

For example, for user1, our recommendation would be:
> user1, article1, article2, article3, article4, article5.

There is only one correct answer for each user's last clicked article, so we check if any of the recommended five articles match the actual answer. We will use **mean reciprocal rank** as the evaluation metric. The formula is as follows:
$$
score(user) = \sum_{k=1}^5 \frac{s(user, k)}{k}
$$

If article1 is the actual article clicked by the user, then s(user1, 1) = 1, and s(user1, 2-4) are all 0. If article2 is the article clicked by the user, then s(user, 2) = 1/2, and s(user, 1, 3, 4, 5) are all 0. Thus, score(user) = the reciprocal of the rank at which the match occurs. If there are no matches, score(user1) = 0. This is reasonable because we want hits to be as high-ranking as possible, which yields a higher score.

## 💡 Project Understanding
The goal of this project is to **predict the last news article a user clicked, based on their historical browsing data**. Unlike traditional structured prediction problems, this is more aligned with real-world recommendation systems, using raw user click logs rather than neatly labeled data.

To approach this, I framed the task as a **supervised learning** problem by transforming user-article interactions into "features + labels" training data. The core idea is to predict the likelihood of a user clicking a given article, turning this into a click-through rate (CTR) prediction task. This reframing allows for the use of **classification models**—starting with simple baselines like logistic regression and moving toward deep learning approaches.

Now, we have converted this problem into a classification problem, where the classification label is whether the user will click on a particular article. The features of the classification problem will include the user and the article. We need to train a classification model to predict the probability of a particular user clicking on a specific article. This raises several additional questions:
- How to create training and testing datasets?
- What specific features can we leverage?
- What models can we attempt?
- With 360,000 articles and over 200,000 users, what strategies do we have to reduce the problem's scale? How do we make the final predictions?

**For the third part, we will prepare new features for each user's recalled list. These feature will be used in the ranking model in the fourth part.**

## Converting to a Supervised Learning Problem
To prepare our data for machine learning, we need to convert it into a supervised learning problem with features and labels. The features we can use directly from our initial data are:

- **Article Features**: These are the inherent properties of each article, such as its category_id, created_at_ts (creation timestamp, which indicates its timeliness), and words_count. These help us understand user preferences for content type, newness, and length.

- **Article Embedding Features**: These are the vector representations of article content, which we've used in the recall phase. We can also use Word2Vec or BERT to create new embedding features that contain semantic relationships.

- **User Device Features**: Information about the user's device provides a useful context for their interactions.

## Building the Supervised Dataset

Our recall phase gives us a dictionary in the format {user_id: [list of potential articles]}. We can use this to build our training set. For each user and each potential article in their list, we'll create a data point. For example, if user1's recall list is {user1: [item1, item2, item3]}, our dataset will have three rows: (user1, item1), (user1, item2), and (user1, item3). These also form the first two feature columns of our supervised test set.

## Constructing New Features from Historical Behavior

A key insight from the data analysis (Part 1) is that a user's final click is strongly correlated with their recent clicks. Therefore, our most important features will be a combination of the user's historical behavior and the candidate article. For each candidate article, we will create the following features based on its relationship to the user's most recent clicks:

1. Similarity Features: Calculate the similarity (e.g., using the inner product of embeddings) between the candidate item and the last few articles the user clicked. This directly captures the user's most recent interests.

2. Statistical Features of Similarity: Compute statistical measures like the average or standard deviation of the similarity features. This can help smooth out noise and capture broader trends in a user's preferences.

3. Word Count Difference: Calculate the difference in word count between the candidate item and the user's most recently clicked articles. This can reveal preferences for article length.

4. Time Difference: Calculate the time gap between the candidate article's creation time and the user's last click time. This is a powerful feature for understanding a user's preference for timely content.

5. User-item similarity obtained from YouTubeDNN recall: Create a similarity feature between the user and the candidate item itself, which can be very informative.

In this part, we will implement the creation of these features. The logic is as follows:
1. First, we'll get a user's last click and their historical clicks from the click log data.
2. Next, we'll create the new features using the user's historical click data, the recall list, article information, and embedding vectors.
3. Finally, we'll create the labels to form our supervised learning dataset.

## 1. Import Packages

In [None]:
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import gc, os
import logging
import time
import lightgbm as lgb
from gensim.models import Word2Vec
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings('ignore')

In [None]:
# If using Google Colab, use this cell to load data
from google.colab import drive

# Connect to Google Drive
drive.mount('/content/drive')

# Define file paths
data_path = '/content/drive/MyDrive/Datasets/news-rec-sys/'
save_path = '/content/drive/MyDrive/Datasets/news-rec-sys/temp_results/'

In [None]:
# If using a local machine
data_path = './data/' 
save_path = './data/temp_results/' # save temperary result

## 2. Data Loading and Splitting Training and Testing Sets

### 2.1 Function to save DataFrame memory.

In [None]:
# A standard function for memory optimization
def reduce_mem(df):
    starttime = time.time()  # Record the start time of the function
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']  # List of numeric data types
    start_mem = df.memory_usage().sum / 1024**2  # Calculate the memory usage of the DataFrame (in Mb)

    # Iterate through each column of the DataFrame
    for col in df.columns:
        col_types = df[col].dtypes  # Get the data type of the column
        if col_type in numerics:
            c_min = df[col].min()  # Get the minimum value in the column
            c_max = df[col].max()  # Get the maximum value in the column

            # Check if there are missing values in the minimum and maximum values
            if pd.isnull(c_min) or pd.isnull(c_max):
                continue

            # Choose the appropriate data type conversion based on the data type's range
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
                    
    end_mem = df.memory_usage().sum / 1024**2  # Calculate the memory usage of the DataFrame after conversion (in Mb)
    print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction), time spend:{:2.2f} min'.format(end_mem,
                                                                                                  100*(start_mem-end_mem)/start_mem,
                                                                                                  (time.time()-starttime)/60))
    return df

### 2.2 Training and Validation Split Function
We split the data into training and validation sets to test our model's performance offline. To fully simulate the validation set, we will take a portion of the user data from the training set and use all of their information to form the validation set.

The benefit of doing this split early is that it reduces the pressure of creating ranking features. Generating features for the entire dataset at once can be time-consuming, so this approach helps us work more efficiently.

In [None]:
# all_click_df refers to the training set.
# sample_user_nums is the number of users to sample for the validation set.
def trn_val_split(all_click_df, sample_user_nums):
    all_click = all_click_df
    all_user_ids = all_click['user_id'].unique()
    
    # replace=False means that users cannot be sampled more than once.
    sample_user_ids = np.random.choice(all_user_ids, size=sample_user_nums, replace=False)

    click_val = all_click[all_click['user_id'].isin(sample_user_ids)]
    click_trn = all_click[~all_click['user_id'].isin(sample_user_ids)]

    # Extract the last click from the validation set to serve as the answer.
    click_val = click_val.sort_values(['user_id', 'click_timestamp'])
    val_ans = click_val.groupby('user_id').tail(1)

    click_val = click_val.groupby('user_id').apply(lambda x: x[:-1]).reset_index(drop=True)

    # Remove cases where a user in val_ans has only one click.
    # If a user has only one click and it's put into val_ans,
    # the training set will have no data for that user,
    # leading to a user cold-start problem which complicates model validation.
    val_ans = val_ans[val_ans['user_id'].isin(click_val['user_id'].unique())] # Ensure users in the answer set also appear in the validation set.
    click_val = click_val[click_val['user_id'].isin(val_ans['user_id'].unique())]

    return click_trn, click_val, val_ans

### 2.3 Get Historical Clicks and Last Click Function

In [None]:
# Get historical and last clicks from the current data
def get_hist_and_last_click(all_click):
    all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])
    click_last_df = all_click.groupby('user_id').tail(1)

    # If the user has only one click record (len(user_df) == 1),
    # the hist_func will return the entire click record for that user.
    # Otherwise, it will return all clicks except the last one.
    def hist_func(user_df):
        if len(user_df) == 1:
            return user_df
        else:
            return user_df[:-1]

    click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)

    return click_hist_df, click_last_df

### 2.4 Read the Training, Validation, and Test Sets Function

In [None]:
def get_trn_val_tst_data(data_path, offline=True):
    if offline:
        click_trn_data = pd.read_csv(data_path + 'train_click_log.csv')  # Training Set
        click_trn_data = reduce_mem(click_trn_data)
        click_trn, click_val, val_ans = trn_val_split(click_trn_data, sample_user_nums)
    else:
        click_trn = pd.read_csv(data_path + 'train_click_log.csv')
        click_trn = reduce_mem(click_trn)
        click_val = None
        val_ans = None

    click_tst = pd.read_csv(data_path + 'testA_click_log.csv')  # Test set

    return click_trn, click_val, click_tst, val_ans

### 2.5 Load Recall Dictionary Function

In [None]:
# Return a multi-channel recall list or a single-channel recall list.
def get_recall_list(save_path, single_recall_model=None, multi_recall=False):
    if multi_recall:
        return pickle.load(open(save_path + 'final_recall_items_dict.pkl', 'rb'))

    if single_recall_model == 'i2i_itemcf':
        return pickle.load(open(save_path + 'itemcf_recall_dict.pkl', 'rb'))
    elif single_recall_model == 'i2i_emb_itemcf':
        return pickle.load(open(save_path + 'itemcf_emb_dict.pkl', 'rb'))
    elif single_recall_model == 'user_cf':
        return pickle.load(open(save_path + 'youtubednn_usercf_dict.pkl', 'rb'))
    elif single_recall_model == 'youtubednn':
        return pickle.load(open(save_path + 'youtube_u2i_dict.pkl', 'rb'))

### 2.6 Read Article Information Function

In [None]:
def get_article_info_df():
    article_info_df = pd.read_csv(data_path + 'articles.csv')
    article_info_df = reduce_mem(article_info_df)

    return article_info_df

## 2.7 Read various Embeddings

### 2.7.1 Word2Vec Embedding

In [None]:
def train_item_word2vec(click_df, embed_size=64, save_name='item_w2v_emb.pkl', split_char=' '):
    click_df = click_df.sort_values('click_timestamp')
    # The data must be converted to strings before training.
    click_df['click_article_id'] = click_df['click_article_id'].astype(str)
    # Convert the data into a sentence-like format.
    docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()
    docs = docs['click_article_id'].values.tolist()

    # Set up logging for easy monitoring of training progress.
    logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)
    
    # The default negative sampling is 5.
    ww2v = Word2Vec(docs, verctor_size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=10)

    # Save the embeddings as a dictionary.
    item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}
    pickle.dump(item_w2v_emb_dict, open(save_path + 'item_w2v_emb.pkl', 'wb'))

    return item_w2v_emb_dict

### 2.7.2 Load embeddings from the recall phase

In [None]:
# Load the word2vec embedding and the three embedding in the recall phase.
def get_embedding(save_path, all_click_df):
    if os.path.exists(save_path + 'item_content_emb.pkl'):
        item_content_emb_dict = pickle.load(open(save_path + 'item_content_emb.pkl', 'rb'))
    else:
        print('item_content_emb.pkl file does not exist...')
        
    # The Word2Vec embeddings need to be pre-trained.
    if os.path.exists(save_path + 'item_w2v_emb.pkl'):
        item_w2v_emb_dict = pickle.load(open(save_path + 'item_w2v_emb.pkl', 'rb'))
    else:
        item_w2v_emb_dict = train_item_word2vec(all_click_df)

    if os.path.exists(save_path + 'item_youtube_emb.pkl'):
        item_youtube_emb_dict = pickle.load(open(save_path + 'item_youtube_emb.pkl', 'rb'))
    else:
        print('item_youtube_emb.pkl file does not exist...')

    if os.path.exists(save_path + 'user_youtube_emb.pkl'):
        user_youtube_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))
    else:
        print('user_youtube_emb.pkl file does not exist...')

    return item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict

## 3. Read Data

In [None]:
# The distinction between 'offline' and 'online' here is whether the validation set is empty.
click_trn, click_val, click_tst, val_ans = get_trn_val_tst_data(data_path, offline=False)

click_trn_hist, click_trn_last = get_hist_and_last_click(click_trn)

if click_val is not None:
    click_val_hist, click_val_last = click_val, val_ans
else:
    click_val_hist, click_val_last = None, None

click_tst_hist = click_tst

## 4. Downsampling the negative samples

After the recall phase, we're left with a list of positive items (what a user clicked) and a huge number of negative items (what they didn't click). This creates a massive class imbalance.

#### The Problem: Imbalanced Data
The data is extremely imbalanced. For each positive click (label 1), there are thousands of negative examples (label 0). Training a model on this skewed data would lead it to simply predict "not clicked" for everything, resulting in a useless model.

#### The Solution: Downsampling
To solve this, we downsample the negative samples. This means we selectively choose a smaller number of negative examples to create a more balanced ratio between positive and negative data points. This serves two purposes:

- Mitigates Imbalance: It makes the ratio of positive to negative samples more manageable, allowing the model to learn what a "click" looks like rather than just what a "non-click" looks like.

- Reduces Computation: It drastically reduces the size of the training dataset, which makes feature engineering and model training significantly faster.

Key Considerations for Downsampling
When performing negative downsampling, we follow these important rules:

1. Only the negative samples are downsampled. All positive samples should be retained.
2. After downsampling, ensure that all users and items that appeared in the original data are still present in the new, smaller dataset.
3. The ratio of positive to negative samples is a hyperparameter we need to tune. We will try different ratios to see what works best for your model.
4. Since downsampling changes the dataset, it's crucial to update our user recall lists to reflect the new data. This ensures that any subsequent features, especially those related to a user's position or history, are accurate.

In this context, we're performing the negative downsampling early because creating ranking features for the full, imbalanced dataset would be too slow. This strategic step helps make the entire process more efficient.

In [None]:
# Convert a recall list dictionary to a DataFrame format
def recall_dict_2_df(recall_list_dict):
    """
    Converts a recall list dictionary (recall_list_dict) to a DataFrame.
    Parameters:
        recall_list_dict (dict): A dictionary of recall lists, where keys are users
                                 and values are lists of recalled items.
    Returns:
        recall_list_df (pd.DataFrame): The converted DataFrame, containing users, items, and scores.

    """
    df_row_list = []  # [user, item, score]
    for user, recall_list in tqdm(recall_list_dict.items()):
        for item, score in recall_list:
            df_row_list.append([user, item, score])

    col_names = ['user_id', 'sim_item', 'score']
    recall_list_df = pd.DataFrame(df_row_list, columns=col_names)

    return recall_list_df

In [None]:
# The negative sampling function, where the sampling ratio can be controlled. A default value is provided here.
def neg_sample_recall_data(recall_items_df, sample_rate=0.001):
    pos_data = recall_items_df[recall_items_df['label'] == 1]
    neg_data = recall_items_df[recall_items_df['label'] == 0]

    print('pos_data_num:', len(pos_data), 'neg_data_num:', len(neg_data), 'pos/neg:', len(pos_data)/len(neg_data))

    # Grouped sampling function
    def neg_sample_func(group_df):
        neg_num = len(group_df)
        sample_num = max(int(neg_num * sample_rate), 1)  # Ensure at least one sample is taken.
        sample_num = min(sample_num, 5)  # Ensure a maximum of 5 samples; this can be adjusted.
        return group_df.sample(n=sample_num, replace=True)

    # Perform negative sampling on a per-user basis, ensuring all users are in the sampled data.
    neg_data_user_sample = neg_data.groupby('user_id', group_keys=False).apply(neg_sample_func)
    # Perform negative sampling on a per-item basis, ensuring all items are in the sampled data.
    neg_data_item_sample = neg_data.groupby('sim_item', group_keys=False).apply(neg_sample_func)

    # Merge the sampled data from both of the above cases.
    neg_data_new = neg_data_user_sample.append(neg_data_item_sample)
    # Since the two operations above were separate, some data points might be duplicated.
    # We must remove duplicates from the merged data.
    neg_data_new = neg_data_new.sort_values(['user_id', 'score']).drop_duplicates(['user_id', 'sim_item'], keep='last')

    # Merge with the positive samples.
    data_new = pd.concat([pos_data, neg_data_new], ignore_index=True)

    return data_new

In [None]:
# Label the recall data
def get_rank_label_df(recall_list_df, label_df, is_test=False):
    # The test set has no labels. To unify the code, we'll use a negative number as a placeholder.
    if is_test:
        recall_list_df['label'] = -1
        return recall_list_df

    label_df = label_df.rename(columns={'click_article_id': 'sim_item'})
    recall_list_df_ = recall_list_df.merge(label_df[['user_id', 'sim_item', 'click_timestamp']], \
                                               how='left', on=['user_id', 'sim_item'])
    recall_list_df_['label'] = recall_list_df_['click_timestamp'].apply(lambda x: 0.0 if np.isnan(x) else 1.0)
    del recall_list_df_['click_timestamp']

    return recall_list_df_

In [None]:
def get_user_recall_item_label_df(click_trn_hist, click_val_hist, click_tst_hist,click_trn_last, click_val_last, recall_list_df):
    # Get the recall list for the training data.
    trn_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_trn_hist['user_id'].unique())]
    # Label the training data.
    trn_user_item_label_df = get_rank_label_df(trn_user_items_df, click_trn_last, is_test=False)
    # Perform negative sampling on the training data.
    trn_user_item_label_df = neg_sample_recall_data(trn_user_item_label_df)

    if click_val is not None:
        val_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_val_hist['user_id'].unique())]
        val_user_item_label_df = get_rank_label_df(val_user_items_df, click_val_last, is_test=False)
        val_user_item_label_df = neg_sample_recall_data(val_user_item_label_df)
    else:
        val_user_item_label_df = None

    # Test data does not need negative sampling; all recalled items are directly given a label of -1.
    tst_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_tst_hist['user_id'].unique())]
    tst_user_item_label_df = get_rank_label_df(tst_user_items_df, None, is_test=True)

    return trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df

In [None]:
# Read the recall list
recall_list_dict = get_recall_list(save_path, single_recall_model='i2i_itemcf')
# Convert the recall dictionary to a DataFrame
recall_list_df = recall_dict_2_df(recall_list_dict)

In [None]:
# Label the training and validation data, and perform negative sampling.
trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df = get_user_recall_item_label_df(click_trn_hist,
                                                                                                       click_val_hist,
                                                                                                       click_tst_hist,
                                                                                                       click_trn_last,
                                                                                                       click_val_last,
                                                                                                       recall_list_df)

## 5. Convert the Downsampled Recall Data to a Dictionary

In [None]:
# Convert the final recall DataFrame to a dictionary format for ranking feature creation.
def make_tuple_func(group_df):
    row_data = []
    for name, row_df in group_df.iterrows():
        row_data.append((row_df['sim_item'], row_df['score'], row_df['label']))

    return row_data

In [None]:
trn_user_item_label_tuples = trn_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()
trn_user_item_label_tuples_dict = dict(zip(trn_user_item_label_tuples['user_id'], trn_user_item_label_tuples[0]))

if val_user_item_label_df is not None:
    val_user_item_label_tuples = val_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()
    val_user_item_label_tuples_dict = dict(zip(val_user_item_label_tuples['user_id'], val_user_item_label_tuples[0]))
else:
    val_user_item_label_tuples_dict = None

tst_user_item_label_tuples = tst_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()
tst_user_item_label_tuples_dict = dict(zip(tst_user_item_label_tuples['user_id'], tst_user_item_label_tuples[0]))

## 6. Feature Engineering

Feature engineering is the process of creating new features from raw data to improve the performance of machine learning models. Here's a summary of the key techniques:

#### a. Text Feature Engineering
- **Text Preprocessing**: Clean and normalize text data by tokenizing, removing stop words, removing irrelevant characters, symbols, or punctuation, removing duplicated/noisy/missing data, performing stemming or lemmatization, and handling named entities.
- **Text Length Features**: Create features based on the length of a news article's title or body, such as character count, word count, or sentence count.
- **Sentiment Analysis**: Use sentiment analysis models (e.g., VADER, TextBlob) to extract the emotional tone of an article as a feature.

#### b. User Behavior Features
- **Time Interval Features**: Calculate the time gap between user actions, such as the interval between clicks on two different articles.
- **User Behavior Sequence**: Sequence a user's historical actions and use a sequence model (e.g., RNN, LSTM) to extract features.
- **User Interest Tags**: Use clustering or topic modeling to build interest tags that represent a user's preferences for different topics or categories.

#### c. Article Attribute Features
- **Topic Modeling**: Use models like LDA or LSA to identify and use the topics of articles as features.
- **One-Hot Encoding**: Convert an article's category into a numerical feature using one-hot encoding.
- **Title Keywords**: Extract keywords from article titles to use as features.

#### d. Contextual Features
- **Geographic Features**: Use a user's location or an article's location tags as features.
- **Time Features**: Convert timestamps into features like season, weekday/weekend, or holiday.
- **User Device Features**: Use information about a user's device, operating system, or browser type.

#### e. Embedding and Dimensionality Reduction
- **Word2Vec Embeddings**: Use a Word2Vec model (or a transformer-based one such as BERT) to convert article titles or bodies into low-dimensional, dense vectors.
- **TF-IDF Weighted Dimensionality Reduction**: Reduce the dimensionality of TF-IDF weighted features to remove noise and make them more manageable.

#### f. Feature Crossing and Combination
- **Feature Combination**: Combine multiple features, such as a user's interest tag with an article's category.
- **Feature Crossing**: Perform feature crossing for categorical features, such as combining a user's click count for a certain category with an article's category.

#### g. Handling Missing Values and Outliers
- **Missing Value Imputation**: Fill in missing feature values with methods like the mean, median, or mode.
- **Outlier Treatment**: Detect and handle outliers using methods like the Z-Score or box plots.

#### h. Feature Selection and Importance
- **Variance Thresholding**: Remove features with a low variance, assuming they have little predictive power.
- **Correlation Analysis**: Calculate the correlation between features and the target variable to select the most relevant ones.
- **Feature Importance Ranking**: Use a machine learning model (e.g., Random Forest, GBDT) to rank features by their importance and select the most significant ones.

### 6.1 Feature Engineering of User Click History Versus Recalled Articles
For each user and each candidate article they've been recalled, we need to generate a set of features that capture the relationship between the candidate article and the user's past interests.

The specific steps are as follows:

1. Retrieve Recent Clicks: For every user, get the item_ids of the last N articles they clicked.
2. Generate Candidate-Specific Features: For each candidate article, calculate the following features by comparing it to the user's last N clicks:

- **Similarity Statistics**: Compute statistical features (sum, max, min, mean) of the similarity between the candidate article and the user's last N clicked articles (e.g., using embedding similarity).

- **Time-Based Features**: Calculate the time difference between the candidate article's creation time and the creation times of the last N clicked articles.

- **Length Features**: Determine the word count difference between the candidate article and the last N clicked articles.

- **User Similarity**: Calculate the similarity score between the user and the candidate article (e.g., using a YouTube DNN model).

In [None]:
# We will create history-related features based on the data below.
def create_feature(users_id, recall_list, click_hist_df,  articles_info, articles_emb, user_emb=None, N=1):
    """
    Creates features based on a user's historical behavior.
    :param users_id: A list of user IDs.
    :param recall_list: A dictionary where keys are user IDs and values are lists of candidate articles recalled for that user.
    :param click_hist_df: DataFrame containing the user's historical click information.
    :param articles_info: DataFrame with information about articles.
    :param articles_emb: A dictionary of article embedding vectors. This can be item_content_emb, item_w2v_emb, or item_youtube_emb.
    :param user_emb: A dictionary of user embedding vectors (user_youtube_emb). If not provided, the feature will not be created.
                     Note that if user_emb is used, articles_emb must be item_youtube_emb to ensure the dimensions match.
    :param N: The number of the most recent clicks to consider. Default is 1 because many users in the test set have only one historical click, which prevents null values.
    """
    # Create a 2D list to store the results, which will later be converted to a DataFrame.
    all_user_feas = []
    i = 0
    for user_id in tqdm(users_id):
        # Get the user's last N clicks.
        hist_user_items = click_hist_df[click_hist_df['user_id']==user_id]['click_article_id'][-N:]

        # Iterate through the user's recall list.
        for rank, (article_id, score, label) in enumerate(recall_list[user_id]):
            # Get the article's creation time and word count.
            a_create_time = articles_info[articles_info['article_id']==article_id]['created_at_ts'].values[0]
            a_words_count = articles_info[articles_info['article_id']==article_id]['words_count'].values[0]
            single_user_fea = [user_id, article_id]
            # Calculate the sum, max, min, and mean of similarity, time difference, and word difference
            sim_fea = []
            time_fea = []
            word_fea = []
            # Iterate through the user's last N clicked articles.
            for hist_item in hist_user_items:
                b_create_time = articles_info[articles_info['article_id']==hist_item]['created_at_ts'].values[0]
                b_words_count = articles_info[articles_info['article_id']==hist_item]['words_count'].values[0]

                sim_fea.append(np.dot(articles_emb[hist_item], articles_emb[article_id]))
                time_fea.append(abs(a_create_time - b_create_time))
                word_fea.append(abs(a_words_count - b_words_count))

            single_user_fea.extend(sim_fea)   # Similarity features
            single_user_fea.extend(time_fea)   # Time difference features
            single_user_fea.extend(word_fea)   # Word count difference features
            single_user_fea.extend([max(sim_fea), min(sim_fea), sum(sim_fea), sum(sim_fea) / len(sim_fea)])  # 相似性的统计特征

            if user_emb:  # If user embedding is provided, calculate the similarity feature between the recalled article and the user.
                single_user_fea.append(np.dot(user_emb[user_id], articles_emb[article_id]))

            single_user_fea.extend([score, rank, label])
            # Add to the main list.
            all_user_feas.append(single_user_fea)

    # Define column names.
    id_cols = ['user_id', 'click_article_id']
    sim_cols = ['sim' + str(i) for i in range(N)]
    time_cols = ['time_diff' + str(i) for i in range(N)]
    word_cols = ['word_diff' + str(i) for i in range(N)]
    sta_cols = ['sim_max', 'sim_min', 'sim_sum', 'sim_mean']
    user_item_sim_cols = ['user_item_sim'] if user_emb else []
    user_score_rank_label = ['score', 'rank', 'label']
    cols = id_cols + sim_cols + time_cols + word_cols + sta_cols + user_item_sim_cols + user_score_rank_label

    # Convert to a DataFrame.
    df = pd.DataFrame(all_user_feas, columns=cols)

    return df

In [None]:
# Load article info
article_info_df = get_article_info_df()

# Merge Log data, which is all the previous data
if click_val is not None:
    all_click = click_trn.append(click_val)
all_click = click_trn.append(click_tst)
all_click = reduce_mem(all_click)

# Get embeddings
item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict = get_embedding(save_path, all_click)

In [None]:
# Get features related to the recalled articles in the training, validation, and test data.
trn_user_item_feats_df = create_feature(trn_user_item_label_tuples_dict.keys(), trn_user_item_label_tuples_dict, \
                                            click_trn_hist, article_info_df, item_content_emb_dict)
trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)

if val_user_item_label_tuples_dict is not None:
    val_user_item_feats_df = create_feature(val_user_item_label_tuples_dict.keys(), val_user_item_label_tuples_dict, \
                                            click_val_hist, article_info_df, item_content_emb_dict)
    val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)
else:
    val_user_item_feats_df = None

tst_user_item_feats_df = create_feature(tst_user_item_label_tuples_dict.keys(), tst_user_item_label_tuples_dict, \
                                            click_tst_hist, article_info_df, item_content_emb_dict)
tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False)

### 6.2 User Behavior and Inherent Article Features
We will combine our existing features with more features in this section. Here's a breakdown of the features we will use:

#### 1. Inherent Article Features
- **Article Word Count**: The number of words in an article indicates its length. This helps us understand a user's preference for long-form vs. short-form content.

- **Article Creation Time**: An article's creation timestamp reflects its freshness. This can be used to gauge a user's preference for timely, up-to-date news.

- **Article Embedding**: Converting articles into low-dimensional, dense vectors (e.g., via Word2Vec or BERT) helps us capture the semantic similarity between articles, which is crucial for improving recommendation accuracy.

#### 2. User Click Environment Features
- **Device Information**: Features related to a user's device type, operating system, or browser can reflect their habits and preferences, helping us deliver personalized content for their specific environment.

#### 3. Users Behavior and Article Popularity Features
- **User Activity**: We can create features that reflect a user's activeness by counting their total clicks and analyzing the time intervals between their clicks.

- **Article Popularity**: We'll build features to measure an article's popularity or "hotness" by analyzing its total click count and the distribution of clicks over time.

- **User Timeliness Preference**: By comparing the click times of a user's historical articles to their creation times, we can understand their preference for content timeliness and recommend articles accordingly.

- **User Topic Preferences**: We can build features that represent a user's favorite topics by statistically analyzing the categories of their past clicks. This helps us predict if a new article belongs to a topic they've previously engaged with.

- **User Reading Length Preference**: By analyzing the word counts of a user's historical articles (e.g., calculating the average word count), we can create a feature that reflects their preference for a specific article length.

In [None]:
# Merge with article information
all_data = all_click.merge(article_info_df, left_on='click_article_id', right_on='article_id')
all_data.shape

### 6.2.1 User Activity: Total Clicks and Time Intervals between clicks
To distinguish active users from less active ones, we'll create a feature that combines their click frequency and the time between their clicks. The steps for this feature are as follows:

1. Group by User: First, we group the data by user_id.

2. Calculate Metrics: For each user, we calculate two key metrics:

 - The total number of articles they've clicked.

 - The average time interval between their consecutive clicks.

3. Combine and Normalize: To create a single score, we take the inverse of the click count and normalize it. We then do the same for the normalized average time interval. These two normalized values are then added together.

4. Handle Single Clicks: If a user has only one click, the average time interval will be a null value. For these cases, we'll assign a large, distinct value to ensure they are properly separated from other users in our analysis.

The intuition behind this feature is that a smaller final value indicates a more active user—meaning they clicked more often (high count -> low inverse) and did so in shorter time frames (short interval -> low normalized value).

In [None]:
def active_level(all_data, cols):
    """
    Creates features to distinguish user activity levels.
    :param all_data: The dataset.
    :param cols: The feature columns to use.
    """
    data = all_data[cols]
    data.sort_values(['user_id', 'click_timestamp'], inplace=True)
    user_act = pd.DataFrame(data.groupby('user_id', as_index=False)[['click_article_id', 'click_timestamp']].\
                            agg({'click_article_id':np.size, 'click_timestamp': {list}}).values, columns=['user_id', 'click_size', 'click_timestamp'])

    # Calculate the mean time difference.
    def time_diff_mean(l):
        if len(l) == 1:
            return 1
        else:
            return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])

    user_act['click_time_diff_mean'] = user_act['click_timestamp'].apply(lambda x: time_diff_mean(x))

    # Take the inverse of the click count.
    user_act['click_size'] = 1 / user_act['click_size']

    # Normalize both features.
    user_act['click_size'] = (user_act['click_size'] - user_act['click_size'].min()) / (user_act['click_size'].max() - user_act['click_size'].min())
    user_act['click_time_diff_mean'] = (user_act['click_time_diff_mean'] - user_act['click_time_diff_mean'].min()) / (user_act['click_time_diff_mean'].max() - user_act['click_time_diff_mean'].min())
    user_act['active_level'] = user_act['click_size'] + user_act['click_time_diff_mean']

    user_act['user_id'] = user_act['user_id'].astype('int')
    del user_act['click_timestamp']

    return user_act

In [None]:
user_act_fea = active_level(all_data, ['user_id', 'click_article_id', 'click_timestamp'])
user_act_fea.head()

### 6.2.2 Article Popularity
To measure an article's popularity, we'll use a similar approach to how we measured user activity. The logic is that an article is considered "hot" if it receives many clicks in a short amount of time. Here are the steps for creating this feature:

1. Group by Article: We group the data by click_article_id.

2. Calculate Metrics: For each article, we calculate two metrics:

 - The total number of times it has been clicked.

 - The average time interval between each of those clicks.

3. Combine and Normalize: To get a single "hotness" score, we take the inverse of the click count and normalize it. The average time interval will also be normalized. We add these two normalized values together.

The result is a single score where a smaller value indicates a hotter article. This is because a lower value means the article was clicked more often and in shorter time intervals.

In [None]:
def hot_level(all_data, cols):
    """
    Creates features to measure an article's popularity.
    :param all_data: The dataset.
    :param cols: The feature columns to use.
    """
    data = all_data[cols]
    data.sort_values(['click_article_id', 'click_timestamp'], inplace=True)
    article_hot = pd.DataFrame(data.groupby('click_article_id', as_index=False)[['user_id', 'click_timestamp']].\
                               agg({'user_id':np.size, 'click_timestamp': {list}}).values, columns=['click_article_id', 'user_num', 'click_timestamp'])

    # Calculate the mean time difference between clicks.
    def time_diff_mean(l):
        if len(l) == 1:
            return 1
        else:
            return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])

    article_hot['article_time_diff_mean'] = article_hot['click_timestamp'].apply(lambda x: time_diff_mean(x))

    # Take the inverse of the click count.
    article_hot['user_num'] = 1 / article_hot['user_num']

    # Normalize both features.
    article_hot['user_num'] = (article_hot['user_num'] - article_hot['user_num'].min()) / (article_hot['user_num'].max() - article_hot['user_num'].min())
    article_hot['article_time_diff_mean'] = (article_hot['article_time_diff_mean'] - article_hot['article_time_diff_mean'].min()) / (article_hot['article_time_diff_mean'].max() - article_hot['article_time_diff_mean'].min())
    article_hot['hot_level'] = article_hot['user_num'] + article_hot['time_diff_mean']

    article_hot['click_article_id'] = article_hot['click_article_id'].astype('int')

    del article_hot['click_timestamp']

    return article_hot

In [None]:
article_hot_fea = hot_level(all_data, ['user_id', 'click_article_id', 'click_timestamp'])
article_hot_fea.to_csv(save_path + 'articles_hot_features.csv', index=False)
article_hot_fea.head()

### 6.3 User Habit and Preference Features
To create a more comprehensive user profile, we can build a DataFrame that contains features reflecting their unique habits and preferences. Here are some key features to construct:

- **User Device Habits**: Identify a user's most frequently used device (the mode).

- **User Time Habits**: Analyze the timestamps of a user's clicks to understand when they are most active. We can extract the hour of the day or day of the week and find a mean or most common value.

- **User Topic Preferences**: Determine a user's content interests by statistically analyzing the topics of their historical clicks. This could be represented using multi-hot encoding to show which categories a user is interested in.

- **User Reading Length Preferences**: Analyze the word counts of a user's historical articles to understand their preference for article length. We can calculate the average word count to see if they prefer long-form or short-form content.

### 6.3.1 User Device Habits

In [None]:
def device_fea(all_data, cols):
    """
    Creates features for user devices.
    :param all_data: The dataset.
    :param cols: The feature columns to use.
    """
    user_device_info = all_data[cols]

    # Use the mode to represent the device information for each user.
    user_device_info = user_device_info.groupby('user_id').agg(lambda x: x.value_counts().index[0]).reset_index()

    return user_device_info

In [None]:
device_cols = ['user_id', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 'click_referrer_type']
user_device_info = device_fea(all_data, device_cols)
user_device_info.head()

### 6.3.2 User Time Habits

In [None]:
def user_time_hab_fea(all_data, cols):
    """
    Creates features for user time habits.
    :param all_data: The dataset.
    :param cols: The feature columns to use.
    """
    user_time_hab_info = all_data[cols]

    # First, normalize the timestamps.
    mm = MinMaxScaler()
    user_time_hab_info['click_timestamp'] = mm.fit_transform(user_time_hab_info[['click_timestamp']])
    user_time_hab_info['created_at_ts'] = mm.fit_transform(user_time_hab_info[['created_at_ts']])

    user_time_hab_info = user_time_hab_info.groupby('user_id').agg('mean').reset_index()

    user_time_hab_info.rename(columns={'click_timestamp': 'user_time_hob1', 'created_at_ts': 'user_time_hob2'}, inplace=True)
    return user_time_hab_info

In [None]:
user_time_hab_cols = ['user_id', 'click_timestamp', 'created_at_ts']
user_time_hab_info = user_time_hab_fea(all_data, user_time_hab_cols)
user_time_hab_info.head()

### 6.3.3 User Topic Preference
First, we will convert the topics of the articles a user has clicked into a list. During the final data aggregation, we will create a separate feature where an article's topic gets a value of 1 if it's in this list, and 0 otherwise.

In [None]:
def user_cat_hab_fea(all_data, cols):
    """
    User's topic preferences.
    :param all_data: The dataset.
    :param cols: The feature columns to use.
    """
    user_category_hab_info = all_data[cols]
    user_category_hab_info = user_category_hab_info.groupby('user_id').agg({list}).reset_index()

    user_cat_hab_info = pd.DataFrame()
    user_cat_hab_info['user_id'] = user_category_hab_info['user_id']
    user_cat_hab_info['cate_list'] = user_category_hab_info['category_id']

    return user_cat_hab_info

In [None]:
user_category_hab_cols = ['user_id', 'category_id']
user_cat_hab_info = user_cat_hab_fea(all_data, user_category_hab_cols)  

### 6.3.4 User Reading Length Preference

In [None]:
user_wcou_info = all_data.groupby('user_id')['words_count'].agg('mean').reset_index()
user_wcou_info.rename(columns={'words_count': 'words_hab'}, inplace=True)

### 6.4 Merge All User Information Features

In [None]:
# Merge all dataframes
user_info = pd.merge(user_act_fea, user_device_info, on='user_id')
user_info = user_info.merge(user_time_hab_info, on='user_id')
user_info = user_info.merge(user_cat_hab_info, on='user_id')
user_info = user_info.merge(user_wcou_info, on='user_id')

In [None]:
# Save user features so that they can be read directly in the future
user_info.to_csv(save_path + 'user_info.csv', index=False)

## 7. Read Features

### 7.1 Read User-related Features

In [None]:
user_info = pd.read_csv(save_path + 'user_info.csv')

In [None]:
if os.path.exists(save_path + 'trn_user_item_feats_df.csv'):
    trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')

if os.path.exists(save_path + 'tst_user_item_feats_df.csv'):
    tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')

if os.path.exists(save_path + 'val_user_item_feats_df.csv'):
    val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')
else:
    val_user_item_feats_df = None

In [None]:
# Merge train, validation, and test sets with user-related features
# Below is for offline validation
trn_user_item_feats_df = trn_user_item_feats_df.merge(user_info, on='user_id', how='left')

if val_user_item_feats_df is not None:
    val_user_item_feats_df = val_user_item_feats_df.merge(user_info, on='user_id', how='left')
else:
    val_user_item_feats_df = None

tst_user_item_feats_df = tst_user_item_feats_df.merge(user_info, on='user_id', how='left')

In [None]:
trn_user_item_feats_df.columns

### 7.2 Read Article-related Features

In [None]:
# Load article info
articles = get_article_info_df()
articles_hot_fea =  pd.read_csv(data_path + 'articles_hot_features.csv')

In [None]:
# Merge article information with article popularity features
articles.merge(articles_hot_fea, on)

In [None]:
# Merge train, validation, and test sets with article-related features
trn_user_item_feats_df = trn_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')
trn_user_item_feats_df = trn_user_item_feats_df.merge(articles_hot_fea, on='click_article_id', how='left')

if val_user_item_feats_df is not None:
    val_user_item_feats_df = val_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')
    val_user_item_feats_df = val_user_item_feats_df.merge(articles_hot_fea, on='click_article_id', how='left')
else:
    val_user_item_feats_df = None

tst_user_item_feats_df = tst_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')
tst_user_item_feats_df = tst_user_item_feats_df.merge(articles_hot_fea, on='click_article_id', how='left')

### 7.3 Check if the Topic of the Recalled Article is in the User's Interest History

In [None]:
trn_user_item_feats_df['is_cat_hab'] = trn_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)

if val_user_item_feats_df is not None:
    val_user_item_feats_df['is_cat_hab'] = val_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)
else:
    val_user_item_feats_df = None
    
tst_user_item_feats_df['is_cat_hab'] = tst_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)

In [None]:
# For offline validation
del trn_user_item_feats_df['cate_list']

if val_user_item_feats_df is not None:
    del val_user_item_feats_df['cate_list']
else:
    val_user_item_feats_df = None

del tst_user_item_feats_df['cate_list']

del trn_user_item_feats_df['article_id']

if val_user_item_feats_df is not None:
    del val_user_item_feats_df['article_id']
else:
    val_user_item_feats_df = None

del tst_user_item_feats_df['article_id']

## 8. Save The Features for the User's Recall list

In [None]:
# Save the features for training, validation, testing sets
trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)

if val_user_item_feats_df is not None:
    val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)
    
tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False)

## 9. Summary

"Data and features determine the ceiling of a model's performance; the algorithm only gets you close to that ceiling." The quality of the features often determines the final outcome.

In this section, we accomplished two major goals:

- Framing the Problem: We transformed our recall results into a supervised learning dataset by creating features and assigning labels. This converted our prediction problem into a format a model can easily understand and learn from.

- Building a Rich Feature Set: We've begun to create a series of features based on both user profiles and article profiles. These features go beyond raw data to capture nuanced information about user activity, preferences, and the characteristics of the content they engage with.

We also tackled the data imbalance problem by implementing negative sampling, which ensures our model gets a balanced view of both what users like and what they don't.