# Weak signal prioritization

## Table of contents

- [Table of contents](#table-of-contents)
- [Libraries](#libraries)
- [Global variables](#global-variables)
- [Load dataset](#load-dataset)

## Libraries

In [26]:
import pandas as pd
from pathlib import Path
import numpy as np

## Global variables

In [20]:
ROOT            = Path("../..") / "data"
PATH_ANONYM     = ROOT / "anonymous_forum_filtered.csv"

## Load dataset

In [21]:
df = pd.read_csv(PATH_ANONYM, sep=",", encoding="utf-8")

In [22]:
print(f"Number of rows : {df.shape[0]}")
print(f"Columns: {df.columns.tolist()}")
df.head(3)

Number of rows : 1087011
Columns: ['msg_id', 'user', 'content', 'topic', 'deleted', 'banned', 'hour']


Unnamed: 0,msg_id,user,content,topic,deleted,banned,hour
0,anon_msg_55908da50f0b,anon_user_a5f371a2c3,"La bière ""Urine de Go*"" vient d'arriver.",anon_topic_a181712c,0,0,18
1,anon_msg_ca8c3a80715a,anon_user_768dcf0a9f,"Non mais ça va aller, les gargouilles le feron...",anon_topic_2e39b13d,1,0,18
2,anon_msg_f2279e23f198,anon_user_f3ac5bb79a,Comme mercredi hein\nLes cliqueurs dans le déni,anon_topic_eb025aee,1,0,17


## Prepare the subsets

In [31]:
mu = 6
sigma = 2.5
lambda_banned = 10  # how much to boost banned users
lambda_deleted = 3  # how much to boost deleted content

# Compute Gaussian weight
df["gauss_weight"] = np.exp(-((df["hour"] - mu) ** 2) / (2 * sigma ** 2))

# Normalize by number of messages per hour
hour_counts = df["hour"].value_counts().to_dict()
df["hour_count"] = df["hour"].map(hour_counts)
df["hour_norm_weight"] = df["gauss_weight"] / df["hour_count"]

# Add banned user boost
df["priority_weight"] = df["hour_norm_weight"] * (1 + lambda_banned * df["banned"]) * (1 + lambda_deleted * df["deleted"])

# Sample using the normalized priority weight
df = df.sample(frac=1.0, random_state=42, weights=df["priority_weight"]).reset_index(drop=True)

# Drop helper columns
df = df.drop(columns=["gauss_weight", "hour_count", "hour_norm_weight", "priority_weight"])

# Split into subsets
size_of_the_datasets = 3000
number_of_subsets = df.shape[0] // size_of_the_datasets
print(f"Number of subsets: {number_of_subsets}")

for i in range(number_of_subsets):
    start = i * size_of_the_datasets
    end = (i + 1) * size_of_the_datasets
    subset = df.iloc[start:end]
    subset.to_csv(ROOT / "subsets_Di" / f"subset_{i}.csv", index=False)
    print(f"Subset {i} created with {subset.shape[0]} rows.")

Number of subsets: 362
Subset 0 created with 3000 rows.
Subset 1 created with 3000 rows.
Subset 2 created with 3000 rows.
Subset 3 created with 3000 rows.
Subset 4 created with 3000 rows.
Subset 5 created with 3000 rows.
Subset 6 created with 3000 rows.
Subset 7 created with 3000 rows.
Subset 8 created with 3000 rows.
Subset 9 created with 3000 rows.
Subset 10 created with 3000 rows.
Subset 11 created with 3000 rows.
Subset 12 created with 3000 rows.
Subset 13 created with 3000 rows.
Subset 14 created with 3000 rows.
Subset 15 created with 3000 rows.
Subset 16 created with 3000 rows.
Subset 17 created with 3000 rows.
Subset 18 created with 3000 rows.
Subset 19 created with 3000 rows.
Subset 20 created with 3000 rows.
Subset 21 created with 3000 rows.
Subset 22 created with 3000 rows.
Subset 23 created with 3000 rows.
Subset 24 created with 3000 rows.
Subset 25 created with 3000 rows.
Subset 26 created with 3000 rows.
Subset 27 created with 3000 rows.
Subset 28 created with 3000 rows.
S