# Topic Classification

The purpose of this notebook is to create a dataset from the **Zeeguu** data and to create functions for evaluating our models.

In [1]:
# ---------------------------- PREPARING NOTEBOOK ---------------------------- #
# Autoreload
%load_ext autoreload
%autoreload 2

# Random seed
import numpy as np
np.random.seed(42)

# External modules
import os
from IPython.display import display, Markdown, Latex, clear_output
from tqdm import notebook as tqdm

# Set global log level
import logging
logging.basicConfig(level=logging.INFO)

# Define PWD as the current git repository
import git
repo = git.Repo('.', search_parent_directories=True)
pwd = repo.working_dir
os.chdir(pwd)

## Dataset Creation

*The purpose of this dataset is to generate templates that can be used to assign one or more themes to each article.*

The dataset is created from **Zeeguu** data. It is made up of 2 columns:
- **text**: the text of the article
- **labels**: labels associated with the article

In [2]:
# ---------------------------- CREATING DATAFRAME ---------------------------- #
import pandas as pd
from sklearn.model_selection import train_test_split

zeeguu_data_path = os.path.join(
    pwd,
    "data",
    "processed",
    "recommendation",
)
csv_to_load = ["article_topic_map.csv", "topic.csv", "article.csv"]

# Load data
article_topic_map = pd.read_csv(os.path.join(zeeguu_data_path, csv_to_load[0]))
topic = pd.read_csv(os.path.join(zeeguu_data_path, csv_to_load[1]))
article = pd.read_csv(os.path.join(zeeguu_data_path, csv_to_load[2]))

# Keep only french articles
article = article[article["language_id"] == 7]

# Create DataFrame
df = pd.merge(
    article[["id", "title", "content"]],
    article_topic_map,
    left_on="id",
    right_on="article_id",
)
df = pd.merge(df, topic, left_on="topic_id", right_on="id")
df = df[["title_x", "content", "title_y"]].rename(
    columns={"title_x": "title", "title_y": "topic"}
)
df["text"] = df["title"] + "\n\n" + df["content"]
df = df[["text", "topic"]]

# Display
display(df.head())

# Split Train/Test
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

# Save
## Create folder if it does not exist
output_folder = os.path.join(pwd, "results", "topic_classification", "DataPreparation")
if not os.path.exists(output_folder):
    os.makedirs(output_folder)
## Save
df_train.to_csv(os.path.join(output_folder, "train.csv"), index=False)
df_test.to_csv(os.path.join(output_folder, "test.csv"), index=False)

  article = pd.read_csv(os.path.join(zeeguu_data_path, csv_to_load[2]))


Unnamed: 0,text,topic
0,"La France, championne de hand !\n\nb""franceinf...",Sport
1,Des enfants ont fait une course de chiens de t...,Sport
2,"C’est quoi, le rallye Dakar ?\n\nb""Parce que l...",Sport
3,Grande-Bretagne: l'ex-entraîneur Barry Bennell...,Sport
4,Défi de Monte-Cristo : une course en mer à la ...,Sport


## Creation of the evaluation function

Now that we have a training and test set, we can create an evaluation function that will allow us to compare the performance of our models.

In [3]:
# modcell
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    confusion_matrix,
)
import pandas as pd
import numpy as np


def compute_metrics(y_pred: pd.Series, y_real: pd.Series):
    accuracy = accuracy_score(y_real, y_pred)
    recall = recall_score(y_real, y_pred, average="weighted")
    precision = precision_score(y_real, y_pred, average="weighted")
    f1 = f1_score(y_real, y_pred, average="weighted")
    conf_matrix = confusion_matrix(y_real, y_pred)

    # Normalize confusion matrix
    conf_matrix = conf_matrix.astype("float") / conf_matrix.sum(axis=1)[:, np.newaxis]

    metrics = pd.Series(
        {
            "accuracy": accuracy,
            "recall": recall,
            "precision": precision,
            "f1_score": f1,
            "confusion_matrix": conf_matrix,
        }
    )

    return metrics