# Concatenate Corpora

This notebook is by Moacir P. de Sá Pereira.

This notebook assumes the existence of one dataset: 

A set of csvs in `./dataframe_files`, each of the name `{corpus}_nnn.csv`. The csvs are each up to 10,000 articles long and were prepared in the `prepare-texts` notebook. The csvs have the following columns:

- `goid`: Int. As above.
- `title`: Str. The headline of the article.
- `date`: Str. The publication date, in `YYYY-MM-DD` format.
- `publisher`: Str. The article's publisher.
- `pub_title`: Str. The title of the publication.
- `author`: Str. The display name of the author, when available.
- `tokens`: Int. A naive word count, derived from splitting the full text on whitespace.

The goal of this notebook is to change around some of the data in response to earlier versions of this codebase.

It is intended to be run once to create parquet files saved in `./full_fixed_parquet_files`. Each file, named `{corpus}.parquet` contains the following data pertaining to each article we want to keep from a particular corpus saved in `./data/{corpus}`:

- `index`: Int. A consecutive index.
- `goid`: Int. As above.
- `date`: DT. The publication date, in datetime format.
- `tokens`: Int. A naive word count, derived from splitting the full text on whitespace.
- `corpus`: Str. The corpus name. This is used in the next notebook.
- `daily_article_count`: Int. The number of articles in the corpus for that day.
- `daily_token_sum`: Int. The sum of naive tokens in the corpus for that day.

The final two new columns help us understand if a text is particularly long for its day and if it is part of a busy or slow news day.

Additionally, the total dataset is trimmed in two ways:

1. Articles longer than 80% of the articles are dropped
2. Articles published on weekends are dropped

The dataset is then shuffled and _reindexed_ with a consecutive index number. This is because the full parquet acts as the record of truth for batch processing in the next notebook.
    

## Imports

In [None]:
%conda update -n base -c conda-forge conda

In [None]:
%conda install pyarrow=18.0.0

In [None]:
%conda install pandas=2.2.3

In [1]:
import os
import pandas as pd

## Constants

In [2]:
corpora = [
    "dollar-tree",
    "lululemon",
    "ulta",
    "walgreens",
    "walmart"
]

root_path = "/home/ec2-user/SageMaker"
dataframe_path = f"{root_path}/dataframe_files"
full_parquets_path = f"{root_path}/full_fixed_parquet_files"

dataframe_files = os.listdir(dataframe_path)

## Reconstruct Corpora

In [3]:
def concat_csvs(corpus, token_cutoff_percentage=0.8, omit_weekends=True):
    df = pd.DataFrame()
    for file in dataframe_files:
        if corpus in file:
            df_chunk = pd.read_csv(f"{dataframe_path}/{file}", index_col=0)
            df = pd.concat([df, df_chunk], ignore_index=True)
            df.reset_index(drop=True, inplace=True)
    df["corpus"] = corpus # this lets us know what corpus we are using in later code.
    
    # Calculate the token breakpoint to remove outliers in terms of token counts.
    # Some articles are unrealistically long (several thousands of tokens) and are
    # likely aggregations of all kinds of information.    
    cutoff = df["tokens"].quantile(token_cutoff_percentage)
    df = df[df["tokens"] < cutoff]
    
    if omit_weekends:
        # Remove all articles that were published on weekends
        df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
        df = df[df["date"].dt.dayofweek < 5]
    
    # Build up aggregate columns
    date_group = df.groupby("date")
    df["daily_article_count"] = date_group.transform("size")
    df["daily_token_sum"] = date_group["tokens"].transform("sum")
    
    # Drop columns we do not need anymore
    df = df.drop(columns=["title", "publisher", "pub_title", "author"])
    
    return df

## Create Full Parquet Lists for Iteration in Next Notebook



In [4]:
for corpus in corpora:
    # Create a single dataframe from all the files in dataframe_files
    df = concat_csvs(corpus)
    # Shuffle
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    # Create a separate index column.
    df = df.reset_index()
    df.rename(columns={"index": "index"}, inplace=True)
    df.to_parquet(f"{full_parquets_path}/{corpus}.parquet")

Unnamed: 0,index,goid,date,tokens,corpus,daily_article_count,daily_token_sum
0,0,2341682516,2019-11-01,3990,walmart,295,512898
1,1,2879485854,2023-10-17,4597,walmart,267,597333
2,2,3067752956,2024-06-14,705,walmart,4845,4322638
3,3,2412245758,2020-06-11,407,walmart,266,400490
4,4,2650312404,2022-04-15,1211,walmart,147,365830
