# Consolidate Analysis Files

This notebook is by Moacir P. de Sá Pereira.

This notebook consolidates the files in `./analyzed_files/`, which take the form of `{corpus}_nnn.parquet`. To account for the tendency of TDM Studio to shut itself off, we saved our work in 1,000 article batches and generated 20 of those files for each corpus with every run until the system shut off. Given that the server auto shuts down after two days and we can analyze about 500 articles an hour, we would never complete the full run of 20 batches for all corpora.

The consolidated files are saved as `./analyzed_full_parquet_files/{corpus}_x_{file_count}000.parquet`, where `file_count` indicates how deeply into the analysis we got. By the end of this project, we had accumulated 100,000 analyzed articles for the two largest corpora and exhausted the smallest three.

Next, we imported the analyzed full parquet files into our GitHub repo and renamed them `{ticker}_sent.parquet`, where `ticker` is the company's stock symbol and not the name of the corpus.

## Imports

In [None]:
import pandas as pd
from tqdm.notebook import tqdm

## Constants

In [None]:
corpora = [
    "walgreens",
    "walmart",
    "dollar-tree",
    "lululemon",
    "ulta"
]

root_path = "/home/ec2-user/SageMaker"
analyzed_files_path = f"{root_path}/analyzed_batch_parquet_files"
analyzed_full_parquet_path = f"{root_path}/analyzed_full_parquet_files"

In [None]:
# Get files with 100,000 analyzed articles (at least)
file_count = 100
for corpus in corpora:
    dfs = []
    for i in range(1, file_count + 1):
        try:
            path = f"{analyzed_files_path}/{corpus}_{str(i).zfill(3)}.parquet"
            df_fragment = pd.read_parquet(path)
            dfs.append(df_fragment)
        except:
            continue
    df = pd.concat(dfs)
    df.to_parquet(f"{analyzed_full_parquet_path}/{corpus}_x_{file_count}000.parquet")