# Reducing File Size

The original file from [Kaggle](https://www.kaggle.com/datasets/adityakharosekar2/guardian-news-articles?resource=download) is over 700MB big. The following code is used to reduce the file size by limiting the entries written in 2022 and removing all rows with empty body text.

In [1]:
import csv

# Set a safe field size limit (1 GB)
csv.field_size_limit(1024 * 1024 * 1024)

input_file = "data/guardian_articles.csv"
output_file = "data/guardian_articles_2022.csv"

with open(input_file, mode='r', encoding='utf-8') as infile, \
     open(output_file, mode='w', encoding='utf-8', newline='') as outfile:
    
    reader = csv.reader(infile)
    writer = csv.writer(outfile)

    # Write the header
    header = next(reader)
    writer.writerow(header)

    for row in reader:
        if len(row) > 5:
            pub_date = row[5]
            body_content = row[4].strip()  # Remove whitespace just in case

            if pub_date.startswith("2022") and body_content:
                writer.writerow(row)