# Exercise: Working with Massive Text Data

**Time: 15 minutes**

This exercise tests whether you actually understand the tools, not just whether you can copy code.

You'll need to:
1. Observe your specific machine's behavior
2. Fix broken code by reading error messages
3. Explain your reasoning (not just write code)

---

## Part 0: Record Your Environment

Run this cell and **write down the values**. You'll need them later.

In [None]:
import os
import psutil
import platform

# Your machine's specs
ram_gb = psutil.virtual_memory().total / (1024**3)
available_ram_gb = psutil.virtual_memory().available / (1024**3)
cpu_count = psutil.cpu_count()
os_name = platform.system()

# The data file
data_file = "../data/cc_news_large.csv"
file_size_gb = os.path.getsize(data_file) / (1024**3)

print("=" * 50)
print("YOUR ENVIRONMENT:")
print("=" * 50)
print(f"Operating System: {os_name}")
print(f"Total RAM: {ram_gb:.1f} GB")
print(f"Available RAM (right now): {available_ram_gb:.1f} GB")
print(f"CPU cores: {cpu_count}")
print(f"Data file size: {file_size_gb:.2f} GB")
print("=" * 50)

## Part 1: Debugging Broken Code (4 points)

The following code has **three bugs**. Run it, read the error messages, and fix them.

**Hint:** The bugs are NOT syntax errors. They're conceptual mistakes in how the tools are used.

In [None]:
# BROKEN CODE - Fix the 3 bugs
import polars as pl

# Task: Get all articles from February 2023, sorted by word count (longest first)
result = (
    pl.read_csv("../data/cc_news_large.csv")  # Bug 1: wrong polars method (function)
    .filter(pl.col("publication_date") >= "2023-02-01")  # Bug 2: wrong column name
    .filter(pl.col("date") < "2023-03-01")
    .with_columns([
        pl.col("text").str.split(" ").list.len().alias("word_count")
    ])
    .sort("word_count")  # Bug 3: wrong sort order for "longest first"
)

print(f"Found {len(result)} articles")
result.head(10)

### Your Fixed Code:

In [None]:
# Task: Get all articles from February 2023, sorted by word count (longest first)
result = (
    # ... your code ...
)

print(f"Found {len(result)} articles")
result.head(10)

### Bug Explanations

## Part 2: Timed Challenge

Write a query that answers this question:

> **Which domain published the most articles containing the word "election" in 2023?**

**Requirements:**
1. Use DuckDB (not Polars, not pandas)
2. Return the top 5 domains with their article counts
3. Record how long your query takes (the timer is built in)

In [None]:
import duckdb
import time

start = time.time()

result = duckdb.sql("""
                    
    -- Write your SQL query in this string...
    -- Filter for 2023
    -- Filter for articles containing 'election' (case-insensitive)
    -- Group by domain
    -- Order by count descending
    -- Limit to top 5

""").df()

elapsed = time.time() - start

print(f"Query completed in {elapsed:.2f} seconds")
print(f"\nTop 5 domains for 'election' articles in 2023:")
result

## Part 3: Reasoning Questions

Answer in 2-3 sentences each. No code required.

### Question 3a

Look at your environment values from Part 0. Given YOUR specific available RAM, approximately how large could a CSV file be before `pd.read_csv()` would fail?

Show some reasoning (not just a number).

_Your answer:_

### Question 3b

In the teaching demo, both DuckDB and Polars were faster, although DuckDB won by a stretch... 

Under what circumstances would you choose one over the other?

Give a specific scenario for each.

_Your answer:_

### Question 3c

You have a 100GB CSV file. Your laptop has 16GB RAM. You need to:
1. Filter to rows from a specific month
2. Count word frequencies across all remaining articles
3. Return the top 1000 words

Describe your approach in plain English. Which tool(s) would you use and why?

_Your answer:_

---
---

## Done?

Wait for the solution review. Use extra time to:
- Write a more complex query combining multiple filters
- Explore the data: what date ranges exist? What domains?

<br>

> Also wait a little before getting into the next exercise...

### Part & Question 4: "Cryptic task"

- A similar table than the one we synthetically generated is available at a Postgres database server located at: `liip.econai.org`
- The port is the default one
- The user is `user` and password is `BSEpass!$)`
- Database is BSE and schema is `synthetic`

1. Investigate how to connect and query that database using DuckDB (and/ or other library)
    - The table is the only table in the schema
2. Compare times vs local
3. Why is there a difference in times?

In [None]:
# Your code here...