## 02: Process text and counts

This script uses a pretrained [spaCy](https://spacy.io) model to extract entities from JSONL-formatted data and count them. It expects each record to have a `"meta"` dict with a `"utc"` value containing the UTC timestamp. Counts are generated by month and for each entity and saved out as a CSV. For example:

```csv
,2012-01,2012-02
meat,1011.0,873.0
salt,805.0,897.0
chicken,694.0,713.0
```

> ⚠️ **Important note:** If you have a lot of data, you probably want to split up your raw data and run multiple jobs in parallel. The next script that calculates the final counts and variance can take a directory of `.csv` files as its input, so reconciling the counts afterwards is no problem.

In [6]:
SPACY_MODEL = "./tmp_model/model-best"      # path to spaCy model with entity recognizer
DATA_FILE = "./corpusSkillsCleaned.jsonl"      # preprocessed Reddit data created in previous step
OUTPUT_FILE = "./raw_counts.csv"  # path to output file
N_PROCESSES = 16                  # number of processes for multiprocessing
ENTITY_LABEL = "SKILL"           # label of entity to count

In [2]:
!pip install spacy srsly pandas

Collecting pandas
  Obtaining dependency information for pandas from https://files.pythonhosted.org/packages/ff/5a/c7359edec58500b35da8dc40a69ea7b0a3be48a479e1c91e8e8d0a2d9aa7/pandas-2.1.1-cp310-cp310-macosx_11_0_arm64.whl.metadata
  Downloading pandas-2.1.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (18 kB)
Downloading pandas-2.1.1-cp310-cp310-macosx_11_0_arm64.whl (10.9 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m:01[0m:01[0m
[?25hInstalling collected packages: pandas
Successfully installed pandas-2.1.1


In [2]:
import spacy
from collections import Counter, defaultdict
import srsly
from datetime import datetime
import pandas as pd

In [8]:
counts = defaultdict(Counter)
nlp = spacy.load(SPACY_MODEL)
data = srsly.read_jsonl(DATA_FILE)

data_tuples = ((eg["text"], eg) for eg in data)
for doc, eg in nlp.pipe(data_tuples, as_tuples=True, n_process=N_PROCESSES):
    timestamp = int(eg["utc"])
    year_month = datetime.utcfromtimestamp(timestamp).strftime("%Y-%m")
    for ent in doc.ents:
        if ent.label_ == ENTITY_LABEL:
            counts[str(ent).lower()][year_month] += 1

df = pd.DataFrame(data=counts).transpose()
df.to_csv(OUTPUT_FILE)