# Proccess Notebook

__by Sean Gilleran__  
__Last updated November 28__, __2021__  
[https://github.com/seangilleran/ia-compmag-collect](https://github.com/seangilleran/ia-compmag-collect)

This notebook processes raw text files into data more suitable to topic modelling. In order, the steps that it performs are:

1. Load text files.
2. Replace Unicode characters with ASCII equivalents.
3. "De-fuzz" the text, i.e.:
   * Combine hyphenated words split across lines using Regex.
   * Remove isolated special characters (most of these are probably OCR artifacts).
   * Check and standardize spelling.
4. Tokenize the text into individual words.
5. Remove stopwords.
6. Replace lemmas with a common equivalent where possible.
7. Save the result to a new file.

## 1. Initialization

### 1.1 Import & Initialization

Python imports. Set up NLTK and other tools here. Make sure to run this before the other parts of the notebook!

In [1]:
from datetime import datetime
import os

from autocorrect import Speller
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
import regex as re
from unidecode import unidecode

nltk.download("averaged_perceptron_tagger")
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("universal_tagset")
wnl = WordNetLemmatizer()

dehyphenator = re.compile(r"(?<=[A-Za-z])-\s\n(?=[A-Za-z])")
defuzzer = re.compile(r"([^a-zA-Z0-9]+)")

spell = Speller(only_replacements=True)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sgill\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sgill\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sgill\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sgill\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\sgill\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


### 1.2 Input & Output Paths

Files should be in `.txt` format in the `in_path` directory. Once processed, these will be duplicated into the `out_path` directory.

In [2]:
in_path = "./corpus/raw"
out_path = "./corpus/out"

print(f" in_path: {os.path.abspath(in_path)}")
print(f"out_path: {os.path.abspath(out_path)}")

 in_path: g:\sgill\Development\ia-compmag-collect\corpus\raw
out_path: g:\sgill\Development\ia-compmag-collect\corpus\out


### 1.3 Add Stop Words

Load the NLTK stop word list and append our own, if necessary.

In [3]:
stop_words = set(stopwords.words("english"))

with open("stopwords.txt", "r", encoding="utf-8") as f:
    stop_words.update([w.strip() for w in f.readlines()])

print(stop_words)

{'my', 'bei', 'piv', 'tht', 'mul', 'our', 'under', 'artd', 'itself', 'hou', 'where', 'rir', 'jun', 'yourselves', 'jiv', 'yoy', "won't", "it's", 'mil', 'doesn', 'than', "shan't", 'liie', 'weren', 'wilti', 'rhe', 'pvn', 'her', 'when', 'be', 'at', 'ihe', "should've", 'gama', 'jou', 'herself', 'fhe', 'haa', 'deg', "you've", 'hui', 'ihia', 'ma', 'haven', 'of', "aren't", 't', 'so', 'aren', 'bul', 'and', 'piu', 'tiave', 'ttii', 'just', 'itiey', 'iiri', 'not', 'm', 'ihat', 'from', 'few', 'their', 'has', 'up', 'only', 'hli', "wasn't", 'iib', "wouldn't", 'ttie', 'been', 'itie', 'lor', 'mustn', 'wtiich', 'needn', 'an', 'if', 'ttm', 'tha', 'he', 'doing', 'it', 'until', 'some', 'll', 've', 'hers', 'couldn', 'hadn', 'arki', 'iht', 'tiy', 'was', 'rtu', 'thi', 'isn', "you'd", "needn't", 'these', 'there', 'cai', 'noa', 'wos', 'ria', 'no', 'ond', 'ibe', 'giv', 'we', 'on', 'copr', 'ihey', 'here', 'any', 'rei', 'anfl', 'pue', 'all', 'vwi', "hasn't", 'after', 'for', 'does', 'what', 'mfi', 're', 'because', 

## 2. Processing

### 2.1 Find Files

Look for raw files to process in the `out_path` directory. Check against the `in_path` directory to make sure we're not doubling up. This is also a handy way of being able to pause and resume our work.

In [4]:
files = []
total_count = 0
skip_count = 0

for file in [f for f in os.listdir(in_path) if f.endswith(".txt")]:

    if os.path.exists(os.path.join(out_path, file)):
        skip_count = skip_count + 1
        continue

    files.append(file)
    total_count = total_count + 1

print(f"Found {total_count} files to process ({skip_count} skipped).")

Found 8841 files to process (443 skipped).


### 2.2 Process Files

This can take a very long time, especially with large data sets! We'll print out a message before each file with a note as to how far we've gotten.

In [5]:
i = 0
for file in files:

    timestamp = datetime.now().strftime("%X")
    i = i + 1
    print(f"[{timestamp}] {i}/{total_count} ({((i / total_count)*100.0):.0f}%): {file}")

    # Load file, remove special characters.
    text = ""
    with open(os.path.join(in_path, file), "r", encoding="utf-8") as f:
        text = unidecode(f.read())
    if not text or text == None or text == "":
        print("No content in file!")
        continue

    # De-fuzz.
    text = dehyphenator.sub("", text)
    text = defuzzer.sub(" ", text)
    text = spell(text)

    # Tokenize.
    tokenized_text = word_tokenize(text)
    text = []

    # Remove stopwords.
    for word in [w for w in tokenized_text if w not in stop_words]:
        text.append(word)

    # Lemmatize.
    text = pos_tag(text, tagset="universal")
    for x in range(len(text)):
        word, pos = text[x]
        if pos == "VERB":
            pos = "v"
        elif pos == "ADJ":
            pos = "a"
        elif pos == "ADV":
            pos = "r"
        else:
            pos = "n"
        text[x] = wnl.lemmatize(word, pos=pos)

    # Save updated text to new file.
    text = " ".join(text)
    with open(os.path.join(out_path, file), "w") as f:
        f.write(text)

print("** DONE **")

ValueError: Invalid format string