# Preprocessing 

## Lemmatization

### What is lemmatization?

Lemmatization is a procedure that reduces the inflectional forms of words to a common base or root. 

English has minimal inflection (e.g. words can be inflected by number: "cat" becomes "cats" in the plural). Other languages, however, have much more inflection. Words can vary, for example, according to whether the word is definite or indefinite, and also according to number and gender.

### Creating a lemmatized version of your corpus

For methods that rely on word counts (e.g. frequency counts, Tf-idf), it's best to use lemmatized text so that a maximum number of words we want counted togther will be counted together. There is evidence that lemmatization is not necessary, maybe even counterproductive for topic modeling.  

> "Stemming has been found to provide little measurable benefits for topic modeling and can sometimes even be harmful (Schofield and Mimno, 2016)." (Nguyen et al., "How We Do Things With Words," p. 8)


It might be good practice to have a lemmatized and unlemmatized version of your corpus so you can experiment with which one produces the most meaningful outputs.

**Lemmatizing mutiple files**

In [None]:
#This loops over multiple files in a directory
#but it might make the kernel crash if it runs out memory
#If the kernel crash you might have to lemmatize single files at a time (cf. below)
#or run nlp.max_length = 200000 (or any large value after loading the model)

#Lemmatizing using spaCy for English
import spacy
import glob

#Download the language model you're interested in (this is the English pipeline)
!python -m spacy download en_core_web_md

In [None]:
#Load language model
nlp = spacy.load('en_core_web_md')

In [None]:
#Set filepath
filepath = 'soderberg-corpus/'
text_files = glob.glob(f'{filepath}/*.txt')

#Loop through the files and open as spacy document
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file)
        document = nlp(text)
        
    #Lemmatize each file and create new file with '-lemmatized.txt' added to name
    outname = file.replace('.txt', '-lemmatized.txt')
    with open(outname, 'w', encoding='utf8') as out:   
        for token in document:
            # Get the lemma for each token
            out.write(token.lemma_.lower())
            # Insert white space between each token
            out.write(' ')

**Lemmatizing single files**

In [None]:
#Lemmatizing single files

#Lemmatizing using spaCy for English
import spacy
#!python -m spacy download en_core_web_md

In [None]:
#Load the language model
nlp = spacy.load('en_core_web_md')

#Open your text and create spaCy document
filepath = 'soderberg-corpus/1897_Drizzle.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

outname = filepath.replace('.txt', '-lemmatized.txt')
with open(outname, 'w', encoding='utf8') as out:   
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

**Checking over the lemmatized forms**

In [None]:
#This prints the original word in the text, 
#a dash, then the lemmatized form that was written to the derivative text document
#check if there are places where the model consistently makes mistakes
#this prints the first 50 tokens - modify the slice next to document for more
for token in document[:50]:
    print(token.text + ' - ' + token.lemma_)

# Using spaCy to create a tokenized version of Chinese and Korean texts

Some languages do not separate words with spaces. One way to tokenize for these language is to artificially insert spaces in the text. This is called segmentation. We can use spaCy to create a *segmented derivative* of the original text.

In [None]:
# Imports
import spacy

In [None]:
#Download the language model you're interested in
#e.g. for Chinese: python -m spacy download zh_core_web_sm
#e.g. for Korean: ko_core_news_sm
#Visit: https://spacy.io/usage/models#languages for more
!python -m spacy download ko_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('ko_core_news_sm')

#Create spaCy document
text = open('korean-corpus.txt', encoding='utf-8').read()
document = nlp(text)

In [None]:
# Create a segmented version of the original text file
#Loop through each token in the original text, lemmatize and lowercase each token, 
#and insert a space between the tokens. Then write them out to new file

filepath = 'korean-corpus.txt'
outname = filepath.replace('.txt', '-segmented.txt')
with open(outname, 'w', encoding='utf8') as out:
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

The code cell below prints the text as a list of individual tokens (words and punctuation), so you can see how successfully it identified word boundaries.

In [None]:
for token in document:
    print(token.lemma_)

_Acknowledgements_: This notebook is inspired by Melanie Walsh’s [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/Multilingual/Chinese/03-POS-Keywords-Chinese.html#keyword-extraction).