# Preprocessing 

## Lemmatization

### What is lemmatization?

Lemmatization is a procedure that reduces the inflectional forms of words to a common base or root. 

English has minimal inflection (e.g. words can be inflected by number: "cat" becomes "cats" in the plural). Other languages, however, have much more inflection. Words can vary, for example, according to whether the word is definite or indefinite, and also according to number and gender.

### Creating a lemmatized version of your corpus

For methods that rely on word counts (e.g. frequency counts, Tf-idf), it's best to use lemmatized text so that a maximum number of words we want counted togther will be counted together. There is evidence that lemmatization is not necessary, maybe even counterproductive for topic modeling.  

> "Stemming has been found to provide little measurable benefits for topic modeling and can sometimes even be harmful (Schofield and Mimno, 2016)." (Nguyen et al., "How We Do Things With Words," p. 8)


It might be good practice to have a lemmatized and unlemmatized version of your corpus so you can experiment with which one produces the most meaningful outputs.

**Lemmatizing mutiple files**

In [1]:
#This loops over multiple files in a directory
#but it might make the kernel crash if it runs out memory
#If the kernel crash you might have to lemmatize single files at a time (cf. below)
#or run nlp.max_length = 200000 (or any large value after loading the model)

#Lemmatizing using spaCy for English
import spacy
import glob

#Download the language model you're interested in (this is the English pipeline)
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [2]:
#Load language model
nlp = spacy.load('en_core_web_md')

In [3]:
#Set filepath
filepath = 'soderberg-corpus/'
text_files = glob.glob(f'{filepath}/*.txt')

#Loop through the files and open as spacy document
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file)
        document = nlp(text)
        
    #Lemmatize each file and create new file with '-lemmatized.txt' added to name
    outname = file.replace('.txt', '-lemmatized.txt')
    with open(outname, 'w', encoding='utf8') as out:   
        for token in document:
            # Get the lemma for each token
            out.write(token.lemma_.lower())
            # Insert white space between each token
            out.write(' ')

soderberg-corpus/1900_TheSixthSense.txt
soderberg-corpus/1897_TheDreamOfEternity.txt
soderberg-corpus/1897_PatriarchPapinianus.txt
soderberg-corpus/1903_TheBlueAnchor.txt
soderberg-corpus/1912_TheSeriousGame.txt
soderberg-corpus/1904_TheConsulGeneralAtThePalaceBall.txt
soderberg-corpus/1901_SatanTheMajorandtheCourtChaplain.txt
soderberg-corpus/1903_TheKiss.txt
soderberg-corpus/1897_TheFurCoat.txt
soderberg-corpus/1903_AfterDinner.txt
soderberg-corpus/1894_TheHistoryTeacher.txt
soderberg-corpus/1897_VoxPopuli.txt
soderberg-corpus/1897_Drizzle.txt
soderberg-corpus/1905_DoctorGlas.txt
soderberg-corpus/1898_Communion.txt
soderberg-corpus/1913_TheTalentedDragon.txt
soderberg-corpus/1898_TheClown.txt
soderberg-corpus/1907_Bloom.txt
soderberg-corpus/1924_ChurchyardArabesque.txt
soderberg-corpus/1905_TheBurningTown.txt
soderberg-corpus/1901_Rugg.txt
soderberg-corpus/1905_TheChinese.txt
soderberg-corpus/1907_TheDarknessGathers.txt
soderberg-corpus/1898_TrueStory.txt
soderberg-corpus/1898_Archim

**Lemmatizing single files**

In [None]:
#Lemmatizing single files

#Lemmatizing using spaCy for English
import spacy
#!python -m spacy download en_core_web_md

In [6]:
#Load the language model
nlp = spacy.load('en_core_web_md')

#Open your text and create spaCy document
filepath = 'soderberg-corpus/1897_Drizzle.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

outname = filepath.replace('.txt', '-lemmatized.txt')
with open(outname, 'w', encoding='utf8') as out:   
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

**Checking over the lemmatized forms**

In [7]:
#This prints the original word in the text, 
#a dash, then the lemmatized form that was written to the derivative text document
#check if there are places where the model consistently makes mistakes
#this prints the first 50 tokens - modify the slice next to document for more
for token in document[:50]:
    print(token.text + ' - ' + token.lemma_)

Drizzle - Drizzle

 - 

It - it
is - be
autumn - autumn
again - again
and - and
the - the
days - day
are - be
dark - dark
, - ,
and - and
the - the
sun - sun
is - be
hiding - hide
in - in
the - the
gloomiest - gloomy
comer - comer
of - of
space - space
for - for
fear - fear
that - that
someone - someone
might - might
notice - notice
how - how
pale - pale
and - and
old - old
and - and
worn - wear
out - out
she - she
has - have
become - become
lately - lately
. - .
But - but
while - while
the - the
wind - wind
whistles - whistle
about - about
the - the
windows - window
and - and


# Using spaCy to create a tokenized version of Chinese and Korean texts

Some languages do not separate words with spaces. One way to tokenize for these language is to artificially insert spaces in the text. This is called segmentation. We can use spaCy to create a *segmented derivative* of the original text.

In [8]:
# Imports
import spacy

In [9]:
#Download the language model you're interested in
#e.g. for Chinese: python -m spacy download zh_core_web_sm
#e.g. for Korean: ko_core_news_sm
#Visit: https://spacy.io/usage/models#languages for more
!python -m spacy download ko_core_news_sm

Collecting ko-core-news-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/ko_core_news_sm-3.5.0/ko_core_news_sm-3.5.0-py3-none-any.whl (14.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.7/14.7 MB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('ko_core_news_sm')


In [11]:
#Load language model
nlp = spacy.load('ko_core_news_sm')

#Create spaCy document
text = open('korean-corpus.txt', encoding='utf-8').read()
document = nlp(text)

In [13]:
# Create a segmented version of the original text file
#Loop through each token in the original text, lemmatize and lowercase each token, 
#and insert a space between the tokens. Then write them out to new file

filepath = 'korean-corpus.txt'
outname = filepath.replace('.txt', '-segmented.txt')
with open(outname, 'w', encoding='utf8') as out:
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

The code cell below prints the text as a list of individual tokens (words and punctuation), so you can see how successfully it identified word boundaries.

In [14]:
for token in document:
    print(token.lemma_)

﻿일
(
日
)
방송
일기예보에독도를
자국영
(
領
)
표시




일본
NHK
위성+방송+의
일기예보+를
보+면
일본+은
녹색，한국+을
비롯+하+ㄴ
다르+ㄴ
국가+는
적+색으로
표시+하+ㄴ다
.




그런데
문제+는
독도+를
녹색으로표+기+해
자국+의
영토+로
나타내+고
있+다는
점이다.+일본
국내+뿐
아니라위성방송+이
되+는
외국+에
독도는
일본
영+토라+는
잘못+되+ㄴ
인식+을
심+어주+기+에
충분+하+다
.




일본+은
외국인+들+에게
TV
방송，교과서，지도
등+을
통하+어
반복+적+으로
「
독도+는
일본
영토
」
이+라는
점을강조+하+고
있+다.하지만
우리+는
장+기적
안목+으로
독도+를
차지하려는일본+의
이러하+ㄴ
움직+임+에+는
전혀
반응+하+지
않+고，오로지
국민감정만을내+세+워
독도
영유권+의
돌파구+를
찾+으려는
것
같+다
.




결국
우리+는
독도영+유권+에
대해방+어적
자세+만
취하+어
왔는데，이러한
우리+의
입장+은
오히려
일본의장기+적+이+ㄴ
독+도
차지
속셈+의
발판을만들+어
주+고
있다.독도문+제+에
대하+ㄴ
좀더
유연성있는
대책+이
필요+하+다+고
보+ㄴ다
.




일
(
日
)
잇+단
독도망언




「
독도우+표
」
발행+을




독도+가
분명
대한민국
+영토+임+에+도
불구하고，일본+의
우익단체와정치인+들+은
40+년이+상+을
오만방자한망언으+로
일삼+아+오+았+다
.
그러나，실+상
우리+의
독도+에
대하+ㄴ
사랑+과
관심+은
그
분노+의
분량+만+큼
진하고+애+절+한
것이었을까
?




여태+껏
그래왔듯+이
일본+의
망발+이
아니고서+는
독도+는
환상+의
섬에불과했다.우리국민의
독도+에
대한관+심+의
정도+는
그저
십여년+전
정광태라는
가수+의
「
독도+는
우+이+리땅
」
에나오+는
노랫+말
정도+이+다
.




이렇+ㄴ
현실속+에서+도
일부에서나마독도사랑+운동+이
펼쳐지+고
있+는
것+은
매우
반가운
일+이+다
.
그
하나의+예+로
대구+지+역
모
시중은행
+「
독도사랑+통장
」
은
대구，경북지역
뿐+만+

_Acknowledgements_: This notebook is inspired by Melanie Walsh’s [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/Multilingual/Chinese/03-POS-Keywords-Chinese.html#keyword-extraction).