# Pre-Processing

**Contents**

[What is preprocessing and why is it important?](#section-1)

[Tokenizing](#section-2)

[Lowercasing and Punctuation](#section-3)

[Normalization](#section-4)

[Lemmatization and Stemming](#section-5)

[Textual units of analysis: chunking](#section-6)

[Stopwords](#section-7)

<a id='section-1'></a>
## What is preprocessing and why is it important?  

Preprocessing are steps and procedures to take before using analyzing the texts in order to prepare your text data for analysis and make the results more meaningful. Text analysis methods require data to be structured in certain ways and will work better if the text data are prepared in certain ways. For example, most text analysis methods rely on matching the same sequences of characters in order to count them together, the more consistant the data is - the more the things we want counted together share the same sets of characters, the same spelling, etc - the better the outputs will be.  

In sum, preprocessing is about modifying and preparing the text data so that they work well with the methods you intend to use. Preprocessing steps are especially important when working with languages that are not Engish or that have features very different from English since there might be particular considerations and steps to take so that the text data work well with widely used text analysis methods. 

There is no one-size-fits-all for preprocessing. Different methods, text data, questions and stages of anlaysis might require different kinds of preprocessing choices. 

**Why is preprocessing important?**

The decisions made and the steps taken during preprocessing change influences the analysis and the outputs generated. And yet, as Nguyen et al. point out, preprocessing steps are often underreported and overlooked.

>“The pre-processing steps have a big impact on the operationalizations, subsequent analyses and reproducibility efforts (Fokkens et al., 2013), and they are usually tightly linked to what we intend to measure. Unfortunately, these steps tend to be underreported, but documenting the pre-processing choices made is essential” (Nguyen et al. “How We Do Things With Words”, p. 7)


It is also not always evident what the consequences our preprocessing chocies might be, which makes it even more crucial to be explicit about preprocessing procedures and document the steps taken. 

> “In unsupervised settings, it is more challenging to understand the effects of different steps. Inferences drawn from unsupervised settings can be sensitive to pre-processing choices (Denny and Spirling, 2018). (...) All in all, this again highlights the need to document these steps.” (Nguyen et al. “How We Do Things With Words”, p. 8)

<a id='section-2'></a>
## Tokenizing

Tokenizing involves splitting the text into units of analysis you're interested in analyzing - most often this is assumed to be "words".

Why do we need to do this? Look what happens when we pass "raw text", an untokenized text, to Counter (in order to count occurrences in our text):

In [None]:
from collections import Counter

f = open('kafka_dv.txt', 'r')
test_text = f.read()
Counter(test_text)

It counts every single character. Strings/text are sequences of character encodings.

In [None]:
#ord() returns the unicode code-point value in Python of a character
ord('å')

In [None]:
#chr() returns the character associated with the code-point
chr(229)

In [None]:
#This is how computers read and store information
#prints the sequences of character code-points that are encoded/deconded into bits
text = 'It was the best of times, it was the worst of times.'

for char in range(len(text)):
    print(ord(text[char]))

We need to split out the sequences of characters into units we want to analyze. This process is called tokenizing: restructuring the text data into units we want to analyze.

Tokenization works by defining markers at which you split the string. Different tokenizing procedures might use different markers.

Here are different ways of tokenizing text:

In [None]:
#the .split() method in Python uses whitespace as default
text = "I'd say, they're happy it's mother's day."
text.split()

In [None]:
#you can also pass different markers to .split() to define where you want to split your text
#this uses regular expression to split at any one character or more that is NOT a word
import re
text = "I'd say, they're happy it's mother's day."
tokens = re.split('\W+', text)
tokens

In [None]:
#Built-in tokenizing procedure in NLTK
import nltk
from nltk.tokenize import word_tokenize
text = "I'd say, they're happy it's mother's day."
tokens = nltk.word_tokenize(text)
tokens

In [None]:
#Built-in tokenizing procedure in spaCy in Spanish
#Download model
import spacy
!python -m spacy download es_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('es_core_news_sm')
#Create spaCy process document
text = 'Yo diría, que están felices de que sea el día de la madre.'
document = nlp(text)

tokens = [token.text for token in document]
tokens

In [None]:
#Built-in tokenizing procedure in spaCy in French
#Download model
import spacy
!python -m spacy download fr_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('fr_core_news_sm')
#Create spaCy process document
text = 'On dirait qu\'ils sont heureux que ce soit la fête des mères.'
document = nlp(text)

tokens = [token.text for token in document]
tokens

**Define your own tokenizing function**

You could define your own tokenizing function to tokenize your text the way you want.

In [None]:
#Only words, no numbers
#Define a function to lowcase, split at and remove anything not a "word" character
#(i.e. a letter or digit or underbar)
#So it will split at and remove whitspace and punctuation
#Then keep only alphabetic characters (i.e. remove numbers) with .isalpha()

def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    no_numbers = [word for word in split_words if word.isalpha()]
    return no_numbers

text_example = "I'd say, they're happy it's mother's day. 1988!"
tokenized_text_example = tokenize(text_example)
tokenized_text_example

In [None]:
#Words and numbers
#Define a function to lowcase, split at and remove anything not a "word" character
#(i.e. a letter or digit or underbar)
#So it will split at and remove whitspace and punctuation

def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

text_example = "I'd say, they're happy it's mother's day. 1988!"
tokenized_text_example = tokenize(text_example)
tokenized_text_example

Tokenizing procedures are therefore premised on the assumption that meaningful semantic units are separable by repeated markers such as whitespace or punctuation.

Different tokenizing procedures operationalize different assumptions about what the markers that delimit semantic units should be (for example, predominently relying on the assumption that meaningful semantic units are words separated by whitespace). 

Some critics resist the assumption that the primary meaningful units of analysis should be words. And if words really are the most meaningful units of analysis, then it might not always be clear what a word is. 

For example, Ramsay argues that 
> “Tokenization forces us to confront the fact that the notion of a word is neither unambiguous nor satisfactorily definable for all circumstances.” (Ramsay, _Reading Machines_, p. 34)

In [None]:
#White space does not always signal the demarcation between two different semanic units
#For example, in Vietnamese, the word "thời gian" is read as a single semantic unit
#but it would be split according to most default tokenizing procedures
text = "thời gian"
text.split()

Nguyen et al. also raise the issue of multi-words: 

>“Multi-word terms are also challenging. Treating them as a single unit can dramatically alter the patterns in text. Many words that are individually ambiguous have clear, unmistakable meanings as terms, like “black hole” or “European Union.”” (Nguyen et al. “How We Do Things With Words”, p. 8)

Tokenization is also problematic for agglutinative languages, languages that generate words and sentences by combining units together. 

For example, in German, words are often combined together to create new concepts or simply to express ideas in more compact form. 

For example: 

Kummerspeck = Kummer (grief, sorrow) + Speck (bacon) = the weight you put on from emotional overeating

Dekadenzkonzept = Dekadenz (decadence) + Konzept (concept) = the concept of decadence

This last word would be tokenized into separate entities in English, but there is no easy way to do that in German, and it's not clear if we should do that in the first place - is this combined word a distinct concept or should it be decomposed into its constitutent forms? Is it simply the addition of its constitutent units?

Tokenizing involves decisions and choices that may impact the analysis. 

> “Such choices may appear simple, but they may have a strong influence on the final text representation, and, subsequently, on the analysis based on this representation.” (Karsdorp, et al. _Humanities Data Analysis_, p. 82)


>“One step that almost everyone takes is to tokenize the original character sequence into the words and word-like units. Tokenization is a more subtle and more powerful process than people expect.” (Nguyen et al. “How We Do Things With Words”, p. 8)


> “Unfortunately, it is difficult to provide a recommendation here apart from advising that tokenization procedures be carefully documented.” (Karsdorp, et al. _Humanities Data Analysis_, p. 82)


> “tokenizers may come with a certain set of assumptions, which should be made explicit through, for instance, properly referring to the exact tokenizer applied in the analysis.” (Karsdorp, et al. _Humanities Data Analysis_, p. 83)

**Segmentation**  

Some languages do not separate words with spaces. One way to tokenize for these language is to artificially insert spaces in the text.

In [None]:
# Segmentation for Chinese in spaCy
import spacy
#!python -m spacy download zh_core_web_sm

In [None]:
#Load language model
nlp = spacy.load('zh_core_web_sm')
#Create spaCy processed document
filepath = 'segmentation_text_sample.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

# Create a segmented version of the original text file
#Loop through each token in the original text, lemmatize and lowercase each token, 
#and insert a space between the tokens. Then write them out to new file

outname = filepath.replace('.txt', '-segmented.txt')
with open(outname, 'w', encoding='utf8') as out:
    for token in document:
        print(token)
        out.write(repr(token))
        out.write(' ')

<a id='section-3'></a>
## Lowercasing and Punctuation

Uppercases and lowercases have distinct encodings and will be counted as distinct characters. Similarly, punctuation is an encoded character - words with trailing punctuation will be counted as distinct characters.

In [None]:
from collections import Counter
example_text = 'And then, there were none. None and then there were.'
tokenized_example_text = example_text.split()
Counter(tokenized_example_text)

It is common practice to lowercase the text and remove punctuation.

In [None]:
#lowercase using .lower() string method
lower_example_text = example_text.lower()
lower_example_text

In [None]:
#Use isaplha() to assess if each word is an alphabetic character
#will remove punctuation and numbers
nopunct_text = [word for word in tokenized_example_text if word.isalpha()]
nopunct_text

Even these routine practices could have consequences for analysis. Upper cases and lower cases carry semantic meaning. For example, in German nouns are capitalized so "Essen" means "food" and "essen" means "to eat". We lose that distinction by lowercasing our text. 

> “We already spoke about lowercasing texts, which is another common preprocessing step. Here as well, we should be aware that it has certain consequences for the final text representation. For instance, it complicates identifying proper nouns or the beginnings of sentences at a later stage in an analysis.” (Karsdorp, et al. _Humanities Data Analysis_, p. 83)

Similarly, punctuation marks can also be carriers of semantic meaning (as analyzed by Piper in Andrew Piper, "Punctuation (Opposition)" in *Enumerations*, The University of Chicago Press, 2018). Depending on our research goals, it may be necessary to consider the semantic function played by punctuation.

> “To illustrate the complexity, consider the problem of modeling thematic differences between texts. For this problem, certain linguistic markers such as punctuation might not be relevant. However, the same linguistic markers might be of crucial importance to another problem. In authorship attribution, for example, it has been demonstrated that punctuation is one of the strongest predictors of authorial identity (Grieve 2007).” (Karsdorp, et al. _Humanities Data Analysis_, p. 83)

<a id='section-4'></a>
# Normalization

Normalization is a catch-all term for a number of different procedures that mainly have to do with the idea of reducing inconsistencies and variations in the text data. 

Text analysis methods work better the more consistent the text data are: because methods rely on matching sequences of characters, the more consistent the sequences of characters you want matched, the better the results. 

Normalization could involve smoothing over inconsistencies in spelling - either because different forms of spelling exist, or because there are spelling errors introduced by OCR. 

> “Data may also vary enormously in quality, depending on how it has been generated. Many historians, for example, work with text produced from an analog original using Optical Character Recognition (OCR). Often, there will be limited information available regarding the accuracy of the OCR, and the degree of accuracy may even vary within a single corpus (e.g., where digitized text has been produced over a period of years, and the software has gradually improved). The first step, then, is to try to correct for common OCR errors. These will vary depending on the type of text, the date at which the “original” was produced, and the nature of the font and typesetting.” (Nguyen et al. “How We Do Things With Words”, p. 7)

Normalization, like other pre-processing procedures, therefore also involves making decisions about what  is considered a meaningful variation and what is not. 

> “Each step requires making additional assumptions about which distinctions are relevant: is “apple” different from “Apple”? Is “burnt” different from “burned”? Is “cool” different from “coooool”? Sometimes these steps can actively hide useful patterns, like social meaning (Eisenstein, 2013). Some of us therefore try do as little modification as possible.” (Nguyen et al. “How We Do Things With Words”, p. 8)

Indentifying characters to normalize using characters counts:

In [None]:
#Counting characters in a text
from collections import Counter

f = open('kafka_dv.txt', 'r')
test_text = f.read()
Counter(test_text)

In [None]:
#Replacing problematic characters
test_text.replace('\n', ' ')

Indentifying words to normalize using word counts and `.replace()` as above:

In [None]:
#Counting all the words in a text
import re
from collections import Counter

#defining a tokenizing function which will split at and remove whitespace and punctuation
#return words and numbers
def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

#Read in text
text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

#Use our tokenizing function to tokenize the text
all_the_words = tokenize(text)

##Count frequencies of all the words
all_the_words_count = Counter(all_the_words)
all_the_words_count

<a id='section-5'></a>
# Lemmatization and Stemming

Both stemming and lemmatization are procedures that reduce the inflectional forms of words to a common base or root. 

English has minimal inflection (e.g. words can be inflected by number: "cat" becomes "cats" in the plural). Other languages, however, have much more inflection. Words can vary, for example, according to whether the word is definite or indefinite, and also according to number and gender. For example in Swedish, not only could there be three different forms for "little" - "litet", "liten", "lilla", but there are also four different forms for "coat" (jacka, jackan, jackor, jackorna) and "apple" (äpple, äpplet, äpplen, äpplena), depending on gender and whether they’re plural or singular and definite or indefinite (whereas in English there would only be two different forms: plural and singular). When working with methods that rely on word counts, and on grouping words we consider the "same" together, these inflectional forms need to be grouped together. Stemming and lemmatization are two processes to reduce inflectional forms to a base form, but they do it in slightly different ways.

### Stemming  

Stemming converts a word into a stem: it works with a list of common inflectional prefixes and suffixes specific to a language, and it cuts off the beginnings and endings of words that have those forms. 

> sing > sing  

>singing > sing  

>sung > sung  

>sang > sang



>niñas > niñ   

>niñez > niñ

In [None]:
#Stemming with NLTK
#Imports
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from collections import Counter

text = 'To sing a happy song, while dancing a happy dance, is to experience happiness, one of the best experiences.'

#tokenize
tokens = nltk.word_tokenize(text)

#initialize stemmer
porter = PorterStemmer()

#stem tokens
stemmed_tokens = [porter.stem(token) for token in tokens]

#count what words counted together
counts = Counter(stemmed_tokens)
counts

### Lemmatization

Lemmatization converts inflections to a base lemma. Lemmatization takes into account morhpology (how words change when inflected) and rely on dictionaries to identify the base lemma of different forms. 

> sing > sing 

> singing > sing 

> sung > sing 

> sang > sing  

> niñas > niño  

> niñez > niñez

Lemmatization might be more precise at grouping together words we think belong together whilst stemming might be more imprecise - it might not take into account all the different inflections of a word, and it might group together words that might not necessarily belong together. Yet stemming can still be an efficient procedure for low inflection languages. Furthermore, as Nguyen et al. point out, not all languages have the same kinds of resources available, and there might not be good quality lemmatization packages available.


> “From a multilingual perspective, English and Chinese have unusually simple inflectional systems, and so it is statistically reasonable to treat each inflection as a unique word type. Romance languages have considerably more inflections than English; many indigenous North American languages have still more. For these languages, unseen data is far more likely to include previously-unseen inflections, and therefore, dealing with inflections is more important. On the other hand, the resources for handling inflections vary greatly by language, with European languages dominating the attention of the computational linguistics community thus far.” (Nguyen et al. “How We Do Things With Words”, p. 8)

**Lemmatizing mutiple files**

In [None]:
#This loops over multiple files in a directory
#but it might make the kernel crash if it runs out memory
#If the kernel crash you might have to lemmatize single files at a time (cf. below)

#Lemmatizing using spaCy for English
import spacy
import glob

#Download the language model you're interested in (this is the English pipeline)
#For french: fr_core_news_sm
#For spanish: es_core_news_sm
!python -m spacy download en_core_web_md

In [None]:
#Load language model (it needs to match the name above)
nlp = spacy.load('en_core_web_md')

In [None]:
#Open your texts and create spaCy document
filepath = 'kafka-corpus/'
text_files = glob.glob(f'{filepath}/*.txt')

#Loop through the files and open as spacy document
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file)
        document = nlp(text)
        
    #Lemmatize
    outname = file.replace('.txt', '-lemmatized.txt')
    with open(outname, 'w', encoding='utf8') as out:   
        for token in document:
            # Get the lemma for each token
            out.write(token.lemma_.lower())
            # Insert white space between each token
            out.write(' ')

**Lemmatizing single files**

In [None]:
#Lemmatizing single files

#Lemmatizing using spaCy for English
import spacy
#!python -m spacy download en_core_web_md

In [None]:
#Load language model (it needs to match the name above)
nlp = spacy.load('en_core_web_md')

#Open your text and create spaCy document
filepath = 'kafka_metamorphosis.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

outname = filepath.replace('.txt', '-lemmatized.txt')
with open(outname, 'w', encoding='utf8') as out:   
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

In [None]:
#Prints the original word in the text, 
#a dash, then the lemmatized form that was written to the derivative text document
#check if there are places where the model consistently makes mistakes
#this prints the first 50 tokens - modify the slice next to document for more
for token in document[:50]:
    print(token.text + ' - ' + token.lemma_)

<a id='section-6'></a>
# Textual units of analysis: chunking

Having texts of widely different lengths might skew the analyses. It's good practice to ensure that the texts are of roughly similar lengths. This might mean joining texts together into larger text blocks if they are very short (e.g. tweets), or splitting longer texts into shorter units.

This process can be called chunking. It is also sometimes referred to as segmentation.

> From a computational perspective, the unit of text can also make a huge difference, especially when we are using bag-of-words models, where word order within a unit does not matter (Boyd-Graber et al., 2017). Finding a good segmentation sometimes means combining short documents and subdividing long documents.” (Nguyen et al. “How We Do Things With Words”, p. 6)


> “Small segments, like tweets, sometimes do not have enough information to make their semantic context clear (Mehrotra et al., 2013). In contrast, larger segments, like novels, have too much variation, making it difficult to train focused models (Jockers, 2013). The word “document” can therefore be misleading. But it is so ingrained in the common NLP lexicon that we use it anyway in this article.” (Nguyen et al. “How We Do Things With Words”, p. 6-7)

In [None]:
#Loop through and check how long the texts are
import glob

#Open your texts
filepath = 'kafka-corpus/'
text_files = glob.glob(f'{filepath}/*.txt')

#Loop through the files and print text file name with number of words
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file, len(text))

In [None]:
#Split long texts into shorter units
#Split into a collection of documents of 3000 words

#Loop through the files and print text file name with number of words
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file, len(text))
        
        segment_length = 3000
        
        nseg = round(len(text) / segment_length)
        for i in range(nseg):
            segment = text[segment_length*i:segment_length*(i+1)]
            outname = file.replace('kafka-corpus/', 'kafka-corpus/kafka-segmented/').replace('.txt', f'-{i}.txt')
            with open(outname, 'w', encoding='utf8') as out:
                text_chunk = ''.join(segment)
                out.write(text_chunk)

Note that this is a coarse kind of splitting - it can split in the middle of words.

<a id='section-7'></a>
# Stopwords

Stopwords are lists of words that we want to filter out from our analyses because they are considered irrelevant or not meaningful to our analyses. Stopwords are important because any word that is on the stopword list will be removed from the analyses. 

There is no real agreement of what should or should not be included on a stopwords list, and this varies widely depending on the research aims and questions. 

Very often, stopwords list include function words. Function words are used to express grammatical relations and hold sentences together. They include pronouns (e.g. I, she, he, me, you, they, their, him, her), articles (e.g. the, a), conjunctions (e.g. and, before, but, because, for, whether, that), and prepositions (e.g. below, before, in, during). Function words are grammatically useful, but are considered not to carry much semantic content. In contrast, content words, such as verbs (e.g. run, shout, eat), nouns (e.g. apple, sister, depth), and adjectives (e.g. yellow, quiet, condusing) are considered as carrier of meaning. Yet function words are effective indicators of style - function words are therefore central to analyses of style and authorship. The choices made about what words to include on a stopwords list will therefore vary according to the specificities of the research questions and the text data.

>“We sometimes also remove words that are not relevant to our goals, for example by calculating vocabulary frequencies. We construct a “stoplist” of words that we are not interested in. If we are looking for semantic themes we might remove function words like determiners and prepositions. If we are looking for author-specific styles, we might remove all words except function words. Some words are generally meaningful but too frequent to be useful within a specific collection. The word “prisoner” would be very interesting in most contexts, but in London court records that consist entirely of decisions about prisoners, it adds nothing. We sometimes also remove very infrequent words. Their occurrences are too low for robust patterns and removing them helps reducing the vocabulary size.”
(Nguyen et al. “How We Do Things With Words”, p. 8)


>“What one researcher considers noise, or something to be discounted in a dataset, may provide essential evidence for another.” (Owens, “Defining Data for Humanists”)


Many packages have built-in stopwords lists, but you will probably need to modify these for the purposes of your analyses and create custom stopword lists. For a review of built-in stopwords list cf. Nothman, Qin and Yurchak. ["Stop Word Lists in Free Open-source Software Packages"](https://aclanthology.org/W18-2502/).

**Adding to Built-in Stop Words List within the package**

In [None]:
#Stopwords in spaCy
import spacy

#Download the language model you're interested in (this is the English pipeline)
!python -m spacy download en_core_web_md

In [None]:
#Load language model and stopwords list
nlp = spacy.load('en_core_web_md')
stopwords = nlp.Defaults.stop_words
stopwords

In [None]:
# Add a word to spacy stopword list
nlp.Defaults.stop_words.add('explain')
stopwords = nlp.Defaults.stop_words
stopwords

In [None]:
#Remove a word from spacy stopword list
nlp.Defaults.stop_words.remove('explain')
stopwords = nlp.Defaults.stop_words
stopwords

In [None]:
# Add multiple words to spacy stopword list
nlp = spacy.load('en_core_web_md')
nlp.Defaults.stop_words |= {"explain","sample"}
stopwords = nlp.Defaults.stop_words
stopwords

In [None]:
# Remove multiple stopwords at once from spacy stopword list
nlp = spacy.load('en_core_web_md')
nlp.Defaults.stop_words -= {'explain', 'sample'}
stopwords = nlp.Defaults.stop_words
stopwords

**Creating a Custom Stopword List**

Identifying what words to add to a stopwords list often rely on frequency counts in order to filter out frequent words that are not relevant to the analyses. 

In [None]:
#Counting all the words in a text
import re
from collections import Counter

#defining a tokenizing function which will split at and remove whitespace and punctuation
#return words and numbers
def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

#Read in text
text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

#Use our tokenizing function to tokenize the text
all_the_words = tokenize(text)

##Count frequencies of all the words
all_the_words_count = Counter(all_the_words)
all_the_words_count

In [None]:
#Counting most frequent words in a text

#How many most frequent words do you want to see?
number_of_desired_words = 50

#Return most frequent words
most_frequent_all_the_words_count = all_the_words_count.most_common(number_of_desired_words)
most_frequent_all_the_words_count

What words in this list are not relevant to your project? You can add them to your custom stopwords list.

**Building a custom stopwords list**

It might be helpful to use an existing stopwords list as a starting point, and tailor that list to specific projects. For example, we could use the built-in spaCy stopwords list as a starting point.

In [None]:
#Stopwords in spaCy
import spacy

#Download the language model you're interested in (this is the English pipeline)
#For french: fr_core_news_sm
#For spanish: es_core_news_sm
!python -m spacy download en_core_web_md

In [None]:
#Load language model and stopwords list
nlp = spacy.load('en_core_web_md')
stopwords = nlp.Defaults.stop_words
sorted(list(stopwords))

In [None]:
#Write out the spacy stopwords list to a txt file
with open("spacy-stopwords-english.txt", "a") as file_object:
    for word in sorted(list(stopwords)): 
        file_object.write(word + '\n')

**Read in your stopwords list, use it in code, and add words to your stopwords list**

In [None]:
#Open your txt file and convert to a Python list
with open("spacy-stopwords-english.txt", "r") as file_object:
    custom_stopwords = [s.rstrip('\n') for s in file_object.readlines()] 

custom_stopwords

In [None]:
#Append a new work to the list
custom_stopwords.append('got')

In [None]:
#Append multiple new words to the list
custom_stopwords += ['gotten', 'mr']

In [None]:
#Remove a word
#Find the index of the word you want to remove
index = custom_stopwords.index('gotten')
index

In [None]:
#then delete word
del custom_stopwords[index]

In [None]:
#Write out the updated list and sort alphabetically
with open("custom-stopwords.txt", "w") as file_object:
    for word in sorted(custom_stopwords):
        file_object.write(word + '\n')

In [None]:
#Check if a given word in list (True if in list, False if not in list)
'friend' in custom_stopwords 

**Example of using your custom stopword list in code**

In [None]:
#Open your txt file and covert to a Python list
with open("custom-stopwords.txt", "r") as file_object:
    custom_stopwords = [s.rstrip('\n') for s in file_object.readlines()] 

custom_stopwords

In [None]:
import re
from collections import Counter

#Defining a tokenizing function
def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

#Reading in text
text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

#Tokenizing text
all_the_words = tokenize(text)

#Filtering only the words not on stopwords list (you use your stopwords list variable here)
meaningful_words = [word for word in all_the_words if word not in custom_stopwords]

#Counting words
meaningful_words_tally = Counter(meaningful_words)

#How many frequent words we want to see
number_of_desired_words = 50

#Return most frequent words
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)
most_frequent_meaningful_words

****
**Read in your stopwords list, use it in code, and add words to your stopwords list**

This does the same thing as above but using pandas instead of Python lists.

In [None]:
import pandas as pd

#Read in stopwords list as pandas dataframe and convert to it to a list
stopwords_df = pd.read_csv('spacy-stopwords-english.txt', names=['word'])
custom_stopwords_list = stopwords_df['word'].to_list()
custom_stopwords_list

In [None]:
#Adding words to the list

#Create list of words you want to add
new_words = ['got', 'mr']

#Create a dataframe of words you want to add
new_words_df = pd.DataFrame(new_words, columns=['word'])
new_words_df

In [None]:
#Concatenate/merge the old dataframe with new dataframe with new words in it, and sort it
updated_stopwords_df = pd.concat([stopwords_df, new_words_df], ignore_index=True)
updated_stopwords_df = updated_stopwords_df.sort_values(by='word')
updated_stopwords_df

In [None]:
#Check if a word is in the list (False if not in list, True if in list)
stopwords_df.word.str.contains('friend').any()

In [None]:
#Write out the dataframe to a txt file
updated_stopwords_df.to_csv('custom_stopwords.txt', sep=' ', header=None, index=False)

**Example of using your custom stopword list in code**

In [None]:
import pandas as pd

#Read in stopwords list as pandas dataframe and convert to it to a list
stopwords_df = pd.read_csv('custom-stopwords.txt', names=['word'])
custom_stopwords_list = stopwords_df['word'].to_list()
custom_stopwords_list

In [None]:
import re
from collections import Counter

#Defining a tokenizing function
def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

#Reading in text
text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

#Tokenizing text
all_the_words = tokenize(text)

#Filtering only the words not on stopwords list (you use your stopwords list variable here)
meaningful_words = [word for word in all_the_words if word not in custom_stopwords_list]

#Counting words
meaningful_words_tally = Counter(meaningful_words)

#How many frequent words we want to see
number_of_desired_words = 50

#Return most frequent words
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)
most_frequent_meaningful_words

**Defining a function for removing a list of stopwords from a list of tokens**

In [None]:
#Function to filter out stopwords from your list of tokens
def remove_stopwords(list_of_tokens, stopwords):
    return [token for token in list_of_tokens if token not in stopwords]

Acknowledgements: This notebook incorporates code and ideas from Melanie Walsh and Quinn Dombrowski's collaborations in Walsh's ["Introduction to Cultural Analytics & Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/01-Multilingual-Text-Analysis.html). And from Jed Dobson's [notebooks](https://github.com/jeddobson/ENGL64.05-21F/blob/main/homework/Homework-03.ipynb) for his "Cultural Analytics" course at Dartmouth College.