# Preprocessing

**Contents**

[Step 1: Normalizing and building a stopwords list](#section-1)

[Step 2: Lemmatization](#section-2)

[Step 3: Textual units of analysis: Chunking](#section-3)

[Step 4: Tokenizing](#section-4)

<a id='section-1'></a>
## Step 1: Normalizing and building a stopwords list

### Are there any strange characters or strange things in the data that I need to replace or get rid of?

Do a character count in order to see if there are any strange characters I might want to delete or replace: 

In [None]:
#Counting characters in a text
from collections import Counter

f = open('kafka_dv.txt', 'r')
test_text = f.read()
Counter(test_text)

I see there are a lot of newline characters (`\n`). I'm going to replace those with a space.

In [None]:
test_text.replace('\n', ' ')

**Normalizing encodings**


Character counts can also flag up any encoding issues. Especially when working with language that have accented characters, we can see here if there are any problems with those accented characters (for example, if the "same" accented characters don't in fact share the same encoding). Then we can normalize the data to ensure all characters are encoded properly so that it doesn't mess up our counts.

In [None]:
#For example, here we have the letter "a" with acute accent
# Its unicode value in python is 225
#and it has a length of 1: it is one accented character
import unicodedata
char = "á"
print(ord(char))
len(char)

In [None]:
#The unicode name for this character
[ unicodedata.name(c) for c in char ]

In [None]:
#But a character that looks the same could be encoded differently
#Here we have the letter a with acute accent again
#but it is in fact the combination of two unicode code-points (97 + 769) 
#so it has a length of 2
char2 = "á"
print([ord(c) for c in char2])
len(char2)

In [None]:
#The unicode name for this character
[ unicodedata.name(c) for c in char2 ]

In [None]:
#These characters look the same but are not infact the same, and have different lengths
len(char) == len(char2)

In [None]:
#Different character encodings will be counted as different characters
'hálo friend' == 'hálo friend'

In [None]:
#We can use the .normalize() method to normalize the character encodings
test = char + char2
len(test)

In [None]:
#Normalize all character encodings in your text
new_test = unicodedata.normalize('NFC', 'test')
len(new_test)

### Word Frequency counts to identify problematic words

Running a word frequency count can help us start building a custom stopword list. It can also help us identify if there are any variant spellings we want to normalize (using `.replace()` as above).

In [None]:
#Counting all the words in a text
import re
from collections import Counter

#defining a tokenizing function which will split at and remove whitespace and punctuation
#return words and numbers
def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

#Read in text
text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

#Use our tokenizing function to tokenize the text
all_the_words = tokenize(text)

##Count frequencies of all the words
all_the_words_count = Counter(all_the_words)
all_the_words_count

In [None]:
#Counting most frequent words in a text

#How many most frequent words do you want to see?
number_of_desired_words = 50

#Return most frequent words
most_frequent_all_the_words_count = all_the_words_count.most_common(number_of_desired_words)
most_frequent_all_the_words_count

What words in this list are not relevant to your project? You can add them to your custom stopwords list.

### Building a custom stopwords list

As we've discussed, you might want to build your own custom stopwords list for your project.

You could use an existing stopwords list as your starting point and remove and add words that you want for your project. 

For example, you could start with the spaCy stopwords list.

In [None]:
#Stopwords in spaCy
import spacy

#Download the language model you're interested in (this is the English pipeline)
#For french: fr_core_news_sm
#For spanish: es_core_news_sm
!python -m spacy download en_core_web_md

In [None]:
#Load language model and stopwords list
nlp = spacy.load('en_core_web_md')
stopwords = nlp.Defaults.stop_words
sorted(list(stopwords))

In [None]:
#Write out the spacy stopwords list to a txt file
with open("spacy-stopwords-english.txt", "a") as file_object:
    for word in sorted(list(stopwords)): 
        file_object.write(word + '\n')

**Read in your stopwords list, use it in code, and add words to your stopwords list**

In [None]:
#Open your txt file and convert to a Python list
with open("spacy-stopwords-english.txt", "r") as file_object:
    custom_stopwords = [s.rstrip('\n') for s in file_object.readlines()] 

custom_stopwords

In [None]:
#Append a new work to the list
custom_stopwords.append('got')

In [None]:
#Append multiple new words to the list
custom_stopwords += ['gotten', 'mr']

In [None]:
#Remove a word
#Find the index of the word you want to remove
index = custom_stopwords.index('gotten')
index

In [None]:
#then delete word
del custom_stopwords[index]

In [None]:
#Write out the updated list and sort alphabetically
with open("custom-stopwords.txt", "w") as file_object:
    for word in sorted(custom_stopwords):
        file_object.write(word + '\n')

In [None]:
#Check if a given word in list (True if in list, False if not in list)
'friend' in custom_stopwords 

**Example of using your custom stopword list in code**

In [None]:
#Open your txt file and covert to a Python list
with open("custom-stopwords.txt", "r") as file_object:
    custom_stopwords = [s.rstrip('\n') for s in file_object.readlines()] 

custom_stopwords

In [None]:
import re
from collections import Counter

#Defining a tokenizing function
def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

#Reading in text
text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

#Tokenizing text
all_the_words = tokenize(text)

#Filtering only the words not on stopwords list (you use your stopwords list variable here)
meaningful_words = [word for word in all_the_words if word not in custom_stopwords]

#Counting words
meaningful_words_tally = Counter(meaningful_words)

#How many frequent words we want to see
number_of_desired_words = 50

#Return most frequent words
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)
most_frequent_meaningful_words

****
**Read in your stopwords list, use it in code, and add words to your stopwords list**

This does the same thing as above but using pandas instead of Python lists.

In [None]:
import pandas as pd

#Read in stopwords list as pandas dataframe and convert to it to a list
stopwords_df = pd.read_csv('spacy-stopwords-english.txt', names=['word'])
custom_stopwords_list = stopwords_df['word'].to_list()
custom_stopwords_list

In [None]:
#Adding words to the list

#Create list of words you want to add
new_words = ['got', 'mr']

#Create a dataframe of words you want to add
new_words_df = pd.DataFrame(new_words, columns=['word'])
new_words_df

In [None]:
#Concatenate/merge the old dataframe with new dataframe with new words in it, and sort it
updated_stopwords_df = pd.concat([stopwords_df, new_words_df], ignore_index=True)
updated_stopwords_df = updated_stopwords_df.sort_values(by='word')
updated_stopwords_df

In [None]:
#Check if a word is in the list (False if not in list, True if in list)
stopwords_df.word.str.contains('friend').any()

In [None]:
#Write out the dataframe to a txt file
updated_stopwords_df.to_csv('custom_stopwords.txt', sep=' ', header=None, index=False)

**Example of using your custom stopword list in code**

In [None]:
import pandas as pd

#Read in stopwords list as pandas dataframe and convert to it to a list
stopwords_df = pd.read_csv('custom-stopwords.txt', names=['word'])
custom_stopwords_list = stopwords_df['word'].to_list()
custom_stopwords_list

In [None]:
import re
from collections import Counter

#Defining a tokenizing function
def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

#Reading in text
text = open('kafka_metamorphosis.txt', encoding="utf-8").read()

#Tokenizing text
all_the_words = tokenize(text)

#Filtering only the words not on stopwords list (you use your stopwords list variable here)
meaningful_words = [word for word in all_the_words if word not in custom_stopwords_list]

#Counting words
meaningful_words_tally = Counter(meaningful_words)

#How many frequent words we want to see
number_of_desired_words = 50

#Return most frequent words
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)
most_frequent_meaningful_words

<a id='section-2'></a>
## Step 2: Lemmatization

### Creating a lemmatized version of your corpus

For methods that rely on word counts (e.g. frequency counts, Tf-idf), it's best to use lemmatized text so that a maximum number of words we want counted togther will be counted together. There is evidence that lemmatization is not necessary, maybe even counterproductive for topic modeling.  

> "Stemming has been found to provide little measurable benefits for topic modeling and can sometimes even be harmful (Schofield and Mimno, 2016)." (Nguyen et al., "How We Do Things With Words," p. 8)


It might be good practice to have a lemmatized and unlemmatized version of your corpus so you can experiment with which one produces the most meaningful outputs.

Below we create a lemmatized version of the kafka corpus.

**Lemmatizing mutiple files**

In [None]:
#This loops over multiple files in a directory
#but it might make the kernel crash if it runs out memory
#If the kernel crash you might have to lemmatize single files at a time (cf. below)

#Lemmatizing using spaCy for English
import spacy
import glob

#Download the language model you're interested in (this is the English pipeline)
#For french: fr_core_news_sm
#For spanish: es_core_news_sm
!python -m spacy download en_core_web_md

In [None]:
#Load language model (it needs to match the name above)
nlp = spacy.load('en_core_web_md')

In [None]:
#Open your texts and create spaCy document
filepath = 'kafka-corpus/'
text_files = glob.glob(f'{filepath}/*.txt')

#Loop through the files and open as spacy document
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file)
        document = nlp(text)
        
    #Lemmatize
    outname = file.replace('.txt', '-lemmatized.txt')
    with open(outname, 'w', encoding='utf8') as out:   
        for token in document:
            # Get the lemma for each token
            out.write(token.lemma_.lower())
            # Insert white space between each token
            out.write(' ')

**Lemmatizing single files**

In [None]:
#Lemmatizing single files

#Lemmatizing using spaCy for English
import spacy
#!python -m spacy download en_core_web_md

In [None]:
#Load language model (it needs to match the name above)
nlp = spacy.load('en_core_web_md')

#Open your text and create spaCy document
filepath = 'kafka_metamorphosis.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

outname = filepath.replace('.txt', '-lemmatized.txt')
with open(outname, 'w', encoding='utf8') as out:   
    for token in document:
        # Get the lemma for each token
        out.write(token.lemma_.lower())
        # Insert white space between each token
        out.write(' ')

In [None]:
#prints the original word in the text, 
#a dash, then the lemmatized form that was written to the derivative text document
#check if there are places where the model consistently makes mistakes
#this prints the first 50 tokens - modify the slice next to document for more
for token in document[:50]:
    print(token.text + ' - ' + token.lemma_)

<a id='section-3'></a>
## Step 3: Textual units of analysis: chunking

Having texts of widely different lengths might skew the analyses. It's good practice to ensure that the texts are of roughly similar lengths. This might mean joining texts together into larger text blocks if they are very short (e.g. tweets), or splitting longer texts into shorter units.

This process can be called chunking. It is also sometimes referred to as segmentation.

In [None]:
#Loop through and check how long the texts are
import glob

#Open your texts
filepath = 'kafka-corpus/'
text_files = glob.glob(f'{filepath}/*.txt')

#Loop through the files and print text file name with number of words
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file, len(text))

In [None]:
#Split long texts into shorter units
#Split into a collection of documents of 3000 words

#Loop through the files and print text file name with number of words
for file in text_files:
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
        print(file, len(text))
        
        segment_length = 3000
        
        nseg = round(len(text) / segment_length)
        for i in range(nseg):
            segment = text[segment_length*i:segment_length*(i+1)]
            outname = file.replace('kafka-corpus/', 'kafka-corpus/kafka-segmented/').replace('.txt', f'-{i}.txt')
            with open(outname, 'w', encoding='utf8') as out:
                text_chunk = ''.join(segment)
                out.write(text_chunk)

Note that this is a coarse kind of splitting - it can split in the middle of words.

<a id='section-4'></a>
## Step 4: Tokenizing

Tokenizing involves splitting the text into units of analysis you're interested in analyzing - most often this is assumed to be "words".  

We need to do this because, if you look back to the very start of this notebook, unstructured "raw" text is just sequences of character encodings, so the analyses will count individual character encodings. We need to restructure our text into the units we want to analyze (i.e. usually "words" or "tokens").

Tokenization works by defining markers at which you split the string. Different tokenizing procedures might use different markers.

Most methods have built-in tokenizing functions (e.g. cf. TF-IDF notebook about scikit-learn's built-in tokenizing procedure that you can override with your own). 

You can use the built-in tokenizing procedures, or you can define and use your own. 

Here are examples of different tokenizing procedures:

In [None]:
#the .split() method in Python uses whitespace as default
text = "I'd say, they're happy it's mother's day."
text.split()

In [None]:
#you can also pass different markers to .split() to define where you want to split your text
#this uses regular expression to split at any one character or more that is NOT a word
import re
text = "I'd say, they're happy it's mother's day."
tokens = re.split('\W+', text)
tokens

In [None]:
#Built-in tokenizing procedure in NLTK
import nltk
from nltk.tokenize import word_tokenize
text = "I'd say, they're happy it's mother's day."
tokens = nltk.word_tokenize(text)
tokens

In [None]:
#Built-in tokenizing procedure in spaCy in Spanish
#Download model
import spacy
!python -m spacy download es_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('es_core_news_sm')
#Create spaCy process document
text = 'Yo diría, que están felices de que sea el día de la madre.'
document = nlp(text)

tokens = [token.text for token in document]
tokens

In [None]:
#Built-in tokenizing procedure in spaCy in French
#Download model
import spacy
!python -m spacy download fr_core_news_sm

In [None]:
#Load language model
nlp = spacy.load('fr_core_news_sm')
#Create spaCy process document
text = 'On dirait qu\'ils sont heureux que ce soit la fête des mères.'
document = nlp(text)

tokens = [token.text for token in document]
tokens

**Defining your own tokenizing functions**

In [None]:
#Only words, no numbers
#Define a function to lowcase, split at and remove anything not a "word" character
#(i.e. a letter or digit or underbar)
#So it will split at and remove whitspace and punctuation
#Then keep only alphabetic characters (i.e. remove numbers) with .isalpha()

def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    no_numbers = [word for word in split_words if word.isalpha()]
    return no_numbers

text_example = "I'd say, they're happy it's mother's day. 1988!"
tokenized_text_example = tokenize(text_example)
tokenized_text_example

In [None]:
#Words and numbers
#Define a function to lowcase, split at and remove anything not a "word" character
#(i.e. a letter or digit or underbar)
#So it will split at and remove whitspace and punctuation

def tokenize(text):
    lowercase_text = text.lower()
    split_words = re.split('\W+', lowercase_text)
    return split_words

text_example = "I'd say, they're happy it's mother's day. 1988!"
tokenized_text_example = tokenize(text_example)
tokenized_text_example