# Preprocessing text from the SEP
Using the scraper before, we obtained all of the SEP in separate files. From now on, we will consider each of these files as one document. Before we get to run our models, we must first do some preprocessing to make the files friendlier for the machine. 

In [3]:
import os
import ast
files = os.listdir('sep_articles/')

In [9]:
files[0]

'18th_century_french_aesthetics'

## Pre-processing with NLTK
The goal is to have a piece of text where all variations of a word are reduced to a single lemma, and where we only have meaningful words. To do this, we must:

1. Tokenize the text.
    * Tokenizing means breaking the text into pieces, be it words or sentences. In our case, we can use words.
3. Part-of-speech tagging.
    * This tags each word in terms of its role in the sentence (noun, verb, adjective, etc.)
2. Remove stop words.
    * Stop words are words that are meaningless (at least for our analysis), such as 'the' or 'and.'
4. Lemmatization.
    * This reduces word variations to a common lemma (e.g. 'ate' and 'eating' => 'eat').

In [1]:
import nltk

### Tokenizing

First, we will divide each text into sentences with NLTK's sent_tokenize() function. Then we will divide each sentence into words. We will use the RegExp Tokenizer included in NLTK so that we can have better control on the kinds of words we want to get. For our purposes, we want words without numbers or symbols. We will spell this out in a regular expression when we instantiate the tokenizer. Once we have each sentence as a list of words, we save it to file. 

In [2]:
tokenizer = nltk.RegexpTokenizer('[A-Za-z]+')

In [19]:
for f in files:
    with open('sep_articles/' + f) as raw_text:
        sent_tokens = nltk.sent_tokenize(raw_text.read().decode('utf8'))
    
    sent_words = []
    for sentence in sent_tokens:
        sent_words.append(tokenizer.tokenize(sentence))
        
    # When we run the tokenizer, some sentences (or at least what the sentence tokenizer thought were sentences) are now empty. 
    # Let's remove to keep things clean.
    sent_words = [sentence for sentence in sent_words if sentence]
    
    with open('sep_articles_tokenized/' + f[:-4] + '.txt', 'w') as tokenized_text:
        tokenized_text.write(str(sent_words))

### Part-of-speech tagging
We can now tag each word for its part of speech. It's better to do this before we remove the stop words, as the stop words do contain information about the sentence's syntactic structure, hence about each word's role in the sentence. To do this tagging, we use NLTK's pos_tag() function.

In [28]:
for f in os.listdir('sep_articles_tokenized/'):
    with open('sep_articles_tokenized/' + f) as sent_words_raw:
        sent_words = ast.literal_eval(sent_words_raw.read())
    
    sent_tagged = []
    for sentence in sent_words:
        sent_tagged.append(nltk.pos_tag(sentence))
        
    with open('sep_articles_tagged/' + f, 'w') as tagged_file:
        tagged_file.write(str(sent_tagged))

### Removing stop words
NLTK includes a list of stopwords we can use.

In [29]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [37]:
for f in os.listdir('sep_articles_tagged/'):
    with open('sep_articles_tagged/' + f) as sent_tagged_raw:
        sent_tagged = ast.literal_eval(sent_tagged_raw.read())
        
    sent_clean = []
    for sentence in sent_tagged:
        sent_clean.append([word for word in sentence if word[0].lower() not in stop_words])
        
    with open('sep_articles_clean/' + f, 'w') as clean_file:
        clean_file.write(str(sent_clean))

### Lemmatization
The last pre-processing step is to reduce each word to its lemma. This helps us normalize the count, avoiding counting different variations of the same concept as if they were different words altogether. Again, we do this with the help of NLTK's lemmatizers.

In [1]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [3]:
lemmatizer.lemmatize('desired', pos='a')

'desired'

NLTK's lemmatizer considers the part-of-speech of the word in order to get its lemma. This is why we did POS tagging before lemmatization. To make things easier, we build a custom lemmatizer function that just runs the lemmatizer with the corresponding argument for each part of speech. Since we are only interested in nouns, verbs, and adjectives, we will return False for any other part of speech.

In [2]:
def customLemmatizer(tagged_word):
    word = tagged_word[0]
    pos = tagged_word[1]
    
    if pos in ['NN', 'NNS', 'NNP', 'NNPS']:
        lemma = lemmatizer.lemmatize(word)
        
    elif pos in ['VB', 'VBN', 'VBZ', 'VBD', 'VBP', 'VBG']:
        lemma = lemmatizer.lemmatize(word, pos = 'v')
        
    elif pos in ['JJ', 'JJR', 'JJS']:
        lemma = lemmatizer.lemmatize(word, pos='a')
        
    else:
        lemma = False
        
    return lemma

Time to run the lemmatizer! We will run it for each sentence individually.

In [12]:
for f in os.listdir('sep_articles_clean/'):
    with open('sep_articles_clean/' + f) as sent_clean_raw:
        sent_clean = ast.literal_eval(sent_clean_raw.read())

    lemmatized_text = []
    for sentence in sent_clean:
        for word in sentence:
            lemma = customLemmatizer(word)
            if lemma:
                lemmatized_text.append(lemma)
                
    # Remove words with two letters.
    lemmatized_text = [word.lower() for word in lemmatized_text if len(word) > 2]
                
    with open('sep_articles_lemmatized/' + f, 'w') as lemmatized_file:
        lemmatized_file.write(str(lemmatized_text))