# Python for Digital Humanities

## Unit #4: Stop Words

* Overview
* Loading the NLTK Stop Words
* Removing Stop Words from our Text
* Creating our own List of Stop Words


<font color=blue>---------------------------------------------------------------</font>

### 4.1 Overview 


Some words, like "is", "the", "and", occur frequently in text but have little semantic content. If we want to know the 25 most commonly-used words in a text, these words would top the list.  Unfortunately, they would not give us any insight into the topics that the text focuses on.  In text analysis, we call these _stop words_ and generally we would wnat to remove them before we do any analysis on the text.

That, in turn, means that we would need to have a list of words that we deem to be unnecessary in our analysis.  While we could sit and come up with a list of our own stop words, it might be beneficial to start with a list of word that other programmers have already created.  There is such a list in the `nltk` package.




### 4.2  Loading the NLTK Stop Words



In [69]:
from nltk.corpus import stopwords
stop_words = stopwords.words("english")

Let's take a look at the list of stop words.

In [70]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

This list may not be complete, but it is a good start.

### 4.3  Removing Stopwords from our Text

Now that we have a list of words to remove, we need to learn how to remove them.  What you are about to see requires more advanced Python skills.  We will learn about these skills as we cover more about Python.  For now, we'll call it a magic formula.

 
<font color=blue>---------------------------------------------------------------</font>



In [71]:

# Load the stop words
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words("english")

# Read the Emma excerpt
with open("austen-emma-excerpt.txt") as f:
    raw_text = f.read()
    
# Convert to a list of words
words = nltk.word_tokenize(raw_text)

# Perform the magic
reduced_words = [w for w in words if w.lower() not in stop_words]
print(reduced_words)

['Emma', 'Jane', 'Austen', '1816', 'VOLUME', 'CHAPTER', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'rich', ',', 'comfortable', 'home', 'happy', 'disposition', ',', 'seemed', 'unite', 'best', 'blessings', 'existence', ';', 'lived', 'nearly', 'twenty-one', 'years', 'world', 'little', 'distress', 'vex', '.', 'youngest', 'two', 'daughters', 'affectionate', ',', 'indulgent', 'father', ';', ',', 'consequence', 'sister', "'s", 'marriage', ',', 'mistress', 'house', 'early', 'period', '.', 'mother', 'died', 'long', 'ago', 'indistinct', 'remembrance', 'caresses', ';', 'place', 'supplied', 'excellent', 'woman', 'governess', ',', 'fallen', 'little', 'short', 'mother', 'affection', '.']




#### Extra Punctuation

The words look much better; however, we are seeing punctuation listed as words.  We could handle this in a couple of ways.  First, we could remove punctuation from our words before we apply the stop words.  This would require another magical formula. Or, we could include punctuation in our list of stop words.

<font color=blue>---------------------------------------------------------------</font>

### 4.4 Creating your own List of Stop Words

There is no such thing as a perfect list of stop words.  We may have to keep tweaking them until we have a comprehensive list for our corpa.  So, it is important to know how to extend the list of stop words that we extracted from `nltk`. 

Specifically, let's see how to include punctuation, like commas, periods, semicolons, and even apostrophe s.  

To do this, we can an "extend" tool to merge our list with the existing list of stop words.  This can be done in two lines:
```
our_stopwords = [",", ".", ";", "'s"]
stop_words.extend(our_stopwords)
```

Let's see it in action with the complete code.



<font color=blue>---------------------------------------------------------------</font>

In [72]:
# Load the stop words
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words("english")

# Extend the stop words with our own
our_stopwords = [",", ".", ";", "'s"]
stop_words.extend(our_stopwords)


# Read the Emma excerpt
with open("austen-emma-excerpt.txt") as f:
    raw_text = f.read()
    
# Convert to a list of words
words = nltk.word_tokenize(raw_text)

# Perform the magic
reduced_words = [w for w in words if w.lower() not in stop_words]
print(reduced_words)

['Emma', 'Jane', 'Austen', '1816', 'VOLUME', 'CHAPTER', 'Emma', 'Woodhouse', 'handsome', 'clever', 'rich', 'comfortable', 'home', 'happy', 'disposition', 'seemed', 'unite', 'best', 'blessings', 'existence', 'lived', 'nearly', 'twenty-one', 'years', 'world', 'little', 'distress', 'vex', 'youngest', 'two', 'daughters', 'affectionate', 'indulgent', 'father', 'consequence', 'sister', 'marriage', 'mistress', 'house', 'early', 'period', 'mother', 'died', 'long', 'ago', 'indistinct', 'remembrance', 'caresses', 'place', 'supplied', 'excellent', 'woman', 'governess', 'fallen', 'little', 'short', 'mother', 'affection']




## Activity:  Using Stop Words on a Larger File


Make sure that you have downloaded the file "emma_chapter_one.txt".
Read in the contents of that file and remove your stop words from the file.  Are there any more words that you feel should be added to the stop words?

    
<font color=blue>---------------------------------------------------------------</font>

### Summary

In this section, we learned how to remove commonly-used words from our text before we do any significant text analysis.  We also saw how we could create a more extensive list of stop words.  

We will want to save the extended list of stop words for future use.  Before we can do that, we will need to learn a little more Python.


<font color=blue>---------------------------------------------------------------</font>