<img align="left" width="200" src="Picture1.png">

# 3. Preparing data

## Using NLP on texts

Like any data analysis project, the data might to be prepared and cleaned before doing Natural Language Processing (NLP). This might include:
<ul><li>Removing certain words</li>
<li>Combining or removing chunks of text</li>
<li>Converting words to lower case</li>
<li>Removing punctuation or stop words</li></ul>

In [1]:
#Run the line below if spaCy is not downloaded already.
#! python -m spacy download en_core_web_sm

In [2]:
import spacy

SpaCy has stopwords, which are words that you might want to remove before doing NLP. The code in the next cell will print the stop words so you can see the list. More advanced NLP projects might create their own stopwords, depending on the data.

In [3]:
nlp = spacy.load('en_core_web_sm')
stopwords = nlp.Defaults.stop_words
print(len(stopwords))
print(stopwords)

326
{'same', 'wherein', 'part', 'nothing', 'thereupon', 'becoming', 'twelve', 'ever', 'thereby', 'others', 'whither', 'could', 'noone', 'anywhere', 'until', 'most', 'onto', 'nor', 'is', 'with', '‘re', 'once', '‘m', 'a', 'regarding', 'still', 'whenever', 'top', 'who', 'somehow', 'these', 'twenty', 'after', 'moreover', 'along', 'several', 'hundred', 'does', 'forty', 'n‘t', 'name', 'none', 'thru', 'yourself', 'then', 'such', 'an', 'anyhow', 'wherever', 'those', 'everywhere', 'nevertheless', 'because', 'yet', "'d", 'anyone', "'m", 'myself', 'were', 'become', 'from', 'her', 'few', 'namely', 'n’t', 'last', 'indeed', 'besides', 'will', 'that', 'or', 'up', 'anything', 'latter', 'afterwards', 'rather', 'if', 'except', 'sixty', 'much', 'do', 'on', 'off', 'themselves', 'within', 'due', 'next', '’ve', 'own', 'mostly', 'therein', 'although', 'neither', 'whoever', 'say', 'using', 'be', 'am', 'either', 'alone', 'via', 'sometime', 're', 'but', 'empty', 'beforehand', 'hereby', 'throughout', 'get', 'bot

I've added a file `little-women.txt` which contains the first chapter of Little Women. I am only including 1 chapter because the whole book was too much text for spaCy to process at once. Running the code below loads that text and renames it `text`.

In [4]:
#Import the Little Women text file
with open ("little-women.txt", "r") as f:
    text = f.read()

Next, we will print out the first 500 words so we can see how the text looks unprocessed.

In [5]:
print (text[:500])


CHAPTER ONE
PLAYING PILGRIMS


“Christmas won’t be Christmas without any presents,” grumbled Jo, lying
on the rug.

“It’s so dreadful to be poor!” sighed Meg, looking down at her old
dress.

“I don’t think it’s fair for some girls to have plenty of pretty
things, and other girls nothing at all,” added little Amy, with an
injured sniff.

“We’ve got Father and Mother, and each other,” said Beth contentedly
from her corner.

The four young faces on which the firelight shone brightened at the
cheer


Sometimes it is easier to work with text if all the words are lowercase. We can lowercase the text easily with this code:

In [6]:
text = text.lower()
print (text[:500])


chapter one
playing pilgrims


“christmas won’t be christmas without any presents,” grumbled jo, lying
on the rug.

“it’s so dreadful to be poor!” sighed meg, looking down at her old
dress.

“i don’t think it’s fair for some girls to have plenty of pretty
things, and other girls nothing at all,” added little amy, with an
injured sniff.

“we’ve got father and mother, and each other,” said beth contentedly
from her corner.

the four young faces on which the firelight shone brightened at the
cheer


Still looks pretty normal right? Now, we're going to use spaCy, which will allow us to take that text and turn it into tokens. When the text becomes tokens, we can use NLP to do textual analysis. After we print the doc as tokens it will look the same, but the computer will understand the text as tokens.

In [7]:
doc = nlp(text)
print (doc)


chapter one
playing pilgrims


“christmas won’t be christmas without any presents,” grumbled jo, lying
on the rug.

“it’s so dreadful to be poor!” sighed meg, looking down at her old
dress.

“i don’t think it’s fair for some girls to have plenty of pretty
things, and other girls nothing at all,” added little amy, with an
injured sniff.

“we’ve got father and mother, and each other,” said beth contentedly
from her corner.

the four young faces on which the firelight shone brightened at the
cheerful words, but darkened again as jo said sadly, “we haven’t got
father, and shall not have him for a long time.” she didn’t say
“perhaps never,” but each silently added it, thinking of father far
away, where the fighting was.

nobody spoke for a minute; then meg said in an altered tone, “you know
the reason mother proposed not having any presents this christmas was
because it is going to be a hard winter for everyone; and she thinks we
ought not to spend money for pleasure, when our men are suff

By running the next code, we see the text is 21,861 characters but when it is processed into tokens, the length becomes 5,473. As tokens, it is possible to do NLP.

In [9]:
print (len(text))
print (len(doc))

21861
5478


This is the code to run to remove stop words and punctuation: 

In [10]:
for token in doc:
    if token.is_stop == False: #if the token is not a stop word, keep it
        if token.pos_ != "PUNCT": #if the token is not punctuation, keep it
            print (token.text) 



chapter


playing
pilgrims




christmas
wo
christmas
presents
grumbled
jo
lying


rug



dreadful
poor
sighed
meg
looking
old


dress



think
fair
girls
plenty
pretty


things
girls
added
little
amy


injured
sniff



got
father
mother
said
beth
contentedly


corner



young
faces
firelight
shone
brightened


cheerful
words
darkened
jo
said
sadly
got


father
shall
long
time


silently
added
thinking
father
far


away
fighting



spoke
minute
meg
said
altered
tone
know


reason
mother
proposed
having
presents
christmas


going
hard
winter
thinks


ought
spend
money
pleasure
men
suffering


army
little
sacrifices


ought
gladly
afraid
meg
shook


head
thought
regretfully
pretty
things
wanted



think
little
spend
good


got
dollar
army
helped
giving


agree
expect
mother
want


buy
_
undine
sintran
_
wanted
long
said


jo
bookworm



planned
spend
new
music
said
beth
little
sigh


heard
hearth
brush
kettle
holder



shall
nice
box
faber
drawing
pencils
need


said
amy
decidedly



m

The code in the next cell prints the token and linguistic features associated with that word or punctuation. When we can label the text as their parts of speech we can do analysis on the text.

In [11]:
for token in doc[100:125]:
    if token.is_stop == False:
        if token.pos_ != "PUNCT":
            print (token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                token.shape_, token.is_alpha)

said say VERB VBD ROOT xxxx True
beth beth PROPN NNP compound xxxx True
contentedly contentedly PROPN NNP nsubj xxxx True

 
 SPACE _SP dep 
 False
corner corner NOUN NN pobj xxxx True


 

 SPACE _SP dep 

 False
young young ADJ JJ amod xxxx True
faces face NOUN NNS nsubj xxxx True
firelight firelight NOUN NN nsubj xxxx True
shone shine VERB VBD relcl xxxx True
brightened brighten VERB VBN advcl xxxx True

 
 SPACE _SP dep 
 False
cheerful cheerful ADJ JJ amod xxxx True
words word NOUN NNS pobj xxxx True


What is this telling us? The tags are identifying the parts of speech of the tokens, excluding stop words and punctuation. More information on linguistic features can be found at: https://spacy.io/usage/linguistic-features

<p>Take this line: `faces face NOUN NNS nsubj xxxx True False`
<p>Faces: the token
<p>Face: Lemma, or the base form of the word (how it might be in the dictionary)
<p>NOUN: Part of speech
<p>NNS: detailed part-of-speech tag.
<p>nsubj: Syntactic dependency (the relation between tokens)
<p>xxxx: Word shape, like capitalization, punctuation, digits
<p>True: It is alphabetical
<p>False: Is not on the stop word list

At this point we've already done a lot: turned text into tokens, removed stop words and punctuation, and annotated text to show information about the parts of speech. The next notebook we are going to do a little more textual analysis, including visualizations, tables, and comparisons. 

## More information:

<p><a href="https://nlp.stanford.edu/">Stanford NLP</a></p>
<p><a href="https://spacy.io/">spaCy</a></p>
<p><a href="https://openrefine.org/">OpenRefine</a></p>