<img align="left" width="200" src="Picture1.png">

In [None]:
# Preparing data

## Using NLP on texts

Like any data analysis project, the data might to be prepared before doing Natural Language Processing (NLP). This session will use a NLP tool called spaCy to remove certain words in a document for textual analysis. 

In [None]:
#Run the line below if spaCy is not downloaded already.
#! python -m spacy download en_core_web_sm

In [None]:
import spacy

SpaCy has stopwords, which are words that you might want to remove before doing NLP. The code in the next cell will print the stop words so you can see the list. More advanced NLP projects might create their own stopwords, depending on the data.

In [None]:
nlp = spacy.load('en_core_web_sm')
stopwords = nlp.Defaults.stop_words
print(len(stopwords))
print(stopwords)

I've added a file `little-women.txt` which contains the first chapter of Little Women. Running the code below loads that text and renames it `text`.

In [None]:
#Import the Little Women text file
with open ("little-women.txt", "r") as f:
    text = f.read()

Next, we will print out the first 500 words so we can see how the text looks unprocessed.

In [None]:
print (text[:500])

Looks pretty normal right? Now, we're going to use spaCy, which will allow us to take that text and turn it into tokens. When the text becomes tokens, we can use NLP to do textual analysis. After we print the doc as tokens it will look the same, but the computer will understand the text as tokens.

In [None]:
doc = nlp(text)
print (doc)

By running the next code, we see the text is 21,861 characters but when it is processed into tokens, the length becomes 5,473.

In [None]:
print (len(text))
print (len(doc))

The next cell prints the first 50 tokens on each line. We see that words are their own tokens, as well as punctuation.

In [None]:
for token in doc[:50]:
    print (token)

Before doing more text analysis it might be good to remove stop words and punctuation: 

In [None]:
for token in doc:
    if token.is_stop == False: #if the token is not a stop word, keep it
        if token.pos_ != "PUNCT": #if the token is not punctuation, keep it
            print (token.text) 

The code in the next cell prints the token and linguistic features associated with that word or punctuation.

In [None]:
for token in doc[100:125]:
    if token.is_stop == False:
        if token.pos_ != "PUNCT":
            print (token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                token.shape_, token.is_alpha)

What is this telling us? The tags are identifying the parts of speech of the tokens, excluding stop words and punctuation. More information on linguistic features can be found at: https://spacy.io/usage/linguistic-features

<p>Take this line: `faces face NOUN NNS nsubj xxxx True False`
<p>Faces: the token
<p>Face: Lemma, or the base form of the word (how it might be in the dictionary)
<p>NOUN: Part of speech
<p>NNS: detailed part-of-speech tag.
<p>nsubj: Syntactic dependency (the relation between tokens)
<p>xxxx: Word shape, like capitalization, punctuation, digits
<p>True: It is alphabetical
<p>False: Is not on the stop word list

At this point we've already done a lot: turned text into tokens, removed stop words and punctuation, and annotated text to show information about the parts of speech. The next notebook we are going to do a little more textual analysis, including visualizations, tables, and comparisons. 

## More information:

<p><a href="https://nlp.stanford.edu/">https://nlp.stanford.edu/</a></p>
<p><a href="https://spacy.io/">https://spacy.io/</a></p>