# Intro to Natural language Processing
In this notebook, we'll start using Python's RegEx package and do a simple natural language processing exercise.

## What are Strings?
Strings are a type of data usually used to store text. They are different from other data types in python like integers, lists, or dictionaries. In python, they are usually enclosed by parantheses, either "" or ''. When checking an object's type, if it is a string, you'll see `str`. Integers will display `int`.

In [2]:
# Check this object's type
type("hello world!")

str

In [3]:
# Check this object's type
type(342)

int

Strings can also be combined into objects like lists and dictionaries. Lists are just that, a combination of objects like a bunch of strings or a bunch of numbers. They are enclosed in brackets [ ]. Dictionaries are kind of like lists, but instead of string information like this:

* Capucchino
* Latte
* Chai

it can store pairs of objects linked together, kind of like this:

Key | Value
--- | ---
one: | you're like a dream come true
two: | just wanna be with you
three: | you know it's plain to see, that you're the only one for me

Dictionaries are inside of curly brackets { }.

In [4]:
print(type(["Capucchino", "Latte", "Chai"]))
print(type({"one":"you're like a dream come true", "two": "just wanna be with you", "three": "you know it's plain to see, that you're the only one for me"}))

<class 'list'>
<class 'dict'>


## What's RegEx?
Regular expressions, or RegEx are a way to parse strings to make them more useful to a computer program. For example, we might want to separate a paragraph into separate sentences to assess average sentence length, or parse a text into separate words to do a word frequency analysis or remove unwanted common words like "to" and "the" that don't give us the information we want.

We can do things like match a substring, or search for a substring within another string.

In [5]:
#import regex
import re
re.match("You", "You know nothing John Snow.")

<_sre.SRE_Match object; span=(0, 3), match='You'>

In [6]:
re.search("Winter", "Winter is coming.")

<_sre.SRE_Match object; span=(0, 6), match='Winter'>

The difference between `match` and `search` is that `match` will see if the two strings match *from the very beginning,* whereas the search will search for the first string anywhere within the second string.

In [7]:
re.match("know", "You know nothing John Snow.")

In [8]:
re.search("is", "Winter is coming.")

<_sre.SRE_Match object; span=(7, 9), match='is'>

## RegEx Patterns
RegEx also allows us to perform operations on certain classes of strings. for example, '\w+' searches for words.

In [9]:
word = '\w+'
re.search(word, "Is it too late now to say sorry?")

<_sre.SRE_Match object; span=(0, 2), match='Is'>

Here are some common RegEx patterns:
    
pattern | matches | example
--- | --- | ---
\w+ | word | "Bieber"
\d | digits | 250624
\s | spaces | ' '
.\* | matches anything | 25or6to4
+ or \* | allows matching to the end of a string or pattern | 'Aaaaaaaaaaaa'

Capitalizing something negates it.  For example \S would return anything that is NOT a space. You can also create a range of characters with square brackets:

pattern | matches | example
--- | --- | ---
[a-z] | lowercase group | "sorry"
[A-Za-z@.] | upper, lower case letters, at sign, and period | MyEmail@bah.com

Run the code below to split the string below, on spaces, into separate words:

In [12]:
re.split('\s+', "The baristas in this coffee shop are playing a lot of Bieber.")

['The',
 'baristas',
 'in',
 'this',
 'coffee',
 'shop',
 'are',
 'playing',
 'a',
 'lot',
 'of',
 'Bieber.']

## Getting Started Processing Text
Now we'll get started processing text with `re.split()` and `re.findall()`. Please complete and run the cell below.

In [23]:
first_string = "Congrats on your first RegEx split!  How was it?  Hopefully not too difficult.  Can you find 5 sentences?"

# TO DO: Write a regular expression to identify the sentence endings "!, ?, and .". 
## To do this, first type "r" to make sure the period registers as a period
## instead of being read as the "match anything" symbol. Right after the r, with
## no spaces, type your brackets, and put the three sentence end symbols inside
## the brackets with no spaces or commas.
sentence_endings = r"[!+?+.+]"

# Next, split the string "first_string" into sentences based on the sentence_endings 
## expression you just created, and print the result. The split function will take
## 2 inputs inside the inner parentheses, and these two inputs will be separated by 
## the comma. The first object will be the name of the regular expression you created above
## and the second input will be my_string.
print(re.split(sentence_endings, first_string))

# Find all capitalized words in my_string and print the result. First create the range of 
## capital characters inside the brackets, then follow it imediately, no spaces, with
## the RegEx pattern for words, all inside the quotation marks.
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, first_string))

# Now split my_string on spaces and print the result. Remember the spaces RegEx pattern and put it in the quotes
spaces = r"\s+"
print(re.split(spaces, first_string))

# Find all digits in my_string and print the result
digits = r"\d"
print(re.findall(digits, first_string))

['Congrats on your first RegEx split', '  How was it', '  Hopefully not too difficult', '  Can you find 5 sentences', '']
['Congrats', 'RegEx', 'How', 'Hopefully', 'Can']
['Congrats', 'on', 'your', 'first', 'RegEx', 'split!', 'How', 'was', 'it?', 'Hopefully', 'not', 'too', 'difficult.', 'Can', 'you', 'find', '5', 'sentences?']
['5']


## Importing a File and Tokenization
Now we're going to import a text file, the first chapter of Pride and Prejudice, and parse the chapter into individual sentences and words. This parsing is called "tokenization" because it returns an array of the individual words or sentences, and we can conduct future operations and analysis on those arrays.

In [3]:
# Import packages. Don't worry aobut these too much right now.
import nltk
nltk.download('punkt')

# TO DO: Import sent_tokenize from nltk.tokenize and then import word_tokenize from nltk.tokenize
from import 
from import 

# TO DO: Import text file of Pride and Prejudice, chapter one by typing the file name 
## "pride_and_prejudice.txt" inside the first set of parentheses.
chapter_one = open("", "rb").read()

# TO DO: Split chapter_one into sentences by typing chapter_one inside of the parantheses
sentences = sent_tokenize()

# TO DO: Use word_tokenize to tokenize the fourth sentence by putting a 3 inside the brackets, 
## and print. Note: many computer programs count from 0, not 1.  Therefore, the first sentence 
## would be indexed 0, the second would be indexed 1, etc.
tokenized_sent = word_tokenize(sentences[])
print(tokenized_sent)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/brianfriederich/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Mr.', 'Bennet', 'replied', 'that', 'he', 'had', 'not', '.']


In [4]:
# TO DO: Make a set of unique tokens in the entire scene by putting chapter_one inside the parentheses.
unique_tokens = set(word_tokenize())

# TO DO: Print the unique tokens result by typing unique_tokens inside the parentheses.
print()

set(['all', 'consider', 'four', 'go', 'children', 'certainly', 'young', 'send', 'to', 'fancied', 'woman', 'returned', 'very', 'none', 'Mrs.', 'fall', 'affect', 'large', 'quick', 'says', 'Long', 'surrounding', 'likely', 'design', 'Jane', 'odd', 'what', 'over-scrupulous', 'giving', ';', 'told', 'men', 'here', 'understanding', 'let', 'others', 'consideration', 'My', 'settling', 'engage', 'impatiently', 'experience', 'Lizzy', 'love', 'humour', 'merely', 'When', 'feelings', 'use', 'from', 'would', 'visit', 'by', 'fortune', 'live', 'therefore', 'Michaelmas', 'recommend', 'taken', 'themselves', 'tell', 'more', 'possession', '``', 'ignorant', 'known', 'cases', 'glad', 'must', 'me', 'account', 'word', 'this', 'pretend', 'can', 'servants', 'my', 'give', 'share', 'high', 'heard', 'Mr.', 'mixture', 'something', 'want', '!', 'information', 'end', 'compassion', 'answer', 'establishment', 'Why', 'A', 'Lucas', 'beauty', 'may', 'abuse', 'such', 'man', 'a', 'lines', 'so', 'talk', 'replied', 'What', 'ind

## Fiding Things in Text
Nice work! You've already turned raw text into something more useful and easier to work with: a list of all the unique words in Chapter 1 of the book. You'll notice capitalized words are counted as a completely different word from lowercase words, and that punctuation is also counted as its own word. This could make analysis difficult, since we usually want to count forms of the same word together regardless of capitalization, and since in novels we usually don't care about symbols and punctuation (although in tweets, we'd care aobut symbols like @ and \#). Now let's learn how to search within a text. We'll find the location of the first mention of servants, and create an array of every quote in the first chapter.

In [24]:
# TO DO: Search for the first occurrence of "servants" in scene_one by running re.search on "servants"
## and chapter_one
match = 

# Print the start and end indexes of match, match.start() and match.end() respectively
print()

# Write a regular expression to search for quotes. Hint: remember the regex expression should be 
## r followed by quotation marks, with the expression you want written inside the quotation marks.
## The \ symbol makes sure that the symbol right after it is read literally instead of as a symbol 
## for some other action. For example, \. is read as a period instead of as a "match anything" 
## character. Also, put (.+?) inside of the quotes to capture the text inside. Don't worry too much about
## the meaning of (.+?) right now.
pattern1 = 

# Inside the parentheses, use re.findall on pattern1 and chapter_one to return an array of every quote in Chapter 1
print()

(1144, 1152)
['My dear Mr. Bennet,', 'have you heard that Netherfield Park is let at last?', 'But it is,', 'for Mrs. Long has just been here, and she told me all about it.', 'Do not you want to know who has taken it?', 'You want to tell me, and I have no objection to hearing it.', 'Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man of large fortune from the north of England; that he came down on Monday in a chaise and four to see the place, and was so much delighted with it, that he agreed with Mr. Morris immediately; that he is to take possession before Michaelmas, and some of his servants are to be in the house by the end of next week.', 'What is his name?', 'Bingley.', 'Is he married or single?', 'Oh! single, my dear, to be sure! A single man of large fortune; four or five thousand a year. What a fine thing for our girls!', 'How so? how can it affect them?', 'My dear Mr. Bennet,', 'how can you be so tiresome! You must know that I am thinking of his 

## Finding the most common Tokens
You've got this! Now, back to the word tokens. We're going to split the chapter into word tokens, conver these to lower case so that all case versions of the same word are counted together, and then print the 10 most common words.

In [41]:
from collections import Counter
# TO DO: word_tokenize chapter_one
tokens = word_tokenize()

# Convert the tokens into lowercase
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 50), ('.', 46), ('``', 32), ("''", 32), ('you', 31), ('of', 29), ('to', 22), ('a', 21), ('the', 18), ('i', 17)]


## Stopwords
So... that ended up not being that useful.  We can see that punctuation and filler words make up the bulk of the tokens in this chapter. Luckily, there's a way to get rid of these words, which are called **stop words**. nltk has a built in list of this, but just to give you an idea of what the list looks like, I've spelled out a stopwords list below.

In [42]:
stopwords = ("a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thickv", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the")

## Lemmatization and Getting Rid of Symbols and Punctuation
We'll also import a tool called a lemmatizer.  This will change different forms of a word into the same root.  For example, all versions of "sing", "sang", and "sung" would become "sing", and both "pineapple" and "pineapples" would be counted as "pineapple." Finally, we'll retain only alphabetic words to get rid of those pesky punctuation "words" like : and " that were crowding our arrays of common word tokens. Finally, we'll print the 10 most common words from this new list.

In [44]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [w for w in alpha_only if w not in stopwords]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words
bow = Counter(lemmatized)

# TO DO: Print the 10 most common tokens by typing 10 into the parentheses.
print(bow.most_common())

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/brianfriederich/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[('i', 17), ('dear', 8), ('bennet', 6), ('know', 5), ('visit', 5), ('year', 4), (u'girl', 4), (u'come', 4), (u'daughter', 4), ('single', 4)]


Much better! Notice the words "girl", "come", and "daughter" have been lemmatized so all forms of those words are counted together. Congrats! You've just done your first word frequency analysis!