#### Day 6: Regular Expressions and Naive Bayes Classifier


##### Part 1: Regular Expressions

- Regular expressions are useful to extract information from text.
- Set of “rules” to identify or match a particular sequence of characters.
- Most text in utf-8 or utf-16: letters, digits, punctuation and symbols
- In Python, mainly through library `re`


In [1]:
pwd

'/Users/mac/pythoncamp2023/Day06/Lecture'

In [2]:
# Set Directory
import os
os.chdir('/Users/mac/pythoncamp2023/Day06/Lecture')

import re # for regular expressions

For demonstration, we will work with Obama's 2008 concession speech from New Hampshire primary. 

In [3]:
# read in example text, remember:
# readlines makes a list of each line break in file
with open("obama-nh.txt", "r") as f:
  text = f.readlines()

Let's take a look at how this file is structured 

In [4]:
# How does it impact our 'text' object?
print(text[0])
# print(text[1])
# print(text[2])

# print(text[0:3])

I want to congratulate Senator Clinton on a hard-fought victory here in



In [5]:
# Join into one string
# What could we have done at the outset instead?
alltext = ''.join(text) 

Or equivalently

In [6]:
with open("obama-nh.txt", "r") as f:
  alltext = f.read()

##### 1.1 Useful functions from `re` library:

- `re.findall`: Return all non-overlapping matches of pattern 
            in string, as a list of strings
- `re.split`: Split string by the occurrences of pattern.
- `re.match`: Search the beginning of the string for a
          regular expression and return the first occurrence.
          Returns a match object.
- `re.search`: Like re.match, but will check all lines of the input string.
- `re.compile`: Compile a regular expression pattern into a regular 
            expression object, which can be used for matching using
            match(), search() and other methods

Source: https://docs.python.org/3/library/re.html

Let's run some examples!

In [7]:
# re.findall(pattern = "Yes we can", string= alltext) # All instance of Yes we can
re.findall("Yes we can", alltext) # All instance of Yes we can

['Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can',
 'Yes we can']

In [8]:
re.findall("American", alltext) # All instances of American

['American', 'American', 'American', 'American']

In [9]:
re.findall("\n", alltext) # all breaklines

['\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n']

##### 1.2 Backslash Characters

Regular expressions use the backslash character `\` to indicte special forms or to allow special characters to be used without invoking their special meaning. 

! This collides with Python's usage of the same character for the same purpose in string literals 

How do we find the literal character `\` in our file? 

In [16]:
# re.findall("\", alltext)
# re.findall("\\", alltext)
re.findall("\\\\", alltext)

['\\']

One way to address such issue is to use Python's raw string notation for regular expression patterns. 

Backslashes are NOT handled in any special way in a string prefixed with `r`. 

So equivalently: 

In [17]:
re.findall(r"\\", alltext)

['\\']

In [18]:
print("\n")





In [19]:
print(r"\n")

\n


In [20]:
print("\\n")

\n


##### 1.3 Basic special characters

In [21]:
# \d find any decimal digit, equivalent to [0-9]
re.findall("\d", alltext) 

['9', '1', '1']

In [22]:
# \D any character that is NOT a decimal digit, equivalent to ^[0-9]
re.findall("\D", alltext) 

['I',
 ' ',
 'w',
 'a',
 'n',
 't',
 ' ',
 't',
 'o',
 ' ',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e',
 ' ',
 'S',
 'e',
 'n',
 'a',
 't',
 'o',
 'r',
 ' ',
 'C',
 'l',
 'i',
 'n',
 't',
 'o',
 'n',
 ' ',
 'o',
 'n',
 ' ',
 'a',
 ' ',
 'h',
 'a',
 'r',
 'd',
 '-',
 'f',
 'o',
 'u',
 'g',
 'h',
 't',
 ' ',
 'v',
 'i',
 'c',
 't',
 'o',
 'r',
 'y',
 ' ',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'i',
 'n',
 '\n',
 'N',
 'e',
 'w',
 ' ',
 'H',
 'a',
 'm',
 'p',
 's',
 'h',
 'i',
 'r',
 'e',
 '.',
 '\n',
 '\n',
 'A',
 ' ',
 'f',
 'e',
 'w',
 ' ',
 'w',
 'e',
 'e',
 'k',
 's',
 ' ',
 'a',
 'g',
 'o',
 ',',
 ' ',
 'n',
 'o',
 ' ',
 'o',
 'n',
 'e',
 ' ',
 'i',
 'm',
 'a',
 'g',
 'i',
 'n',
 'e',
 'd',
 ' ',
 't',
 'h',
 'a',
 't',
 ' ',
 'w',
 'e',
 "'",
 'd',
 ' ',
 'h',
 'a',
 'v',
 'e',
 ' ',
 'a',
 'c',
 'c',
 'o',
 'm',
 'p',
 'l',
 'i',
 's',
 'h',
 'e',
 'd',
 ' ',
 'w',
 'h',
 'a',
 't',
 ' ',
 'w',
 'e',
 ' ',
 'd',
 'i',
 'd',
 '\n',
 'h',
 'e',
 'r',
 'e',
 ' ',


`[]` can be used to indicate a set of characters 

In [26]:
# all instances of the char in []
re.findall("[a]", alltext) 

['a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a'

In [24]:
# all instances of the from char 1 to char 2 in []
re.findall("[a-d]", alltext) 

['a',
 'c',
 'a',
 'a',
 'a',
 'a',
 'a',
 'd',
 'c',
 'a',
 'a',
 'a',
 'd',
 'a',
 'd',
 'a',
 'a',
 'c',
 'c',
 'd',
 'a',
 'd',
 'd',
 'c',
 'a',
 'a',
 'a',
 'b',
 'd',
 'a',
 'd',
 'a',
 'a',
 'c',
 'b',
 'd',
 'b',
 'c',
 'd',
 'b',
 'c',
 'a',
 'a',
 'd',
 'c',
 'a',
 'd',
 'c',
 'a',
 'd',
 'a',
 'd',
 'c',
 'a',
 'a',
 'a',
 'c',
 'a',
 'c',
 'a',
 'a',
 'a',
 'd',
 'a',
 'd',
 'a',
 'b',
 'a',
 'a',
 'd',
 'c',
 'd',
 'c',
 'a',
 'a',
 'a',
 'a',
 'c',
 'b',
 'c',
 'a',
 'b',
 'c',
 'b',
 'c',
 'a',
 'b',
 'a',
 'c',
 'c',
 'a',
 'b',
 'a',
 'c',
 'a',
 'a',
 'a',
 'a',
 'd',
 'a',
 'b',
 'a',
 'c',
 'a',
 'd',
 'c',
 'b',
 'b',
 'c',
 'a',
 'a',
 'a',
 'b',
 'd',
 'a',
 'a',
 'b',
 'b',
 'd',
 'c',
 'a',
 'a',
 'c',
 'b',
 'a',
 'c',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'd',
 'a',
 'a',
 'a',
 'a',
 'a',
 'd',
 'a',
 'c',
 'a',
 'd',
 'a',
 'a',
 'd',
 'c',
 'a',
 'a',
 'a',
 'c',
 'a',
 'a',
 'a',
 'a',
 'c',
 'a',
 'c',
 'a',
 'b',
 'a',
 'c',
 'a',
 'a',
 'd',
 'a',
 'a'

In [27]:
# all char, ^ except for of the from char 1 to char 2 in []
re.findall("[^a-z]", alltext) 

['I',
 ' ',
 ' ',
 ' ',
 ' ',
 'S',
 ' ',
 'C',
 ' ',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 ' ',
 '\n',
 'N',
 ' ',
 'H',
 '.',
 '\n',
 '\n',
 'A',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 "'",
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 '.',
 ' ',
 'F',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 ' ',
 'B',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 ' ',
 'A',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 'A',
 '.',
 '\n',
 '\n',
 'T',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 'D',
 ' ',
 'M',
 ' ',
 '\n',
 'D',
 ';',
 ' ',
 ' ',
 'L',
 ' ',
 ' ',
 'C',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 'J',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 '\n',
 '\n',
 

In [28]:
# all char and digits (alphanumeric)
re.findall("[a-zA-Z0-9]", alltext) 

['I',
 'w',
 'a',
 'n',
 't',
 't',
 'o',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e',
 'S',
 'e',
 'n',
 'a',
 't',
 'o',
 'r',
 'C',
 'l',
 'i',
 'n',
 't',
 'o',
 'n',
 'o',
 'n',
 'a',
 'h',
 'a',
 'r',
 'd',
 'f',
 'o',
 'u',
 'g',
 'h',
 't',
 'v',
 'i',
 'c',
 't',
 'o',
 'r',
 'y',
 'h',
 'e',
 'r',
 'e',
 'i',
 'n',
 'N',
 'e',
 'w',
 'H',
 'a',
 'm',
 'p',
 's',
 'h',
 'i',
 'r',
 'e',
 'A',
 'f',
 'e',
 'w',
 'w',
 'e',
 'e',
 'k',
 's',
 'a',
 'g',
 'o',
 'n',
 'o',
 'o',
 'n',
 'e',
 'i',
 'm',
 'a',
 'g',
 'i',
 'n',
 'e',
 'd',
 't',
 'h',
 'a',
 't',
 'w',
 'e',
 'd',
 'h',
 'a',
 'v',
 'e',
 'a',
 'c',
 'c',
 'o',
 'm',
 'p',
 'l',
 'i',
 's',
 'h',
 'e',
 'd',
 'w',
 'h',
 'a',
 't',
 'w',
 'e',
 'd',
 'i',
 'd',
 'h',
 'e',
 'r',
 'e',
 't',
 'o',
 'n',
 'i',
 'g',
 'h',
 't',
 'F',
 'o',
 'r',
 'm',
 'o',
 's',
 't',
 'o',
 'f',
 't',
 'h',
 'i',
 's',
 'c',
 'a',
 'm',
 'p',
 'a',
 'i',
 'g',
 'n',
 'w',
 'e',
 'w',
 'e',
 'r',
 'e',
 'f'

In [29]:
# \w alphanumeric, one word char 
re.findall("\w", alltext) # same as above

['I',
 'w',
 'a',
 'n',
 't',
 't',
 'o',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e',
 'S',
 'e',
 'n',
 'a',
 't',
 'o',
 'r',
 'C',
 'l',
 'i',
 'n',
 't',
 'o',
 'n',
 'o',
 'n',
 'a',
 'h',
 'a',
 'r',
 'd',
 'f',
 'o',
 'u',
 'g',
 'h',
 't',
 'v',
 'i',
 'c',
 't',
 'o',
 'r',
 'y',
 'h',
 'e',
 'r',
 'e',
 'i',
 'n',
 'N',
 'e',
 'w',
 'H',
 'a',
 'm',
 'p',
 's',
 'h',
 'i',
 'r',
 'e',
 'A',
 'f',
 'e',
 'w',
 'w',
 'e',
 'e',
 'k',
 's',
 'a',
 'g',
 'o',
 'n',
 'o',
 'o',
 'n',
 'e',
 'i',
 'm',
 'a',
 'g',
 'i',
 'n',
 'e',
 'd',
 't',
 'h',
 'a',
 't',
 'w',
 'e',
 'd',
 'h',
 'a',
 'v',
 'e',
 'a',
 'c',
 'c',
 'o',
 'm',
 'p',
 'l',
 'i',
 's',
 'h',
 'e',
 'd',
 'w',
 'h',
 'a',
 't',
 'w',
 'e',
 'd',
 'i',
 'd',
 'h',
 'e',
 'r',
 'e',
 't',
 'o',
 'n',
 'i',
 'g',
 'h',
 't',
 'F',
 'o',
 'r',
 'm',
 'o',
 's',
 't',
 'o',
 'f',
 't',
 'h',
 'i',
 's',
 'c',
 'a',
 'm',
 'p',
 'a',
 'i',
 'g',
 'n',
 'w',
 'e',
 'w',
 'e',
 'r',
 'e',
 'f'

In [30]:
# \W non-alphanumeric, opposite to \w
re.findall("\W", alltext) # same as re.findall(r"[^a-zA-Z0-9]", alltext)

[' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 '.',
 '\n',
 '\n',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 "'",
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 '.',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 '.',
 '\n',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ';',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 '\n',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 '-',
 ' ',
 ' ',


In [31]:
# \s whitespace
re.findall("\s", alltext) 

[' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 '\n',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 '\n',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',

In [32]:
# \S non-whitespace
re.findall("\S", alltext) 

['I',
 'w',
 'a',
 'n',
 't',
 't',
 'o',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e',
 'S',
 'e',
 'n',
 'a',
 't',
 'o',
 'r',
 'C',
 'l',
 'i',
 'n',
 't',
 'o',
 'n',
 'o',
 'n',
 'a',
 'h',
 'a',
 'r',
 'd',
 '-',
 'f',
 'o',
 'u',
 'g',
 'h',
 't',
 'v',
 'i',
 'c',
 't',
 'o',
 'r',
 'y',
 'h',
 'e',
 'r',
 'e',
 'i',
 'n',
 'N',
 'e',
 'w',
 'H',
 'a',
 'm',
 'p',
 's',
 'h',
 'i',
 'r',
 'e',
 '.',
 'A',
 'f',
 'e',
 'w',
 'w',
 'e',
 'e',
 'k',
 's',
 'a',
 'g',
 'o',
 ',',
 'n',
 'o',
 'o',
 'n',
 'e',
 'i',
 'm',
 'a',
 'g',
 'i',
 'n',
 'e',
 'd',
 't',
 'h',
 'a',
 't',
 'w',
 'e',
 "'",
 'd',
 'h',
 'a',
 'v',
 'e',
 'a',
 'c',
 'c',
 'o',
 'm',
 'p',
 'l',
 'i',
 's',
 'h',
 'e',
 'd',
 'w',
 'h',
 'a',
 't',
 'w',
 'e',
 'd',
 'i',
 'd',
 'h',
 'e',
 'r',
 'e',
 't',
 'o',
 'n',
 'i',
 'g',
 'h',
 't',
 '.',
 'F',
 'o',
 'r',
 'm',
 'o',
 's',
 't',
 'o',
 'f',
 't',
 'h',
 'i',
 's',
 'c',
 'a',
 'm',
 'p',
 'a',
 'i',
 'g',
 'n',
 ',',
 'w'

In [34]:
# . any char (include white spaces, except a newline)
re.findall(".", alltext) 

['I',
 ' ',
 'w',
 'a',
 'n',
 't',
 ' ',
 't',
 'o',
 ' ',
 'c',
 'o',
 'n',
 'g',
 'r',
 'a',
 't',
 'u',
 'l',
 'a',
 't',
 'e',
 ' ',
 'S',
 'e',
 'n',
 'a',
 't',
 'o',
 'r',
 ' ',
 'C',
 'l',
 'i',
 'n',
 't',
 'o',
 'n',
 ' ',
 'o',
 'n',
 ' ',
 'a',
 ' ',
 'h',
 'a',
 'r',
 'd',
 '-',
 'f',
 'o',
 'u',
 'g',
 'h',
 't',
 ' ',
 'v',
 'i',
 'c',
 't',
 'o',
 'r',
 'y',
 ' ',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'i',
 'n',
 'N',
 'e',
 'w',
 ' ',
 'H',
 'a',
 'm',
 'p',
 's',
 'h',
 'i',
 'r',
 'e',
 '.',
 'A',
 ' ',
 'f',
 'e',
 'w',
 ' ',
 'w',
 'e',
 'e',
 'k',
 's',
 ' ',
 'a',
 'g',
 'o',
 ',',
 ' ',
 'n',
 'o',
 ' ',
 'o',
 'n',
 'e',
 ' ',
 'i',
 'm',
 'a',
 'g',
 'i',
 'n',
 'e',
 'd',
 ' ',
 't',
 'h',
 'a',
 't',
 ' ',
 'w',
 'e',
 "'",
 'd',
 ' ',
 'h',
 'a',
 'v',
 'e',
 ' ',
 'a',
 'c',
 'c',
 'o',
 'm',
 'p',
 'l',
 'i',
 's',
 'h',
 'e',
 'd',
 ' ',
 'w',
 'h',
 'a',
 't',
 ' ',
 'w',
 'e',
 ' ',
 'd',
 'i',
 'd',
 'h',
 'e',
 'r',
 'e',
 ' ',
 't',
 'o',
 'n',
 'i',
 'g'

In [36]:
# \ is an escape character (. has a special use)
re.findall("\.", alltext) 

['.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.',
 '.']

In [38]:
# ? Makes the preceding RE optional. (match 0 or 1 repetitions of the preceding RE)
re.findall("Am?", alltext) # This would match A or Am where m is optional

['A',
 'A',
 'Am',
 'Am',
 'A',
 'Am',
 'Am',
 'Am',
 'Am',
 'A',
 'A',
 'Am',
 'A',
 'A',
 'A',
 'Am',
 'Am',
 'A',
 'A',
 'Am',
 'Am']

In [42]:
# + match 1 or more repetitions of the preceding RE 
re.findall("\d+", alltext)
re.findall("am+", alltext)

['am', 'am', 'am', 'am', 'am', 'am', 'am', 'am', 'am', 'am', 'am']

In [40]:
# * match 0 or more repetitions of the preceding RE
re.findall("am*", alltext) # match a, am, or a followed by any number of m's 

['a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'am',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'am',
 'a',
 'a',
 'a',
 'a',
 'a',
 'am',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'am',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'am',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'am',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a'

In [45]:
# get any word that starts with America
re.findall("America[a-z]*", alltext) 

['America',
 'Americans',
 'America',
 'America',
 'American',
 'Americans',
 'America',
 'America',
 'Americans',
 'America',
 'America']

`{m}` specifies exactly m copies of the previous RE should be matched

In [49]:
# {x} exactly x times (numbers with exact number of digits)
re.findall("\d{2}", alltext) 

['11']

In [47]:
re.findall("\d{1}", alltext) 

['9', '1', '1']

`{m,n}` matches from m to n repetitions of the preceding RE, while attempting to match as many repetitions as possible

In [51]:
re.findall("\d{1,2}", alltext) 

['9', '11']

- There are so many more special characters
- Regex can be super powerful and complicated 
- Use parenthese to group things together when using operators like `+`, `*`, `?`, `^`


<br>

##### Short Exercise: 
How would we grab 10/10 and 19/18 as they appear in the text using `re.findall()`? 

In [55]:
x = "Hi 10/10 hello 19/18 asdf 7/6 and 1/10 or 10/1 "
re.findall("[0-9][0-9]/[0-9][0-9]",x)

['10/10', '19/18']

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

In [None]:
# Answer
re.findall("\d{2}/\d{2}", x) 

#### 1.4 `re.split()`

Split string by the occurrences of pattern. 

In [56]:
# splits at digits, deletes digits
re.split("\d", alltext) 

["I want to congratulate Senator Clinton on a hard-fought victory here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. For most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in record numbers, you came\nout and spoke up for change. And with your voices and your votes, you\nmade it clear that at this moment - in this election - there is\nsomething happening in America.\n\nThere is something happening when men and women in Des Moines and\nDavenport; in Lebanon and Concord come out in the snows of January to\nwait in lines that stretch block after block because they believe in\nwhat this country can be.\n\nThere is something happening when Americans who are young in age and in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vot

In [57]:
re.split("America*", alltext) 

["I want to congratulate Senator Clinton on a hard-fought victory here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. For most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in record numbers, you came\nout and spoke up for change. And with your voices and your votes, you\nmade it clear that at this moment - in this election - there is\nsomething happening in ",
 '.\n\nThere is something happening when men and women in Des Moines and\nDavenport; in Lebanon and Concord come out in the snows of January to\nwait in lines that stretch block after block because they believe in\nwhat this country can be.\n\nThere is something happening when ',
 "ns who are young in age and in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vote no

#### 1.5 `re.compile()`

Compile a RE pattern into a RE object, which can then be used for matching using the `match()` and `search()` methods. 

In [132]:
keyword = re.compile("America[a-z]*")

In [133]:
# search file for keyword in line by line version
for l in text: 
    if keyword.search(l): # reuse the RE here
        print(l)

something happening in America.

There is something happening when Americans who are young in age and in

America right now. Change is what's happening in America.

Our new American majority can end the outrage of unaffordable,

working Americans who deserve it.

is a challenge that should unite America and the world against the

But in the unlikely story that is America, there has never been anything

we can't, generations of Americans have responded with a simple creed

remember that there is something happening in America; that we are not

nation; and together, we will begin the next great chapter in America's



In [134]:
# Create a regex object
pattern = re.compile('\d+')

In [135]:
pattern.findall(alltext) # equivalent to the earlier but longer version using RE

['9', '11']

In [136]:
pattern.split(alltext)

["I want to congratulate Senator Clinton on a hard-fought victory here in\nNew Hampshire.\n\nA few weeks ago, no one imagined that we'd have accomplished what we did\nhere tonight. For most of this campaign, we were far behind, and we\nalways knew our climb would be steep. But in record numbers, you came\nout and spoke up for change. And with your voices and your votes, you\nmade it clear that at this moment - in this election - there is\nsomething happening in America.\n\nThere is something happening when men and women in Des Moines and\nDavenport; in Lebanon and Concord come out in the snows of January to\nwait in lines that stretch block after block because they believe in\nwhat this country can be.\n\nThere is something happening when Americans who are young in age and in\nspirit - who have never before participated in politics - turn out in\nnumbers we've never seen because they know in their hearts that this\ntime must be different.\n\nThere is something happening when people vot

#### 1.6 `re.MULTILINE` or `re.M`

When specified, it helps to search across lines in a single string. 

In [61]:
mline = "python\nis\nfun"
print(mline)

python
is
fun


I want to search for "fun" in the third line, where it starts with an "f"

- We can use `^` to search the start of a string
- Be careful, `^` when used in `[]` means negating characters
- `$` can be used to match the end of a string

In [62]:
re.findall("^f\w*", mline)

[]

In [64]:
# re.findall("^f\w*", mline, re.M)
re.findall("^f\w*", mline, re.MULTILINE)

['fun']

#### Short Exercise: 

What does the following code search for? 

In [66]:
re.findall("^.*\.$", alltext, re.MULTILINE)

['New Hampshire.',
 'something happening in America.',
 'what this country can be.',
 'time must be different.',
 "America right now. Change is what's happening in America.",
 'fulfill.',
 'working Americans who deserve it.',
 'can do this with our new majority.',
 'weapons; climate change and poverty; genocide and disease.',
 'ideas. And all are patriots who serve this country honorably.',
 'the people who love this country, can do to change it.',
 "That's why tonight belongs to you.",
 'believed in our improbable journey and rallied so many others to join.',
 'in the weeks to come.',
 'offering the people of this nation false hope.',
 'that sums up the spirit of a people.',
 'Yes we can.',
 'destiny of a nation.',
 'Yes we can.',
 'toward freedom through the darkest of nights.',
 'Yes we can.',
 'pioneers who pushed westward against an unforgiving wilderness.',
 'Yes we can.',
 'who took us to the mountaintop and pointed the way to the Promised Land.',
 'world. Yes we can.',
 'shinin

##### Part 2: Naive Bayes Classification


Docs for this library: https://www.nltk.org/api/nltk.classify.naivebayes.html

#### 2.1 Installation and Import Libraries

In [137]:
# !pip3 install nltk
import nltk
nltk.download('names')
from nltk.corpus import names
import random

[nltk_data] Downloading package names to /Users/mac/nltk_data...
[nltk_data]   Package names is already up-to-date!


In [138]:
# Create a list of tuples with names
names = ([(name, 'male') for name in names.words('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

In [139]:
names

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male'),
 ('Abdullah', 'male'),
 ('Abe', 'male'),
 ('Abel', 'male'),
 ('Abelard', 'male'),
 ('Abner', 'male'),
 ('Abraham', 'male'),
 ('Abram', 'male'),
 ('Ace', 'male'),
 ('Adair', 'male'),
 ('Adam', 'male'),
 ('Adams', 'male'),
 ('Addie', 'male'),
 ('Adger', 'male'),
 ('Aditya', 'male'),
 ('Adlai', 'male'),
 ('Adnan', 'male'),
 ('Adolf', 'male'),
 ('Adolfo', 'male'),
 ('Adolph', 'male'),
 ('Adolphe', 'male'),
 ('Adolpho', 'male'),
 ('Adolphus', 'male'),
 ('Adrian', 'male'),
 ('Adrick', 'male'),
 ('Adrien', 'male'),
 ('Agamemnon', 'male'),
 ('Aguinaldo', 'male'),
 ('Aguste', 'male'),
 ('Agustin', 'male'),
 ('Aharon', 'male'),
 ('Ahmad', 'male'),
 ('Ahmed', 'male'),
 ('Ahmet', 'male'),
 ('Ajai', 'male'),
 ('Ajay', 'male'),
 ('Al', 'male'),
 ('Alaa', 'male'),
 ('Alain', 'male'),
 ('Alan', 'male

In [140]:
# Now, we shuffle
random.shuffle(names)
print(names)

[('Naomi', 'female'), ('Imojean', 'female'), ('Fulton', 'male'), ('Selinda', 'female'), ('Webster', 'male'), ('Waine', 'male'), ('Quint', 'male'), ('Elisabet', 'female'), ('Pavel', 'male'), ('Wojciech', 'male'), ('Marga', 'female'), ('Alecia', 'female'), ('Hoyt', 'male'), ('Roni', 'male'), ('Saunders', 'male'), ('Damara', 'female'), ('Ruperto', 'male'), ('Norton', 'male'), ('Maudie', 'female'), ('Angus', 'male'), ('Jacinta', 'female'), ('Jodee', 'female'), ('Waiter', 'male'), ('Paulina', 'female'), ('Leanora', 'female'), ('Essa', 'female'), ('Bambie', 'female'), ('Mamie', 'female'), ('Henryetta', 'female'), ('Celle', 'female'), ('Christopher', 'male'), ('Kary', 'female'), ('Garret', 'male'), ('Elwina', 'female'), ('Sunny', 'female'), ('Lucio', 'male'), ('Danyelle', 'female'), ('Lotta', 'female'), ('Haley', 'female'), ('Curtis', 'male'), ('Alica', 'female'), ('Doe', 'female'), ('Edy', 'female'), ('Joao', 'male'), ('Duffie', 'male'), ('Eddie', 'male'), ('Cher', 'female'), ('Verna', 'fema

#### 2.2 Split Training and Test Sets

In [141]:
len(names) # N of observations

7944

In [142]:
# Define training and test set sizes
train_size = 5000

# Split train and test objects
train_names = names[:train_size]
test_names = names[train_size:]

#### 2.3 Define Features

In [143]:
# A simple feature: Get the last letter of the name
def g_features1(name):
  return {'last_letter': name[-1]}

Tips: Python functions can return multiple values

In [77]:
# Quick break — some syntax:
def return_two():
  return 5, 10

# When a method returns two values, we can use this format: 
x, y = return_two()
x, y

(5, 10)

#### 2.4 Data Preparation

Loop over names, and return tuple of dictionary and label

In [144]:
train_set = [(g_features1(n), g) for (n, g) in train_names]
test_set = [(g_features1(n), g) for (n,g) in test_names]

In [145]:
train_set

[({'last_letter': 'i'}, 'female'),
 ({'last_letter': 'n'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'r'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 't'}, 'male'),
 ({'last_letter': 't'}, 'female'),
 ({'last_letter': 'l'}, 'male'),
 ({'last_letter': 'h'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 't'}, 'male'),
 ({'last_letter': 'i'}, 'male'),
 ({'last_letter': 's'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'o'}, 'male'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 's'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'r'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_le

#### 2.5 Train the Classifier

In [86]:
# Run the naive Bayes classifier for the train set
classifier = nltk.NaiveBayesClassifier.train(train_set)

#### 2.6 Test your Classifier

In [146]:
# Apply the classifier to some names
classifier.classify(g_features1('Cecilia'))

'neg'

In [88]:
classifier.classify(g_features1('Leticia'))

'female'

In [89]:
classifier.classify(g_features1('Irene'))

'female'

In [90]:
classifier.classify(g_features1('Jie'))

'female'

In [91]:
classifier.classify(g_features1('Tian'))

'male'

In [92]:
classifier.classify(g_features1('Masanori'))

'female'

In [93]:
classifier.classify(g_features1('Peter'))

'male'

In [94]:
# Get the probability of female:
classifier.prob_classify(g_features1('Cecilia')).prob("female")

0.9802321060009186

In [95]:
classifier.prob_classify(g_features1('Peter')).prob("male")

0.7790075994370524

We can check the overall accuracy with our test set. 

More on accuracy, F1, precision, recall: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

In [96]:
print(nltk.classify.accuracy(classifier, test_set))

0.7652853260869565


#### 2.7 Feature Attribution

In [97]:
# Lets see what is driving this
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     28.6 : 1.0
             last_letter = 'k'              male : female =     17.1 : 1.0
             last_letter = 'f'              male : female =     11.4 : 1.0
             last_letter = 'd'              male : female =     10.9 : 1.0
             last_letter = 'm'              male : female =     10.7 : 1.0


Let's be smarter and add more features!

In [108]:
# What all are we including now?
def g_features2(name):
  features = {}
  features["firstletter"] = name[0].lower()
  features["lastletter"] = name[-1].lower()
  for letter in 'abcdefghijklmnopqrstuvwxyz':
      features["count(%s)" % letter] = name.lower().count(letter)
      features["has(%s)" % letter] = (letter in name.lower())
  return features

In [109]:
g_features2('Cecilia')

{'firstletter': 'c',
 'lastletter': 'a',
 'count(a)': 1,
 'has(a)': True,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 2,
 'has(c)': True,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 1,
 'has(e)': True,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 0,
 'has(h)': False,
 'count(i)': 2,
 'has(i)': True,
 'count(j)': 0,
 'has(j)': False,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 1,
 'has(l)': True,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 0,
 'has(n)': False,
 'count(o)': 0,
 'has(o)': False,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [100]:
# Run for train set
train_set = [(g_features2(n), g) for (n,g) in train_names]
# Run for test set
test_set = [(g_features2(n), g) for (n,g) in test_names]

In [101]:
# Run new classifier
classifier_new = nltk.NaiveBayesClassifier.train(train_set)

In [102]:
# Check the overall accuracy with test set
print(nltk.classify.accuracy(classifier_new, test_set))

0.7700407608695652


In [103]:
# Lets see what is driving this
classifier_new.show_most_informative_features(20)

Most Informative Features
              lastletter = 'a'            female : male   =     28.6 : 1.0
              lastletter = 'k'              male : female =     17.1 : 1.0
              lastletter = 'f'              male : female =     11.4 : 1.0
              lastletter = 'd'              male : female =     10.9 : 1.0
              lastletter = 'm'              male : female =     10.7 : 1.0
              lastletter = 'o'              male : female =      9.1 : 1.0
              lastletter = 'v'              male : female =      7.3 : 1.0
              lastletter = 'r'              male : female =      6.1 : 1.0
                count(a) = 3              female : male   =      5.3 : 1.0
              lastletter = 'w'              male : female =      5.2 : 1.0
             firstletter = 'w'              male : female =      4.9 : 1.0
                count(v) = 2              female : male   =      4.8 : 1.0
                count(y) = 2              female : male   =      4.5 : 1.0

In [110]:
# Worse? Better? How can we refine?
# Lets look at the errors from this model
# and see if we can do better
errors = []
for (name, label) in test_names:
  guess = classifier.classify(g_features2(name))
  if guess != label:
    prob = classifier.prob_classify(g_features2(name)).prob(guess)
    errors.append((label, guess, prob, name))

In [111]:
for (label, guess, prob, name) in sorted(errors):
  print('correct={} guess={} prob={:.2f} name={}'.format(label, guess, prob, name))

correct=male guess=female prob=0.63 name=Abbie
correct=male guess=female prob=0.63 name=Abbot
correct=male guess=female prob=0.63 name=Abbott
correct=male guess=female prob=0.63 name=Abby
correct=male guess=female prob=0.63 name=Abdullah
correct=male guess=female prob=0.63 name=Abel
correct=male guess=female prob=0.63 name=Ace
correct=male guess=female prob=0.63 name=Adams
correct=male guess=female prob=0.63 name=Adger
correct=male guess=female prob=0.63 name=Adnan
correct=male guess=female prob=0.63 name=Adolph
correct=male guess=female prob=0.63 name=Adolpho
correct=male guess=female prob=0.63 name=Adolphus
correct=male guess=female prob=0.63 name=Adrian
correct=male guess=female prob=0.63 name=Adrick
correct=male guess=female prob=0.63 name=Adrien
correct=male guess=female prob=0.63 name=Aharon
correct=male guess=female prob=0.63 name=Ajai
correct=male guess=female prob=0.63 name=Ajay
correct=male guess=female prob=0.63 name=Alaa
correct=male guess=female prob=0.63 name=Alan
correct

What could we do to improve it? (Lab Assignment)

<br>
<br>
Now lets look at some bigger documents. 

This may take a while to download.

In [112]:
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/mac/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [113]:
# list of tuples
# ([words], label)
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]


In [None]:
# type(documents[0])
# type(documents)
# documents[0][1] # only neg & pos

In [114]:
random.shuffle(documents)

In [115]:
# Dictionary of words and number of instances
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
len(all_words)

39768

In [116]:
all_words

FreqDist({',': 77717, 'the': 76529, '.': 65876, 'a': 38106, 'and': 35576, 'of': 34123, 'to': 31937, "'": 30585, 'is': 25195, 'in': 21822, ...})

In [117]:
# Check the frequency of ','
all_words[',']

77717

In [122]:
word_features = [k for k in all_words.keys() if all_words[k] > 5]

In [123]:
len(word_features)

13214

In [125]:
# Function to get document features
def document_features(document):
  document_words = set(document)
  features = {}
  for word in word_features:
      features['contains(%s)' % word] = (word in document_words)
  return features

In [126]:
document_features(['This', 'is', 'a', 'horrible', 'movie'])

{'contains(plot)': False,
 'contains(:)': False,
 'contains(two)': False,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': False,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': False,
 'contains(drink)': False,
 'contains(and)': False,
 'contains(then)': False,
 'contains(drive)': False,
 'contains(.)': False,
 'contains(they)': False,
 'contains(get)': False,
 'contains(into)': False,
 'contains(an)': False,
 'contains(accident)': False,
 'contains(one)': False,
 'contains(of)': False,
 'contains(the)': False,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': False,
 'contains(his)': False,
 'contains(girlfriend)': False,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': False,
 'contains(in)': False,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': False,
 'contains(nightmares)': False,
 'contains(what)': False,
 "contains(')": F

In [127]:
document_features(movie_reviews.words('pos/cv957_8737.txt'))

{'contains(plot)': True,
 'contains(:)': True,
 'contains(two)': True,
 'contains(teen)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(to)': True,
 'contains(a)': True,
 'contains(church)': False,
 'contains(party)': False,
 'contains(,)': True,
 'contains(drink)': False,
 'contains(and)': True,
 'contains(then)': True,
 'contains(drive)': False,
 'contains(.)': True,
 'contains(they)': True,
 'contains(get)': True,
 'contains(into)': True,
 'contains(an)': True,
 'contains(accident)': False,
 'contains(one)': True,
 'contains(of)': True,
 'contains(the)': True,
 'contains(guys)': False,
 'contains(dies)': False,
 'contains(but)': True,
 'contains(his)': True,
 'contains(girlfriend)': True,
 'contains(continues)': False,
 'contains(see)': False,
 'contains(him)': True,
 'contains(in)': True,
 'contains(her)': False,
 'contains(life)': False,
 'contains(has)': True,
 'contains(nightmares)': False,
 'contains(what)': True,
 "contains(')": True,
 'contains(s)': T

In [129]:
## Now we have tuple of ({features}, label)
train_docs = documents[:1000]
test_docs = documents[1000:1500]
train_set = [(document_features(d), c) for (d,c) in train_docs]
test_set = [(document_features(d), c) for (d,c) in test_docs]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [130]:
print(nltk.classify.accuracy(classifier, test_set))

0.77


In [131]:
classifier.show_most_informative_features(10)

Most Informative Features
    contains(whatsoever) = True              neg : pos    =     14.2 : 1.0
     contains(ludicrous) = True              neg : pos    =     11.6 : 1.0
      contains(captures) = True              pos : neg    =     11.1 : 1.0
    contains(surrounded) = True              pos : neg    =     10.4 : 1.0
     contains(redeeming) = True              neg : pos    =     10.3 : 1.0
      contains(depicted) = True              pos : neg    =      9.1 : 1.0
         contains(legal) = True              pos : neg    =      9.1 : 1.0
   contains(outstanding) = True              pos : neg    =      8.8 : 1.0
        contains(german) = True              pos : neg    =      8.7 : 1.0
       contains(ordered) = True              pos : neg    =      8.4 : 1.0


In [None]:
# Copyright of the original version:

# Copyright (c) 2014 Matt Dickenson
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
