## Cleaning text

Now we are able to process data to analyze text, numbers and symbols for particular regular expressions that can be useful for data cleaning, representation, and decision making analyses. Let's start performing some data cleaning filtering.

For this purpose we will first use the NLTK library characteristics (Natural Language Toolkit).

<code>Regular expressions and examples
Data cleaning:
    Tokenining
    Removing punctuation
    Stemming and Lemmatizing
    Removing tags
Text representation
        TF-IDF: Term frequencies (counter)
        Vector normalization
        Feature weighting (Inverse Document Frequency)	
        Sklearn implementation
Learning text representations
        Stopwords
        Bag of Words
        n-grams
        Training a (naive Bayes) Classifier with NLTK: film critiques example
        Training a (naive Bayes) Classifier with TextBlob: A Tweet Sentiment Analyzer
        Pattern module</code>

In [4]:
import nltk
nltk.download()
raw_docs = ["Here are some very simple basic sentences.",
"They won't be very interesting, I'm afraid.",
"The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."]

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


### Tokenizing text into bags of words

NLTK makes it easy to convert documents-as-strings into word-vectors, a process called tokenizing.

In [6]:
from nltk.tokenize import word_tokenize

tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
print (tokenized_docs)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', "n't", 'be', 'very', 'interesting', ',', 'I', "'m", 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', '_learn', 'how', 'basic', 'text', 'cleaning', 'works_', 'on', '*very', 'simple*', 'data', '.']]


### Removing punctuation

Punctuation can help with tokenizers, but once you've done that, there's no reason to keep it around. There are tons of ways to remove punctuation. 

Let's review some useful functions

re.escape: Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

In [7]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [8]:
import re
import string
print (re.escape(string.punctuation))
regex = re.compile('[%s]' % re.escape(string.punctuation)) 

tokenized_docs_no_punctuation = []

for review in tokenized_docs:
    
    new_review = []
    for token in review: 
        new_token = regex.sub(u'', token)
        if not new_token == u'':
            new_review.append(new_token)
    
    tokenized_docs_no_punctuation.append(new_review)
    
print (tokenized_docs)    
print (tokenized_docs_no_punctuation)

!"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~
[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', "n't", 'be', 'very', 'interesting', ',', 'I', "'m", 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', '_learn', 'how', 'basic', 'text', 'cleaning', 'works_', 'on', '*very', 'simple*', 'data', '.']]
[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'wo', 'nt', 'be', 'very', 'interesting', 'I', 'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', 'learn', 'how', 'basic', 'text', 'cleaning', 'works', 'on', 'very', 'simple', 'data']]


Punctuation symbols are removed, and those words containing a punctuation symbol are keeped and marked with an initial 'u'.

### Stemming and Lemmatizing

If you have taken linguistics, you may be familiar with morphology. This is the belief that words have a root form. If you want to get to the basic term meaning of the word, you can try applying a stemmer or lemmatizer. Here are three very popular methods ready to go right out of the NLTK box. 

In [8]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()

preprocessed_docs = []

for doc in tokenized_docs_no_punctuation:
    final_doc = []
    for word in doc:
        final_doc.append(porter.stem(word))
       # final_doc.append(snowball.stem(word)) # requires 'corpora/wordnet' -> nltk.download()
       # final_doc.append(wordnet.lemmatize(word)) # requires 'corpora/wordnet' -> nltk.download()
    preprocessed_docs.append(final_doc)

print (tokenized_docs_no_punctuation)
print (preprocessed_docs)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'wo', 'nt', 'be', 'very', 'interesting', 'I', 'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', 'learn', 'how', 'basic', 'text', 'cleaning', 'works', 'on', 'very', 'simple', 'data']]
[['here', 'are', 'some', 'veri', 'simpl', 'basic', 'sentenc'], ['they', 'wo', 'nt', 'be', 'veri', 'interest', 'I', 'm', 'afraid'], ['the', 'point', 'of', 'these', 'exampl', 'is', 'to', 'learn', 'how', 'basic', 'text', 'clean', 'work', 'on', 'veri', 'simpl', 'data']]


### Removing HTML entities and tags

We have it already implemented in NLTK!

In [12]:
import nltk

def clean_html(html):
    """
    Copied from NLTK package.
    Remove HTML markup from the given string.

    :param html: the HTML string to be cleaned
    :type html: str
    :rtype: str
    """

    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    return cleaned.strip()

test_string ="<p>While many of the stories tugged <a> at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the &quot;Chicken Soup for the Soul&quot; series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)</a>"
print (test_string)
clean_html(test_string) 

#Use from bs4 import BeautifulSoup  : for improved versions of tag cleaning 

<p>While many of the stories tugged <a> at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the &quot;Chicken Soup for the Soul&quot; series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)</a>


"While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the &quot;Chicken Soup for the Soul&quot; series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)"

## The Vector Space Model of text: TF-IDF (Term Frequency - Inverse Distance Frequency) 

Once text is analyzed based on regular expressions and cleaned by filtering using some of the previous tools, we can proceed to represent it in order to perform posterior analyses.

<p>We need to start thinking about how to translate collections of texts into quantifiable phenomena.  The easiest way to start is to think about word frequencies.</p>

Example with histograms of visual words representation:

<img src="VisualWords.png">

### Basic term frequencies

First, let's review how to get a count of terms per document: a term frequency vector.

In [13]:
#examples taken from here: http://stackoverflow.com/a/1750187

mydoclist = ['Julie loves me more than Linda loves me',
'Jane likes me more than Julie loves me',
'He likes basketball more than baseball']

#mydoclist = ['sun sky bright', 'sun sun bright']

from collections import Counter

for doc in mydoclist:
    tf = Counter()
    for word in doc.split():
        tf[word] +=1
    print (tf.items())

dict_items([('Julie', 1), ('loves', 2), ('me', 2), ('more', 1), ('than', 1), ('Linda', 1)])
dict_items([('Jane', 1), ('likes', 1), ('me', 2), ('more', 1), ('than', 1), ('Julie', 1), ('loves', 1)])
dict_items([('He', 1), ('likes', 1), ('basketball', 1), ('more', 1), ('than', 1), ('baseball', 1)])


<p>Here, we've introduced a new Python object called a Counter.  Counters are only in Python 2.7 and higher.  They're neat because they allow you to perform this exact kind of function; counting things in a loop.</p>


A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The Counter class is similar to bags or multisets in other languages.

Elements are counted from an iterable or initialized from another mapping (or counter):

<code>
c = Counter()                           # a new, empty counter
c = Counter('gallahad')                 # a new counter from an iterable
</code>

Counter objects have a dictionary interface except that they return a zero count for missing items instead of raising a KeyError:

<code>
c = Counter(['eggs', 'ham'])
c['bacon']                              # count of a missing element is zero
0
</code>


<p>Let's call this a first stab at representing documents quantitatively, just by their word counts (also thinking that we may have previously filtered and cleaned the text using previous approaches).  </p>

In [20]:
import string 
    
def build_lexicon(corpus): # define a set with all possible words included in all the sentences or "corpus"
    lexicon = set()
    for doc in corpus:
        lexicon.update([word for word in doc.split()])
    return lexicon

def tf(term, document):
  return freq(term, document)

def freq(term, document):
  return document.split().count(term)

vocabulary = build_lexicon(mydoclist)

doc_term_matrix = []
print ('Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']')
for doc in mydoclist:
    print ('The doc is "' + doc + '"')
    tf_vector = [tf(word, doc) for word in vocabulary]
    tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
    print ('The tf vector for Document %d is [%s]' % ((mydoclist.index(doc)+1), tf_vector_string))
    doc_term_matrix.append(tf_vector)
print ('All combined, here is our master document term matrix: ')
print (doc_term_matrix)

Our vocabulary vector is [than, me, Jane, baseball, Julie, basketball, likes, Linda, loves, He, more]
The doc is "Julie loves me more than Linda loves me"
The tf vector for Document 1 is [1, 2, 0, 0, 1, 0, 0, 1, 2, 0, 1]
The doc is "Jane likes me more than Julie loves me"
The tf vector for Document 2 is [1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1]
The doc is "He likes basketball more than baseball"
The tf vector for Document 3 is [1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1]
All combined, here is our master document term matrix: 
[[1, 2, 0, 0, 1, 0, 0, 1, 2, 0, 1], [1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1], [1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1]]


Okay, that seems reasonable enough. If any of you have any experience with machine learning, what you've just seen is the creation of a feature space. Now every document is in the same feature space, meaning that we can represent the entire corpus in the same dimensional space without having lost too much information.

### Normalizing vectors to L2 Norm

<p>Once you've got your data in the same feature space, you can start applying some machine learning methods; classifying, clustering, and so on.  But actually, we've got a few problems.  Words aren't all equally informative.</p>
<p>If words appear too frequently in a single document, they're going to muck up our analysis.  We want to perform some scaling of each of these term frequency vectors into something a bit more representative.  In other words, we need to do some <strong>vector normalizing</strong>.</p>
<p>One possibility is to ensure that the L2 norm of each vector is equal to 1.  Here's some code that shows how this is done.</p>

In [23]:
import math
import numpy as np

def l2_normalizer(vec):
    denom = np.sum([el**2 for el in vec])
    return [(el / math.sqrt(denom)) for el in vec]

doc_term_matrix_l2 = []
for vec in doc_term_matrix:
    doc_term_matrix_l2.append(l2_normalizer(vec))

print ('A regular old document term matrix: ') 
print (np.matrix(doc_term_matrix))
print ('\nA document term matrix with row-wise L2 norms of 1:')
print (np.matrix(doc_term_matrix_l2))

A regular old document term matrix: 
[[1 2 0 0 1 0 0 1 2 0 1]
 [1 2 1 0 1 0 1 0 1 0 1]
 [1 0 0 1 0 1 1 0 0 1 1]]

A document term matrix with row-wise L2 norms of 1:
[[ 0.28867513  0.57735027  0.          0.          0.28867513  0.          0.
   0.28867513  0.57735027  0.          0.28867513]
 [ 0.31622777  0.63245553  0.31622777  0.          0.31622777  0.
   0.31622777  0.          0.31622777  0.          0.31622777]
 [ 0.40824829  0.          0.          0.40824829  0.          0.40824829
   0.40824829  0.          0.          0.40824829  0.40824829]]


<p>You can see immediately that we've scaled down vectors such that each element is between [0, 1], without losing too much valuable information.</p>
<p>Why would we care about this kind of normalizing?  Think about it this way; if you wanted to make a document seem more related to a particular topic than it actually was, you might try boosting the likelihood of its inclusion into a topic by repeating the same word over and over and over again.  Frankly, at a certain point, we're getting a diminishing return on the informative value of the word.  So we need to scale down words that appear too frequently in a document.  </p>

<h3 id="idf-frequency-weighting">IDF frequency weighting</h3>

<p>But we're still not there yet.  Just as all words aren't equally valuable <em>within</em> a document, not all words are valuable across <em>all documents</em>.  We can try reweighting every word by its <strong>inverse document frequency</strong>. Let's see what's involved in that.</p>

In [24]:
def numDocsContaining(word, doclist):
    doccount = 0
    for doc in doclist:
        if freq(word, doc) > 0:
            doccount +=1
    return doccount 

def idf(word, doclist):
    n_samples = len(doclist)
    df = numDocsContaining(word, doclist)
    return np.log(n_samples / (float(df)) )

my_idf_vector = [idf(word, mydoclist) for word in vocabulary]

print ('Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']')
print ('The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']')

Our vocabulary vector is [than, me, Jane, baseball, Julie, basketball, likes, Linda, loves, He, more]
The inverse document frequency vector is [0.000000, 0.405465, 1.098612, 1.098612, 0.405465, 1.098612, 0.405465, 1.098612, 0.405465, 1.098612, 0.000000]


<p>Now we have a general sense of information values per term in our vocabulary, accounting for their relative frequency across the entire corpus.  Recall that this is an inverse!  The lower the value, the more frequent it is.</p>
<p>To get TF-IDF weighted word vectors, we have to perform the simple calculation of tf * idf.  </p>

In [26]:
import numpy as np

def build_idf_matrix(idf_vector):
    idf_mat = np.zeros((len(idf_vector), len(idf_vector)))
    np.fill_diagonal(idf_mat, idf_vector)
    return idf_mat

my_idf_matrix = build_idf_matrix(my_idf_vector)
print (my_idf_matrix)

[[ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.40546511  0.          0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.          1.09861229  0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.          0.          1.09861229  0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.40546511  0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          1.09861229
   0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.
   0.40546511  0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   1.098612

Now we have converted our IDF vector into a matrix of size BxB, where the diagonal is the IDF vector. That means we can perform now multiply every term frequency vector by the inverse document frequency matrix. Then to make sure we are also accounting for words that appear too frequently within documents, we'll normalize each document using L2 norm. 

In [27]:
doc_term_matrix_tfidf = []

#performing tf-idf matrix multiplication
for tf_vector in doc_term_matrix:
    doc_term_matrix_tfidf.append(np.dot(tf_vector, my_idf_matrix))

#normalizing
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
    doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))
                                    
print (vocabulary)
print (np.matrix(doc_term_matrix_tfidf_l2)) # np.matrix() just to make it easier to look at

{'than', 'me', 'Jane', 'baseball', 'Julie', 'basketball', 'likes', 'Linda', 'loves', 'He', 'more'}
[[ 0.          0.49474872  0.          0.          0.24737436  0.          0.
   0.67026363  0.49474872  0.          0.        ]
 [ 0.          0.52812101  0.71547492  0.          0.2640605   0.
   0.2640605   0.          0.2640605   0.          0.        ]
 [ 0.          0.          0.          0.56467328  0.          0.56467328
   0.20840411  0.          0.          0.56467328  0.        ]]


<p>Now let's see an efficient implementation of the previous approach using scikits-learn, which ensures that you don't have to worry about the efficiency of all the previous steps.</p>
<p><strong>NOTE</strong>: The values you get from the <code>TfidfVectorizer/TfidfTransformer</code> will be different than what we have computed by hand. This is because scikits-learn uses an adapted version of Tfidf to deal with divide-by-zero errors. 

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(min_df=1)
term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
print ("Vocabulary:", count_vectorizer.vocabulary_)

# print term_freq_matrix
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(term_freq_matrix)

tf_idf_matrix = tfidf.transform(term_freq_matrix)
print (tf_idf_matrix.todense())

Vocabulary: {'julie': 4, 'loves': 7, 'me': 8, 'more': 9, 'than': 10, 'linda': 6, 'jane': 3, 'likes': 5, 'he': 2, 'basketball': 1, 'baseball': 0}
[[ 0.          0.          0.          0.          0.28945906  0.
   0.38060387  0.57891811  0.57891811  0.22479078  0.22479078]
 [ 0.          0.          0.          0.41715759  0.3172591   0.3172591
   0.          0.3172591   0.6345182   0.24637999  0.24637999]
 [ 0.48359121  0.48359121  0.48359121  0.          0.          0.36778358
   0.          0.          0.          0.28561676  0.28561676]]


Or more directly

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)

print (tfidf_matrix.todense())

[[ 0.          0.          0.          0.          0.28945906  0.
   0.38060387  0.57891811  0.57891811  0.22479078  0.22479078]
 [ 0.          0.          0.          0.41715759  0.3172591   0.3172591
   0.          0.3172591   0.6345182   0.24637999  0.24637999]
 [ 0.48359121  0.48359121  0.48359121  0.          0.          0.36778358
   0.          0.          0.          0.28561676  0.28561676]]


And we can fit new observations into this vocabulary space like so:

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']
new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)
print (tfidf_vectorizer.vocabulary_)
print (new_term_freq_matrix.todense())

{'julie': 4, 'loves': 7, 'me': 8, 'more': 9, 'than': 10, 'linda': 6, 'jane': 3, 'likes': 5, 'he': 2, 'basketball': 1, 'baseball': 0}
[[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.68091856  0.          0.          0.51785612  0.51785612
   0.          0.          0.          0.          0.        ]
 [ 0.62276601  0.          0.          0.62276601  0.          0.          0.
   0.4736296   0.          0.          0.        ]]


Note that we didn't get words like 'watches' in the new_term_freq_matrix. That's because we trained the object on the documents in mydoclist, and that word doesn't appear in the vocabulary from that corpus. In other words, it's out of the lexicon.

<h2 id="Learning text representations">Learning text representations</h2>
<ul>
<li>When using NLTK classifier, it is noted that the classifer expects <code>dict</code> style feature sets, so we need to transform the text into a <code>dict</code>.</li>
<li>We can use previous TF-IDF. Let's see another implementation: Bag of Words.</li>


In [31]:
# First, write a Feature extractor (the following is taken from nltk-trainer package)

# download featx.py (written by Perkins)

import math
from nltk import probability

def bag_of_words(words):
        return dict([(word, True) for word in words])

def bag_of_words_in_set(words, wordset):
        return bag_of_words(set(words) & wordset)
    
def word_counts(words):
        return dict(probability.FreqDist((w for w in words)))

def word_counts_in_set(words, wordset):
        return word_counts((w for w in words if w in wordset))

def train_test_feats(label, instances, featx=bag_of_words, fraction=0.75):
        labeled_instances = [(featx(i), label) for i in instances]
        
        if fraction != 1.0:
                l = len(instances)
                cutoff = int(math.ceil(l * fraction))
                return labeled_instances[:cutoff], labeled_instances[cutoff:]
        else:
                return labeled_instances, labeled_instances

In [32]:
import nltk
bag_of_words(['this', 'is', 'awesome'])

{'awesome': True, 'is': True, 'this': True}

In [33]:
def bag_of_words_not_in_set(words, badwords):
    return bag_of_words(set(words) - set(badwords))

bag_of_words_not_in_set(['this','is','awesome'],['this'])

{'awesome': True, 'is': True}

In [34]:
from nltk.corpus import stopwords

def bag_of_non_stopwords(words, stopfile = 'english'):
    badwords = stopwords.words(stopfile)
    return bag_of_words_not_in_set(words, badwords)

bag_of_non_stopwords(['this','is','awesome'])

#nltk.download()   # try to use this is you need some package

{'awesome': True}

In [36]:
#nltk.download()   # try to use this is you need some package
from nltk.corpus import stopwords
print (stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

<h3 id="bi-gram">Bi-gram and n-grams</h3>
<ul>
<li>It is sometimes useful to take <em>significant</em> bi-grams into the bag-of-word model. Note that this example can be extended to n-grams.</li>

<li>In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.</li>

<li>An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on.</li>

In [16]:
#featx.py file 

import collections
from nltk.corpus import stopwords, reuters
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

def bag_of_words(words):
	'''
	>>> bag_of_words(['the', 'quick', 'brown', 'fox'])
	{'quick': True, 'brown': True, 'the': True, 'fox': True}
	'''
	return dict([(word, True) for word in words])

def bag_of_words_not_in_set(words, badwords):
	'''
	>>> bag_of_words_not_in_set(['the', 'quick', 'brown', 'fox'], ['the'])
	{'quick': True, 'brown': True, 'fox': True}
	'''
	return bag_of_words(set(words) - set(badwords))

def bag_of_non_stopwords(words, stopfile='english'):
	'''
	>>> bag_of_non_stopwords(['the', 'quick', 'brown', 'fox'])
	{'quick': True, 'brown': True, 'fox': True}
	'''
	badwords = stopwords.words(stopfile)
	return bag_of_words_not_in_set(words, badwords)

def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
	'''
	>>> bag_of_bigrams_words(['the', 'quick', 'brown', 'fox'])
	{'brown': True, ('brown', 'fox'): True, ('the', 'quick'): True, 'fox': True, ('quick', 'brown'): True, 'quick': True, 'the': True}
	'''
	bigram_finder = BigramCollocationFinder.from_words(words)
	bigrams = bigram_finder.nbest(score_fn, n)
	return bag_of_words(words + bigrams)

def bag_of_words_in_set(words, goodwords):
	return bag_of_words(set(words) & set(goodwords))

def label_feats_from_corpus(corp, feature_detector=bag_of_words):
	label_feats = collections.defaultdict(list)
	
	for label in corp.categories():
		for fileid in corp.fileids(categories=[label]):
			feats = feature_detector(corp.words(fileids=[fileid]))
			label_feats[label].append(feats)
	
	return label_feats

def split_label_feats(lfeats, split=0.75):
	train_feats = []
	test_feats = []
	
	for label, feats in lfeats.items():
		cutoff = int(len(feats) * split)
		train_feats.extend([(feat, label) for feat in feats[:cutoff]])
		test_feats.extend([(feat, label) for feat in feats[cutoff:]])
	
	return train_feats, test_feats

def high_information_words(labelled_words, score_fn=BigramAssocMeasures.chi_sq, min_score=5):
	word_fd = FreqDist()
	label_word_fd = ConditionalFreqDist()
	
	for label, words in labelled_words:
		for word in words:
			word_fd.inc(word)
			label_word_fd[label].inc(word)
	
	n_xx = label_word_fd.N()
	high_info_words = set()
	
	for label in label_word_fd.conditions():
		n_xi = label_word_fd[label].N()
		word_scores = collections.defaultdict(int)
		
		for word, n_ii in label_word_fd[label].items():
			n_ix = word_fd[word]
			score = score_fn(n_ii, (n_ix, n_xi), n_xx)
			word_scores[word] = score
		
		bestwords = [word for word, score in word_scores.items() if score >= min_score]
		high_info_words |= set(bestwords)
	
	return high_info_words

def reuters_high_info_words(score_fn=BigramAssocMeasures.chi_sq):
	labeled_words = []
	
	for label in reuters.categories():
		labeled_words.append((label, reuters.words(categories=[label])))
	
	return high_information_words(labeled_words, score_fn=score_fn)

def reuters_train_test_feats(feature_detector=bag_of_words):
	train_feats = []
	test_feats = []
	
	for fileid in reuters.fileids():
		if fileid.startswith('training'):
			featlist = train_feats
		else: # fileid.startswith('test')
			featlist = test_feats
		
		feats = feature_detector(reuters.words(fileid))
		labels = reuters.categories(fileid)
		featlist.append((feats, labels))
	
	return train_feats, test_feats

if __name__ == '__main__':
	import doctest
	doctest.testmod()

**********************************************************************
File "__main__", line 33, in __main__.bag_of_bigrams_words
Failed example:
    bag_of_bigrams_words(['the', 'quick', 'brown', 'fox'])
Expected:
    {'brown': True, ('brown', 'fox'): True, ('the', 'quick'): True, 'fox': True, ('quick', 'brown'): True, 'quick': True, 'the': True}
Got:
    {'the': True, 'quick': True, 'brown': True, 'fox': True, ('brown', 'fox'): True, ('quick', 'brown'): True, ('the', 'quick'): True}
**********************************************************************
File "__main__", line 25, in __main__.bag_of_non_stopwords
Failed example:
    bag_of_non_stopwords(['the', 'quick', 'brown', 'fox'])
Expected:
    {'quick': True, 'brown': True, 'fox': True}
Got:
    {'quick': True, 'fox': True, 'brown': True}
**********************************************************************
File "__main__", line 11, in __main__.bag_of_words
Failed example:
    bag_of_words(['the', 'quick', 'brown', 'fox'])
Expec

In [17]:
#from featx import bag_of_bigrams_words
bag_of_bigrams_words(['this','is','an','incredible','place'])

{'this': True,
 'is': True,
 'an': True,
 'incredible': True,
 'place': True,
 ('an', 'incredible'): True,
 ('incredible', 'place'): True,
 ('is', 'an'): True,
 ('this', 'is'): True}

<h2 id="training-a-naive-bayes-classifier">Training a (naive Bayes) Classifier with NLTK: film critiques example</h2>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<ul>
<li>Once we have extracted features from text, we can train a classifier.</li>
<li>The easiest classifer to get started is the <code>NaiveBayesClassifer</code>.<ul>
<li>It uses <em>Bayes Theorem</em> to predict the probability that a given feature set belongs to a particular label. Recall the formula (more detail in the next classes):<pre><code>P(label|features) = P(label) * P(features|label) / P(features)
</code></pre></li>
</ul>
</li>
</ul>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered">
<div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<ul>
<li>Corpus: movie reviews corpus<ul>
<li>each file in the corpus is composed of either positive or negative movie reviews.</li>
<li>let's try a sentiment analysis.</li>

In [18]:
#nltk.download()
from nltk.corpus import movie_reviews
movie_reviews.categories()

['neg', 'pos']

the <code>label_feats_from_corpus()</code> function takes a <em>corpus</em>, and a <em>feature_detector function</em>, which is <code>bag_of_words()</code> by default.

In [19]:
lfeats = label_feats_from_corpus(movie_reviews)
lfeats.keys()


dict_keys(['neg', 'pos'])

Once we get a mapping of <code>label:feature</code> sets, as shown in <code>lfeats</code>: <pre><code>defaultdict(&lt;type 'list'&gt;, {'neg': [{'all': True, 'concept': True, 'skip': True, 'go': True, 'seemed': True, 'suits': True, 'presents': True, 'to': True, 'sitting': True, 'very': True, 'horror': True, 'continues': True, 'every': True, 'exact': True, 'cool': True, 'entire': True, 'did': True, 'dig': True, 'flick': True, 'neighborhood': True, 'crow': True, 'street': True, 'video': True, 'further': True,.............
</code></pre>we need to split the data into training and testing ones.

In [20]:
# (split = 0.75) by default

train_feats, test_feats = split_label_feats(lfeats)
len(train_feats)

1500

In [21]:
len(test_feats)

500

So there are 1,000 pos files, 1,000 neg files, and we end up woth 1,500 labeled training instances and 500 labeled testing instances. Now we can train a NaiveBayesClassifier using its train() method.

In [22]:
from nltk.classify import NaiveBayesClassifier
nb_classifier = NaiveBayesClassifier.train(train_feats)
nb_classifier.labels()

['neg', 'pos']

Once trained, let's test the classifer on some made-up reviews. The classify() method takes a single argument, which should be a feature set. We can use bag_of_words() feature detector on a made-up list of words to get the feature set.

In [23]:
negfeat = bag_of_words(['the', 'plot', 'was', 'ludicrous'])
nb_classifier.classify(negfeat)

'neg'

In [24]:
posfeat = bag_of_words(['kate', 'winslet', 'is', 'accessible'])
nb_classifier.classify(posfeat)

'pos'

We can test the accuracy of the classifer

In [25]:
from nltk.classify.util import accuracy
accuracy(nb_classifier, test_feats)

0.728

To get the classification probability of each label, you can use the prob_classify() method.

In [26]:
probs = nb_classifier.prob_classify(test_feats[0][0])
probs.samples()
# ['neg','pos']
probs.max()
#'pos'
probs.prob('pos')
#0.99999996464309127
probs.prob('neg')
#3.5356889692409258e-08

3.5356889692412044e-08

The most_informative_features() method returns a list of form [(feature name, feature value)] ordered by most informative to least informative. In the case, the feature value will always be True, though.

In [27]:
nb_classifier.most_informative_features(n=5)

[('magnificent', True),
 ('outstanding', True),
 ('insulting', True),
 ('vulnerable', True),
 ('ludicrous', True)]

The show_most_informative_features() method will print out the results and include the probability of a feature pair belonging to each label.

In [28]:
nb_classifier.show_most_informative_features(n=5)

Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0


<h2 id="training-a-naive-bayes-classifier">Training a (naive Bayes) Classifier with TextBlob: A Tweet Sentiment Analyzer</h2>

We will train a simple sentiment analyzer trained on a small dataset of fake tweets. To begin, we’ll import the text.classifiers and create some training and test data.


https://blog.twitter.com/2012/a-new-barometer-for-the-election

"( Wednesday, August 1, 2012 | By Adam Sharp (@AdamS) [16:00 UTC] ) One glance at the numbers, and it’s easy to see why pundits are already calling 2012 “the Twitter election.” More Tweets are sent every two days today than had ever been sent prior to Election Day 2008 — and Election Day 2008’s Tweet volume represents only about six minutes of Tweets today."

"Each day, the Index evaluates and weighs the sentiment of Tweets mentioning Obama or Romney relative to the more than 400 million Tweets sent on all other topics. For example, a score of 73 for a candidate indicates that Tweets containing their name or account name are on average more positive than 73 percent of all Tweets."

<img src="Tweetpolitics.png">

In [30]:
from textblob.classifiers import NaiveBayesClassifier

In [31]:


train = [
    ('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')
]
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]



We create a new classifier by passing training data into the constructor for a NaiveBayesClassifier.

In [32]:
cl = NaiveBayesClassifier(train)

We can now classify arbitrary text using the NaiveBayesClassifier.classify(text) method.

In [33]:
cl.classify("Their burgers are amazing")  # "pos"

'pos'

In [34]:
cl.classify("I don't like their pizza.")  # "neg"

'neg'

Another way to classify strings of text is to use TextBlob objects. You can pass classifiers into the constructor of a TextBlob.

In [36]:
import textblob
blob = textblob.TextBlob("The beer was amazing. "
                "But the hangover was horrible. My boss was not happy.", classifier=cl)
print (blob)

The beer was amazing. But the hangover was horrible. My boss was not happy.


You can then call the classify() method on the blob.

In [37]:
blob.classify()  # "neg"

'neg'

You can also take advantage of TextBlob’s sentence tokenization and classify each sentence indvidually.

In [38]:
for sentence in blob.sentences:
    print(sentence)
    print(sentence.classify())
# "pos", "neg", "neg"

The beer was amazing.
pos
But the hangover was horrible.
neg
My boss was not happy.
neg


In [39]:
cl.accuracy(test)  

0.8333333333333334

We can improve our classifier by adding more training and test data. Here we’ll add data from the movie review corpus which was downloaded with NLTK.

In [41]:
import random
from nltk.corpus import movie_reviews

reviews = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

new_train, new_test = reviews[0:100], reviews[101:200]

Let’s see what one of these documents looks like.

In [42]:
print(new_train[0])

(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', 'what', "'", 's', 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'b

We can now update our classifier with the new training data using the update(new_data) method, as well as test it using the larger test dataset.

In [43]:
cl.update(new_train) # it takes a while
accuracy = cl.accuracy(test + new_test) 
print("Accuracy: {0}".format(accuracy))

Accuracy: 0.9714285714285714


Here’s the full, updated script:

In [45]:
import random
from nltk.corpus import movie_reviews
from textblob.classifiers import NaiveBayesClassifier
random.seed(1)
 
train = [
('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')
]
test = [
('The beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]
 
cl = NaiveBayesClassifier(train)
accuracy = cl.accuracy(test)
print("Accuracy: {0}".format(accuracy))

# Grab some movie review data
reviews = [(list(movie_reviews.words(fileid)), category)
           for category in movie_reviews.categories()
           for fileid in movie_reviews.fileids(category)]

random.shuffle(reviews)
new_train, new_test = reviews[0:100], reviews[101:200]
 
# Update the classifier with the new training data
cl.update(new_train)
 
# Compute accuracy
accuracy = cl.accuracy(test + new_test)
print("Accuracy: {0}".format(accuracy))

Accuracy: 0.8333333333333334
Accuracy: 0.7714285714285715


## Pattern python module

Pattern is a web mining module for the Python programming language. It bundles tools for data mining (Google + Twitter + Wikipedia API, web crawler, HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization).

Modules:
    
    pattern.web
    pattern.db
    pattern.en | es | de | fr | it | nl
    pattern.search
    pattern.vector
    pattern.graph 


In [63]:
# Older Python versions -> for new ones python-twitter 3.3
# pip install pattern
#
#from pattern.web import Twitter3
#t = Twitter3()
#i = None
#for j in range(3):
#    print (j)
#    for tweet in t.search('notebook', start=i, count=10):
#        print (tweet)
#        print (tweet.text)
#        print
#        i = tweet.id

### Homework exercise

Define a binary problem to be learnt by a binary naive bayes classifier. For this purpose define a set of text examples from two
binary categories:

$\bullet$Design a corpus or load it from already existing examples (or data from previous classes)

$\bullet$Define the positive and negative sets

$\bullet$Train the classifier (define a proper partition of train and test data)

$\bullet$Estimate the performance of your trained classifier

Summary of some useful functions

<img src="summary.png">