# Lab 7: Text Analysis

- **Instructor:** Li Zeng ([lizeng@uw.edu](mailto:lizeng@uw.edu))
- **Date:** Nov 1, 2018
- **Course:** IMT 547 AU18 - Social Media Data Mining and Analysis

In this tutorial, we are going to perform text analysis using [NLTK](https://www.nltk.org/). NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is free, opensource, easy to use, large community, and well documented.

### Topics
* Tokenization
* Normalization
* Text Classification using Bag of Words and TF-IDF

## Intall and import NLTK

In [1]:
#In your terminal:
#pip install nltk

#Loading NLTK
import nltk

In [26]:
# Download NLTK corpora
#nltk.download()
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sakura\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

## Import data - threads on [Topix](http://www.topix.com/forum/us)

In this tutorial, we will work on a sample of text data scraped from a public forum on Topix for the local Santa Cruz community to discuss news and issues. 

In [2]:
# Loading padnas and numpy
import pandas as pd
import numpy as np

In [4]:
data = pd.read_csv("topix.csv")

## Inspect data

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4580 entries, 0 to 4579
Data columns (total 15 columns):
thread_id          4580 non-null int64
title              4580 non-null object
author_name        4580 non-null object
author_geo         4580 non-null object
registered_user    4580 non-null bool
comment_order      4580 non-null object
comment_date       4580 non-null object
post_content       4580 non-null object
judge1_title       4580 non-null object
judge1_count       4580 non-null int64
judge2_title       4231 non-null object
judge2_count       4580 non-null int64
judge3_title       3972 non-null object
judge3_count       4580 non-null int64
score              4580 non-null object
dtypes: bool(1), int64(4), object(10)
memory usage: 505.5+ KB


In [6]:
data.head()

Unnamed: 0,thread_id,title,author_name,author_geo,registered_user,comment_order,comment_date,post_content,judge1_title,judge1_count,judge2_title,judge2_count,judge3_title,judge3_count,score
0,1,Norse distributes flier promoting executing ho...,Give Up Santa Cruz,"Santa Cruz, CA",False,#1,25-Oct-13,https://www.indybay.org/newsitems/2013/10/23/....,Interesting,1,Incendiary,1,Disagree,1,Negative
1,1,Norse distributes flier promoting executing ho...,Bud,"Santa Cruz, CA",False,#2,25-Oct-13,Give Up Santa Cruz wrote: https://www.indy...,Brilliant,3,Agree,3,Helpful,3,Positive
2,1,Norse distributes flier promoting executing ho...,DBS,"Cupertino, CA",False,#3,25-Oct-13,It's really too bad that Robert doesn't have h...,Brilliant,4,Agree,4,Helpful,4,Positive
3,1,Norse distributes flier promoting executing ho...,jeff helms,"Walnut Creek, CA",False,#4,25-Oct-13,Give Up Santa Cruz wrote: https://www.indy...,Spam,4,Clueless,4,Nuts,4,Negative
4,1,Norse distributes flier promoting executing ho...,jeff helms,"Walnut Creek, CA",False,#5,25-Oct-13,Give Up Santa Cruz wrote: https://www.indy...,Spam,3,Clueless,3,Nuts,3,Negative


In [7]:
data = data.iloc[:1000]

In [8]:
data['score'].value_counts()

Negative    545
Positive    455
Name: score, dtype: int64

## Clean data
Before we analyze and mine text data, we want it to contain as less noise as possible. However, it is not usully the case for texts collected from our real world. Therefore, we need to de-noise our raw text.

#### Sample noise removal tasks could include:
* removing text file headers, footers
* removing HTML, XML, etc. markup and metadata
* extracting valuable data from other formats, such as JSON

In [9]:
# raw text
data['post_content'].head()

0    https://www.indybay.org/newsitems/2013/10/23/....
1      Give Up Santa Cruz wrote:   https://www.indy...
2    It's really too bad that Robert doesn't have h...
3      Give Up Santa Cruz wrote:   https://www.indy...
4      Give Up Santa Cruz wrote:   https://www.indy...
Name: post_content, dtype: object

In [10]:
import re

In [13]:
def customized_cleaning(x):    
    # we want to remove urls
    cleaned_x = re.sub(r'http\S+','',x)
    # we want to remove non-ASCII characters from list of tokenized words
    #cleaned_x = cleaned_x.decode('utf-8', 'ignore')
    # you can implement other cleaning task here
    return cleaned_x

In [14]:
data['post_content'] = data['post_content'].apply(customized_cleaning)

In [15]:
data['post_content'].head()

0                                 His words, not mine.
1      Give Up Santa Cruz wrote:    words, not mine...
2    It's really too bad that Robert doesn't have h...
3      Give Up Santa Cruz wrote:    words, not mine...
4      Give Up Santa Cruz wrote:    words, not mine...
Name: post_content, dtype: object

## Tokenization
Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization. Token is a single entity that is building blocks for sentence or paragraph.

In [16]:
# let's first play around the first three posts in this sample
posts = data['post_content']
print(posts.head().values)

[' His words, not mine.'
 '  Give Up Santa Cruz wrote:    words, not mine.   Rather than click a link and give Indybay more visitors just let it be what it is: A theatrical display of ignorance.'
 "It's really too bad that Robert doesn't have his pathetic fliers printed on toilet paper.Then for the first time, he would have done something to help the homeless he pretends to care so much about!!"
 '  Give Up Santa Cruz wrote:    words, not mine.   Don\'t you think its kinda of ironic people think they can attack and shove aside someone who is poor?--most egged on by the prostitution racket?? Where would these prostitutes be without prostitution?? HOMELESS!--there\'s the "get the tweakers" crowd also--and "tweakers" are people also and might turn into a fascist movement etc.--we all know what a horrible monster fascism is. The anti zoonotic crowd has legitimate complaint along with private property owners. The anti-drug crowd needs to look at the std/drug epidemic and what\'s causing it-

In [17]:
# sentence tokenization - breaks text paragraph into sentences.
from nltk.tokenize import sent_tokenize
tokenized_text = posts.apply(sent_tokenize)
print(tokenized_text.head().values)

[list([' His words, not mine.'])
 list(['  Give Up Santa Cruz wrote:    words, not mine.', 'Rather than click a link and give Indybay more visitors just let it be what it is: A theatrical display of ignorance.'])
 list(["It's really too bad that Robert doesn't have his pathetic fliers printed on toilet paper.Then for the first time, he would have done something to help the homeless he pretends to care so much about!", '!'])
 list(['  Give Up Santa Cruz wrote:    words, not mine.', "Don't you think its kinda of ironic people think they can attack and shove aside someone who is poor?--most egged on by the prostitution racket??", 'Where would these prostitutes be without prostitution??', 'HOMELESS!--there\'s the "get the tweakers" crowd also--and "tweakers" are people also and might turn into a fascist movement etc.--we all know what a horrible monster fascism is.', 'The anti zoonotic crowd has legitimate complaint along with private property owners.', 'The anti-drug crowd needs to look a

In [18]:
# word tokenization - breaks text paragraph into words.
from nltk.tokenize import word_tokenize
tokenized_word=posts.apply(word_tokenize)
print(tokenized_word.head().values)

[list(['His', 'words', ',', 'not', 'mine', '.'])
 list(['Give', 'Up', 'Santa', 'Cruz', 'wrote', ':', 'words', ',', 'not', 'mine', '.', 'Rather', 'than', 'click', 'a', 'link', 'and', 'give', 'Indybay', 'more', 'visitors', 'just', 'let', 'it', 'be', 'what', 'it', 'is', ':', 'A', 'theatrical', 'display', 'of', 'ignorance', '.'])
 list(['It', "'s", 'really', 'too', 'bad', 'that', 'Robert', 'does', "n't", 'have', 'his', 'pathetic', 'fliers', 'printed', 'on', 'toilet', 'paper.Then', 'for', 'the', 'first', 'time', ',', 'he', 'would', 'have', 'done', 'something', 'to', 'help', 'the', 'homeless', 'he', 'pretends', 'to', 'care', 'so', 'much', 'about', '!', '!'])
 list(['Give', 'Up', 'Santa', 'Cruz', 'wrote', ':', 'words', ',', 'not', 'mine', '.', 'Do', "n't", 'you', 'think', 'its', 'kinda', 'of', 'ironic', 'people', 'think', 'they', 'can', 'attack', 'and', 'shove', 'aside', 'someone', 'who', 'is', 'poor', '?', '--', 'most', 'egged', 'on', 'by', 'the', 'prostitution', 'racket', '?', '?', 'Where',

## Normalization

Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.

Remember, after tokenization, we are no longer working at a text level, but now at a word level. Our normalization functions, shown below, reflect this.

In [19]:
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

In [20]:
def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def normalize(words):
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    return words

In [21]:
normalized_posts = tokenized_word.apply(normalize)
print(normalized_posts.head())

0                                        [words, mine]
1    [give, santa, cruz, wrote, words, mine, rather...
2    [really, bad, robert, nt, pathetic, fliers, pr...
3    [give, santa, cruz, wrote, words, mine, nt, th...
4    [give, santa, cruz, wrote, words, mine, robert...
Name: post_content, dtype: object


### Stemming and Lemmatization

In [22]:
from nltk.stem import LancasterStemmer, WordNetLemmatizer

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def stem_and_lemmatize(words):
    stems = stem_words(words)
    lemmas = lemmatize_verbs(words)
    return stems, lemmas

In [23]:
normalized_posts.iloc[:10]

0                                        [words, mine]
1    [give, santa, cruz, wrote, words, mine, rather...
2    [really, bad, robert, nt, pathetic, fliers, pr...
3    [give, santa, cruz, wrote, words, mine, nt, th...
4    [give, santa, cruz, wrote, words, mine, robert...
5    [people, using, homeless, racketeer, property,...
6    [looks, like, roberts, flyer, coming, true, le...
7    [punk, ass, wannabes, think, push, americans, ...
8    [expensive, apartment, losing, house, illegall...
9    [bud, wrote, quoted, text, rather, click, link...
Name: post_content, dtype: object

In [24]:
normalized_posts.iloc[:10].apply(stem_words)

0                                          [word, min]
1    [giv, sant, cruz, wrot, word, min, rath, click...
2    [real, bad, robert, nt, pathet, fli, print, to...
3    [giv, sant, cruz, wrot, word, min, nt, think, ...
4    [giv, sant, cruz, wrot, word, min, robert, sol...
5    [peopl, us, homeless, racket, property, etc, r...
6    [look, lik, robert, fly, com, tru, leigh, say,...
7    [punk, ass, wannab, think, push, am, around, d...
8    [expend, apart, los, hous, illeg, racket, prop...
9    [bud, wrot, quot, text, rath, click, link, giv...
Name: post_content, dtype: object

In [27]:
normalized_posts.iloc[:10].apply(lemmatize_verbs)

0                                         [word, mine]
1    [give, santa, cruz, write, word, mine, rather,...
2    [really, bad, robert, nt, pathetic, fliers, pr...
3    [give, santa, cruz, write, word, mine, nt, thi...
4    [give, santa, cruz, write, word, mine, robert,...
5    [people, use, homeless, racketeer, property, e...
6    [look, like, roberts, flyer, come, true, leigh...
7    [punk, ass, wannabes, think, push, americans, ...
8    [expensive, apartment, lose, house, illegally,...
9    [bud, write, quote, text, rather, click, link,...
Name: post_content, dtype: object

Compare the stemmed and lemmatized results. Depending on your upcoming NLP task or preference, one of these may be more appropriate than the other

## Text Classification

### Feature Generation using Bag of Words

In [28]:
# Prepare the text for ML
text = normalized_posts.apply(lemmatize_verbs)
text_str = [" ".join(t) for t in text.values]

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
x = np.asarray(text_str)
vectorizer = CountVectorizer(min_df=0)
vectorizer.fit(x)
X = vectorizer.transform(x)
X = X.toarray()

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, data['score'], test_size=0.3, random_state=1)

In [32]:
print(X_test.shape,  Y_train.shape)
print(X_train.shape, Y_test.shape)

(300, 5633) (700,)
(700, 5633) (300,)


In [34]:
from sklearn.naive_bayes import MultinomialNB
# call the fitted model fitted_model, for future reference:
fitted_model = MultinomialNB()
fitted_model.fit(X_train, Y_train)
print(fitted_model.score(X_test, Y_test))

0.61


### Feature Generation using TF-IDF

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer()
text_tf= tf.fit_transform(text_str)

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_tf, data['score'], test_size=0.3, random_state=1)

In [38]:
from sklearn.naive_bayes import MultinomialNB
# call the fitted model fitted_model, for future reference:
fitted_model = MultinomialNB()
fitted_model.fit(X_train, Y_train)
print(fitted_model.score(X_test, Y_test))

0.6466666666666666
