# Text as Data using Python
By Shuhei Kitamura

### Outline
1. Introduction
2. Import data
3. Clean data
4. Vectorization
5. Sentiment Analysis

## 1. Introduction
- Nowadays, many kinds of text data are available online.
    - E.g., news articles, customer reviews, tweets, etc.
- Possible analysis using such data includes:
    - Sentiment analysis (pos/neg), conflict/hate speech detection, etc.

- Let's evaluate the sentiment of user reviews in [IMDb (Internet Movie Database)](https://www.imdb.com/)!
    - You can possibly apply the same method to analyze the sentiment of e.g. Amazon customer reviews.
- In particular, we:
    1. Prepare data for analysis using techniques in natural language processing (NLP)
    2. Convert words to a matrix
    3. Conduct sentiment analysis using machine learning
- The following packages will be mainly used:
    - nltk
    - scikit-learn

In [None]:
import os
import re
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk; nltk.download('stopwords'); nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

In [3]:
os.chdir('...') # set the directory

## 2. Import data
- Let's import data.
    - ``full_train.txt``: train data.
    - ``full_test.txt``: test data
- We will train the machine using ``full_train.txt`` and then apply it to ``full_test.txt``.
- All data were taken from [this website](http://ai.stanford.edu/~amaas/data/sentiment/).

In [None]:
train = [] # import data as a list
with open('data/full_train.txt', encoding="utf-8") as file: 
    for line in file:
        train.append(line.strip())
        
test = [] # import data as a list
with open('data/full_test.txt', encoding="utf-8") as file: 
    for line in file:
        test.append(line.strip())
        
print(len(train), len(test))

- Print train data here.

## 3. Clean data
### - Remove unnecessary items
- The text contains many unnecessary items such as punctuation marks (., :, ;, !, ?), parentheses, etc.
- Let's remove them first.

In [None]:
def clean_texts(text):
    text = [re.compile("[*.;:!\'?,\"()\[\]]").sub("", line.lower()) for line in text] # make all text lowe case, then replace punctuation marks, parantheses, etc. by "" (e.g., "isn't" becomes "isnt")
    text = [re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)").sub(" ", line) for line in text] # replace "<br /><br />", etc. by " " with space (e.g., "<br /><br />" becomes " ")
    text = [re.compile("[^\x00-\x7f]").sub("", line) for line in text] # replace all ascii key codes \x96, \x97, etc. by ""
    return text

train_clean = clean_texts(train)
test_clean = clean_texts(test)
print(len(train_clean), len(test_clean))

- Print train data again.

### - Remove stop words
- The text also contains stop words (is, a, it, as, some, etc.). Let's remove them.
    - By doing so, you can focus more on "meaningful" words.

In [None]:
stop_words = stopwords.words('english')
print(stop_words)

In [8]:
def remove_stop_words(text):
    remain_words = []
    for item in text:
        remain_words.append(' '.join([word for word in item.split() if word not in stop_words]))
    return remain_words

train_clean_no_stop = remove_stop_words(train_clean)
test_clean_no_stop = remove_stop_words(test_clean)

- Print train data again.

### - Normalization
- Finally, we normalize text. For example:
    - am, is, are -> be
    - play, plays, playing, played -> play
- There are two major ways to normalize text:
    - Stemming: Simply return stems without knowledge of the context (e.g. connections, connected, connecting, connection -> connect).
    - Lemmatization: Iovolve a vocabulary and morphological analysis of words (e.g. better -> good, meeting -> meeting or meet). Return the base or dictionary form of a word (= the lemma).

#### Stemming
- There are several types of the stemmer (see e.g. [this webpage](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)):
    - Porter stemmer
    - Lancaster stemmer
    - etc.
- We will use Porter stemmer.

In [10]:
def get_stemmed_text(text):
    stemmer = PorterStemmer()
    stemmed_text = [' '.join([stemmer.stem(word) for word in item.split()]) for item in text]
    return stemmed_text

train_clean_stemmed = get_stemmed_text(train_clean)
test_clean_stemmed = get_stemmed_text(test_clean)

- Print train data.

#### Lemmatization
- We will use WordNet Lemmatizer. 
    - It uses the WordNet Database to lookup lemmas of words.

In [12]:
def get_lemmatized_text(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = [' '.join([lemmatizer.lemmatize(word) for word in item.split()]) for item in text]
    return lemmatized_text

train_clean_lemmatized = get_lemmatized_text(train_clean)
test_clean_lemmatized = get_lemmatized_text(test_clean)

- Print train data.

## 4. Vectorization
- Let's convert text to numeric values.... How?
- There are several methods:
    - Bag-of-Words (BOW)
    - Word Counts
    - Term Frequency-Inverse Document Frequency (TF-IDF)
    - N-grams
    - etc.

- Let's use the lemmatized version for now.

In [14]:
train_data = train_clean_lemmatized # or train_clean_stemmed
test_data = test_clean_lemmatized # or test_clean_stemmed

- Print the first review in the train data.

### Bag-of-Words (BOW)
- Make a review-by-word matrix which takes 1 if a word appears in a review, and 0 otherwise.

In [None]:
vec = CountVectorizer(binary=True)
vec.fit(train_data) # set a vocabulary dictionary of all words in the train data
X = vec.transform(train_data) # transform train data to a review-by-word sparce matrix using the dictionary (takes either 0 or 1 with many zeros)

- Let's check the size of vocabulary and all the vocabularies.

In [None]:
print('Vocabulary size: {}'.format(len(vec.vocabulary_)))
print('Vocabulary content: {}'.format(vec.vocabulary_))

- Let's check the content of the matrix.

In [None]:
df_X = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df_X)

- Delete ``df_X`` for now because it consumes a lot of memory.

In [None]:
del df_X
gc.collect()

### Word Counts
- The method is similar to BOW but takes into account the frequency of words.

In [20]:
vec = CountVectorizer(binary=False)
vec.fit(train_data)
X = vec.transform(train_data)

- Check the content of the matrix.

In [None]:
df_X = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df_X['high'][df_X['high']!=0])

In [None]:
plt.hist(df_X['high'], bins=8, align='mid', range=(1,8), alpha=0.3, color='r')
plt.title('Frequency of "high" per review') 
plt.ylabel('reviews')
plt.xlabel('times')
plt.show() 

- Delete ``df_X`` for now because it consumes a lot of memory.

In [None]:
del df_X
gc.collect()

### Term Frequency-Inverse Document Frequency (TF-IDF)
- The method takes into account the frequency of words in a document and in all documents.
- The formula: TF * IDF, where
    - TF: # of word `X` in review i / # of all words in review i
    - IDF: log(# of reviews / # of reviews that contains word `X`)

In [24]:
vec = TfidfVectorizer()
vec.fit(train_data)
X = vec.transform(train_data)

- Check the content of the matrix.

In [None]:
df_X = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df_X['high'][df_X['high']!=0])

In [None]:
plt.hist(df_X['high'], bins=8, align='mid', range=(0.001,1), alpha=0.3, color='r')
plt.title('TF-IDF of "high" per review') 
plt.ylabel('reviews')
plt.xlabel('TF-IDF')
plt.show() 

- Delete ``df_X`` for now because it consumes a lot of memory.

In [None]:
del df_X
gc.collect()

### N-grams
- All the above three methods consider only single words.
- An alternative approach is to consider multiple consecutive words.
- Let's collect all two consecutive words. They are called bi-grams.
    - Increasing N does not necessarily improve accuracy. Also, computation takes a longer time for a larger N.

In [None]:
vec = CountVectorizer(binary=True, ngram_range=(1,2))
vec.fit(train_data)

- Let's print data.

In [None]:
print('Vocabulary size: {}'.format(len(vec.vocabulary_)))
print('bi-grams containing "good": {}'.format([key for key, value in vec.vocabulary_.items() if "good" in key]))

## 5. Sentiment Analysis
- We train a machine such that it can predict the sentiment of movie reviews with high precision.
- There are several types of learning algorithms:
    - Supervised learning: Labeled train data are needed.
    - Unsupervised learning: Learn from unlabeled test data.
    - Others (semi-supervised learning, reinforcement learning, etc.)
- We will use supervised learning.

- There are several models:
    - Linear models (e.g., OLS, logistic regression)
    - Support Vector Machines
    - Decision Trees
    - Ensemble methods (e.g., Random Forest)
    - Neural Networks
    - etc.

### Logistic regression
- Let's start by using a simple model: Logistic regression.
- There are so many words in the data. 
    - We want to consider only a few words that are important for prediction.
- Using regularization, we punish words that are not important for prediction.
- Let's find the best "c", which is the inverse of regularization strength.
    - A smaller value means a stronger regularization.

- Let's vectorize text using BOW.

In [30]:
vec = CountVectorizer(binary=True)
vec.fit(train_data) 
X = vec.transform(train_data) 
X_test = vec.transform(test_data)

- Next, randomly split train data into two groups and predict one from the other.
- Find the best "c" that yields the highest accuracy.

In [None]:
target = [1 if i < 12500 else 0 for i in range(25000)] # make a list, which takes 1 for pos and 0 for neg
X_tr, X_te, y_tr, y_te = train_test_split(X, target, train_size=0.75, random_state=12345) # split "X" and "traget" into 2 groups, train_size means how many you want to allocate to the train group

for c in [0.01, 0.05, 0.25, 0.5, 1]: 
    lr = LogisticRegression(C=c)
    lr.fit(X_tr, y_tr)
    print ("Accuracy for C=%s: %s" % (c, accuracy_score(y_te, lr.predict(X_te))))

- It seems that the best "c" is 0.05.
    - This means that only a few words are important for prediction.
- Let's see the accuracy of our machine using the entire data.

In [None]:
lr = LogisticRegression(C=0.05)
lr.fit(X, target)
print ("Final Accuracy: %s" % accuracy_score(target, lr.predict(X_test)))

- Let's compare the correct answer and the machine's prediction for a specific review.

In [33]:
out = pd.DataFrame({'text':test_data, 'pred':lr.predict(X_test), 'orig':target})

In [None]:
row = 2 # choose a specific review
print(out.loc[row,'text']); print('predicted =', out.loc[row,'pred']); print('correct =', out.loc[row,'orig'])

- Let's play with the data by changing reviews.

In [38]:
test_clean_mod = test_clean
test_clean_mod[0] = 'i do not like this movie'
test_clean_mod[1] = 'i love this movie'
X_test2 = vec.transform(test_clean_mod)
out2 = pd.DataFrame({'text':test_clean_mod, 'pred':lr.predict(X_test2)})

In [None]:
print(out2.loc[0,'text']); print('predicted =', out2.loc[0,'pred'])
print(out2.loc[1,'text']); print('predicted =', out2.loc[1,'pred'])

### Support Vector Machines
- Linear Support Vector Machines find the "maximum-margin" hyperplane that separates data points.

In [None]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:    
    linear = LinearSVC(C=c)
    linear.fit(X_tr, y_tr)
    print("Accuracy for C=%s: %s" % (c, accuracy_score(y_te, linear.predict(X_te))))

- It seems that the best "c" is 0.01.

In [None]:
linear = LinearSVC(C=0.01)
linear.fit(X, target)
print("Final Accuracy: %s" % accuracy_score(target, linear.predict(X_test)))

### Decision Trees
- A tree of inputs is made from train data.

In [None]:
dt = DecisionTreeClassifier(random_state=12345, max_depth=10) 
dt.fit(X, target)
print("Final Accuracy: %s" % accuracy_score(target, dt.predict(X_test)))

### Random Forests
- Each tree is built from a randomly drawn sample from train data with replacement.

In [None]:
rf = RandomForestClassifier(random_state=12345, max_depth=10)
rf.fit(X, target)
print("Final Accuracy: %s" % accuracy_score(target, rf.predict(X_test)))

### Neural Networks
- Neural Networks (or Multi-layer Perceptron) mimic neurons.
- A network has hidden layers between the input and output layers.

In [None]:
clf = MLPClassifier(solver='sgd', random_state=12345, max_iter=5)
clf.fit(X, target)
print("Final Accuracy: %s" % accuracy_score(target, clf.predict(X_test)))