# Text as Data using Python
By Shuhei Kitamura

- Nowadays, many kinds of text data are available online.
    - E.g., news articles, customer reviews, tweets, etc.
- Possible analysis using such data includes:
    - Sentiment analysis (pos/neg), conflict/hate speech detection, etc.

- Let's evaluate the sentiment (pos/neg) of user reviews in [IMDb (Internet Movie Database)](https://www.imdb.com/)!
    - You can possibly apply the same method to analyze the sentiment of e.g. Amazon customer reviews.
- In particular, we:
    1. Prepare data for analysis using techniques in natural language processing (NLP)
    2. Convert texts to a matrix
    3. Conduct a sentiment analysis using machine learning
- The following packages will be mainly used:
    - nltk
    - scikit-learn

### Outline<a id='top'></a>
1. [Importing data](#sec1)
2. [Cleaning data](#sec2)
    1. [Removing unnecessary items](#sec2_1)
    2. [Normalization](#sec2_2)
    3. [Removing stop words](#sec2_3)    
3. [Vectorization](#sec3)
    1. [Bag-of-Words](#sec3_1)
    2. [Word Counts](#sec3_2)
    3. [Term Frequency-Inverse Document Frequency (TF-IDF)](#sec3_3)
    4. [N-grams](#sec3_4)
4. [Sentiment Analysis](#sec4)
    1. [Logistic Regressions](#sec4_1)
    2. [Support Vector Machines](#sec4_2)
    3. [Decision Trees](#sec4_3)
    4. [Random Forests](#sec4_4)
    5. [Neural Networks](#sec4_5)

In [None]:
# import packages and modules
import os
import re
import gc
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk; nltk.download('stopwords'); nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

In [None]:
# os.chdir('...') # set the directory if necessary

## 1. Importing data<a id='sec1'></a>
- Let's import data as lists.
    - ``full_train.txt``: train data.
    - ``full_test.txt``: test data
- We will train the machine using ``full_train.txt`` and then apply it to ``full_test.txt``.
- All data were taken from [this website](http://ai.stanford.edu/~amaas/data/sentiment/).
    
[back to top](#top)

In [None]:
train = [] # import data as a list
with open('full_train.txt', encoding="utf-8") as file: 
    for line in file:
        train.append(line.strip())
        
test = [] # import data as a list
with open('full_test.txt', encoding="utf-8") as file: 
    for line in file:
        test.append(line.strip())
        
print(len(train), len(test))

- Print the first item in `train`.

## 2. Cleaning data<a id='sec2'></a>
    
[back to top](#top)

### A. Removing unnecessary items<a id='sec2_1'></a>
- The text contains many unnecessary items such as punctuation marks (., :, ;, !, ?), parentheses, etc.
- We will remove them first.
    
[back to top](#top)

In [None]:
def clean_texts(text): # define a function 
    t = [re.sub("[*.;:!\'?,\"()\[\]]", "", line.lower()) for line in text] # make all text be lowercases, then remove punctuation marks, parantheses, etc. (e.g., "isn't" becomes "isnt")
    t = [re.sub("(<br\s*/><br\s*/>)|(\-)|(\/)", " ", line) for line in t] # replace "<br /><br />", etc. with " " with a space (e.g., "<br /><br />" becomes " ")
    t = [re.sub("[^\x00-\x7f]", "", line) for line in t] # remove all ascii key codes \x96, \x97, etc. see, e.g., https://www.codetable.net/asciikeycodes
    return t

train_clean = clean_texts(train) # apply the function
test_clean = clean_texts(test)

print(len(train_clean), len(test_clean))

- Print the first item in `train_clean`.

### B. Normalization<a id='sec2_2'></a>
- Next, we normalize text. For example:
    - am, is, are -> be
    - play, plays, playing, played -> play
- There are two major ways to normalize text:
    - Stemming: Simply return stems without knowledge of the context (e.g. connections, connected, connecting, connection -> connect).
    - Lemmatization: Iovolve a vocabulary and morphological analysis of words (e.g. better -> good, meeting -> meeting or meet). Return the base or dictionary form of a word (= the lemma).
        
[back to top](#top)

#### Stemming
- There are several types of the stemmer (see e.g. [this webpage](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)):
    - Porter stemmer
    - Lancaster stemmer
    - etc.
- We will use Porter stemmer.
- Run the following code. This may take a few minutes.

In [None]:
def get_stemmed_text(text): 
    stemmer = PorterStemmer()
    stemmed_text = [' '.join([stemmer.stem(word) for word in line.split()]) for line in text]
    return stemmed_text

train_clean_stemmed = get_stemmed_text(train_clean)
test_clean_stemmed = get_stemmed_text(test_clean)

- Print the first item in `train_clean_stemmed`.

#### Lemmatization
- We will use WordNet Lemmatizer. 
    - It uses the [WordNet Database](https://wordnet.princeton.edu/) to lookup lemmas of words.
- Run the following code. This process also takes time...

In [None]:
def get_lemmatized_text(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_text = [' '.join([lemmatizer.lemmatize(word) for word in line.split()]) for line in text]
    return lemmatized_text

train_clean_lemmatized = get_lemmatized_text(train_clean)
test_clean_lemmatized = get_lemmatized_text(test_clean)

- Print the first item in `train_clean_lemmatized`.

### C. Removing stop words<a id='sec2_3'></a>
- The text still contains stop words (is, a, it, as, some, etc.). Let's remove them.
    - By doing so, you can focus more on "meaningful" words.
        
[back to top](#top)

In [None]:
stop_words = stopwords.words('english') # define stop words
print(stop_words)

In [None]:
def remove_stop_words(text): # make a function
    remain_words = []
    for line in text:
        remain_words.append(' '.join([word for word in line.split() if word not in stop_words])) # join all words with ' ' inbetween
    return remain_words

train_clean_no_stop = remove_stop_words(train_clean_lemmatized) # apply the function  # alternatively, use train_clean_stemmed
test_clean_no_stop = remove_stop_words(test_clean_lemmatized) # alternatively, use test_clean_stemmed

print(len(train_clean_no_stop), len(test_clean_no_stop))

- Print the first item in `train_clean_no_stop`.

## 3. Vectorization<a id='sec3'></a>
- Let's convert text to numeric values.... How?
- There are several methods:
    - Bag-of-Words (BOW)
    - Word Counts
    - Term Frequency-Inverse Document Frequency (TF-IDF)
    - N-grams
    - etc.
        
[back to top](#top)

- Let's use the lemmatized version for now.

In [None]:
train_data = train_clean_no_stop
test_data = test_clean_no_stop

### A. Bag-of-Words (BOW)<a id='sec3_1'></a>
- Make a review-by-word matrix which takes 1 if a word appears in a review, and 0 otherwise.
        
[back to top](#top)

In [None]:
vec = CountVectorizer(binary=True)
vec.fit(train_data) # learn a vocabulary dictionary of all words in the train data
X = vec.transform(train_data) # transform train data to a review x word sparse matrix using the dictionary (takes either 0 or 1 with many zeros)

- Let's check the size of vocabulary and all the vocabularies.

In [None]:
print('Vocabulary size: {}'.format(len(vec.vocabulary_)))
print('Vocabulary content: {}'.format(vec.vocabulary_))

- Let's check the content of the matrix.
    - If you get an error, that's fine. It says that your computer does not have enough memory.

In [None]:
df_X = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df_X)

- Delete ``df_X`` for now because it consumes a lot of memory.

In [None]:
del df_X
gc.collect()

### B. Word Counts<a id='sec3_2'></a>
- The method is similar to BOW but takes into account the frequency of words.
        
[back to top](#top)

In [None]:
vec = CountVectorizer(binary=False)
vec.fit(train_data)
X = vec.transform(train_data)

- Check the content of the matrix.

In [None]:
df_X = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df_X['high'][df_X['high']>3]) # print the frequency of 'high' in reviews that contain at least 3 'high's 

In [None]:
plt.hist(df_X['high'], bins=8, align='mid', range=(1,8), alpha=0.3, color='r') # print the frequency of 'high' per review
plt.title('Frequency of "high" per review') 
plt.ylabel('reviews')
plt.xlabel('times')
plt.show() 

- Delete ``df_X`` for now because it consumes a lot of memory.

In [None]:
del df_X
gc.collect()

### C. Term Frequency-Inverse Document Frequency (TF-IDF)<a id='sec3_3'></a>
- The method takes into account the frequency of words in a document and in all documents.
- Still, a higher number means that the word is frequently used.
- The formula: TF * IDF, where
    - TF: # of word `X` in review i / # of all words in review i
    - IDF: log(# of reviews / # of reviews that contains word `X`)

[back to top](#top)

In [None]:
vec = TfidfVectorizer()
vec.fit(train_data)
X = vec.transform(train_data)

- Check the content of the matrix.

In [None]:
df_X = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df_X['high'][df_X['high']>0.2]) # print the frequency of 'high' in reviews where TF-IDF of 'high' is larger than 0.2

In [None]:
plt.hist(df_X['high'], bins=8, align='mid', range=(0.001,1), alpha=0.3, color='r') # print TF-IDF of 'high' per review
plt.title('TF-IDF of "high" per review') 
plt.ylabel('reviews')
plt.xlabel('TF-IDF')
plt.show() 

- Delete ``df_X`` for now because it consumes a lot of memory.

In [None]:
del df_X
gc.collect()

### D. N-grams<a id='sec3_4'></a>
- All the above three methods consider only single words.
- An alternative approach is to consider multiple consecutive words.
- Let's collect all two consecutive words. They are called bi-grams.
    - Increasing N does not necessarily improve accuracy. Also, computation takes a longer time for a larger N.
    
[back to top](#top)

In [None]:
vec = CountVectorizer(binary=True, ngram_range=(1,2))
vec.fit(train_data)

- Let's print data.

In [None]:
print('Vocabulary size: {}'.format(len(vec.vocabulary_)))
print('bi-grams containing "good": {}'.format([key for key, value in vec.vocabulary_.items() if "good" in key]))

## 5. Sentiment Analysis<a id='sec4'></a>
- We train a machine such that it can predict the sentiment (pos/neg) of movie reviews with high precision.
- There are several types of learning algorithms:
    - Supervised learning: Labeled train data are needed.
    - Unsupervised learning: Learn from unlabeled test data.
    - Others (semi-supervised learning, reinforcement learning, etc.)
- We will use supervised learning.
    
[back to top](#top)

- There are several models:
    - Linear models (e.g., OLS, logistic regression)
    - Support Vector Machines
    - Decision Trees
    - Ensemble methods (e.g., Random Forest)
    - Neural Networks
    - etc.
- [Cheat sheets](https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463) are also available.

### A. Logistic Regressions<a id='sec4_1'></a>
- Let's start by using a simple model: Logistic regression.
- There are so many words in the data. 
- We want to consider only a few words that are important for prediction.
- Using **regularization**, we punish words that are not important for prediction.
- Let's find the best "c", which is the inverse of regularization strength.
    - A smaller value means a stronger regularization.
    
[back to top](#top)

- First, vectorize text using BOW.

In [None]:
vec = CountVectorizer(binary=True)
vec.fit(train_data) 
X = vec.transform(train_data) # a 250000 x 84063 scipy sparce matrix
X_test = vec.transform(test_data)

- Next, randomly split train data into two groups and predict one from the other.
    - In the below example, we use `X_train_sub` and `y_train_sub` to train the machine.
    - Then predict `y_pred_sub` by using `X_test_sub`.
    - Finally, compare `y_pred_sub` (predicted values) with `y_test_sub` (true values).
- Find the best "c" that yields the highest accuracy.
    - You may get a warning message but that's fine for now.

In [None]:
y = [1 if i < 12500 else 0 for i in range(25000)] # make a list, which takes 1 for pos (first 12500 obs) and 0 for neg (next 12500 obs)
X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(X, y, train_size=0.75, random_state=12345) # split X and y into 2 groups, train_size means how many you want to allocate to the train group

for c in [0.01, 0.05, 0.25, 0.5, 1]: 
    lr = LogisticRegression(C=c)
    lr.fit(X_train_sub, y_train_sub) # train a machine
    y_pred = lr.predict(X_test_sub) # predict values using X_test_sub
    print ("Accuracy for C=%s: %s" % (c, accuracy_score(y_test_sub, y_pred))) # compare true values (y_test_sub) with predicted values (y_pred)

- It seems that the best "c" is 0.05.
    - This means that only a few words are important for prediction.
- Let's see the accuracy of our machine using the entire data.

In [None]:
lr = LogisticRegression(C=0.05)
lr.fit(X, y)
y_pred = lr.predict(X_test)
print ("Accuracy is: %s" % accuracy_score(y, y_pred))

- Let's compare the correct answer and the machine's prediction for a specific review.

In [None]:
out = pd.DataFrame({'text':test_data, 'pred':y_pred, 'orig':y}) # make a dataframe
row = 2 # choose a third review
print(out.loc[row, 'text']); print('predicted =', out.loc[row, 'pred']); print('correct =', out.loc[row, 'orig'])

- Let's play with the data by manually changing reviews.

In [None]:
test_clean_mod = test_clean
test_clean_mod[0] = 'i do not like this movie' # change the first review
test_clean_mod[1] = 'i love this movie' # change the second review
X_test_mod = vec.transform(test_clean_mod) # make a sparse matrix
y_mod_pred = lr.predict(X_test_mod) # predict values
out_mod = pd.DataFrame({'text':test_clean_mod, 'pred':y_mod_pred})

In [None]:
print(out_mod.loc[0, 'text']); print('predicted =', out_mod.loc[0, 'pred'])
print(out_mod.loc[1, 'text']); print('predicted =', out_mod.loc[1, 'pred'])

### B. Support Vector Machines<a id='sec4_2'></a>
- Linear Support Vector Machines find the maximum-margin hyperplane that separates data points ([web](https://scikit-learn.org/stable/modules/svm.html)).
    
[back to top](#top)

In [None]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:    
    linear = LinearSVC(C=c)
    linear.fit(X_train_sub, y_train_sub)
    y_pred = linear.predict(X_test_sub)
    print("Accuracy for C=%s: %s" % (c, accuracy_score(y_test_sub, y_pred)))

- It seems that the best "c" is 0.01.

In [None]:
linear = LinearSVC(C=0.01)
linear.fit(X, y)
y_pred = linear.predict(X_test)
print("Accuracy is: %s" % accuracy_score(y, y_pred))

### C. Decision Trees<a id='sec4_3'></a>
- The machine learns simple decision rules inferred from the train data ([web](https://scikit-learn.org/stable/modules/tree.html#tree)).
    
[back to top](#top)

In [None]:
dt = DecisionTreeClassifier(random_state=12345, max_depth=10) 
dt.fit(X, y)
y_pred = dt.predict(X_test)
print("Accuracy is: %s" % accuracy_score(y, y_pred))

### D. Random Forests<a id='sec4_4'></a>
- Each tree is built from a randomly drawn sample from train data with replacement ([web](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)).
    
[back to top](#top)

In [None]:
rf = RandomForestClassifier(random_state=12345, max_depth=10)
rf.fit(X, y)
y_pred = rf.predict(X_test)
print("Accuracy is: %s" % accuracy_score(y, y_pred))

### E. Neural Networks<a id='sec4_5'></a>
- Neural Networks (or Multi-layer Perceptron) mimic neurons ([web](https://scikit-learn.org/stable/modules/neural_networks_supervised.html)).
- A network has hidden layers between an input and output. A neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation, followed by a function to produce an output. Weights are updated for each iteration.
    
[back to top](#top)

In [None]:
clf = MLPClassifier(solver='sgd', random_state=12345, max_iter=5) # default max_iter is 200 but it takes time to run the code
clf.fit(X, y)
y_pred = clf.predict(X_test)
print("Accuracy is: %s" % accuracy_score(y, y_pred))