# Natural Language Processing

In this homework, you will apply the TFIDF technique to text classification as well as use word2vec model to generate the dense word embedding for other NLP tasks. 

## Text Classification
The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In this lab, we will experiment different feature extraction on the 20 newgroups dataset, including the count vector and TF-IDF vector. Also, we will apply the Naive Bayes classifier  to this dataset and report the prediciton accuracy.

### Load the explore the 20newsgroup data

20 news group data is part of the sklearn library. We can directly load the data using the following command.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import string
import re
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cProfile
import argparse
import pprint
import seaborn as sns

from nltk.corpus import stopwords
from scipy.sparse import coo_matrix
from scipy.sparse import csr_matrix

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

In [2]:
# load the traning data and test data
import numpy as np
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=False)
twenty_test = fetch_20newsgroups(subset='test', shuffle=False)

# print total number of categories
print("Number of training data:" + str(len(twenty_train.data)))
print("Number of categories:" + str(len(twenty_train.target_names)))

# print the first text and its category
print(twenty_train.data[0])
print(twenty_train.target[0])

# You can check the target variable by printing all the categories
twenty_train.target_names

Number of training data:11314
Number of categories:20
From: cubbie@garnet.berkeley.edu (                               )
Subject: Re: Cubs behind Marlins? How?
Article-I.D.: agate.1pt592$f9a
Organization: University of California, Berkeley
Lines: 12
NNTP-Posting-Host: garnet.berkeley.edu


gajarsky@pilot.njin.net writes:

morgan and guzman will have era's 1 run higher than last year, and
 the cubs will be idiots and not pitch harkey as much as hibbard.
 castillo won't be good (i think he's a stud pitcher)

       This season so far, Morgan and Guzman helped to lead the Cubs
       at top in ERA, even better than THE rotation at Atlanta.
       Cubs ERA at 0.056 while Braves at 0.059. We know it is early
       in the season, we Cubs fans have learned how to enjoy the
       short triumph while it is still there.

9


['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Build a Naive Bayes Model 

Your task is to build practice an ML model to classify the newsgroup data into different categories. You will try both raw count and TF-IDF for feature extraction and then followed by a Naive Bayes classifier. Note that you can connect the feature generation and model training steps into one by using the [pipeline API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) in sklearn.

Try to use Grid Search to find the best hyper parameter from the following settings (feel free to explore other options as well):

* Differnet ngram range
* Weather or not to remove the stop words
* Weather or not to apply IDF

After building the best model from the training set, we apply that model to make predictions on the test data and report its accuracy.

In [3]:
def preprocess(text, list_of_steps):
    
    for step in list_of_steps:
        if step == 'remove_non_ascii':
            text = ''.join([x for x in text if ord(x) < 128])
        elif step == 'lowercase':
            text = text.lower()
        elif step == 'remove_punctuation':
            punct_exclude = set(string.punctuation)
            text = ''.join(char for char in text if char not in punct_exclude)
        elif step == 'remove_numbers':
            text = re.sub("\d+", "", text)
        elif step == 'strip_whitespace':
            text = ' '.join(text.split())
        elif step == 'remove_stopwords':
            stops = stopwords.words('english')
            word_list = text.split(' ')
            text_words = [word for word in word_list if word not in stops]
            text = ' '.join(text_words)
        elif step == 'stem_words':
            lmtzr = WordNetLemmatizer()
            word_list = text.split(' ')
            stemmed_words = [lmtzr.lemmatize(word) for word in word_list]
            text = ' '.join(stemmed_words)
    return text

step_list = ['remove_non_ascii', 'lowercase', 'remove_punctuation', 'remove_numbers',
            'strip_whitespace']

In [4]:
train = pd.DataFrame({'data':twenty_train.data, 'target':twenty_train.target})
test = pd.DataFrame({'data':twenty_test.data, 'target':twenty_test.target})
train.head()
train.shape

Unnamed: 0,data,target
0,From: cubbie@garnet.berkeley.edu ( ...,9
1,From: gnelson@pion.rutgers.edu (Gregory Nelson...,4
2,From: crypt-comments@math.ncsu.edu\nSubject: C...,11
3,From: ()\nSubject: Re: Quadra SCSI Problems??...,4
4,From: keith@cco.caltech.edu (Keith Allan Schne...,0


(11314, 2)

In [5]:
tf = TfidfVectorizer()
X_train = tf.fit_transform(train['data'])
y_train = train['target']
X_test = tf.transform(test['data'])
y_test = test['target']

In [6]:
tf_STOP = TfidfVectorizer(stop_words='english')
X_train_STOP = tf_STOP.fit_transform(train['data'])
y_train_STOP = train['target']
X_test_STOP = tf_STOP.transform(test['data'])
y_test_STOP = test['target']

In [7]:
train_clean = train.copy()
train_clean['data'] = train_clean['data'].map(lambda s: preprocess(s, step_list))

In [8]:
# TODO
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# Grid Search
parameters = {
    'alpha': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
}
nb_grid = GridSearchCV(MultinomialNB(), parameters)

# Models 
# nb = MultinomialNB()
# nb_TFIDF = make_pipeline(tf, nb_grid)
# nb_TFIDF_STOP = make_pipeline(tf_STOP, nb_grid)
nb_TFIDF_CLEAN = make_pipeline(TfidfVectorizer(), nb_grid)

# Train
# nb_TFIDF.fit(X_train, y_train)
# nb_TFIDF_STOP.fit(X_train_STOP, y_train_STOP)
nb_TFIDF_CLEAN.fit(train_clean['data'], train_clean['target'])
print(f"The best hyper parameter setting is {nb_grid.best_params_}")

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('gridsearchcv',
                 GridSearchCV(estimator=MultinomialNB(),
                              param_grid={'alpha': [0.1, 0.2, 0.3, 0.4, 0.5,
                                                    0.6, 0.7, 0.8, 0.9, 1]}))])

The best hyper parameter setting is {'alpha': 0.1}


In [9]:
# PERFORMANCE from TFIDF
# accuracy_score(y_true, y_pred)
# from sklearn.metrics import accuracy_score

# y_pred_TFIDF = nb_grid.predict(X_test)
# accuracy_score(y_test, y_pred_TFIDF)

In [10]:
# PERFORMANCE from TFIDF with Stop Words
# accuracy_score(y_true, y_pred)
# from sklearn.metrics import accuracy_score

# y_pred_STOP = nb_TFIDF_STOP.predict(twenty_test.data)
# accuracy_score(twenty_test.target, y_pred_STOP)

In [11]:
# PERFORMANCE from TFIDF with Clean Data
# accuracy_score(y_true, y_pred)
from sklearn.metrics import accuracy_score

y_pred_CLEAN = nb_TFIDF_CLEAN.predict(twenty_test.data)
print(f'TF-IDF accuracy from clean data: {"%0.4F" % accuracy_score(twenty_test.target, y_pred_CLEAN)}')

TF-IDF accuracy from clean data: 0.8128


In [12]:
# Reference: https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html
# categories = ['talk.religion.misc', 'soc.religion.christian',
#              'sci.space', 'comp.graphics']
# train = fetch_20newsgroups(subset='train', categories=categories)
# test = fetch_20newsgroups(subset='test', categories=categories)
# model = make_pipeline(TfidfVectorizer(), MultinomialNB())
# model.fit(train.data, train.target)
# y_pred = model.predict(test.data)
# accuracy_score(test.target, y_pred)

---------

## Word Embedding with word2vec

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. 

In this assessment, we will experiment with [word2vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model from package [gensim](https://radimrehurek.com/gensim/) and generate word embeddings from a review dataset. You can then explore those word embeddings and see if they make sense semantically. 

In [13]:
import gzip
import logging
import warnings
from gensim.models import Word2Vec

warnings.simplefilter(action='ignore', category=FutureWarning)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Load the review data

In [14]:
import gensim

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    print("reading file {0}...this may take a while".format(input_file))
    with gzip.open(input_file, 'rb') as f:
        for i, line in enumerate(f):
 
            if (i % 10000 == 0):
                print("read {0} reviews".format(i))
            # do some pre-processing and return list of words for each review b text
            yield gensim.utils.simple_preprocess(line)
            
documents = list(read_input('../data/reviews_data.txt.gz'))
logging.info("Done reading data file")

reading file ../data/reviews_data.txt.gz...this may take a while
read 0 reviews
read 10000 reviews
read 20000 reviews
read 30000 reviews
read 40000 reviews
read 50000 reviews
read 60000 reviews
read 70000 reviews
read 80000 reviews
read 90000 reviews
read 100000 reviews
read 110000 reviews
read 120000 reviews
read 130000 reviews
read 140000 reviews
read 150000 reviews
read 160000 reviews
read 170000 reviews
read 180000 reviews
read 190000 reviews
read 200000 reviews
read 210000 reviews
read 220000 reviews
read 230000 reviews
read 240000 reviews
read 250000 reviews


2021-03-01 23:33:36,235 : INFO : Done reading data file


### Train the word2vec model

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling introduced in Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. A word2vec tutorial can be found [here](https://rare-technologies.com/word2vec-tutorial/).

In [15]:
# TODO build vocabulary and train model
model = gensim.models.Word2Vec(documents, min_count=5) 

2021-03-01 23:33:36,238 : INFO : collecting all words and their counts
2021-03-01 23:33:36,239 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-01 23:33:36,419 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2021-03-01 23:33:36,605 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2021-03-01 23:33:36,823 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2021-03-01 23:33:37,026 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2021-03-01 23:33:37,248 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2021-03-01 23:33:37,500 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2021-03-01 23:33:37,707 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2021-03-01 23:33:37,896 : INFO : PROG

2021-03-01 23:34:20,794 : INFO : EPOCH 2 - PROGRESS: at 57.65% examples, 1361776 words/s, in_qsize 6, out_qsize 0
2021-03-01 23:34:21,797 : INFO : EPOCH 2 - PROGRESS: at 62.49% examples, 1364519 words/s, in_qsize 5, out_qsize 0
2021-03-01 23:34:22,799 : INFO : EPOCH 2 - PROGRESS: at 67.28% examples, 1364085 words/s, in_qsize 6, out_qsize 0
2021-03-01 23:34:23,807 : INFO : EPOCH 2 - PROGRESS: at 71.83% examples, 1365253 words/s, in_qsize 6, out_qsize 0
2021-03-01 23:34:24,815 : INFO : EPOCH 2 - PROGRESS: at 76.53% examples, 1366640 words/s, in_qsize 5, out_qsize 0
2021-03-01 23:34:25,818 : INFO : EPOCH 2 - PROGRESS: at 80.88% examples, 1365570 words/s, in_qsize 6, out_qsize 0
2021-03-01 23:34:26,823 : INFO : EPOCH 2 - PROGRESS: at 85.47% examples, 1367259 words/s, in_qsize 6, out_qsize 0
2021-03-01 23:34:27,836 : INFO : EPOCH 2 - PROGRESS: at 90.61% examples, 1369717 words/s, in_qsize 6, out_qsize 0
2021-03-01 23:34:28,838 : INFO : EPOCH 2 - PROGRESS: at 95.38% examples, 1370506 words/s

2021-03-01 23:35:25,146 : INFO : EPOCH 5 - PROGRESS: at 66.24% examples, 1440486 words/s, in_qsize 6, out_qsize 0
2021-03-01 23:35:26,146 : INFO : EPOCH 5 - PROGRESS: at 71.11% examples, 1443046 words/s, in_qsize 5, out_qsize 0
2021-03-01 23:35:27,149 : INFO : EPOCH 5 - PROGRESS: at 76.15% examples, 1445970 words/s, in_qsize 5, out_qsize 0
2021-03-01 23:35:28,151 : INFO : EPOCH 5 - PROGRESS: at 80.92% examples, 1448160 words/s, in_qsize 5, out_qsize 0
2021-03-01 23:35:29,151 : INFO : EPOCH 5 - PROGRESS: at 85.77% examples, 1449751 words/s, in_qsize 5, out_qsize 0
2021-03-01 23:35:30,157 : INFO : EPOCH 5 - PROGRESS: at 91.06% examples, 1450692 words/s, in_qsize 5, out_qsize 0
2021-03-01 23:35:31,162 : INFO : EPOCH 5 - PROGRESS: at 96.15% examples, 1452317 words/s, in_qsize 5, out_qsize 0
2021-03-01 23:35:31,904 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-03-01 23:35:31,908 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-03-01 23:35:31,

### Find similar words for a given word
Once the model is built, you can find interesting patterns in the model. For example, can you find the 5 most similar words to word `polite`

In [16]:
# TODO: look up top 5 words similar to 'polite' using most_similar function
# Feel free to try other words and see if it makes sense.
model.most_similar(positive=['polite'], topn=5)

  This is separate from the ipykernel package so we can avoid doing imports until
2021-03-01 23:35:31,915 : INFO : precomputing L2-norms of word weight vectors


[('courteous', 0.9425445795059204),
 ('cordial', 0.912461519241333),
 ('curteous', 0.9069803953170776),
 ('curtious', 0.8921737670898438),
 ('personable', 0.8755834698677063)]

### Compare the word embedding by comparing their similarities
We can also find similarity betwen two words in the embedding space. Can you find the similarities between word `great` and `good`/`horrible`, and also `dirty` and `clean`/`smelly`. Feel free to play around with the word embedding you just learnt and see if they make sense.

In [17]:
# TODO: find similarities between two words using similarity function
a = 'great'
b = 'good'
c = 'horrible'
print(f'{a} and {b} similarity: {"%0.4f" % model.similarity(a, b)}')
print(f'{a} and {c} similarity: {"%0.4f" % model.similarity(a, c)}')

great and good similarity: 0.8299
great and horrible similarity: 0.3865


  """
  


In [18]:
# TODO: find similarities between two words using similarity function
a = 'dirty'
b = 'clean'
c = 'smelly'
print(f'{a} and {b} similarity: {"%0.4f" % model.similarity(a, b)}')
print(f'{a} and {c} similarity: {"%0.4f" % model.similarity(a, c)}')

dirty and clean similarity: 0.3428
dirty and smelly similarity: 0.7966


  """
  


<b>Reflection</b>
<p>
This week I learned how powerful natural learning processing is. It is incredibly useful in quantifying vocabularies in order to find insights. I greatly enjoyed diving into this dataset and getting my hands dirty.
<p>