# Cleaning SVM Bag of Words Question

In [2]:
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
import string
from keras.preprocessing.text import Tokenizer
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

- nltk is a great library for NLP that we can use to get items such as a list of common stop words as you will see below
- keras does a good job making it easy for us to set up the bag of words model easily by simply feeding in text rather than implementing it ourselves as we did in the first HW question
- sklearn is a library we use to simply build ML models we have discussed in this class such as SVMs and Logistic Regression. It will alsoo allow us to see the confusion matrix and get accuracy easily. 


## 1. Cleaning data

For this question we use IMBD movie reviews from Kaggle and attempt to use SVM (which we learned about earlier in this class) and the bag of words model to do sentiment analysis. We also examine the effect of data cleaning on train/test accuracy. 

### Part A: Identifying and Adding Cleaning functions

We first get an example review and attempt to create a function to clean the data. Below Identify what the three given cleaning sections do and explain why they are helpful, and write code for a fourth section that would aid in removing words such as "I" or "A" which do not have an impact on sentiment analysis. 

Hint: Consider the minimum length of useful information

In [None]:
# function for getting the doc
def get_doc(filename):
    f = open(filename, 'r')
    txt = f.read()
    f.close()
    return txt

# Used if we did not clean file
def not_clean_file(f):
    data = f.split()
    return data

# function for cleaning the doc
def clean_file(f):
    # we grab all the data seperated by whitespace
    data = f.split()
    
    # Clean 1  
    table = str.maketrans('', '', string.punctuation)
    data = [w.translate(table) for w in data]
    
    #  Clean 2
    data = [w for w in data if w.isalpha()]
    
    # Clean 3
    # Recall from our notes what stop words are. Here we use the stopwords library to easily gain access
    # to a list of common stop words in english. EX. words include: 
    #{‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’}
    stop_words = set(stopwords.words('english'))
    data = [w for w in data if not w in stop_words]
    
    ### Begin Part A
    # Add code here
    ### End Part A
    return data
    

# get the cleaned text
f = 'data/pos/cv000_29590.txt'
text = get_doc(f)
print("Original text: ")
print(text[:1000])

cleaned_text = clean_file(text)
not_cleaned_text = not_clean_file(text)

print()
print("Words from text: ")
print(not_cleaned_text[:30])

print()
print("Cleaned words from text: ")
print(cleaned_text[:20])

#### RESPONSE:

In [None]:
# Clean 1:
# 
# Clean 2:
# 
# Clean 3:
# 

## 2. Training without Cleaning

### Part B: Building Vocabulary

We now create a vocabulary that we can use for later steps. To do this we run the functions from before for all the train data. For this part of the assignment we wil NOT be cleaning data.

In [None]:
# load doc and add to vocab (not clean)
def add_doc_to_vocab(filename, vocab):
    doc = get_doc(filename)
    not_cleaned = not_clean_file(doc)
    vocab.update(not_cleaned)
    

def process_docs(directory, vocab):
    for filename in listdir(directory):
        # skip any reviews in the test set
        if filename.startswith('cv9'):
            continue
        path = directory + '/' + filename
        add_doc_to_vocab(path, vocab)
        
# define vocab as a counter type
vocab = Counter()
# Adding both positive and negative data
process_docs('data/pos', vocab)
process_docs('data/neg', vocab)
# Printing the most common words from vocab
print(vocab.most_common(50))

What do you notice about the most common values in the vocabulary above. Do you think that they are helpful in our sentiment analysis?

#### RESPONSE:

In [None]:
# 

### Part C: removing values that appear less than once

We do not need to include words that appear only once in our vocabulary as they are most likely unique words that are not common and do not play a major role in sentiment analysis. Write the ccode below to remove all words with length less than 5.

In [None]:
### Start Part C
min_occurance = 5
trim_vocab = [k for k,c in vocab.items() if c >= min_occurance]
### End Part C

We then save this vocab as a file to use for later.

In [None]:
def save_file(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
    
# save tokens to a vocabulary file
save_file(trim_vocab, 'vocab_unclean.txt')

### Part D: Creating a model

We create helper functions that will allow  us to properly use SVMs as our learning models. Please complete the code segments below.

In [None]:
# Function we will use to load a doc and grab all values that are also
# in vocab
def doc_to_line(filename, vocab):
    doc = get_doc(filename)
    words = not_clean_file(doc)
    # Write code to only include words that are in the vocabulary
    ### Begin Part D
    # Add code here
    ### End Part D
    return ' '.join(words)

In [None]:
# Loads all data given whether it is train or test
def process_docs(directory, vocab, is_trian):
    lines = list()
    for filename in listdir(directory):
        # choose train or test data
        if is_trian and filename.startswith('cv9'):
            continue
        if not is_trian and not filename.startswith('cv9'):
            continue
            
        path = directory + '/' + filename
        line = doc_to_line(path, vocab)
        lines.append(line)
    return lines

In [None]:
# We use tokenizer in order to generate our Xtrain and Xtest
# This uses the Tokenizer library in order to help create the featurizations for the bag of words model based on the
# input words and vocabulary we have.

# the Tokenizer provides 4 attributes that you can use to query what has been learned about your documents:

# word_counts: A dictionary of words and their counts.
# word_docs: A dictionary of words and how many documents each appeared in.
# word_index: A dictionary of words and their uniquely assigned integers.
# document_count:An integer count of the total number of documents that were used to fit the Tokenizer.

# This function provides a suite of standard bag-of-words model text encoding schemes that can be provided 
# via a mode argument to the function.

# The modes available include:

# ‘binary‘: Whether or not each word is present in the document. This is the default.
# ‘count‘: The count of each word in the document.
# ‘tfidf‘: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
# ‘freq‘: The frequency of each word as a ratio of words within each document.

# For more information about Tokenizer, you can visit: 
# https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/

# View this link: https://keras.io/api/preprocessing/text/#tokenizer 
# For documentation information
def prepare_data(train_docs, test_docs, mode):
    # We create the tokenizer
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(train_docs)
    # encode train data set
    Xtrain = tokenizer.texts_to_matrix(train_docs, mode=mode)
    # encode test data set
    Xtest = tokenizer.texts_to_matrix(test_docs, mode=mode)
    return Xtrain, Xtest

### Part E: Running the Model

We now begin to run the model. Please complete the code below and answer the following questions.

In [None]:
# load the vocabulary
vocab_filename = 'vocab_unclean.txt'
vocab = get_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

In [None]:
# load all training reviews
positive_lines = process_docs('data/pos', vocab, True)
negative_lines = process_docs('data/neg', vocab, True)
train_docs = negative_lines + positive_lines

In [None]:
positive_lines = process_docs('data/pos', vocab, False)
negative_lines = process_docs('data/neg', vocab, False)
test_docs = negative_lines + positive_lines

In [None]:
# prepare labels
ytrain = np.array([0 for _ in range(900)] + [1 for _ in range(900)])
ytest = np.array([0 for _ in range(100)] + [1 for _ in range(100)])

In [None]:
# Note here we use the binary option. If you read the explanation for Tokenizer, you see there are three other options.
# Feel free to try 'count', 'freq', or 'tfidf' instead, but we do not require you to do so for this problem
Xtrain, Xtest = prepare_data(train_docs, test_docs, 'binary')
# Write code below to use SVMs to create a model. You may use sklearn
### Begin Part E
# Add code here
### End Part E


In [None]:
# Print train error
### Begin Part E
# Add code here
### End Part E

In [None]:
# Print test error
### Begin Part E
# Add code here
### End Part E

In [None]:
# Print Confusion Matrix
### Begin Part E
# Add code here
### End Part E

How well did your model perform. Is it what you expected?

#### RESPONSE:

In [None]:
# 

### Part F: Vocabulary with Clean Data

We will recreate our vocabulary but with cleaned data this data. Respond to the question below.

In [None]:
# load doc and add to vocab (clean)
def add_doc_to_vocab2(filename, vocab):
    doc = get_doc(filename)
    cleaned = clean_file(doc)
    vocab.update(cleaned)
    
def process_docs2(directory, vocab):
    for filename in listdir(directory):
        # skip any reviews in the test set
        if filename.startswith('cv9'):
            continue
        path = directory + '/' + filename
        add_doc_to_vocab2(path, vocab)
        
# define vocab as a counter type
vocab2 = Counter()
# Adding both positive and negative data
process_docs2('data/pos', vocab2)
process_docs2('data/neg', vocab2)
# Printing the most common words from vocab
print(vocab2.most_common(50))

How do the most common words compare to that of Part B when we built the vocabulary without cleaning?

#### RESPONSE:

In [None]:
#

Re-add the code from Part C below to remove values that appear less than five times.

In [None]:
### Start Part F
min_occurance = 5
trim_vocab2 = [k for k,c in vocab2.items() if c >= min_occurance]
### End Part F

In [None]:
# save tokens to a vocabulary file
save_file(trim_vocab2, 'vocab_clean.txt')

### Part G: Training with Clean Data

We re-train the model with clean data this time. Add code below and answer the following questions.

In [None]:
# Function we will use to load a doc and grab all values that are also
# in vocab
def doc_to_line2(filename, vocab):
    doc = get_doc(filename)
    words = clean_file(doc)
    # Write code to only include words that are in the vocabulary
    # This is the same as part D
    ### Begin Part G
    # Add code here
    ### End Part G
    return ' '.join(words)

# Loads all data given whether it is train or test
def process_docs(directory, vocab, is_trian):
    lines = list()
    for filename in listdir(directory):
        # choose train or test data
        if is_trian and filename.startswith('cv9'):
            continue
        if not is_trian and not filename.startswith('cv9'):
            continue
            
        path = directory + '/' + filename
        line = doc_to_line2(path, vocab)
        lines.append(line)
    return lines

In [None]:
# load the vocabulary
vocab_filename = 'vocab_clean.txt'
vocab = get_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)

In [None]:
# load all training reviews
positive_lines = process_docs('data/pos', vocab, True)
negative_lines = process_docs('data/neg', vocab, True)
train_docs = negative_lines + positive_lines

In [None]:
positive_lines = process_docs('data/pos', vocab, False)
negative_lines = process_docs('data/neg', vocab, False)
test_docs = negative_lines + positive_lines

In [None]:
# prepare labels
ytrain = np.array([0 for _ in range(900)] + [1 for _ in range(900)])
ytest = np.array([0 for _ in range(100)] + [1 for _ in range(100)])

In [None]:
Xtrain, Xtest = prepare_data(train_docs, test_docs, 'binary')
# Write code below to use SVMs to create a model. You may use sklearn
# Same as Part E
### Begin Part G
# Add code here
### End Part G

In [None]:
# Print train error
# Same as Part E
### Begin Part G
# Add code here
### End Part G

In [None]:
# Print test error
# Same as Part E
### Begin Part G
# Add code here
### End Part G

In [None]:
# Print Confusion Matrix
# Same as Part E
### Begin Part G
# Add code here
### End Part G

Did you expect these results? What effect did cleaning the data before training have?

#### RESPONSE:

In [None]:
#