# Introduction to Bag of Words for Binary Classification

Motivation: In this problem we provide an introduction to a real world application of the Bag of Words Model: sentiment analysis and binary classification. The student will perform binary classification on two datasets: one Yelp Review dataset partitioned by low and highly rated reviews and another dataset of Airplane Tweets classified into positive and negative sentiment. The student starts off with data exploration. Afterwards, they create a simple logistic regression model with the only feature being word count, then implement bag of words features, then explore a number of modifications to the model in order to evaluate the tangible impact of using different variations and better understand the nuances of the model.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import sklearn
import json

from collections import Counter



## 1. Exploring the data set
Here we explore the dataset of Yelp Reviews and Airplane Tweets. The Yelp dataset has a star rating of 1 to 5, so we extract the most polar reviews with 1 and 5 stars and classify them as -1 and 1 respectively. For Airplane Tweets, we translate the 'negative' and 'positive' sentiment labels into classes of -1 and 1. We initially load the data and do some exploratory analysis in order to learn more about the data we will be classifying and gain some insight into how to do so.

In [None]:
### Uncomment to use Yelp Reviews dataset
df = pd.read_csv('yelp_academic_dataset_review.csv') 
###

### Uncomment this to use the Airplane Tweets dataset
# df = pd.read_csv('Tweets.csv') 
###

We simply grab all the one star and five star data from the dataset here. 

In [None]:

### Uncomment to use Yelp Reviews dataset
# Get one star reviews and label them with -1
dfNegative = df[df['stars'] == 1]
dfNegative = dfNegative.head(10000)
dfNegative['stars'] = dfNegative['stars'].apply(lambda x: -1)

# Get five star reviews and label them with 1
print("Shape of the negative input: ")
print(dfNegative.shape)
dfPositive = df[df['stars'] == 5]
dfPositive = dfPositive.head(10000)
dfPositive['stars'] = dfPositive['stars'].apply(lambda x: 1)

print("Shape of the positive input: ")
print(dfPositive.shape)
dfCombined = pd.concat([dfNegative, dfPositive], axis=0)
dfCombined = dfCombined[['stars', 'text']]
dfCombined=dfCombined.rename(columns = {'stars':'class'})
###


### Uncomment this to use the Airplane Tweets dataset
# dfCombined = df[['airline_sentiment', 'text']]
# dfCombined = dfCombined[dfCombined.airline_sentiment != 'neutral']
# dfCombined['airline_sentiment'] = dfCombined['airline_sentiment'].replace(['positive','negative'],[1, -1])
# dfCombined = dfCombined.rename(columns = {'airline_sentiment':'class'})
# dfPositive = dfCombined[dfCombined['class'] == 1]
# dfNegative = dfCombined[dfCombined['class'] == -1]
# print("Shape of the negative input: ")
# print(dfNegative.shape)
# print("Shape of the positive input: ")
# print(dfPositive.shape)
###



# Randomly shuffling the data then dividing it into train and test sets
dfCombined = dfCombined.sample(frac=1)
print("Shape of the dataframe: ")
print(dfCombined.shape)

dfTrainset = dfCombined.head(int(len(dfCombined.index) * .8))
dfTestset = dfCombined.tail(int(len(dfCombined.index) * .2))

trainX = np.asarray(dfTrainset['text'])
trainY = np.asarray(dfTrainset['class'])

testX = np.asarray(dfTestset['text'])
testY = np.asarray(dfTestset['class'])

print('Data Frame of reviews:')


dfCombined

### Part A: Data Sampling

Try to run the below block multiple times to see different reviews and their respective class. Please comment below on what interesting aspects of the reviews you find associated with each class. What distinguishes between a classification of 1 and one of -1? Do so for both datasets.

In [None]:
sample = dfTrainset.sample() 
print("Text: " + sample['text'].values[0]  + "\n")
print("Classification: " + str(sample['class'].values[0]))

#### RESPONSE: 

Yelp Reviews:
    
Airplane Tweets:

### Part B: Corpus Examination

We will now look at all the text in our train dataset (corpus) in order to see what it contains. In the provided space below use a histogram to visualize the frequency of the 25 most common words. Then answer the questions that follow. Hint: The most_common function for Counters may come in handy.

In [None]:
allText = ' '.join(dfTrainset["text"])
words = allText.split() 

wordCounts = Counter()
for word in words:
    wordCounts[word] += 1


In [None]:
print("Length of all text:")
print(len(allText))
print("Number of unique words:")
print(len(wordCounts))
### Begin Part B




### End Part B

#### What do you notice about the most common words for both datasets? Do you think they are useful in classifying a review?

#### RESPONSE: 

RESPONSE HERE

Look at some of the least common words below. Define the variable least common.

In [None]:
### Begin Part B

### End Part B

In [None]:
print(leastCommon)

#### What do you notice about the least common words for both datasets? Do you think they are useful in classifying a review?

#### RESPONSE: 

RESPONSE HERE

### Part C: Identifying Unique Most Common Words of Each Classification

We now want to find the most common words in each class that are not included in the other. Basically, we find the most common words in positive reviews (class = 1) that are not in the most common set of words for negative reviews (class = -1) and vice versa. Fill out the below code and answer the following questions.

In [None]:
allTextPositive = ' '.join(dfPositive["text"])
wordsPositive = allTextPositive.split() 

### Begin Part C
# Find the 100 most common words that are found in the five star reviews





### End Part C



allTextNegative = ' '.join(dfNegative["text"])
wordsNegative = allTextNegative.split() 

### Begin Part C
# Find the 100 most common words that are found in the one star reviews




### End Part C

### Begin Part C
# Subtract sets in order to find the most common unique words for each set



### End Part C

print("Most common words in negative reviews: ")
print(negativeUnique)
print()
print("Most common words in positive reviews: ")
print(positiveUnique)

#### What do you notice about these words above? Are they more respresentative of each classification? What words do you think are good indicators of each review class? What words are not so good? Answer for both datasets.

#### RESPONSE: 

Yelp Reviews:
    
Airplane Tweets:

## 2. Constructing and Evaluating Different Models

### Part D: Baseline Model

To see the effect of the bag of words model, we first build a naive baseline model that tries to simply classification of the model purely based on the length of the review. Complete the code below and answer the following questions.

In [None]:
def baseline_featurize(review):
    ### Begin Part D
    # Featurize the data based on the length of the review. Hint: There should only be one feature.

    ### End Part D

def trainModel(X_featurized, y_true):
    ### Begin Part D
    # Return a logistic regression model

    

    ### End Part D

def accuracyData(model, X_featurized, y_true):
    ### Begin Part D
    # Predict the data given the model and corresponding data. Print and return the accuracy 
    # as the percentage of values that were correctly classified. Also print a confusion
    # matrix to help visualize the error. Hint: Look at sklearn.metrics.confusion
    
    
    
    
    
    
    
    ### End Part D
    return accuracy
    

In [None]:
### Begin Part D
# Featurize the training data and then train a model on it. 
# Afterwards, featurize the test data and evaluate the model on it.
# Use the functions you made above to do so
print("Beginning Train Featurization")

print("Beginning Training")

print("Beginning Test Featurization")

print("Accuracy:")

### End Part D

#### What did you get as your accuracy? Does that surprise you? Why or why not? Answer for both datasets.

#### RESPONSE: 

Yelp Reviews:

Airplane Tweets:

### Part E: Bag of Words Model

Now implement the bag of words featurization below based on the provided lecture. Please complete the following code segments and answer the following questions.

In [None]:
# We create a wordsOrdered list that contains all words in the train data that show up more
# than one time. Each word count should be in its respective place in the feature vector.

modifiedCounter = Counter(el for el in wordCounts.elements() if wordCounts[el] > 1)
wordsOrdered = [key for key, _ in modifiedCounter.most_common()]

def bag_of_words_featurize(review):
    ### Begin Part E
    # Code the featurization for the bag of words model. Return the corresponding vector
    
    
    
    
    
    
    ### End Part E        

Run the below script and see how well the bag of words model performs. Warning: this block may
around 10 minutes to run.

In [None]:
print("Beginning Train Featurization")
currBagFeaturized_data = np.array(list(map(bag_of_words_featurize, trainX)))
print("Beginning Training")
currBagModel = trainModel(currBagFeaturized_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedBag_data = np.array(list(map(bag_of_words_featurize, testX)))
print("Accuracy:")
accuracyData(currBagModel, testFeaturizedBag_data, np.asarray(dfTestset["class"]))

#### What was your accuracy? Does that surprise you? Why did it perform as it did? Answer for both datasets.

#### RESPONSE: 

Yelp Reviews:

Airplane Tweets:

In [None]:
intermed = dict(enumerate(wordsOrdered))
wordPosition = {y:x for x,y in intermed.items()}

### Part F: Examining Bag of Words Weights

We have provided a function that gets the weight of a word feature below in the weight vector generated from the logistic regression model with bag of words featurization. Answer the question below.

In [None]:
def weightOfWords(word):
    if word not in wordPosition.keys():
        print("Word does not exist in model, no weight is assigned to it")
        return
    return currBagModel.coef_[0][wordPosition[word]]


In [None]:
# Try different words here
weightOfWords('good')

#### List three words that have positive weights. List three that have negative weights. Explain why that makes sense. Answer for both datasets.

#### RESPONSE: 

Yelp Reviews:

Airplane Tweets:

### Part G: Binary Bag of Words

There are times when we only want to identify whether a word is in a review or not and disregard the number of times it has shown up in the review. In this case, we find binary bag of words more useful that our regualar bag of words model. Hypothesize which model should run better given the examination of the dataset. Complete the code below and answer the questions below.

In [None]:
def bag_of_words_binary_featurize(review):
    ### Begin Part G
    
    
    
    
    
    
    ### End Part G

Run the below script and see how well the bag of words model performs. Warning: this block may
around 10 minutes to run.

In [None]:
print("Beginning Train Featurization")
currBinBagFeaturized_data = np.array(list(map(bag_of_words_binary_featurize, trainX)))
print("Beginning Training")
currBinBagModel = trainModel(currBinBagFeaturized_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedBinBag_data = np.array(list(map(bag_of_words_binary_featurize, testX)))
print("Accuracy:")
accuracyData(currBinBagModel, testFeaturizedBinBag_data, np.asarray(dfTestset["class"]))

#### What was your accuracy percentage? Was it what you expected? How did it compare to the regular Bag of Words model? Answer for both datasets.

#### RESPONSE:

Yelp Reviews:

Airplane Tweets:

### Part H: Bag of Words Negative Features

There are times where we also want to identify negative words as separate features instead of regular features. For example if we get a review: "The food is not good", the word "good" is used in a negative connotation and should be treated as such. Thus we make new features for the negative of each of our chosen words. Complete the code below and answer the following questions. Hint: Try doubling the size of the feature vector.

In [None]:
def bag_of_words_neg_featurize(review):
    ### Begin Part H
    
    
    
    
    
    
    
    
    
    
    
    
    
    ### End Part H

Run the below script and see how well the bag of words model performs. Warning: this block may
around 10 minutes to run.

In [None]:
print("Beginning Train Featurization")
neg_data = np.array(list(map(bag_of_words_neg_featurize, trainX)))
print("Beginning Training")
negModel = trainModel(neg_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedNeg_data = np.array(list(map(bag_of_words_neg_featurize, testX)))
print("Accuracy:")
accuracyData(negModel, testFeaturizedNeg_data, np.asarray(dfTestset["class"]))

#### How did this model perform? Is it as expected? Why did it perform this way? Answer for both datasets.

#### RESPONSE:

Yelp Reviews:

Airplane Tweets:

### Part I: Negative Binary Features

Follow the code below and answer the questions below for combining the two features we worked on.

In [None]:
def bag_of_words_neg_binary_featurize(review):
    ### Begin Part I
    
    
    
    
    
    
    
    
    
    
    
    
    
    ### End Part I

Run the below script and see how well the bag of words model performs. Warning: this block may around 10 minutes to run.

In [None]:
print("Beginning Train Featurization")
negbin_data = np.array(list(map(bag_of_words_neg_binary_featurize, trainX)))
print("Beginning Training")
negBinModel = trainModel(negbin_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedNegBin_data = np.array(list(map(bag_of_words_neg_binary_featurize, testX)))
print("Accuracy:")
accuracyData(negBinModel, testFeaturizedNegBin_data, np.asarray(dfTestset["class"]))

#### Was the result as expected? Why or why not? Answer for both datasets.

#### RESPONSE:

Yelp Reviews:

Airplane Tweets:

## 3. Extra Credit

### Part J (OPTIONAL): Enhanced Model

In order to get extra credit, Try to create some sort of featurization below that will reach an accuracy of .97 or higher for either model. Ideas to keep in mind are the Bigram model that was discussed in the notes that takes consecutive words into account as well as methods to increase the number of features we use. Good luck!!
HINT: You can combine additional features like length with existing bag of words features.

In [None]:
def bag_of_words_extra_credit_featurize(review):
    ### Begin Part J
    # User solution!
    ### End Part J

In [None]:
print("Beginning Train Featurization")
ExtraBagFeaturized_data = np.array(list(map(bag_of_words_extra_credit_featurize, trainX)))
print("Beginning Training")
ExtraBagModel = trainModel(ExtraBagFeaturized_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedBinBag_extra = np.array(list(map(bag_of_words_extra_credit_featurize, testX)))
print("Accuracy:")
accuracyData(ExtraBagModel, testFeaturizedBinBag_extra, np.asarray(dfTestset["class"]))

#### What features did you add? Why did you do so? What was your accuracy percentage?

#### RESPONSE:

RESPONSE HERE

### <span style="color:red">ONLY RUN BELOW CODE IF YOU ARE ON THE YELP DATASET</span> 

## 4. Evaluating Yelp Model with Less Polar Data

Now we will be performing a similar analysis on the Yelp Dataset but including both 1 star and 2 star reviews as the negative class and 4 star and 5 star reviews as the positive class. This way there will be less of a clear divide between the two classes and students should see how adapting the bag of words model can prove beneficial.

In [None]:

# Get one star reviews and label them with -1
dfOnes = df[df['stars'] == 1]
dfOnes = dfOnes.head(10000)
dfOnes['stars'] = dfOnes['stars'].apply(lambda x: -1)


dfTwos = df[df['stars'] == 2]
dfTwos = dfTwos.head(10000)
dfTwos['stars'] = dfTwos['stars'].apply(lambda x: -1)

# Get five star reviews and label them with 1
print("Shape of the ones input: ")
print(dfOnes.shape)

dfFives = df[df['stars'] == 5]
dfFives = dfFives.head(10000)
dfFives['stars'] = dfFives['stars'].apply(lambda x: 1)

dfFours = df[df['stars'] == 4]
dfFours = dfFours.head(10000)
dfFours['stars'] = dfFours['stars'].apply(lambda x: 1)

print("Shape of the fives input: ")
print(dfFives.shape)
dfCombined = pd.concat([dfOnes, dfTwos, dfFours, dfFives], axis=0)
dfCombined=dfCombined.rename(columns = {'stars':'class'})
dfCombined = dfCombined.sample(frac=1)

dfTrainset = dfCombined.head(int(len(dfCombined.index) * .8))
dfTestset = dfCombined.tail(int(len(dfCombined.index) * .2))

trainX = np.asarray(dfTrainset['text'])
trainY = np.asarray(dfTrainset['class'])

testX = np.asarray(dfTestset['class'])
testY = np.asarray(dfTestset['class'])

print('Data Frame of reviews:')
dfCombined

### Part K: Data Sampling of the 2 and 4 star reviews

In [None]:
sample = dfTwos.sample() 
print("Text: " + sample['text'].values[0]  + "\n")
print("Class: " + str(sample['class'].values[0]))

In [None]:
sample = dfFours.sample() 
print("Text: " + sample['text'].values[0]  + "\n")
print("Class: " + str(sample['class'].values[0]))

In [None]:
allText = ' '.join(dfCombined["text"])
words = allText.split() 

wordCounts = Counter()
for word in words:
    wordCounts[word] += 1
    
modifiedCounter = Counter(el for el in wordCounts.elements() if wordCounts[el] > 1)
wordsOrdered = [key for key, _ in modifiedCounter.most_common()]

### Part L: Baseline Model

In [None]:
print("Beginning Train Featurization")
currBaselineFeaturized_data = np.array(list(map(baseline_featurize, trainX)))
print("Beginning Training")
currBaselineModel = trainModel(currBaselineFeaturized_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedBaseline_data = np.array(list(map(baseline_featurize, testX)))
print("Accuracy:")
accuracyData(currBaselineModel, testFeaturizedBaseline_data, np.asarray(dfTestset["class"]))

#### What was your accuracy percentage? Was it what you expected? How did it compare to the regular Bag of Words model? Answer for both datasets.

#### RESPONSE:

Yelp Reviews:

Airplane Tweets:

### Part M: Bag of Words Model

In [None]:
print("Beginning Train Featurization")
currBagFeaturized_data = np.array(list(map(bag_of_words_featurize, trainX)))
print("Beginning Training")
currBagModel = trainModel(currBagFeaturized_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedBag_data = np.array(list(map(bag_of_words_featurize, testX)))
print("Accuracy:")
accuracyData(currBagModel, testFeaturizedBag_data, np.asarray(dfTestset["class"]))

#### What was your accuracy percentage? Was it what you expected? How did it compare to the regular Bag of Words model? Answer for both datasets.

#### RESPONSE:

Yelp Reviews:

Airplane Tweets:

### Part N: Binary Bag of Words Model

In [None]:
print("Beginning Train Featurization")
currBinBagFeaturized_data = np.array(list(map(bag_of_words_binary_featurize, trainX)))
print("Beginning Training")
currBinBagModel = trainModel(currBinBagFeaturized_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedBinBag_data = np.array(list(map(bag_of_words_binary_featurize, testX)))
print("Accuracy:")
accuracyData(currBinBagModel, testFeaturizedBinBag_data, np.asarray(dfTestset["class"]))

#### What was your accuracy percentage? Was it what you expected? How did it compare to the regular Bag of Words model? Answer for both datasets.

#### RESPONSE:

Yelp Reviews:

Airplane Tweets:

### Part O: Negative Bag of Words Model

In [None]:
print("Beginning Train Featurization")
neg_data = np.array(list(map(bag_of_words_neg_featurize, trainX)))
print("Beginning Training")
negModel = trainModel(neg_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedNeg_data = np.array(list(map(bag_of_words_neg_featurize, testX)))
print("Accuracy:")
accuracyData(negModel, testFeaturizedNeg_data, np.asarray(dfTestset["class"]))

#### What was your accuracy percentage? Was it what you expected? How did it compare to the regular Bag of Words model? Answer for both datasets.

#### RESPONSE:

Yelp Reviews:

Airplane Tweets:

### Part P: Negative Binary Bag of Words Model

In [None]:
print("Beginning Train Featurization")
negbin_data = np.array(list(map(bag_of_words_neg_binary_featurize, trainX)))
print("Beginning Training")
negBinModel = trainModel(negbin_data, np.asarray(dfTrainset["class"]))
print("Beginning Test Featurization")
testFeaturizedNegBin_data = np.array(list(map(bag_of_words_neg_binary_featurize, testX)))
print("Accuracy:")
accuracyData(negBinModel, testFeaturizedNegBin_data, np.asarray(dfTestset["class"]))

#### What was your accuracy percentage? Was it what you expected? How did it compare to the regular Bag of Words model? Answer for both datasets.

#### RESPONSE:

Yelp Reviews:

Airplane Tweets: