## Final Project Part 2 - NLP Tasks
#### Josh Breunig

### Load Data

In [1]:
import pandas as pd
import json
import numpy as np


In [2]:
# loading in the review data in chunks
reviews_df = pd.read_json('Data/yelp_academic_dataset_review.json', lines=True, chunksize=1000)

In [3]:
# iterating through the data and pulling the desired columns into a data frame
reviews_output = pd.DataFrame()
for chunk in reviews_df:
    data = chunk[['business_id', 'stars', 'text']]
    reviews_output = pd.concat([reviews_output, data], axis=0)

In [4]:
# adding sentiment labels to the dataframe based on user 'star' rating
reviews_output['sentiment'] = ['positive' if stars >= 3 else 'negative' for stars in reviews_output['stars']]

In [5]:
# cutting the size of the dataset
reviews_output = reviews_output[:200000]

### Preprocessing

We will be using the normalize_corpus function from Sarkar's book to preprocess the reviews.  The following preprocessing steps will be performed:
- Correct spelling
- Lemmatization
- Remove whitespace
- Remove numbers and special characters
- Remove stop words (We will want to edit the stop words list and remove words like 'no', 'not', etc. to ensure we capture the context of each review)
- Expand contractions
- Remove accented characters


In [6]:
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import pattern

In [7]:
def normalize_data(doc, text_lemmatization=True, stopword_removal=True):
   # adjusting the stop word list
    stop_words = nltk.corpus.stopwords.words('english')
    stop_words.remove('no')
    stop_words.remove('but')
    stop_words.remove('not')
    lemmatizer = WordNetLemmatizer()
    
    normalized_text = []
    for text in doc:
        text = text.lower()
        text = text.strip()
        text = re.sub(r'[\r|\n|\r\n]', '', text) # removing html tags
        text = re.sub(r'[^a-zA-Z0-9]', ' ', text) # removing special characters
        text = re.sub(r'[0-9]', '', text) # removing numbers
        if text_lemmatization:
            tokens = word_tokenize(text)
            tokens = [token.strip() for token in tokens] 
            text = ' '.join([lemmatizer.lemmatize(w) for w in tokens])
        if stopword_removal:
            filtered_tokens = [token for token in tokens if token not in stop_words]
            text = ' '.join(filtered_tokens)
        # correct word lengthening
        pattern = re.compile(r'(.)\1{2,}')
        text = pattern.sub(r'\1\1', text)
        normalized_text.append(text)
    return normalized_text
        

In [8]:
# adding the reviews to an array
reviews = np.array(reviews_output['text'])

In [9]:
# normalizing the reviews
norm_reviews = normalize_data(reviews)

In [10]:
# saving the normalized reviews to a dataframe
norm_reviews_df = pd.DataFrame(norm_reviews)

In [11]:
# loading in csv with normalized reviews
norm_reviews_df = pd.read_csv("preprocessed_reviews.csv")

In [23]:
norm_reviews_df.head()

Unnamed: 0,norm_text
0,delicious best sandwich shop ever way top sand...
1,terrific unique little spot downtown tucson gr...
2,company moved irvington area darling husband a...
3,appointment today truly nolen not show time bu...
4,came sushi rose celebrate friend birthday free...


In [22]:
norm_reviews_df.rename(columns= {'0':'norm_text'}, inplace=True)

In [24]:
full_df = reviews_output.copy()

In [25]:
full_df['norm_text'] = norm_reviews_df['norm_text']

In [26]:
full_df.head()

Unnamed: 0,business_id,stars,text,sentiment,norm_text
0,XQfwVwDr-v0ZS3_CbbE5Xw,3,"If you decide to eat here, just be aware it is...",positive,delicious best sandwich shop ever way top sand...
1,7ATYjTIgM3jUlt4UM3IypQ,5,I've taken a lot of spin classes over the year...,positive,terrific unique little spot downtown tucson gr...
2,YjUWPpI6HXG530lwP-fb2A,3,Family diner. Had the buffet. Eclectic assortm...,positive,company moved irvington area darling husband a...
3,kxX2SOes4o-D3ZQBkiMRfA,5,"Wow! Yummy, different, delicious. Our favo...",positive,appointment today truly nolen not show time bu...
4,e4Vwtrqf-wpJfwesgvdgxQ,4,Cute interior and owner (?) gave us tour of up...,positive,came sushi rose celebrate friend birthday free...


In [37]:
# saving  to a csv
full_df.to_csv("preprocessed_df.csv", index=False)
full_df.fillna(" ", inplace=True)

### Create Test/Train Sets

In [38]:
reviews_array = np.array(full_df['norm_text'])
sentiments_array = np.array(full_df['sentiment'])

# calculating the 33% mark to split the data into train and test datasets
total_reviews = full_df.shape[0]
cut_off = round(total_reviews*.33)

# build train and test datasets
train_reviews = reviews[cut_off:]
train_sentiments = sentiments_array[cut_off:]
test_reviews = reviews_array[:cut_off]
test_sentiments = sentiments_array[:cut_off]

In [39]:
print("Test Dataset Shape: ", test_reviews.shape)
print("Train Dataset Shape: ", train_reviews.shape)

Test Dataset Shape:  (66000,)
Train Dataset Shape:  (134000,)


### Feature Engineering

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

# build BOW features on train_reviews
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(train_reviews)

In [41]:
# transform test reviews into features
cv_test_features = cv.transform(test_reviews)


In [46]:
# saving vectorizer 
pickle.dump(cv, open("cv.pickel", "wb"))

### Sentiment Analysis
#### Model Training, Prediction, and Performance


##### Logistic Regression

In [42]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=700, C=1)

In [43]:
lr.fit(cv_train_features, train_sentiments)
lr_predictions = lr.predict(cv_test_features)

In [44]:
import model_evaluation_utils as meu

meu.display_model_performance_metrics(true_labels=test_sentiments, 
                                      predicted_labels=lr_predictions,
                                     classes=['positive', 'negative'])

Model Performance metrics:
------------------------------
Accuracy: 0.9265
Precision: 0.9242
Recall: 0.9265
F1 Score: 0.9236

Model Classification report:
------------------------------
              precision    recall  f1-score   support

    positive       0.94      0.98      0.96     53550
    negative       0.87      0.72      0.79     12450

    accuracy                           0.93     66000
   macro avg       0.90      0.85      0.87     66000
weighted avg       0.92      0.93      0.92     66000


Prediction Confusion Matrix:
------------------------------
                 Predicted:         
                   positive negative
Actual: positive      52220     1330
        negative       3524     8926


In [45]:
# model traning and saving
import pickle
pickle.dump(lr, open('saved_lr_model.sav', 'wb')) # saving the LR model

### Contribution Statement

I managed all contributions for this sprint, as I am working on this project solo.

One of the biggest challenges I faced during this sprint was the size of the dataset.  With over 6M data points and a file size of ~5GB, it was very difficult to work with the data on my local machine.  It was a challenge to load the data into the program, but I was able to find a solution by chunking the file.  I was very limited by the preprocessing steps I was able to take… the more steps included in the function, the longer it would take to run.  I originally tried to use the text_normalizer.py file from Sarkar’s book and that took over nine hours to run and eventually failed.  In the end, I decided to cut down the size of the dataset to make it more manageable. 

Another issue I faced with the preprocessing step was expanding the contractions.  Breaking down the contractions could help the sentiment analysis results, but I wasn’t able to successfully implement that step into my preprocessing code.
