<a href="https://colab.research.google.com/github/jacob-hansen/Multimodal-Activity-Classification/blob/main/Classification/yelpToWebClassification-Jacob.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open Connection to Google Sheets File

In [1]:
import requests
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())

In [54]:
worksheet = gc.open('Yelp Scraping Data Final').sheet1
places = set(worksheet.col_values(2)[1:])
testValues = worksheet.get()
headerNames = worksheet.get("A1:F1")[0]
headerDic = {headerNames[i]: i for i in range(len(headerNames))}

# Filter Text to Desired Words

In [3]:
# Set up nltk imports for tokenization
import nltk
nltk.download('punkt')
nltk.download('regexp')
from nltk.tokenize import RegexpTokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Error loading regexp: Package 'regexp' not found in index


In [4]:
# Set up nltk and gensim stopwords removal
nltk.download('stopwords')
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import STOPWORDS

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
# set up to get words to stem root
import re
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [62]:
def getTextByRating(file, minRating, readLength = 0):
    """
    From a specified Sheets File in Google Drive, 
    get the readLength number of reviews and return those that have the given minRating. 
    If readLength == 0, default is returning all reviews. 
    """
    worksheet = gc.open(file).sheet1
    if readLength != 0:
        zipped = zip(worksheet.col_values(3)[1:readLength], worksheet.col_values(1)[1:readLength], worksheet.col_values(4)[1:readLength])
    else:
        zipped = zip(worksheet.col_values(3)[1:], worksheet.col_values(1)[1:], worksheet.col_values(4)[1:])
    places = (i[1:] for i in zipped if int(i[0]) >= minRating)
    activityNames = {}
    for row in testValues:
        if row[0] not in activityNames:
            activityNames[row[0]] = row[1]
    return places, activityNames

In [6]:
def masterTokenizer(inputtext):
    no_nums = re.sub(r'[0-9]+', ' ', inputtext)
    firstTokens = tokenizer.tokenize(no_nums) 
    stemmed_words_ps = [stemmer.stem(word) for word in firstTokens]
    tokens_without_sw = [word.lower() for word in stemmed_words_ps if word.lower() not in STOPWORDS and word.lower() not in stopwords.words('english')]
    return list(tokens_without_sw)

In [100]:
testText, activityNames = getTextByRating('Yelp Scraping Data Final', 4)
tokenList = {}
tokenizer = RegexpTokenizer(r'\w+')
for activity, review in testText:
    if int(activity) not in tokenList:
        tokenList[int(activity)] = []
    if len(review) > 200:
        tokenList[int(activity)].append(masterTokenizer(review)) 
        # python maintains dictionary key orders, 
        # So tokenList.keys() will return the correct order of insertion
    

output tokens is tokenList (a dictionary of activities 0-n of a list of list of words)

# Gensim Word2Vec 

In [17]:
from gensim.models import Word2Vec
import itertools
import numpy as np

In [9]:
def getVec(model, inputtext): 
    modifiedText = masterTokenizer(inputtext)
    checkText = [i for i in modifiedText if i in model.wv]
    vTx = [0]*model.vector_size
    for word in checkText:
        vTx += np.array(model[word])
    vTNorm = vTx / np.sqrt(np.dot(vTx,vTx))
    return vTNorm

In [106]:
# groups = [j.join(' ') for i in tokenList for j in tokenList[i]]
sentences = [list(itertools.chain.from_iterable(tokenList[i])) for i in tokenList]
WVmodel = Word2Vec(sentences, min_count=3, size=10, window=200)
vocabulary = WVmodel.wv
activityVecs = np.array([getVec(WVmodel, " ".join(i)) for i in sentences])

  


In [58]:
def inf_norm(matrix):
    return matrix

In [59]:
def mostProbable(model, activityVecs, inputText, activityNames = False):
    inputVec = np.matrix(getVec(model, inputText))
    if activityNames:
        return activityNames[str(np.argmax(inf_norm(activityVecs*np.transpose(inputVec))))]
    else:
        return inf_norm(activityVecs*np.transpose(inputVec))


In [64]:
testSentence = "I love going to the aquarium. I get to see tons of fish and I get to explore. Every floor is so cool! There are penguins, frogs, more fish. You name it!"
test2 = "Escape rooms that I got out fast"  #Yesterday I went to the escape room with some friends. There were a lot of us in the room, so we got out pretty fast. But there were fun puzzles and tricks."
print(mostProbable(WVmodel, activityVecs, testSentence, activityNames))
print(mostProbable(WVmodel, activityVecs, test2, activityNames))

new england aquarium boston
boxaroo boston 6


  


In [65]:
def reviewSimilarity(model, a, b): 
    """
    Given a gensim Word2Vec Model
    Finds and sums the vectors for each word in each sentence
    Computes and returns the Cosine Distance
    """
    checkA = [i for i in a if i in model.wv]
    checkB = [i for i in b if i in model.wv]
    vA = [0]*model.vector_size
    vB = [0]*model.vector_size
    for word in checkA:
        vA += np.array(model[word])
    for word in checkB:
        vB += np.array(model[word])
    distance = abs(np.dot(vA, vB) / (np.sqrt(np.dot(vA,vA)) * np.sqrt(np.dot(vB,vB))))
    return distance 


In [66]:
reviewSimilarity(WVmodel, tokenList[0][4], tokenList[0][24])

  if sys.path[0] == '':
  


0.9958994746097066

In [67]:
testActivity1 = "The interactive exhibits and exquisite attention to historical detail make this a quintessential Boston museum that every visitor must experience."  
testActivity4 = "Voted one of the Best Boston Ghost Tours for a Frightfully Good Time that's guaranteed to raise your spirits. Not all Boston Haunted Tours are created equal."
print(reviewSimilarity(WVmodel, testActivity1, testActivity4))

0.2607496960179214


  if sys.path[0] == '':
  


# Testing Google Web Searches Against Review Sorting

In [116]:
worksheet = gc.open('Google Search Websites Boston').sheet1
testValues = worksheet.get()[1:]

In [115]:
scores = {}
for row in testValues:
    try:
      activity = row[0]
      if activity not in scores:
          scores[activity] = []
      text = row[2]+" "+row[3]
      if len(row) == 6:
        text += " "+row[5]
      test = mostProbable(WVmodel, activityVecs, text)
      scores[activity].append(str(np.argmax(test))==activity)
    except:
      print(len(row))
      continue
total = [0, 0]
for i in scores:
  subset = scores[i][:4]
  total[0] += sum(subset)
  total[1] += len(subset)
  print("Correctly predicted "+str(sum(subset))+" of "+str(len(subset))+" websites for "+str(activityNames[i]))
print("Predicting success total at "+str(total[0]/total[1]*100)+"%")

  


Correctly predicted 4 of 4 websites for boston tea party ships and museum boston
Correctly predicted 0 of 4 websites for trapology boston boston
Correctly predicted 2 of 4 websites for jacques cabaret boston
Correctly predicted 4 of 4 websites for haunted boston ghost tours boston
Correctly predicted 1 of 4 websites for escape the room boston boston
Correctly predicted 0 of 4 websites for the lawn on d boston
Correctly predicted 3 of 4 websites for charles river canoe and kayak cambridge 2
Correctly predicted 2 of 4 websites for urban axes somerville somerville
Correctly predicted 1 of 4 websites for cambridge center roof garden cambridge
Correctly predicted 2 of 4 websites for lucky strike somerville somerville
Correctly predicted 4 of 4 websites for the esplanade boston 2
Correctly predicted 3 of 4 websites for boxaroo boston 6
Correctly predicted 1 of 4 websites for chez vous roller skating rink boston
Predicting success total at 51.92307692307693%


# Data Explanation
The model predicted around half of the websites correctly. Given the limited data set we have, I was happy with the results (the model only took 10 sec to train). Obviously, the biggest limitation in this model is the vocabulary. Many of the words in non-training samples are not found in the vocabulary. Additionally, with limited data, it is especially hard to make predictions on data formated differently than the training data. In this case, I simply concatenated all the information provided by Google for each website. Ideally, I would attempt this again by training on a variety of information and preclassify like activities. In the training set, there were 3 escape room activities. It's no wonder that the model preformed poorly on most of those activities. Also, the descriptions of the lawn on boston and cambridge center roof garden are difficult to distinguish (even by hand once names were taken out). 

In a model attempting to classify activities from people's lives, it will be important to get a time and location stamp to help strengthen activities that should be grouped together. I propose first collecting a substantial database of journals and information relating to activites of those people who journaled. Then I would first group information by location and time. I would further train a model simply for weeding out non-similar data. Then I would train a seperate model for recognizing similar type data. Importantly, the two approaches for cleaning the data and then training on the final model will need to be different and require more thought. 