<a href="https://colab.research.google.com/github/hembrickad/HoloAAC/blob/swati_files/updatedWeights_conv_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# for any questions about this code, please make sure to contact the owner: Swati Rampalli (rampa009@umn.edu)
# please see the "NOTE:" for potential issues and suggestions with the code 

# there are multiple dependencies you need to install to run this conversation retrieval script
# NOTE: spaCY took a painstakingly long time to download locally, so I used the Colab document for now
# spaCY is used primarily because of its comprehensive stop word removal, but if you find it too unfeasile
# to download locally, you can opt for a better stopword removal package 

# NOTE: I manually created this dataset myself, so it is only 100 sentences. For further testing purposes, you might 
# want to add more sentences. An initial problem I had was that I was getting a lot of the same weights for words, 
# which really didn't signify anything important. This was because the frequency of my words in the dataset were VERY
# similar so when put into the log function that Tfidf does to calculate the weights, the weights turned out to be 
# basically the same. This could probably be solved (or at least worked around) with a significantly larger dataset, but
# make sure that the frequency of non-stopwords significantly vary. 

# NOTE: The software that we're using to connect Unity and Python is Upython, which requires a local python file to run. This
# is a google colab document (I used this becuase of spacy, see above). You need to either download spacy locally or use 
# a different stopword removal package

# NOTE: If this python script is run locally, you might need to put the path to where your python packages are downloaded
# at the front of the file (you can do this with sys.path(location to python packages)). Otherwise, IDLE will not be able 
# to see where the "nltk" package is stored. This usually happened because of conflicting paths between Anaconda and IDLE. 


!pip install spacy



In [2]:
!pip install nltk



In [3]:
import nltk
nltk.download('stopwords')

import nltk
nltk.download('punkt')

import nltk
nltk.download('averaged_perceptron_tagger')

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [5]:
import nltk
import random
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from heapq import nlargest
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
from nltk.corpus import wordnet
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')
import spacy
en = spacy.load('en_core_web_sm')
sw_spacy = en.Defaults.stop_words
from nltk.tokenize import word_tokenize

# this function was used to determine whether the word is an adverb, noun, etc. for lemmatization purposes
# refer to this link for info on lemmatization and how it differs from stemming: https://www.machinelearningplus.com/nlp/lemmatization-examples-python/ 
# NOTE: stemming doesn't retain the english word AND doesn't account for the context/meaning, which is why I used a lemmatizer instead
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

# this "keyword" is the object that was detected like "apple, milk, etc." and is initially used
# to just gather ALL sentences containing the keyword - we want this initially so we can better 
# refine that list with additional keywords later on 
keyword = "milk"

return_sentences = []

# mounting the google drive is specific to the colab document, if this file is run locally, then this should be 
# replaced by the local paths 
sentences_file = open("/content/gdrive/My Drive/Colab Notebooks/grocery_store_sentences.txt", "r", encoding='utf-8-sig')
sentences_list = sentences_file.readlines()
stripped_sentences_list = []
returned_sentences_list = []
stemmed_sentence_list = []

punctuations="?:!.,;"
lemmatizer = WordNetLemmatizer()



for sentence in sentences_list:
    strip = sentence.strip()
    if keyword in strip:
      stripped_sentences_list.append(strip)

# I used this example list for testing purposes because I was getting the same weights 
# for the words, indicating that words like "nonfat" and "bag" had the same frequency, 
# even though "bag" was used a lot more 

#stripped_sentences_list = ['Can I weigh this apple', 'Can I weigh this avocado',
                            #'Can I weigh this banana',
                           # 'Can I weigh this orange',
                          #  'Can I weigh this tomato',
                           # 'Can I weigh this ketchup',
                           # 'Can I weigh this soda',
                           # 'Can I weigh this drink',
                          #  'Can I buy this mango',
                          #  'Can I buy this water'
                          #  ]

# for preprocessing purposes, I used the WordNet lemmatizer and spaCY stopword removal which will boil 
# the sentence down to its root words (i.e. playing turns into play) and remove words that have no meaning (like this, and) 

for sentence in stripped_sentences_list:
    words_to_join = []
    possible_adj = []
    #stripped_sentence = sentence.strip()
    stemmed_list = nltk.word_tokenize(sentence)
    for word in stemmed_list:
      # removing ALL punctuation because this isn't recognized by the lemmatizer 
        if word in punctuations:
            stemmed_list.remove(word)
    tokens_without_stop = [word for word in stemmed_list if not word.lower() in sw_spacy]

# this step is a little extra becuase I was having problems with lemmatization earlier since there were a bunch
# of different possible pertainyms - refer to this site to know what I mean: http://wordnetweb.princeton.edu/perl/webwn?s=separately
# I basically narrowed this down by choosing the one that had the same 3-letter root word as the word of interest,
# so "separately" would have multiple different pertainyms - "individual", "separate", "single" and "separate" is the one we want 

    for word in tokens_without_stop:
        if get_wordnet_pos(word) == "r" and str(wordnet.synsets(word)[0])[:-5][-1] == 'r':
            for ss in wordnet.synsets(word):
                for lemmas in ss.lemmas(): # all possible lemmas
                    for ps in lemmas.pertainyms(): # all possible pertainyms
                        possible_adj.append(ps.name())
            for adj in possible_adj:
                if adj[0:3] == word[0:3]:
                    words_to_join.append(adj)
        else:
            words_to_join.append(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

    lemmatized_output = ' '.join(words_to_join)        
    #lemmatized_output = ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in tokens_without_stop])
    stemmed_sentence_list.append(lemmatized_output)
    

#vectorizer = TfidfVectorizer()
#X = vectorizer.fit(stemmed_sentence_list)
#print(vectorizer.vocabulary_)
##
#### figure out how to strip all punctuation - maybe beforehand, when
#### reading the sentences from the file?
##
#### only 100 sentence dataset - maybe too small to accurately represent what
#### the proper distinguishing words are for each class?
##
##
###instantiate CountVectorizer() 
cv=CountVectorizer(token_pattern='(?u)\\b\\w+\\b') 

# this steps generates word counts for each of the words in the sentences and assigns a weight to each word
# NOTE: words more frequent have smaller weights compared to words less frequent 

word_count_vector=cv.fit_transform(stemmed_sentence_list)
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) 

# gets the top 5 words with the highest weights, i.e. the most "distinguishing" words

df_idf.sort_values(by=['idf_weights'])
top_words = list(df_idf.tail(5).index)


# this is where we refine our initial list of sentences such that if the user wants to select additional keywords after the 
# original object class, they can do so. 
# I also assigned "fake" click counts from 1-5, designated by the random library to simulate user selection history as another
# parameter for ranking. For example, if three sentences popped up, I would further refine it with the one with the highest 
# click counts

additional_keywords = "bag"
refined_sentence_dict = {}
stripped_dict = {}

# a simulated list of all the sentences with the object name and their "click counts"
# this is intended to show how the user's previous selection history can be accounted for 
for item in stripped_sentences_list:
  stripped_dict[item] = random.randint(1,5)

# the narrowed list of sentences based on the additional keyword(s) provided
for sentence in stripped_dict:
  if additional_keywords in sentence:
    refined_sentence_dict[sentence] = stripped_dict[sentence]

# sentence with the highest click count out of the narrowed list of sentences
max_key = max(refined_sentence_dict, key=refined_sentence_dict.get)

# to view the list of weights, type "df_idf"
# to view the list of sentences JUST with the object name and their click counts, type "stripped_dict"
# to view the list of refined sentences BASED on the additional keyword(s), type "refined_sentence_dict"

In [None]:
##unique_sentence_weights = {}
###for all the sentences containing the keyword                
##for basic_sent in returned_sentences_list:
##    unique_sentence_weights[basic_sent] = 0
##    #compute the sum of the idfs for a sentence with no duplicate words
##    #make the sentence unique first (remove duplicate words like "the")
##    basic_sent_split = basic_sent.lower().strip("?.’").split(" ")
##    for word in basic_sent_split:
##        if word not in unique_sentence_weights:
##            unique_sentence_weights[basic_sent] = unique_sentence_weights[basic_sent]+ float(df_idf.loc[[word.lower()], 'idf_weights'][word.lower()])
##        
##highest_sen = nlargest(4, unique_sentence_weights, key = unique_sentence_weights.get)              
##     
##
##sentences_file.close()

In [6]:
df_idf

Unnamed: 0,idf_weights
1,2.609438
2,2.609438
bag,2.321756
double,3.014903
expire,3.014903
find,3.014903
gallon,2.098612
milk,1.0
nonfat,2.321756
organic,3.014903


In [7]:
stripped_dict

{'Can you put the 2% and 1% milk in separate bags?': 1,
 'Can you put the whole milk and skim milk in separate bags?': 3,
 'Could you double bag the milk gallons?': 2,
 'Do you have 2% milk?': 4,
 'Do you have more of the 1% milk?': 5,
 'Do you have organic milk?': 1,
 'Do you have skim milk?': 3,
 'What is the price of a milk gallon?': 4,
 'What is the price of this milk gallon?': 2,
 'What is the price of this nonfat milk gallon?': 1,
 'What is the price of this skim milk?': 5,
 'When does this nonfat milk expire?': 3,
 'Where can I find nonfat milk?': 1,
 'Where is the whole milk?': 3}

In [8]:
refined_sentence_dict

{'Can you put the 2% and 1% milk in separate bags?': 1,
 'Can you put the whole milk and skim milk in separate bags?': 3,
 'Could you double bag the milk gallons?': 2}

In [9]:
max_key

'Can you put the whole milk and skim milk in separate bags?'