You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importnumpyasnpimportnltkfromnltk.corpusimportwordnetaswnimportpandasaspddefconvert_tag(tag):
"""Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""tag_dict= {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
returntag_dict[tag[0]]
exceptKeyError:
returnNonedefdoc_to_synsets(doc):
""" Returns a list of synsets in document. Tokenizes and tags the words in the document doc. Then finds the first synset for each word/tag combination. If a synset is not found for that combination it is skipped. Args: doc: string to be converted Returns: list of synsets Example: doc_to_synsets('Fish are nvqjp friends.') Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')] """# Your Code Heretoken=nltk.word_tokenize(doc)
# add parts of speech to tokentag=nltk.pos_tag(token)
# convert nltk pos into wordnet posnltk2wordnet= [(i[0], convert_tag(i[1])) foriintag]
# if there are no synsets in token, ignore, else put in a listoutput= [wn.synsets(i, z)[0] fori, zinnltk2wordnetiflen(wn.synsets(i, z))>0]
returnoutputdefsimilarity_score(s1, s2):
""" Calculate the normalized similarity score of s1 onto s2 For each synset in s1, finds the synset in s2 with the largest similarity value. Sum of all of the largest similarity values and normalize this value by dividing it by the number of largest similarity values found. Args: s1, s2: list of synsets from doc_to_synsets Returns: normalized similarity score of s1 onto s2 Example: synsets1 = doc_to_synsets('I like cats') synsets2 = doc_to_synsets('I like dogs') similarity_score(synsets1, synsets2) Out: 0.73333333333333339 """# Your Code Herelist1= []
# For each synset in s1forains1:
# finds the synset in s2 with the largest similarity valuelist1.append(max([i.path_similarity(a) foriins2ifi.path_similarity(a) isnotNone]))
output=sum(list1)/len(list1)
returnoutputdefdocument_path_similarity(doc1, doc2):
"""Finds the symmetrical similarity between doc1 and doc2"""# first function u need to createsynsets1=doc_to_synsets(doc1)
synsets2=doc_to_synsets(doc2)
# 2nd function u need to createreturn (similarity_score(synsets1, synsets2) +similarity_score(synsets2, synsets1)) /2deftest_document_path_similarity():
doc1='This is a function to test document_path_similarity.'doc2='Use this function to see if your code in doc_to_synsets \ and similarity_score is correct!'returndocument_path_similarity(doc1, doc2)
Assign Scores to New Documents
# paraphrases is a DataFrame which contains the following columns: Quality, D1, and D2.# Quality is an indicator variable which indicates if the two documents D1 and D2 are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).importnumpyasnpdefmost_similar_docs():
# Your Code Heredeffunc(x):
try:
returndocument_path_similarity(x['D1'], x['D2'])
except:
returnnp.nanparaphrases['similarity_score'] =paraphrases.apply(func, axis=1)
# sort by score and extract the maxdf=paraphrases.sort_values('similarity_score', ascending=False)[:1]
# remove similarity scoredf=df[df.columns[1:]]
# change dataframe to an array, and convert to a tupleoutput=tuple(df.values[0])
returnoutput