# NTLK Project: Analysis of Snippets

The aim of this project is to design an approach that makes use of Google and msn snippet in order to compute the semantic similarity between sentences. Given two sentences S1 and S2, the key is to input each of the sentences to the search engine and investigate the overlapping that may exist between the generated snippets. 

Seminar report date: 11.12.2018.
Project delivery deadline: 7.1.2018.

1.	Define two sentences S1 and S2.

  S1: "Several research groups have discovered new pharmaceuticals from nordic berries."
  
  S2: "New substances discovered from fruits, vegetables and berries have positive health effects."
 

In [104]:
import nltk
import string
import math 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer
import pandas as pd
import re
import sys
import numpy
numpy.set_printoptions(threshold=sys.maxsize)

pd.set_option('display.max_columns', None)      # or 1000
pd.set_option('display.max_rows', None)         # or 1000
pd.set_option('display.max_colwidth', -1)       # or 199



# snippets / sentences examples ##########

document_0 = " 'Several research-groups have discovered new@ pharmaceuticals; from nordic berries.' '"
document_1 = "New pharmaceuticals discovered from northern plants and berries have -positive- health effects."
document_2 = "Several substances of nordic berries that prevent the growth of cancer and microbe cells: flavonoids, peptides and phenolics."
document_3 = "Antimicrobial peptides of nordic 'crowberry' prevent microbe growth in humans."
document_4 = "New pharmaceutically active compounds of forestical plants were discovered to have antimicrobial effects on highly resistant infections."
document_5 = "New pharmaceutically active peptides discovered from crowberry prevent the growth of highly resistant hospital infections"

all_documents = [document_0, document_1, document_2, document_3, document_4, document_5]

data_all = pd.DataFrame(all_documents)
data_all.columns = ['Academic sentence - short example']
display(data_all.head())  

low_documents = []
for document in all_documents:
    low_documents.append(document.lower())
    
data_low = pd.DataFrame(low_documents)
data_low.columns = ['Lower case sentence']
display(data_low.head())  
    
# tokenization by split # Sentences Tokenized into Words - split by whitespace

sentences_documents = []
#document_counter = 0
for document in low_documents:
    sentences_documents.append(document.split())

printableList1 = []
for sentence1 in sentences_documents:
    sentence1AsString = ''
    for idx1, aWord1 in enumerate(sentence1):        
        if idx1 == len(sentence1) - 1:
            sentence1AsString = sentence1AsString + aWord1
        else:
            str1 = aWord1 + ', '
            sentence1AsString = sentence1AsString + str1
    printableList1.append(sentence1AsString)

data_sentences1 = pd.DataFrame(printableList1)
data_sentences1.columns = ['Sentence tokenized into words   - string form and comma separated for display']
display(data_sentences1.head())     


Unnamed: 0,Academic sentence - short example
0,'Several research-groups have discovered new@ pharmaceuticals; from nordic berries.' '
1,New pharmaceuticals discovered from northern plants and berries have -positive- health effects.
2,"Several substances of nordic berries that prevent the growth of cancer and microbe cells: flavonoids, peptides and phenolics."
3,Antimicrobial peptides of nordic 'crowberry' prevent microbe growth in humans.
4,New pharmaceutically active compounds of forestical plants were discovered to have antimicrobial effects on highly resistant infections.


Unnamed: 0,Lower case sentence
0,'several research-groups have discovered new@ pharmaceuticals; from nordic berries.' '
1,new pharmaceuticals discovered from northern plants and berries have -positive- health effects.
2,"several substances of nordic berries that prevent the growth of cancer and microbe cells: flavonoids, peptides and phenolics."
3,antimicrobial peptides of nordic 'crowberry' prevent microbe growth in humans.
4,new pharmaceutically active compounds of forestical plants were discovered to have antimicrobial effects on highly resistant infections.


Unnamed: 0,Sentence tokenized into words - string form and comma separated for display
0,"'several, research-groups, have, discovered, new@, pharmaceuticals;, from, nordic, berries.', '"
1,"new, pharmaceuticals, discovered, from, northern, plants, and, berries, have, -positive-, health, effects."
2,"several, substances, of, nordic, berries, that, prevent, the, growth, of, cancer, and, microbe, cells:, flavonoids,, peptides, and, phenolics."
3,"antimicrobial, peptides, of, nordic, 'crowberry', prevent, microbe, growth, in, humans."
4,"new, pharmaceutically, active, compounds, of, forestical, plants, were, discovered, to, have, antimicrobial, effects, on, highly, resistant, infections."


In [95]:

# change compound words to separate words ie. 'conditional-statements' -> 'conditional', 'statements' 
print("\n" 'Single words' "\n")
single_word_documents = []
for sentence_words in sentences_documents:
    single_word_list = []
    for word in sentence_words:
        regex = re.compile("[-_]")
        trimmed = regex.sub(' ', word)
        separate = trimmed.split( )
        for item in separate:
            single_word_list.append(item)        
    single_word_documents.append(single_word_list)
print(single_word_documents)

printableList2 = []
for sentence2 in single_word_documents:
    sentence2AsString = ''
    for idx2, aWord2 in enumerate(sentence2):        
        if idx2 == len(sentence2) - 1:
            sentence2AsString = sentence2AsString + aWord2
        else:
            str2 = aWord2 + ', '
            sentence2AsString = sentence2AsString + str2
    printableList2.append(sentence2AsString)

data_sentences2 = pd.DataFrame(printableList2)
data_sentences2.columns = ['Single words   - string form and comma separated for display']
display(data_sentences2.head())      
    
    
    
# remove all tokens that are not alphabetic #############
print("\n" 'Tokenized with alphabetic chars only' "\n")
alpha_documents = []
for single_word_sentence in single_word_documents:
    cleaned_list = []
    for single_word in single_word_sentence:
        regex = re.compile('[^a-zA-Z]')
        #First parameter is the replacement, second parameter is your input string
        nonAlphaRemoved = regex.sub('', single_word)
        # add string to list only if it has content
        if nonAlphaRemoved:
            cleaned_list.append(nonAlphaRemoved)
    alpha_documents.append(cleaned_list)
print(alpha_documents)

printableList3 = []
for sentence3 in alpha_documents:
    sentence3AsString = ''
    for idx3, aWord3 in enumerate(sentence3):        
        if idx3 == len(sentence3) - 1:
            sentence3AsString = sentence3AsString + aWord3
        else:
            str3 = aWord3 + ', '
            sentence3AsString = sentence3AsString + str3
    printableList3.append(sentence3AsString)

data_sentences3 = pd.DataFrame(printableList3)
data_sentences3.columns = ['Tokenized with alphabetic chars only   - string form and comma separated for display']
display(data_sentences3.head())     


# filter out stopwords ########
print("\n" 'English stopwords filtered tokens' "\n")
stop_filtered_tokens = []
english_stop_words = set(stopwords.words('english'))

for fword in alpha_documents:
    fword_list = []
    for sword in fword:
        #fword_list = [sword for sword in alpha_documents if not sword in english_stop_words]
        if not sword in english_stop_words:
            fword_list.append(sword)
    stop_filtered_tokens.append(fword_list)
print(stop_filtered_tokens)  


printableList4 = []
for sentence4 in stop_filtered_tokens:
    sentence4AsString = ''
    for idx4, aWord4 in enumerate(sentence4):        
        if idx4 == len(sentence4) - 1:
            sentence4AsString = sentence4AsString + aWord4
        else:
            str4 = aWord4 + ', '
            sentence4AsString = sentence4AsString + str4
    printableList4.append(sentence4AsString)

data_sentences4 = pd.DataFrame(printableList4)
data_sentences4.columns = ['English stopwords filtered tokens   - comma separated for display']
display(data_sentences4.head())     


# tokenization by PorterStemmer ############
print("\n" 'Word Stemming by PorterStemmer' "\n")
porter_documents = []
for ps_word_list in stop_filtered_tokens:
    PS = PorterStemmer()
    porter_list = []
    for ps_word in ps_word_list:
        porter_list.append(PS.stem(ps_word))
    porter_documents.append(porter_list)
print(porter_documents)

printableList5 = []
for sentence5 in porter_documents:
    sentence5AsString = ''
    for idx5, aWord5 in enumerate(sentence5):        
        if idx5 == len(sentence5) - 1:
            sentence5AsString = sentence5AsString + aWord5
        else:
            str5 = aWord5 + ', '
            sentence5AsString = sentence5AsString + str5
    printableList5.append(sentence5AsString)

data_sentences5 = pd.DataFrame(printableList5)
data_sentences5.columns = ['Word Stemming by PorterStemmer   - comma separated for display']
display(data_sentences5.head())     


Single words

[["'several", 'research', 'groups', 'have', 'discovered', 'new@', 'pharmaceuticals;', 'from', 'nordic', "berries.'", "'"], ['new', 'pharmaceuticals', 'discovered', 'from', 'northern', 'plants', 'and', 'berries', 'have', 'positive', 'health', 'effects.'], ['several', 'substances', 'of', 'nordic', 'berries', 'that', 'prevent', 'the', 'growth', 'of', 'cancer', 'and', 'microbe', 'cells:', 'flavonoids,', 'peptides', 'and', 'phenolics.'], ['antimicrobial', 'peptides', 'of', 'nordic', "'crowberry'", 'prevent', 'microbe', 'growth', 'in', 'humans.'], ['new', 'pharmaceutically', 'active', 'compounds', 'of', 'forestical', 'plants', 'were', 'discovered', 'to', 'have', 'antimicrobial', 'effects', 'on', 'highly', 'resistant', 'infections.'], ['new', 'pharmaceutically', 'active', 'peptides', 'discovered', 'from', 'crowberry', 'prevent', 'the', 'growth', 'of', 'highly', 'resistant', 'hospital', 'infections']]


Unnamed: 0,Single words - string form and comma separated for display
0,"'several, research, groups, have, discovered, new@, pharmaceuticals;, from, nordic, berries.', '"
1,"new, pharmaceuticals, discovered, from, northern, plants, and, berries, have, positive, health, effects."
2,"several, substances, of, nordic, berries, that, prevent, the, growth, of, cancer, and, microbe, cells:, flavonoids,, peptides, and, phenolics."
3,"antimicrobial, peptides, of, nordic, 'crowberry', prevent, microbe, growth, in, humans."
4,"new, pharmaceutically, active, compounds, of, forestical, plants, were, discovered, to, have, antimicrobial, effects, on, highly, resistant, infections."



Tokenized with alphabetic chars only

[['several', 'research', 'groups', 'have', 'discovered', 'new', 'pharmaceuticals', 'from', 'nordic', 'berries'], ['new', 'pharmaceuticals', 'discovered', 'from', 'northern', 'plants', 'and', 'berries', 'have', 'positive', 'health', 'effects'], ['several', 'substances', 'of', 'nordic', 'berries', 'that', 'prevent', 'the', 'growth', 'of', 'cancer', 'and', 'microbe', 'cells', 'flavonoids', 'peptides', 'and', 'phenolics'], ['antimicrobial', 'peptides', 'of', 'nordic', 'crowberry', 'prevent', 'microbe', 'growth', 'in', 'humans'], ['new', 'pharmaceutically', 'active', 'compounds', 'of', 'forestical', 'plants', 'were', 'discovered', 'to', 'have', 'antimicrobial', 'effects', 'on', 'highly', 'resistant', 'infections'], ['new', 'pharmaceutically', 'active', 'peptides', 'discovered', 'from', 'crowberry', 'prevent', 'the', 'growth', 'of', 'highly', 'resistant', 'hospital', 'infections']]


Unnamed: 0,Tokenized with alphabetic chars only - string form and comma separated for display
0,"several, research, groups, have, discovered, new, pharmaceuticals, from, nordic, berries"
1,"new, pharmaceuticals, discovered, from, northern, plants, and, berries, have, positive, health, effects"
2,"several, substances, of, nordic, berries, that, prevent, the, growth, of, cancer, and, microbe, cells, flavonoids, peptides, and, phenolics"
3,"antimicrobial, peptides, of, nordic, crowberry, prevent, microbe, growth, in, humans"
4,"new, pharmaceutically, active, compounds, of, forestical, plants, were, discovered, to, have, antimicrobial, effects, on, highly, resistant, infections"



English stopwords filtered tokens

[['several', 'research', 'groups', 'discovered', 'new', 'pharmaceuticals', 'nordic', 'berries'], ['new', 'pharmaceuticals', 'discovered', 'northern', 'plants', 'berries', 'positive', 'health', 'effects'], ['several', 'substances', 'nordic', 'berries', 'prevent', 'growth', 'cancer', 'microbe', 'cells', 'flavonoids', 'peptides', 'phenolics'], ['antimicrobial', 'peptides', 'nordic', 'crowberry', 'prevent', 'microbe', 'growth', 'humans'], ['new', 'pharmaceutically', 'active', 'compounds', 'forestical', 'plants', 'discovered', 'antimicrobial', 'effects', 'highly', 'resistant', 'infections'], ['new', 'pharmaceutically', 'active', 'peptides', 'discovered', 'crowberry', 'prevent', 'growth', 'highly', 'resistant', 'hospital', 'infections']]


Unnamed: 0,English stopwords filtered tokens - comma separated for display
0,"several, research, groups, discovered, new, pharmaceuticals, nordic, berries"
1,"new, pharmaceuticals, discovered, northern, plants, berries, positive, health, effects"
2,"several, substances, nordic, berries, prevent, growth, cancer, microbe, cells, flavonoids, peptides, phenolics"
3,"antimicrobial, peptides, nordic, crowberry, prevent, microbe, growth, humans"
4,"new, pharmaceutically, active, compounds, forestical, plants, discovered, antimicrobial, effects, highly, resistant, infections"



Word Stemming by PorterStemmer

[['sever', 'research', 'group', 'discov', 'new', 'pharmaceut', 'nordic', 'berri'], ['new', 'pharmaceut', 'discov', 'northern', 'plant', 'berri', 'posit', 'health', 'effect'], ['sever', 'substanc', 'nordic', 'berri', 'prevent', 'growth', 'cancer', 'microb', 'cell', 'flavonoid', 'peptid', 'phenol'], ['antimicrobi', 'peptid', 'nordic', 'crowberri', 'prevent', 'microb', 'growth', 'human'], ['new', 'pharmaceut', 'activ', 'compound', 'forest', 'plant', 'discov', 'antimicrobi', 'effect', 'highli', 'resist', 'infect'], ['new', 'pharmaceut', 'activ', 'peptid', 'discov', 'crowberri', 'prevent', 'growth', 'highli', 'resist', 'hospit', 'infect']]


Unnamed: 0,Word Stemming by PorterStemmer - comma separated for display
0,"sever, research, group, discov, new, pharmaceut, nordic, berri"
1,"new, pharmaceut, discov, northern, plant, berri, posit, health, effect"
2,"sever, substanc, nordic, berri, prevent, growth, cancer, microb, cell, flavonoid, peptid, phenol"
3,"antimicrobi, peptid, nordic, crowberri, prevent, microb, growth, human"
4,"new, pharmaceut, activ, compound, forest, plant, discov, antimicrobi, effect, highli, resist, infect"


In [102]:
# define jaccard similarity for python ################
def jaccard_similarity(query, jdoc):
    intersection = set(query).intersection(set(jdoc))
    union = set(query).union(set(jdoc))
    return len(intersection)/len(union)

# calculate jaccard similarity
print('Jaccard similarity')
result = jaccard_similarity(porter_documents[0], porter_documents[1])
j_string = "{:.4f}".format(result)
print(j_string)


#data_table1 = pd.DataFrame(tableJaccSim1)
# data_table1.columns = ['n', 'sentence', 'n', 'sentence', 'JaccardSim'  ]
#display(data_table1.head())

def listToString (sourceList):
    subListAsString = ''    
    for listIndex, listWord in enumerate(sourceList):        
        if listIndex == len(sourceList) - 1:
            subListAsString = subListAsString + listWord
        else:
            strWithComma = listWord + ', '
            subListAsString = subListAsString + strWithComma        
    return subListAsString

# compare the first porter document to the rest in the porter docs list
def printJaccardSimilarities (porterDocs):    
    printableJaccardList = []    
    for porterIndex, porter in enumerate(porterDocs):
        if porterIndex > 0:            
            jresult = jaccard_similarity(porterDocs[0], porter)    
            j_string = "{:.4f}".format(jresult)
            porterParamStr1 = listToString(porterDocs[0])
            porterParamStr2 = listToString(porter)
            data = [0, porterParamStr1, porterIndex, porter, j_string]
            printableJaccardList.append(data)
    df = pd.DataFrame(printableJaccardList,columns=['First Index','First Sentence','Second Index','Second Sentence','Jaccard similarity'])
    display(df.head()) 

printJaccardSimilarities(porter_documents)

    


Jaccard similarity
0.3077


Unnamed: 0,First Index,First Sentence,Second Index,Second Sentence,Jaccard similarity
0,0,"sever, research, group, discov, new, pharmaceut, nordic, berri",1,"[new, pharmaceut, discov, northern, plant, berri, posit, health, effect]",0.3077
1,0,"sever, research, group, discov, new, pharmaceut, nordic, berri",2,"[sever, substanc, nordic, berri, prevent, growth, cancer, microb, cell, flavonoid, peptid, phenol]",0.1765
2,0,"sever, research, group, discov, new, pharmaceut, nordic, berri",3,"[antimicrobi, peptid, nordic, crowberri, prevent, microb, growth, human]",0.0667
3,0,"sever, research, group, discov, new, pharmaceut, nordic, berri",4,"[new, pharmaceut, activ, compound, forest, plant, discov, antimicrobi, effect, highli, resist, infect]",0.1765
4,0,"sever, research, group, discov, new, pharmaceut, nordic, berri",5,"[new, pharmaceut, activ, peptid, discov, crowberri, prevent, growth, highli, resist, hospit, infect]",0.1765


2.	Use Google search API and msn search API to generate the first ten snippets associated to each sentence. 

In [115]:
from googleapiclient.discovery import build
import pprint
import json

my_api_key = "AIzaSyBN0zRiSDC_IdQrYWQaTcbCheyKLRopqOA"
my_cse_id = "009592823161165690347:wrkvjhigeuw"

searchTerms = 'nordic:berry:antibiotic'

def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res['items']

results = google_search(
    searchTerms, my_api_key, my_cse_id, num=10)
googleSearchSnippetlist = []
for result in results:
    jsonResult = json.dumps(result)
    jsonDict = json.loads(jsonResult)    
    for key, value in jsonDict.items():
        if key == 'snippet':
            googleSearchSnippetlist.append(value)
gSnippetDf = pd.DataFrame(googleSearchSnippetlist, columns=['Google search snippets for search terms : ' + searchTerms])
display(gSnippetDf.head()) 
    

Unnamed: 0,Google search snippets for search terms : nordic:berry:antibiotic
0,"Oct 10, 2018 ... PDF | Antimicrobial activity and mechanisms of phenolic extracts of 12 Nordic \nberries were studied against selected human pathogenic ..."
1,"Our popular Nordic Berries multivitamin give growing kids the essential vitamins \nand minerals they need, including vitamins A, B, C, D3, E, and zinc. They even ..."
2,"The content of total phenolics in berry pulp extracts varied from 20.4 to 35.5, .... (\n2006) have studied the antimicrobial activity of 12 Nordic berries against ..."
3,"Jul 27, 2016 ... Antibiotics could be called the cure and the cause but a new study suggests that \ncranberries is the way to go."
4,"BERRY DELICIOUS: Our award-winning Nordic Berries capture the distinct \nsweet-and-sour taste of Norwegian cloudberries, and deliver the same quality \nyou ..."


In [160]:
import requests

subscription_key = "ec8557b875a046eb8f036276a87cd9b0"
assert subscription_key

search_url = "https://api.cognitive.microsoft.com/bing/v7.0/search"
search_term = "nordic AND berry AND antibiotic"

headers = {"Ocp-Apim-Subscription-Key" : subscription_key}
params  = {"q": search_term, "textDecorations":True, "textFormat":"HTML"}
response = requests.get(search_url, headers=headers, params=params)
response.raise_for_status()
search_results = response.json()

bingSearchSnippetlist = []
bingjsonResult = json.dumps(search_results)
bingjsonDict = json.loads(bingjsonResult)

for bingKey, bingValue in bingjsonDict.items():        
        if bingKey == "webPages":
            for webKey, webValueItems in bingValue.items():
                if webKey == "value":
                    for valueItems in webValueItems:
                        for valueKey, valueItem in valueItems.items():
                            if valueKey == "snippet":
                                bingSearchSnippetlist.append(valueItem)
                                
bingdf = pd.DataFrame(bingSearchSnippetlist, columns=['Bing search snippets for search terms : ' + search_term])
display(bingdf.head()) 

Unnamed: 0,Bing search snippets for search terms : nordic AND berry AND antibiotic
0,• The use of NSAIDs and antibiotic ... Nordic Laboratories · Nygade 6 · 3.sal · 1164 Copenhagen K · Denmark ...
1,• Antibiotic use • Changing sexual mores ... Nordic Laboratories · Nygade 6 · 3.sal · 1164 Copenhagen K · Denmark ...
2,Nordic Naturals Nordic Berries. Sign in . Your Account. contact; ... Antibiotic/Steroid ... and cherry berry. Nordic Berries make an ideal companion to any Nordic ...
3,Nordic Naturals Nordic Berries Cherry Berry. Sign in . Your Account. ... Antibiotic/Steroid Combination ; ... Nordic Berries Cherry Berry
4,"Prophylactic antibiotics are an ... the time of administration of preoperative prophylactic antibiotic in relation to the ... Berry WR , Lipsitz SR ..."


3.	Design and implement a similarity measure that computes the number of overlapping words between the total terms of the ten snippets associated to the first sentence S1 and the second sentence S2. 

    Hint: use loop for S1 snippets and S2 snippets similarity measurement. The measurement should be conducted between each snippets for each sentence S1 and S2.

In [None]:
#your code here.

 5. Compare the result with sentence semantic similarity that you have seen in Lab2.
    
    Hint: in lab2, WordNet was used to calculate sentence semantic similarity.

In [1]:
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet as wn

#example
def penn_to_wn(tag):
    """ Convert between a Penn Treebank tag to a simplified Wordnet tag """
    if tag.startswith('N'):
        return 'n'
 
    if tag.startswith('V'):
        return 'v'
 
    if tag.startswith('J'):
        return 'a'
 
    if tag.startswith('R'):
        return 'r'
 
    return None
 
def tagged_to_synset(word, tag):
    wn_tag = penn_to_wn(tag)
    if wn_tag is None:
        return None
 
    try:
        return wn.synsets(word, wn_tag)[0]
    except:
        return None
 
def sentence_similarity(sentence1, sentence2):
    """ compute the sentence similarity using Wordnet """
    # Tokenize and tag
    sentence1 = pos_tag(word_tokenize(sentence1))
    sentence2 = pos_tag(word_tokenize(sentence2))
 
    # Get the synsets for the tagged words
    synsets1 = [tagged_to_synset(*tagged_word) for tagged_word in sentence1]
    synsets2 = [tagged_to_synset(*tagged_word) for tagged_word in sentence2]
 
    # Filter out the Nones
    synsets1 = [ss for ss in synsets1 if ss]
    synsets2 = [ss for ss in synsets2 if ss]
 
    score, count = 0.0, 0
 
    # For each word in the first sentence
    best_score = [0.0]
    for ss1 in synsets1:
        for ss2 in synsets2:
            best1_score=ss1.path_similarity(ss2)
        if best1_score is not None:
            best_score.append(best1_score)
        max1=max(best_score)
        if best_score is not None:
            score += max1
        if max1 is not 0.0:
            count += 1
        best_score=[0.0]
    print(score/count)      
   
    # Average the values
    score /= count
    return score
 
sentences = [
    "Dogs are awesome.",
    "Some gorgeous creatures are felines.",
    "Dolphins are swimming mammals.",
    "Cats are beautiful animals.",
]
 
focus_sentence = "Cats are beautiful animals."
 
for sentence in sentences:
    print ("Similarity(\"%s\", \"%s\") = %s" % (focus_sentence, sentence, sentence_similarity(focus_sentence, sentence)))
    print ("Similarity(\"%s\", \"%s\") = %s" % (sentence, focus_sentence, sentence_similarity(sentence, focus_sentence)))
    print 

0.3333333333333333
Similarity("Cats are beautiful animals.", "Dogs are awesome.") = 0.3333333333333333
0.2222222222222222
Similarity("Dogs are awesome.", "Cats are beautiful animals.") = 0.2222222222222222
0.23650793650793647
Similarity("Cats are beautiful animals.", "Some gorgeous creatures are felines.") = 0.23650793650793647
0.41798941798941797
Similarity("Some gorgeous creatures are felines.", "Cats are beautiful animals.") = 0.41798941798941797
0.17777777777777778
Similarity("Cats are beautiful animals.", "Dolphins are swimming mammals.") = 0.17777777777777778
0.14027777777777778
Similarity("Dolphins are swimming mammals.", "Cats are beautiful animals.") = 0.14027777777777778
0.41203703703703703
Similarity("Cats are beautiful animals.", "Cats are beautiful animals.") = 0.41203703703703703
0.41203703703703703
Similarity("Cats are beautiful animals.", "Cats are beautiful animals.") = 0.41203703703703703


6. Refine your code in order to expand the terms of each snippets to include all the hyponyms and hypernyms of the associated words by quering the WordNet database, and repeat the overlapping process.

In [None]:
#

7. Wikipedia based similarity.
   Similarly, use Wikipedia dump files in order to design a program that search the Wikipedia documents for each Sentence. The similarity between the sentences is therefore measured as the number of common Wikipedia documents outputted by the queries (S1 and S2) over the total number of documents outputted by the two queries. Repeat the process of calculating the semantic similarity for your set of chosen academic examples.

In [None]:
#your code here.

8. Use a publicly available database of your choice in order to test the usefulness of this similarity measure (Snippets and Wikipedia based similarity) and compare the results with some state of art measures mentioned in the literature employing your chosen publicly database.

9. Design a simple GUI interface that allows you to demonstrate your findings

In [None]:
#your code here