<a href="https://colab.research.google.com/github/lanashin/NLP/blob/main/NLP/Jaccard_cosine_meassures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Basic text processing: sentences, types, tokens, similarity**

In [1]:
from __future__ import print_function
from operator import itemgetter

In [2]:
import re

In [60]:
# Copy and pasted background info from New York Times page
# (https://www.nytimes.com/interactive/2022/09/26/world/asia/china-fishing-south-america.html)
# "How China Targets the Global Fish Supply" written by Steven Lee Myers, Agnes Chang, Derek Watkins and Claire Fu | 09. 26, 2022

background = """
The Chinese effort has prompted diplomatic and legal protests. The fleet has also been linked to illegal activity, including encroaching on other countries’ territorial waters, tolerating labor abuses and catching endangered species. In 2017, Ecuador seized a refrigerated cargo ship, the Fu Yuan Yu Leng 999, carrying an illicit cargo of 6,620 sharks, whose fins are a delicacy in China.

Much of what China does, however, is legal — or, on the open seas at least, largely unregulated. Given the growing demands of an increasingly prosperous consumer class in China, it is unlikely to end soon. That doesn’t mean it is sustainable.

In the summer of 2020, the conservation group Oceana counted nearly 300 Chinese ships operating near the Galápagos, just outside Ecuador’s exclusive economic zone, the 200 nautical miles off its territory where it maintains rights to natural resources under the Law of the Sea Treaty. The ships hugged the zone so tightly that satellite mapping of their positions traced the zone’s boundary.

Together, they accounted for nearly 99 percent of the fishing near the Galápagos. No other country came close.

“Our sea can’t handle this pressure anymore,” said Alberto Andrade, a fisherman from the Galápagos. The presence of so many Chinese vessels, he added, has made it harder for local fishermen inside Ecuador’s territorial waters, a UNESCO World Heritage Site that inspired Charles Darwin’s theory of evolution.

Mr. Andrade has organized a group of fishermen, the Island Front for the Galápagos Marine Reserve, to call for the expansion of fishery protections around the islands.

“The industrial fleets are razing the stocks, and we are afraid that in the future there will be no more fishery,” he said. “Not even the pandemic stopped them.”

An Industrial Effort
China can fish on such an industrial scale because of vessels like Hai Feng 718, a refrigerated cargo ship built in Japan in 1996. It is registered in Panama and managed by a company in Beijing called Zhongyu Global Seafood Corporation.

Its owner is a state-owned enterprise: the China National Fisheries Corporation.

Hai Feng 718 is known as a carrier vessel, or mothership. It has refrigerated storage holds to preserve tons of catch. It also carries fuel and other supplies for smaller ships that can unload their hauls and resupply their crews at sea. As a result, the other vessels do not need to spend time returning to port, allowing them to fish almost continuously.

Over the course of a year beginning June 2021, the Hai Feng 718 met at least 70 smaller Chinese-flagged fishing vessels in various locations at sea, according to Global Fishing Watch, a research organization that assembles location data from ship transponders. Each encounter, known as a transshipment, represents the transfer of tons of fish that the smaller ships would have had to unload in port hundreds of miles away.

Together the vessels followed the coasts of South America in what has become a year-round pursuit of catch.

After leaving Weihai, a port town in China’s Shandong Province, the Hai Feng 718 arrived in the Galápagos in August 2021 and spent nearly a month in the waters off Ecuador’s exclusive economic zone. There it serviced numerous ships like the Hebei 8588.

Such vessels are designed for catching squid, one of the prizes for the fleet. The lights the ships use at night to lure squid to the surface are so bright they can be tracked from space.

A month later, the Chinese fleet traveled to the coast of Peru, where the Hai Feng 718 sidled up to more than two dozen smaller vessels, some of them multiple times, including, again, the Hebei 8588.

Loaded with catch, the mothership returned to China. By last December, it was at sea again, this time heading west through the Indian Ocean. It arrived off the coast of Argentina for the start of the squid season there in January. In May, it was once again off the coast of the Galápagos.

These operations have allowed a boom in the squid harvest. Between 1990 and 2019, the number of deep-water squid boats soared from six to 528, while the annual reported catch rose from about 5,000 tons to 278,000, according to a report this year by Global Fishing Watch. In 2019, China accounted for nearly all the squid boats operating in the South Pacific.

The arrangement of transferring catch to another vessel is not illegal, but according to experts, the use of the motherships makes it easy to underreport the catch and disguise its origins. Other places also deploy deep-water fleets, including Japan, South Korea and Taiwan, but none do so on the scale of China.

The Hai Feng 718 alone has more than 500,000 cubic feet of cargo space, enough to carry thousands of tons of fish.

Global Fishing Watch has tracked scores of unexplained “loitering events,” where larger ships linger in one area without any recorded meetings between the carriers and smaller ships. Experts warn that the smaller ships may be turning off their transponders to avoid detection to disguise illegal or unregulated catch.

The impact on certain species like squid off the coast of South America is difficult to measure exactly. In some regions, like the South Pacific, international agreements require countries to report their haul, though underreporting is believed to be common. In the South Atlantic, there is no such agreement.

There are already worrisome signs of diminishing stocks, which could foreshadow a broader ecological collapse.

“The concern is the sheer number of ships and the lack of accountability, to know how much is being fished out and where it’s going to,” said Marla Valentine, an oceanographer with Oceana, the conservation group. “And I’m worried that the impacts that are happening now are going to cascade into the future.

“Because it’s not just the squid that are going to be affected,” she added. “It’s going to be everything that feeds on the squid, too.”

The Global Backlash
The appearance of the Chinese fleet on the edge of the Galápagos in 2020 focused international attention on the industrial scale of China’s fishing fleet. Ecuador lodged a protest in Beijing. Its president at the time, Lenín Moreno, vowed on Twitter to defend the marine sanctuary, which he called “a seedbed of life for the entire planet.”

China has responded with offers of concessions. It announced moratoriums on fishing in certain areas, though critics noted that the restrictions apply to seasons when the fish are not as abundant. It vowed to cap the size of its deep-water fleet, though not to reduce it, and to trim the government subsidies it provides fishing companies, many still state-owned or controlled.

In the year that followed the furor over the Galápagos, the bulk of the Chinese fleet kept a greater distance from Ecuador’s exclusive economic zone. Otherwise it continued to fish as much as before.

In Argentina, a group of environmentalists, supported by the Gallifrey Foundation, an ocean conservation organization, filed an injunction with the country’s top court last year in the hope of prodding the government to do more to comply with its constitutional obligations to protect the environment. They plan to submit a similar injunction in the coming months in Ecuador.

“We have a permanent Chinese fleet 200 miles off our coast,” said Pablo Ferrara, a lawyer and professor at the University of Salvador in Buenos Aires, referring to the distance covered by Argentina’s exclusive economic zone.

Argentina’s navy, which sank a Chinese fishing boat inside the zone in 2016, has since announced it would add four new patrol ships to step up its enforcement efforts in its coastal waters.

The United States, too, has pledged to assist smaller nations to counter China’s illegal or unregulated fishing practices. The U.S. Coast Guard, which now calls the practice one of the greatest security threats in the oceans, has dispatched patrol ships to the South Pacific.

In July, President Biden issued a national security memorandum pledging to increase monitoring of the industry. Speaking virtually at a forum of Pacific nations that month, Vice President Kamala Harris said the United States would triple American assistance to help the nations patrol their waters, offering $60 million a year for the next decade.

Such efforts may help in territorial waters, but they do little to restrict China’s fleet on the open seas. The consumption of fish worldwide continues to rise, reaching a record high in 2019. At the same time, the known stocks of most species of fish continue to decline, according to the latest report by the United Nations Food and Agriculture Organization.

“The challenge is to persuade China that it, too, has a need to ensure the long-range sustainability of the ocean’s resources,” said Duncan Currie, an international environmental lawyer who advises the Deep Sea Conservation Coalition. “It’s not going to be there forever.”



"""

We want to work with sentences. A simple idea is to break at periods.

In [61]:
background.split(".")

['\nThe Chinese effort has prompted diplomatic and legal protests',
 ' The fleet has also been linked to illegal activity, including encroaching on other countries’ territorial waters, tolerating labor abuses and catching endangered species',
 ' In 2017, Ecuador seized a refrigerated cargo ship, the Fu Yuan Yu Leng 999, carrying an illicit cargo of 6,620 sharks, whose fins are a delicacy in China',
 '\n\nMuch of what China does, however, is legal — or, on the open seas at least, largely unregulated',
 ' Given the growing demands of an increasingly prosperous consumer class in China, it is unlikely to end soon',
 ' That doesn’t mean it is sustainable',
 '\n\nIn the summer of 2020, the conservation group Oceana counted nearly 300 Chinese ships operating near the Galápagos, just outside Ecuador’s exclusive economic zone, the 200 nautical miles off its territory where it maintains rights to natural resources under the Law of the Sea Treaty',
 ' The ships hugged the zone so tightly that sat

This is not perfect . One problem consists of the Wikipedia footnotes such as "[9]", which are irrelevant to us. Let's remove them with a regular expression.

In [62]:
re.findall(r"\[\d+\]", background)


[]

In [63]:
background_nofoot = re.sub(r"\[\d+\]", "", background)
background_nofoot

'\nThe Chinese effort has prompted diplomatic and legal protests. The fleet has also been linked to illegal activity, including encroaching on other countries’ territorial waters, tolerating labor abuses and catching endangered species. In 2017, Ecuador seized a refrigerated cargo ship, the Fu Yuan Yu Leng 999, carrying an illicit cargo of 6,620 sharks, whose fins are a delicacy in China.\n\nMuch of what China does, however, is legal — or, on the open seas at least, largely unregulated. Given the growing demands of an increasingly prosperous consumer class in China, it is unlikely to end soon. That doesn’t mean it is sustainable.\n\nIn the summer of 2020, the conservation group Oceana counted nearly 300 Chinese ships operating near the Galápagos, just outside Ecuador’s exclusive economic zone, the 200 nautical miles off its territory where it maintains rights to natural resources under the Law of the Sea Treaty. The ships hugged the zone so tightly that satellite mapping of their posit

We can now try to split sentences again. We also want to split on other punctuation marks, not just the period. This is possible with regular expressions.



In [64]:
splitter = re.compile(r"""
    [.!?]       # split on punctuation
    """, re.VERBOSE)

for sent in splitter.split(background_nofoot):
    print(sent.strip())
    print("--")

The Chinese effort has prompted diplomatic and legal protests
--
The fleet has also been linked to illegal activity, including encroaching on other countries’ territorial waters, tolerating labor abuses and catching endangered species
--
In 2017, Ecuador seized a refrigerated cargo ship, the Fu Yuan Yu Leng 999, carrying an illicit cargo of 6,620 sharks, whose fins are a delicacy in China
--
Much of what China does, however, is legal — or, on the open seas at least, largely unregulated
--
Given the growing demands of an increasingly prosperous consumer class in China, it is unlikely to end soon
--
That doesn’t mean it is sustainable
--
In the summer of 2020, the conservation group Oceana counted nearly 300 Chinese ships operating near the Galápagos, just outside Ecuador’s exclusive economic zone, the 200 nautical miles off its territory where it maintains rights to natural resources under the Law of the Sea Treaty
--
The ships hugged the zone so tightly that satellite mapping of their 

While this looks better, it messes up som instances: O.J. Simpson's name, the "E!" network's name, and ellipses (...).

To deal with this, we need to think about the decision process of whether a punctuation mark is an end of sentence or not. In class, we came up with three requirements:

It is succeeded by at least one whitespace character (discards O.J)
The first character after the whitespace should be uppercase (discards E! renewed)
The last character before it should not be uppercase (discards J. Simpson)
Are these rules complete?

In [65]:
#Note that I do not need to catch the last sentence ending, since I am using split.


splitter = re.compile(r"""
    (?<![A-Z])  # LOOKBEHIND last character cannot be uppercase
    [.!?]       # match punctuation
    \s+
    (?=[A-Z])   # LOOKAHEAD next character must be followed by at least one whitespace and an uppercase      
    """, re.VERBOSE)

# But if I really really want to match that last period (note that this will create and extra empty sentence)
# splitter = re.compile(r"""
#     (?<![A-Z])  # LOOKBEHIND last character cannot be uppercase
#     [.!?]       # match punctuation
#     (?=         # LOOKAHEAD next character must either be:
#     \s+[A-Z]    # followed by at least one whitespace and an uppercase
#     |
#     \s*$        # followed by the end of the string
#     )           
#     """, re.VERBOSE)

Running our sentence splitter on the text, it seems to do a good job.



In [66]:
for sentence in splitter.split(background_nofoot):
    print(sentence.strip())
    print("--")

The Chinese effort has prompted diplomatic and legal protests
--
The fleet has also been linked to illegal activity, including encroaching on other countries’ territorial waters, tolerating labor abuses and catching endangered species
--
In 2017, Ecuador seized a refrigerated cargo ship, the Fu Yuan Yu Leng 999, carrying an illicit cargo of 6,620 sharks, whose fins are a delicacy in China
--
Much of what China does, however, is legal — or, on the open seas at least, largely unregulated
--
Given the growing demands of an increasingly prosperous consumer class in China, it is unlikely to end soon
--
That doesn’t mean it is sustainable
--
In the summer of 2020, the conservation group Oceana counted nearly 300 Chinese ships operating near the Galápagos, just outside Ecuador’s exclusive economic zone, the 200 nautical miles off its territory where it maintains rights to natural resources under the Law of the Sea Treaty
--
The ships hugged the zone so tightly that satellite mapping of their 

Now let's identify individual words within each sentence. Rather than splitting at whitespace, let's match all sequences of word characters.



In [67]:
word_splitter = re.compile(r"""
    (\w+)
    """, re.VERBOSE)

In [68]:
sent_words = [word_splitter.findall(sent)
              for sent in splitter.split(background_nofoot) if len(sent)>0]

sent_words

[['The',
  'Chinese',
  'effort',
  'has',
  'prompted',
  'diplomatic',
  'and',
  'legal',
  'protests'],
 ['The',
  'fleet',
  'has',
  'also',
  'been',
  'linked',
  'to',
  'illegal',
  'activity',
  'including',
  'encroaching',
  'on',
  'other',
  'countries',
  'territorial',
  'waters',
  'tolerating',
  'labor',
  'abuses',
  'and',
  'catching',
  'endangered',
  'species'],
 ['In',
  '2017',
  'Ecuador',
  'seized',
  'a',
  'refrigerated',
  'cargo',
  'ship',
  'the',
  'Fu',
  'Yuan',
  'Yu',
  'Leng',
  '999',
  'carrying',
  'an',
  'illicit',
  'cargo',
  'of',
  '6',
  '620',
  'sharks',
  'whose',
  'fins',
  'are',
  'a',
  'delicacy',
  'in',
  'China'],
 ['Much',
  'of',
  'what',
  'China',
  'does',
  'however',
  'is',
  'legal',
  'or',
  'on',
  'the',
  'open',
  'seas',
  'at',
  'least',
  'largely',
  'unregulated'],
 ['Given',
  'the',
  'growing',
  'demands',
  'of',
  'an',
  'increasingly',
  'prosperous',
  'consumer',
  'class',
  'in',
  'China

In [69]:
sent_words_lower = [[w.lower() for w in sent]
                    for sent in sent_words]
sent_words_lower

[['the',
  'chinese',
  'effort',
  'has',
  'prompted',
  'diplomatic',
  'and',
  'legal',
  'protests'],
 ['the',
  'fleet',
  'has',
  'also',
  'been',
  'linked',
  'to',
  'illegal',
  'activity',
  'including',
  'encroaching',
  'on',
  'other',
  'countries',
  'territorial',
  'waters',
  'tolerating',
  'labor',
  'abuses',
  'and',
  'catching',
  'endangered',
  'species'],
 ['in',
  '2017',
  'ecuador',
  'seized',
  'a',
  'refrigerated',
  'cargo',
  'ship',
  'the',
  'fu',
  'yuan',
  'yu',
  'leng',
  '999',
  'carrying',
  'an',
  'illicit',
  'cargo',
  'of',
  '6',
  '620',
  'sharks',
  'whose',
  'fins',
  'are',
  'a',
  'delicacy',
  'in',
  'china'],
 ['much',
  'of',
  'what',
  'china',
  'does',
  'however',
  'is',
  'legal',
  'or',
  'on',
  'the',
  'open',
  'seas',
  'at',
  'least',
  'largely',
  'unregulated'],
 ['given',
  'the',
  'growing',
  'demands',
  'of',
  'an',
  'increasingly',
  'prosperous',
  'consumer',
  'class',
  'in',
  'china

In [70]:
len(sent_words_lower)

59

In [71]:
def getwords(sent):
    return [w.lower() 
            for w in word_splitter.findall(sent)]

sent_words_lower = [getwords(sent) 
                    for sent in splitter.split(background_nofoot)]

How many sentences do we have?



In [72]:
len(sent_words_lower)


59

How many words are there in total? ("tokens")



In [73]:
allwords=[w for sent in sent_words_lower
          for w in sent]

How many distinct types of words ("types") are there?



In [74]:
len(set(allwords))


619

In [75]:
sorted(set(allwords))

['000',
 '1990',
 '1996',
 '200',
 '2016',
 '2017',
 '2019',
 '2020',
 '2021',
 '278',
 '300',
 '5',
 '500',
 '528',
 '6',
 '60',
 '620',
 '70',
 '718',
 '8588',
 '99',
 '999',
 'a',
 'about',
 'abundant',
 'abuses',
 'according',
 'accountability',
 'accounted',
 'activity',
 'add',
 'added',
 'advises',
 'affected',
 'afraid',
 'after',
 'again',
 'agreement',
 'agreements',
 'agriculture',
 'aires',
 'alberto',
 'all',
 'allowed',
 'allowing',
 'almost',
 'alone',
 'already',
 'also',
 'america',
 'american',
 'an',
 'and',
 'andrade',
 'announced',
 'annual',
 'another',
 'any',
 'anymore',
 'appearance',
 'apply',
 'are',
 'area',
 'areas',
 'argentina',
 'around',
 'arrangement',
 'arrived',
 'as',
 'assembles',
 'assist',
 'assistance',
 'at',
 'atlantic',
 'attention',
 'august',
 'avoid',
 'away',
 'backlash',
 'be',
 'because',
 'become',
 'been',
 'before',
 'beginning',
 'beijing',
 'being',
 'believed',
 'between',
 'biden',
 'boat',
 'boats',
 'boom',
 'boundary',
 'brigh

# **Stemming**

We might not want to distinguish between air/aired/airing announce/announcing since they talk abuot the same concept. We can do this through stemming.



In [76]:
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()

def getstems(sent):
    return [stemmer.stem(w.lower()) for w in word_splitter.findall(sent)]

sent_words_lower_stemmed = [getstems(sent) for sent in splitter.split(background_nofoot)]

allstemms=[w for sent in sent_words_lower_stemmed
           for w in sent]

set(allstemms)

{'000',
 '1990',
 '1996',
 '200',
 '2016',
 '2017',
 '2019',
 '2020',
 '2021',
 '278',
 '300',
 '5',
 '500',
 '528',
 '6',
 '60',
 '620',
 '70',
 '718',
 '8588',
 '99',
 '999',
 'a',
 'about',
 'abund',
 'abus',
 'accord',
 'account',
 'activ',
 'ad',
 'add',
 'advis',
 'affect',
 'afraid',
 'after',
 'again',
 'agreement',
 'agricultur',
 'air',
 'alberto',
 'all',
 'allow',
 'almost',
 'alon',
 'alreadi',
 'also',
 'america',
 'american',
 'an',
 'and',
 'andrad',
 'ani',
 'announc',
 'annual',
 'anoth',
 'anymor',
 'appear',
 'appli',
 'are',
 'area',
 'argentina',
 'around',
 'arrang',
 'arriv',
 'as',
 'assembl',
 'assist',
 'at',
 'atlant',
 'attent',
 'august',
 'avoid',
 'away',
 'backlash',
 'be',
 'becaus',
 'becom',
 'been',
 'befor',
 'begin',
 'beij',
 'believ',
 'between',
 'biden',
 'boat',
 'boom',
 'boundari',
 'bright',
 'broader',
 'bueno',
 'built',
 'bulk',
 'but',
 'by',
 'call',
 'came',
 'can',
 'cap',
 'cargo',
 'carri',
 'carrier',
 'cascad',
 'catch',
 'certa

# **Answering queries on the data**

We will try to retrieve the closest matching sentence to a given query. To do this, we must define what "closest" means. In other words, we need a similarity measure.

A simple one is the number of types in common between the query and the sentence.

In [77]:
def types_in_common(query_words, sentence):
    A = set(query_words)
    B = set(sentence)
    return len(A.intersection(B))

A slightly more complex one is the the Jaccard similarity measure, which additionaly takes into account the total number of types that the query and the sentence has

In [78]:
#later
def jaccard(query_words, sentence):
    A = set(query_words)
    B = set(sentence)
    return float(len(A.intersection(B)))/len(A.union(B))

Next we'll define a basic "search engine" which will go through all the sentences and calculate each one's similarity with the query. It returns a list of sentences sorted by their similarity score (if that score is greater than zero).

To calculate the similarity, this function takes as an argument a similarity_measure function.

In [79]:
def run_search(query, similarity_measure):
    query_words = word_splitter.findall(query)
    query_words = [w.lower() for w in query_words]
    
    sent_scores = [(sent, similarity_measure(query_words, sent))
                   for sent in sent_words_lower]

    sent_scores = sorted(sent_scores, key=itemgetter(1), reverse=True)
    sent_scores = [(sent, score)
                   for sent, score in sent_scores
                   if score > 0]

    joined_sents = [(" ".join(sent), score) for sent, score in sent_scores]
    return joined_sents

Now we'll run two versions of the search engines (one using the types_in_commmon measure and one using the jaccard measure) for two different queries.



In [59]:
run_search("Raptors",types_in_common)


[]

In [24]:
##first add Jaccard
run_search("Raptors",jaccard)

[('most famously the toronto raptors built a wall of lengthy adept defenders to neutralise antetokounmpo s paint dominance in their 2019 conference finals victory',
  0.041666666666666664),
 ('the bucks went on to reach the eastern conference finals where they were defeated 4 2 by the eventual champions the toronto raptors despite winning the first two games',
  0.04),
 ('on april 15 2017 antetokounmpo scored a playoff career high 28 points in a 97 83 win over the third seed toronto raptors in game 1 of their first round playoff series',
  0.034482758620689655)]

In [25]:
##first add Jaccard
run_search("Toronto Raptors",jaccard)

[('most famously the toronto raptors built a wall of lengthy adept defenders to neutralise antetokounmpo s paint dominance in their 2019 conference finals victory',
  0.08333333333333333),
 ('the bucks went on to reach the eastern conference finals where they were defeated 4 2 by the eventual champions the toronto raptors despite winning the first two games',
  0.08),
 ('on april 15 2017 antetokounmpo scored a playoff career high 28 points in a 97 83 win over the third seed toronto raptors in game 1 of their first round playoff series',
  0.06896551724137931)]

In [26]:
##first add Jaccard
run_search("greek freak",jaccard)

[('because many could not pronounce his surname he quickly became known as the greek freak',
  0.13333333333333333),
 ('he also became the first greek nba all star', 0.1),
 ('antetokounmpo again joined the greek national team for eurobasket 2015',
  0.09090909090909091),
 ('antetokounmpo s country of origin in addition to his size speed strength and ball handling skills have earned him the nickname greek freak',
  0.08695652173913043),
 ('from 2014 to 2019 antetokounmpo played with the senior men s greek national team in 49 games',
  0.05555555555555555),
 ('he was eventually issued greek citizenship on may 9 2013 less than two months before the 2013 nba draft',
  0.05263157894736842),
 ('he was also selected by the coaches as a special participant in the 2013 greek league all star game',
  0.05263157894736842),
 ('national team career junior national team antetokounmpo represented greece for the first time in july 2013 with the greek under 20 national team at the 2013 fiba europe unde

In [27]:
def print_results(orderedlist, relevant_docs=[], maxresults=5):
    """Print search results while highlighting the ones we truly care about"""
    count = 1
    for item, score in orderedlist:
        if item in relevant_docs:
            print("{:d} !!! {:.2f} {}".format(count, score, item))
        elif count <= maxresults:
            print("{:d}     {:.2f} {}".format(count, score, item))
        print()
        count += 1

In [28]:
relevant_docs_citizen = ["after gaining greek citizenship in 2013 his official surname became αντετοκούνμπο the greek transcription of adetokunbo which was then transliterated letter for letter and officially spelled on his greek passport as antetokounmpo",
                          'although adetokunbo and three of his four brothers were born in greece, they did not automatically receive greek citizenship as greek nationality law follows jus sanguinis',
                          'antetokounmpo also holds nigerian citizenship, having received his nigerian passport in 2015, and as such possesses dual citizenship']

In [29]:
print_results(run_search("greek citizenship", jaccard),
              relevant_docs=relevant_docs_citizen)

1     0.11 he was eventually issued greek citizenship on may 9 2013 less than two months before the 2013 nba draft

2     0.10 he also became the first greek nba all star

3     0.09 antetokounmpo again joined the greek national team for eurobasket 2015

4     0.08 although adetokunbo and three of his four brothers were born in greece they did not automatically receive greek citizenship as greek nationality law follows jus sanguinis

5 !!! 0.07 after gaining greek citizenship in 2013 his official surname became αντετοκούνμπο the greek transcription of adetokunbo which was then transliterated letter for letter and officially spelled on his greek passport as antetokounmpo













# **Vector Space Model**
In order to have a more flexible system that we can modify, let's implement a vector space model. We will use term frequency (TF) weights in order to address one of the issues with Jaccard, namely, that if a word occurs more than once, it should naturally matter more.

In [30]:
terms = sorted(set(allwords))

# TF (term frequency) vectorization
# We represent vectors in a "sparse" dictionary format.
# All keys not present in the dictionary are assumed to be zeros.

def doc_to_vec(term_list):
    d = {}
    for v in terms:
        d[v] = term_list.count(v)
    return d

def query_to_vec(term_list):
    d = {}
    for v in terms:
        d[v] = term_list.count(v)
    return d

In [31]:
import math

def dot(d, q):
    sum=0
    for v in d:  # iterates through keys
        sum += d[v] * q[v]
    return sum

One simple similarity measure operating on vectors is the dot product. The higher the dot product between two vectors, the more similar they are.



In [32]:
def dot_measure(query_words, sentence):
    A = query_to_vec(query_words)
    B = doc_to_vec(sentence)
    return float(dot(A, B))

In [33]:
print_results(run_search("greek citizenship",dot_measure),relevant_docs=relevant_docs_citizen)


1 !!! 4.00 after gaining greek citizenship in 2013 his official surname became αντετοκούνμπο the greek transcription of adetokunbo which was then transliterated letter for letter and officially spelled on his greek passport as antetokounmpo

2     3.00 although adetokunbo and three of his four brothers were born in greece they did not automatically receive greek citizenship as greek nationality law follows jus sanguinis

3     2.00 he was eventually issued greek citizenship on may 9 2013 less than two months before the 2013 nba draft

4     2.00 antetokounmpo also holds nigerian citizenship having received his nigerian passport in 2015 and as such possesses dual citizenship

5     1.00 giannis antetokounmpo giannis sina ugo antetokounmpo a né adetokunbo b december 6 1994 is a greek nigerian professional basketball player for the milwaukee bucks of the national basketball association nba













We can see that this does even worse, as it rewards longer documents unfairly. We can address that by length normalizing the documents and the query, that is, dividing the vectors by their norm.

The resulting measure is the cosine similarity measure:

In [34]:
def norm(d):
    sum_sq = 0
    for v in d:
        sum_sq += d[v] * d[v]
    return math.sqrt(sum_sq)

def cos_measure(query_words, sentence):
    A = query_to_vec(query_words)
    B = doc_to_vec(sentence)
    return float(dot(A, B)) / (norm(A) * norm(B))

In [35]:
print_results(run_search("greek citizenship",cos_measure),relevant_docs=relevant_docs_citizen)


1 !!! 0.44 after gaining greek citizenship in 2013 his official surname became αντετοκούνμπο the greek transcription of adetokunbo which was then transliterated letter for letter and officially spelled on his greek passport as antetokounmpo

2     0.40 although adetokunbo and three of his four brothers were born in greece they did not automatically receive greek citizenship as greek nationality law follows jus sanguinis

3     0.31 he was eventually issued greek citizenship on may 9 2013 less than two months before the 2013 nba draft

4     0.30 antetokounmpo also holds nigerian citizenship having received his nigerian passport in 2015 and as such possesses dual citizenship

5     0.24 he also became the first greek nba all star













This already does better, ranking one of our important documents as first. The other is still low in the ranking.

Another issue we have at the moment is that all words matter equally, but intuition dictates that some words in the query (e.g. olympic) are more important than others (e.g. in). A way to address this is to weight the words according to their specificity, and a concrete implementation is to use inverse document frequency (IDF)

In [36]:
IDF = {}
DF = {}

for t in terms:
    DF[t] = len([1 for sent in sent_words_lower if t in sent])
    IDF[t] = 1 / float(DF[t] + 1)

In [37]:
for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[:10]:
    print(IDF_t)

print("...")

for IDF_t in sorted(IDF.items(), key=itemgetter(1),reverse = False)[-10:]:
    print(IDF_t)

('the', 0.006289308176100629)
('in', 0.008064516129032258)
('and', 0.009900990099009901)
('antetokounmpo', 0.009900990099009901)
('a', 0.011111111111111112)
('to', 0.012987012987012988)
('he', 0.013333333333333334)
('of', 0.013513513513513514)
('points', 0.016666666666666666)
('bucks', 0.018518518518518517)
...
('west', 0.5)
('western', 0.5)
('winner', 0.5)
('without', 0.5)
('work', 0.5)
('yet', 0.5)
('youth', 0.5)
('zaragoza', 0.5)
('zisis', 0.5)
('αντετοκούνμπο', 0.5)
