# Exercise 1 : Locate the hidden fortress of the evil Dr. Unstructured

General guidelines : after decomposing each step into small tasks, for each task check if :


*   You know how to implement it : write the code
*   You dont know how to implement it : search google for either **code examples** or **existing packages**

If you can't find the code for a simple task on google, it means it is not general enough yet, and you need to divide it into smaller tasks.



### Step 1 : First, we need to make sure all agents are ready for duty.
Write a code that will print for each agent, that he is ready for duty.
For example for agent Fluffinson it would print 'Fluffinson, ready for duty!'

> Use a **loop** and a **f-string**

In [None]:
# Code printing ready for duty messages
agents = ['Doggson', 'Fluffinson', 'Marshmallow', 'Bella']

# Write your code here :
for agent in agents:
  print(f'{agent}, ready for duty!')


Doggson, ready for duty!
Fluffinson, ready for duty!
Marshmallow, ready for duty!
Bella, ready for duty!


### Step 2 : Now that our agents are ready for duty, they are hungry.
We need a way to retrieve their favorite food. Write a function that takes the name of an agent as argument, and returns their favorite food. Don't forget to use **types** and **documentation**.
Here are the favorite foods of the different agents. Please keep them secret.

*   Doggson: Chicken
*   Fluffinson: Milk
*   Bella: Apples



In [None]:
# Write your code here :
def get_favorite_food(name:str) -> str:
  """
  Returns the favorite food of an agent
  """
  favorite_foods = {'Doggson':'Chicken', 'Fluffinson':'Milk', 'Bella':'Apples'}
  return favorite_foods[name]


Now use this function to print the favorite food of Bella. It should print : 'The favorite food of Bella is Apples'. Use a **function call** and an **f-string**.

In [None]:
# Call your function here:
food = get_favorite_food('Bella')
print(f'The favorite food of Bella is {food}')

The favorite food of Bella is Apples


### Step 3 : Quick, our agents just intercepted a new code message sent from the hidden fortress of the evil Dr. Unstructured !
Our top communication experts tell us that the longest word from that message contains a crucial clue to the location of the fortress. We need your help to **print** the **longest** **word** from that message.

Here is the secret message :

> We are guarding the hidden fortress where it is snowing and we are so cold



In [None]:
# Write your code here:
message = 'We are guarding the hidden fortress where it is snowing and we are so cold'
words = message.split(' ')
longest_word = ''
for word in words:
  if len(word) > len(longest_word):
    longest_word = word
print(longest_word)

guarding


### Step 4 : eutils is an API provided by PubMed. It allows to search and retrieve biomedical publications.
You can get the number of publications matching a particular query, by supplying the query for the term parameter in the URL.

For example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=guarding&retmax=0&retmode=json



1.   Use this URL in your browser to see the structure of the JSON response it returns. Notice how dictionaries and lists in JSON look similar to dictionaries and lists in python.
2.   What do you think is the search query in this URL ? Try different search queries.
3.   Write a function in python that takes as parameter a query (or search term) and returns the number of publications matching this query in PubMed.




In [None]:
import requests
# Write your code here :
def get_publications_count(query:str) -> int:
  url = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={query}&retmax=0&retmode=json'
  result = requests.get(url)
  if result.status_code == 200:
      return result.json()['esearchresult']['count']
  else:
      print('Can not access the URL')

Now, use this function to retrieve and **print** the number of publications corresponding to the term **guarding**.

In [None]:
# Write your code here :
print(get_publications_count('guarding'))

23656


### Step 5 : Now, sum together all the digits of the number.

For example if the number of publications was 123 you should get 1 + 2 +3  = 6

**TIP 1**: A string is a sequence. You can **loop** over it like you can loop over a list.

**TIP 2**: You can cast variables to change their type. For example you can cast an integer into a string, so it will behave like a string. And you can cast a string into an integer.

- *my_integer=int('6')*
- *my_string = str(234)*


In [None]:
# Write your code here :
sum = 0
pub_count = get_publications_count('guarding')
s_pub_count = str(pub_count)
for s_digit in s_pub_count:
  sum += int(s_digit)
print(sum)

22


### Step 6 : Good job! We now have the first coordinate of the secret fortress of the evil Dr. Unstructured.
Our agents are ready. Now you need to find the second coordinate. Thankfully, agent Mittenson, after he finished playing with the curtains, intercepted another transmission from the evil agents for the Dr. Unstructured. The transmission was just a list of numbers. However, our best data experts think that the only number not divisible by **2** is the second coordinate of the hidden fortress.

Agent Doggson has charged you personally with uncovering (**printing**) that second coordinate. Here are the intercepted numbers :



> 8, 6, 10, 11, 14, 2

TIP : You will need to search on google about the **modulo** operator in python


In [None]:
# Write your code here :
numbers = [8, 6, 10, 11, 14, 2]
for number in numbers:
  rest = number % 2
  if rest != 0:
    print(number)

11


# Exercice 2 : Let's build our own search engine !

In this exercice we will build our own database, add publications from PubMed, and search them using stopwords and lemmatization.

We will also calculate the score of each match and rank the hits by score.

What we want is to be able to index text such as 'dogs are cool" and be able to find it with a query such as 'are dogs cool ?'. This means that we need to index both our text and our query as words.

### Step 1 : To do that, let's first design a function **get_words** that will take as input a string of text, and return a list of words.

We dont want the returned words to contain punctuation or stopwords.

TIP1: Use the function **word_tokenize** from the nltk package

TIP2: You can check if a word is in **string.punctutation**

In [None]:
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab') # Add this line to download the missing resource

def get_words(text:str) -> list:
  """
  Returns meaningful words from the text
  """
  # Write your code here :
  words = nltk.word_tokenize(text)
  clean_words = []
  for word in words:
    if word not in stopwords.words('english'):
      if word not in string.punctuation:
        clean_words.append(word)
  return clean_words

# Use print to try out your code here :
print(get_words('dogs are cool!'))

['dogs', 'cool']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Step 2 : Now that we have a function that can transform any text into words, we want to build a database (implemented as dictionary).
The dictionary will **associate** each word with the index of the original text, stored in a separate **list**.

In [None]:
def build_database(texts:list) -> dict:
  """
  Returns a dictionary associating each word of each text to the index of that text
  """
  database = {}
  # Write your code here :
  for index, text in enumerate(texts):
    words = get_words(text)
    for word in words:
        if word not in database:
          database[word] = set()
        database[word].add(index)
  return database

# Try your code with a print here :
print(build_database(['dogs are cool!', 'my computer is too slow.']))


{'dogs': {0}, 'cool': {0}, 'computer': {1}, 'slow': {1}}


### Step 3 : Now let's create the function **get_match_indexes** that will take the database and a query string as input, and return the indexes of all mathing texts

In [None]:
def get_match_indexes(database:dict, query:str) -> list:
  """
  Returns the indexes of sentences matching the query
  """
  match_indexes = set()
  # Write your code here :
  query_words = get_words(query)
  for word in query_words:
    if word in database:
      word_match_indexes = database[word]
      match_indexes.update(word_match_indexes)
  return match_indexes

# Let's try this function :
demo_database = build_database(['dogs are cool!', 'my computer is too slow.'])
print(get_match_indexes(demo_database, 'I like dogs'))

{0}


### Our search engine is ready !
Lets combine the functions we created. We will add two sentences to the database, then search it with a query :

In [None]:
texts = ['dogs are cool!', 'my computer is too slow.']
demo_database = build_database(texts)
for match_index in get_match_indexes(demo_database, 'I like dogs'):
  print(texts[match_index])

dogs are cool!


### Step 4 : Nice job! You now have a fully functional search engine.
But before we can index some cool stuff, such as publications from PubMed, we need to improve it a little.

For example, let's try to search for 'My dog is great!'. How many matches do we have ? Why ?

In [None]:
# Write here the code to search for 'My dog is great!'
for match_index in get_match_indexes(demo_database, 'My dog is great!'):
  print(texts[match_index])

### Step 5 : It seems that our search engine doesnt know that dogs is the plural of dog.
The simplest way to deal with that, is to transform each word into a root form, that we will store in our database.


> For example, we want to transform "dogs" into "dog"


And for that we can use lemmatization.

Let's improve our **get_words** method by lemmatizing each word

In [None]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def get_words(text:str) -> list:
  """
  Returns meaningful words from the text
  """
  # Write your code here :
  words = nltk.word_tokenize(text)
  clean_words = []
  for word in words:
    if word not in stopwords.words('english'):
      if word not in string.punctuation:
        word = lemmatizer.lemmatize(word)
        clean_words.append(word)
  return clean_words

# Let's try the function here :
print(get_words('the dogs are outside'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


['dog', 'outside']


 What if we try the text 'My Dogs are great' ?

In [None]:
# Write here the code to search for 'My dog is great!'
print(get_words('My Dogs are great'))

['My', 'Dogs', 'great']


### Step 6 : Update the function **get_words** to make sure we only add the the database words in lowercase form.

In [None]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

def get_words(text:str) -> list:
  """
  Returns meaningful words from the text
  """
  text = text.lower()
  words = nltk.word_tokenize(text)
  clean_words = []
  for word in words:
    if word not in stopwords.words('english'):
      if word not in string.punctuation:
        word = lemmatizer.lemmatize(word)
        clean_words.append(word)
  return clean_words

print(get_words('The Dogs are outside'))

['dog', 'outside']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Step 7 : What if we add to our database two sentences :


*   I like dogs
*   Dogs and cats

If we search for '**I like my Dog**', the first sentence is clearly a better match than the second. So we should display it first.

This means that we should **rank** our matches based on the *number of words from the query that they contain*.

Modify the **get_match_indexes** function to store the number of words from the query matched for each document where there was a match.



In [None]:
from collections import defaultdict

def get_match_indexes(database:dict, query:str) -> dict:
  """
  Returns the indexes of sentences matching the query
  """
  match_indexes = defaultdict(int)
  # Write your code here :
  query_words = get_words(query)
  for word in query_words:
    if word in database:
      word_match_indexes = database[word]
      for word_match_index in word_match_indexes:
        match_indexes[word_match_index] += 1
  return match_indexes

# Now let's try this code :
texts = ['Dogs and cats', 'I like dogs']
demo_database = build_database(texts)
print(get_match_indexes(demo_database, 'I like my Dog'))

defaultdict(<class 'int'>, {1: 2, 0: 1})


### Step 8 : And add a **get_ranked_matches** function to rank match indexes based on the number of word from the query a sentence in our database matches.

In [None]:
def get_ranked_matches(matches:dict)->list:
  """
  Returns a ranked list of matches
  """
  # Write your code here :
  l_matches = [(index, score) for index, score in matches.items()]
  sorted_matches = sorted(l_matches, key=lambda m:m[1], reverse=True)
  return [m[0] for m in sorted_matches]

Now let's try it on an example :

In [None]:
texts = ['Dogs and cats', 'I like dogs']
demo_database = build_database(texts)
matches = get_match_indexes(demo_database, 'I like my Dog')
ranked_matches = get_ranked_matches(matches)

for match_index in ranked_matches:
  print(texts[match_index])

I like dogs
Dogs and cats


### Step 9 : Now, lets nicely pack all this functionality inside a **class** called **SearchEngine**

In [None]:
import string
import nltk
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

class SearchEngine:

  def __init__(self):
    self.lemmatizer = WordNetLemmatizer()

  # Add your functions after this line. Don't forget to transform them into methods. (the first parameter must be 'self')

  def get_words(self, text:str) -> list:
    text = text.lower()
    words = nltk.word_tokenize(text)
    clean_words = []
    for word in words:
      if word not in stopwords.words('english'):
        if word not in string.punctuation:
          word = self.lemmatizer.lemmatize(word)
          clean_words.append(word)
    return clean_words

  def build_database(self, texts:list) -> dict:
    self.texts = texts
    self.database = {}
    for index, text in enumerate(texts):
      words = self.get_words(text)
      for word in words:
          if word not in self.database:
            self.database[word] = set()
          self.database[word].add(index)

  def get_match_indexes(self, query:str) -> dict:
    match_indexes = defaultdict(int)
    query_words = self.get_words(query)
    for word in query_words:
      if word in self.database:
        word_match_indexes = self.database[word]
        for word_match_index in word_match_indexes:
          match_indexes[word_match_index] += 1
    return match_indexes

  def get_ranked_matches(self, matches:dict)->list:
    l_matches = [(index, score) for index, score in matches.items()]
    sorted_matches = sorted(l_matches, key=lambda m:m[1], reverse=True)
    return [m[0] for m in sorted_matches]

  def search(self, query:str) -> None:
    match_indexes = self.get_match_indexes(query)
    ranked_matches = self.get_ranked_matches(match_indexes)
    for match_index in ranked_matches:
      match_text = self.texts[match_index]
      print(match_text)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Now let's try this new class :

In [None]:
engine = SearchEngine()
engine.build_database(['Dogs and cats', 'I like dogs', 'only cats here'])
engine.search('I like my dog')

I like dogs
Dogs and cats


### Step 10 : Now let's try with some actual publication titles !
Write a function that will retrieve a list of titles from LitCovid API

In [None]:
import requests
def get_publication_titles() -> int:
  """
  Returns the top 100 most recent titles from publications in LitCovid
  """
  url = f'https://www.ncbi.nlm.nih.gov/research/coronavirus-api/latest/?limit=100'
  # Write your code here :
  result = requests.get(url)
  if result.status_code == 200:
    titles = []
    for p in result.json():
      titles.append(p['title'])
    return titles
  else:
      print('Can not access the URL')


Now let's try it !

In [None]:
titles = get_publication_titles()
engine = SearchEngine()
engine.build_database(titles)
engine.search('vaccine covid-19')

Analysis of antibody markers as immune correlates of risk of severe COVID-19 in the PREVENT-19 efficacy trial of the NVX-CoV2373 recombinant protein vaccine.
Daily briefing: People with cancer lived longer if they'd had a COVID-19 vaccine.
Nationwide estimates of SARS-CoV-2 infection fatality rates and numbers needed to vaccinate for COVID-19 vaccines in 2024 in Austria.
Nanobody-based combination vaccine using licensed protein nanoparticles protects animals against respiratory and viral infections.
mRNA covid vaccines may "turbo charge" cancer immunotherapy, research suggests.
Science for vaccine policy: Independent review of the September 2025 ACIP processes, deliberations and votes.
Evolving Features of Acute Flaccid Myelitis After COVID-19: A Four-Case Series.
Loans dominated COVID-19 funding: it's time to adjust.
Retrospective Analysis on Mortality and Functional Outcomes in Critically Ill Elderly with COVID-19: A Comparative Study Between Full Code and DNR Orders.
Frailty Assessm