<a href="https://colab.research.google.com/github/guilhermelaviola/NaturalLanguageProcessing/blob/main/Class06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Information Retrieval**
Information Retrieval (IR) is a key area of Natural Language Processing (NLP) focused on retrieving the most relevant documents from large text collections using advanced algorithms rather than simple keyword matching. Central to IR is document indexing, particularly the use of reverse indexes to enable efficient searching. Common techniques include Levenshtein Distance, which measures string similarity to handle variations in search terms, and the Bag of Words (BoW) model, which represents documents as word frequency vectors and compares them using cosine similarity. Implementing IR systems typically involves preprocessing steps such as stop-word removal and normalization, followed by similarity calculations to rank documents by relevance. Large-scale systems like Google exemplify IR in practice by combining indexing, ranking algorithms, and machine learning to deliver accurate search results, making IR an essential discipline for professionals working with extensive textual data.

In [None]:
! pip3 install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=8ce7e7bfd96aede4b7117229c2cc97cf3392c8e2722e6ebb2863c4781650573d
  Stored in directory: /root/.cache/pip/wheels/8f/ab/cb/45ccc40522d3a1c41e1d2ad53b8f33a62f394011ec38cd71c6
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [None]:
! pip3 install python-Levenshtein

In [None]:
# Importing all the necessary libraries and resources:
import wikipedia
from Levenshtein import distance
import string
import nltk
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

## **Example: Information Retrieval with NLTK and Levenshtein**
In the following example we download Wikipedia pages, then we recover documents using Levensthein. After that, we use functions for data manipulation.

In [None]:
# Downloading Wikipedia pages:
wikipedia.set_lang('en')

wikipedia.search('North America')
wikipedia.search('Central America')
wikipedia.search('South America')
wikipedia.search('Europe')
wikipedia.search('Asia')
wikipedia.search('Africa')
wikipedia.search('Oceania')

documents = {}
for continent in ['North America', 'Central America', 'South America', 'Europe', 'Asia', 'Africa', 'Oceania']:
  for page in wikipedia.search(continent):
    # Check if the page title is valid before attempting to retrieve content
    try:
      documents[page] = {
          'title': page,
          'content': wikipedia.page(page).content
      }
    except wikipedia.exceptions.PageError:
      print(f"Skipping page '{page}' due to PageError.")
    except wikipedia.exceptions.DisambiguationError as e:
        print(f"Skipping page '{page}' due to DisambiguationError: {e}")

len(documents)
print(documents['South America'])

['Oceania',
 'List of islands in the Pacific Ocean',
 'List of sovereign states and dependent territories in Oceania',
 'Political geography of Nineteen Eighty-Four',
 'Oceania (disambiguation)',
 '2025 Formula Regional Oceania Championship',
 'Oceania Cruises',
 'Australia (continent)',
 '2025 Oceania Badminton Championships',
 'Women in Oceania']

In [None]:
# Recovering documents using Levensthein:
distance('Europe', 'Africa')
distance('North America', 'South America')

documents.keys()

titles = list(documents.keys())

Collecting python-Levenshtein
  Downloading python_Levenshtein-0.26.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.26.1 (from python-Levenshtein)
  Downloading levenshtein-0.26.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.26.1->python-Levenshtein)
  Downloading rapidfuzz-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading python_Levenshtein-0.26.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.26.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (162 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.7/162.7 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.12.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages:

In [None]:
# Defining functions to calculate distance and to rank the closest titles to a search:
def levenshtein(str1, str2):
    return distance(str1.lower(), str2.lower(), weights=(1, 10, 10))

def distance_ranking(search):
  distances = [(title, levenshtein(search, title)) for title in titles]
  return sorted(distances, key=lambda x: x[1])

distance_ranking('El Salvador')

distance_ranking('Louisiana')

[('South American plate', 41),
 ('List of islands in the Pacific Ocean', 47),
 ('Women in Oceania', 47),
 ('Asia (Asia album)', 49),
 ('South Africa', 53),
 ('Latin America', 54),
 ('South America', 54),
 ('Australia (continent)', 54),
 ('List of North American cities by population', 56),
 ('British colonization of the Americas', 57),
 ('Languages of South America', 57),
 ('List of South American countries by population', 57),
 ('European Union', 57),
 ('Survivor: Africa', 58),
 ('Enel North America', 59),
 ('North American Union', 61),
 ('British North America', 62),
 ('West Asia', 62),
 ('List of sovereign states and dependent territories in Oceania', 62),
 ('North Africa', 63),
 ('2025 Formula Regional Oceania Championship', 63),
 ('Flag of Central America', 64),
 ('Europe (band)', 64),
 ('West Africa', 64),
 ('Indigenous peoples of the Americas', 65),
 ('Economy of South America', 65),
 ('History of South America', 65),
 ('Languages of Europe', 65),
 ('Federal Republic of Central A

In [None]:
# Defining a function to return the closest document to the search:
def recover_document(search):
  distances = distance_ranking(search)
  return documents[sorted(distances, key=lambda x: x[1])[0][0]]['content']

recover_document('Mexico')
recover_document('Zambia')

'The Americas, sometimes collectively called America, are a landmass comprising the totality of North America and South America. When viewed as a single continent, the Americas or America is the 2nd largest continent right after Asia, and is the 3rd largest continent by population. The Americas make up most of the land in Earth\'s Western Hemisphere and comprise the New World.\nAlong with their associated islands, the Americas cover 8% of Earth\'s total surface area and 28.4% of its land area. The topography is dominated by the American Cordillera, a long chain of mountains that runs the length of the west coast. The flatter eastern side of the Americas is dominated by large river basins, such as the Amazon, St. Lawrence River–Great Lakes, Mississippi, and La Plata basins. Since the Americas extend 14,000 km (8,700 mi) from north to south, the climate and ecology vary widely, from the arctic tundra of Northern Canada, Greenland, and Alaska, to the tropical rainforests in Central Americ

In [None]:
# Recovering documents using Bag-of-Words:
punctuation = string.punctuation

nltk.download('punkt')
nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')

documents['Honduras'].keys()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


KeyError: 'Honduras'

In [None]:
# Defining the function that will be used to extract the most frequent tokens in documents:
def extract_tokens_freq(text):
  freq = {}
  more_stopwords = ['===', '==', "''", '``']
  tokens = word_tokenize(texto)
  tokens = [token.lower() for token in tokens if token.lower() not in punctuation and token.lower() not in stopwords and token.lower() not in more_stopwords]
  for token in tokens:
    if not token in freq:
      freq[token] = 1
      else:
        freq[token] += 1
        return freq

In [None]:
# Defining the 10 most important words for all documents, which will be used as parameters for the vectors:
for document in documents:
  freq_tokens = extract_tokens_freq(documents[documeno]['content'])
  top_10 = sorted(freq_tokens.items(), key=lambda x: x[1], reverse=True)[:10]
  documents[document]['top_10'] = dict(top_10)

  documents['Brazil']['top_10']

  all_top_10 = []

for document in documents:
  top_10 = documents[document]['top_10']
  all_top_10.extend([token for token in top_10])

print(all_top_10)

print(len(all_top_10))

all_top_10 = set(all_top_10)
print(len(alls_top_10))

In [None]:
# Defining the function to vectorize documents and user search:
def vectorize(text):
  vector = []
  tokens_freq = extract_tokens_freq(text)
  for token in all_top_10:
    vector.append(freq_tokens.get(token, 0))
    return vector

vectors = []
titles = []

for document in documents:
  titles.append(document)
  vector = vectorize(documents[document]['content'])
  vectors.append(vector)
  print(vector)

In [None]:
# Defining functions to calculate similarity between documents and user search using vectors:
def top_vectors(question)):
  question_vec = vectorize(question)
  similarities = cosine_similarity(np.asarray(question_vec).reshape(1, -1), np.asarray(vectors))
  ranking = sorted([(titles[i], similarity) for i, similarity in enumerate(similarities[0])], key=lambda x: x[1], reverse=True)
  return ranking

def recover_using_vector(question):
  best_result = top_vectors(question)[0]
  title = best_result[0]
  similarity = best_result[1]
  return "Similarity: {}\n{}\n\n{}".format(similarity, question, documents[title]['content'])

  top_vectors('Brazil')
  top_vectors('Colombia')
  top_vectors('Paraguay')

  recover_using_vector('What is the brazilian capital?')