# Task B - Information Retrieval 1
1. Consider an academic journal of your own choice and collect 20 abstracts using a method of your own (can be a simple manual copy-and-past operation) in a single file. Save this file in your local desk. Save also the keywords mentioned in each abstract file (if the journal allows for keywords, otherwise you may use words of title of the paper as keywords) in another separate file.
2. Consider an information retrieval system where a keyword plays the role search query. Write a script that uses logical query-matching for five queries of your own choice (from the list of keywords) to find out whether a given query is found in the document or not, so that for each keyword input, the program outputs 1 if a logical matching is found (the given keyword is found in the abstract) and 0, otherwise.
3. Now instead, of compiling the abstracts into a single file, we want to keep each abstract as a separate file, labeled as A1, A2, …, A20. Write a script that constructs an inverted file of the abstract files. Then suggest a program that employs a simple string matching operation to output the list of files (abstract-file (A0, A1,..A20)) for each keyword.
4. We want to relax the assumption of exact matching between keywords and words of the abstract and allow the matching to be considered correct if 90% of the characters of the keywords are found in one word of the abstract. Write a script that implements this reasoning and display the result of your search operation. 

## Solution B1:
Consider an academic journal of your own choice and collect 20 abstracts using a method of your own (can be a simple manual copy-and-past operation) in a single file. Save this file in your local desk. Save also the keywords mentioned in each abstract file (if the journal allows for keywords, otherwise you may use words of title of the paper as keywords) in another separate file.

For this project, I will use one of my favorite journals "ArXiv". From the page "https://arxiv.org/list/astro-ph.GA/current" I will recursively crawl 20 paper details including abstract and keywords in the following python program. First it will fetch the page and save the link to details of each papers in "paper_links.txt"

In [168]:
import requests
from bs4 import BeautifulSoup

root_url = "https://arxiv.org/list/astro-ph.GA/current"

response = requests.get(root_url)
print("Fetching: %s" % root_url)

links = []
if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    elements = soup.select("#dlpage span.list-identifier a")
    for el in elements:
        href = el.get('href', '')
        if href[0:4] == "/abs":
            link = "https://arxiv.org" + href
            print(link)
            links.append(link)
print("Total links: %s" % len(links))

with open("paper_links.txt", "w") as file:
    file.write("\n".join(links))
    
print("Saving paper links...")

Fetching: https://arxiv.org/list/astro-ph.GA/current
https://arxiv.org/abs/2309.00031
https://arxiv.org/abs/2309.00039
https://arxiv.org/abs/2309.00041
https://arxiv.org/abs/2309.00045
https://arxiv.org/abs/2309.00048
https://arxiv.org/abs/2309.00050
https://arxiv.org/abs/2309.00053
https://arxiv.org/abs/2309.00102
https://arxiv.org/abs/2309.00110
https://arxiv.org/abs/2309.00198
https://arxiv.org/abs/2309.00272
https://arxiv.org/abs/2309.00291
https://arxiv.org/abs/2309.00318
https://arxiv.org/abs/2309.00449
https://arxiv.org/abs/2309.00459
https://arxiv.org/abs/2309.00501
https://arxiv.org/abs/2309.00657
https://arxiv.org/abs/2309.00670
https://arxiv.org/abs/2309.00671
https://arxiv.org/abs/2309.00719
https://arxiv.org/abs/2309.00852
https://arxiv.org/abs/2309.00888
https://arxiv.org/abs/2309.00955
https://arxiv.org/abs/2309.01024
https://arxiv.org/abs/2309.01039
Total links: 25
Saving paper links...


Now we will load the `paper_links.txt` file and fetch details of each papers

In [185]:
import requests
from bs4 import BeautifulSoup
import json

with open('paper_links.txt', 'r') as file:
    paper_links = file.readlines()

journal_data = []
abstracts = []
titles = []
for paper_link in paper_links:
    response = requests.get(paper_link)
    print("Fetching: %s" % paper_link)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")

        title_el = soup.select("#abs h1.title")
        abstract_el = soup.select("blockquote.abstract")
        
        title = title_el[0].text[6:]
        abstract = abstract_el[0].text[12:].replace('\n', ' ')
        
        titles.append(title)
        abstracts.append(abstract)
        
        journal_data.append({
            "title": title,
            "abstract": abstract
        })
        
        # save this data for backup purpose as json
        with open("journal_data.json", "w") as file:
            json.dump(journal_data, file, indent=4)
            
print("Total papers fetched: %s" % len(titles))
            
print("Saving abstracts.txt...")
with open("abstracts.txt", "w") as file:
    file.write("\n".join(abstracts))

print("Saving titles.txt...")
with open("titles.txt", "w") as file:
    file.write("\n".join(titles))

        
        
        
        
        

Fetching: https://arxiv.org/abs/2309.00031

Fetching: https://arxiv.org/abs/2309.00039

Fetching: https://arxiv.org/abs/2309.00041

Fetching: https://arxiv.org/abs/2309.00045

Fetching: https://arxiv.org/abs/2309.00048

Fetching: https://arxiv.org/abs/2309.00050

Fetching: https://arxiv.org/abs/2309.00053

Fetching: https://arxiv.org/abs/2309.00102

Fetching: https://arxiv.org/abs/2309.00110

Fetching: https://arxiv.org/abs/2309.00198

Fetching: https://arxiv.org/abs/2309.00272

Fetching: https://arxiv.org/abs/2309.00291

Fetching: https://arxiv.org/abs/2309.00318

Fetching: https://arxiv.org/abs/2309.00449

Fetching: https://arxiv.org/abs/2309.00459

Fetching: https://arxiv.org/abs/2309.00501

Fetching: https://arxiv.org/abs/2309.00657

Fetching: https://arxiv.org/abs/2309.00670

Fetching: https://arxiv.org/abs/2309.00671

Fetching: https://arxiv.org/abs/2309.00719

Fetching: https://arxiv.org/abs/2309.00852

Fetching: https://arxiv.org/abs/2309.00888

Fetching: https://arxiv.org/abs/

Now we will load the `titles.txt` file to process it using nltk to extract keywords


In [199]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK data if not already installed
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

# most pos tags except preposition and articles
included_pos_tags = ['NN', 'NNS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ', 'JJR', 'JJS']

keyword_lines = []

# load input
with open('titles.txt', 'r') as file:
    titles = file.readlines()
    
    for title in titles:
        # Tokenize into words
        words = word_tokenize(title.lower())
        # perform part-of-speech (POS) tagging
        pos_tags = nltk.pos_tag(words)

        keywords = []
        
        for word, pos in pos_tags:
            # Check if the word is in included parts of speech tags and not a stop word
            if pos in included_pos_tags and word.lower() not in set(stopwords.words('english')):
                keywords.append(word)
                
        keyword_line = ",".join(keywords)
        
        print("%s -> %s" % (title, keyword_line))
        
        keyword_lines.append(keyword_line)
        
print("Saving keywords.txt...")
with open("keywords.txt", "w") as file:
    file.write("\n".join(keyword_lines))
        


PEARLS: Near Infrared Photometry in the JWST North Ecliptic Pole Time Domain Field
 -> pearls,infrared,photometry,jwst,north,ecliptic,pole,time,domain,field
Fuzzy dark matter dynamics in tidally perturbed dwarf spheroidal galaxy satellites
 -> fuzzy,dark,matter,dynamics,perturbed,dwarf,spheroidal,galaxy,satellites
EDGE -- Dark matter or astrophysics? Clear prospects to break dark matter heating degeneracies with HI rotation in faint dwarf galaxies
 -> edge,dark,matter,astrophysics,clear,prospects,break,dark,matter,heating,degeneracies,hi,rotation,faint,dwarf,galaxies
The energy distribution of the first supernovae
 -> energy,distribution,first,supernovae
Detection of the Keplerian decline in the Milky Way rotation curve
 -> detection,keplerian,decline,milky,way,rotation,curve
Illuminating the Dark Side of Cosmic Star Formation II. A second date with RS-NIRdark galaxies in COSMOS
 -> illuminating,dark,side,cosmic,star,formation,ii,second,date,rs-nirdark,galaxies,cosmos
The first compreh

[nltk_data] Downloading package punkt to /home/burhan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/burhan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/burhan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Solution B2:
Consider an information retrieval system where a keyword plays the role search query. Write a script that uses logical query-matching for five queries of your own choice (from the list of keywords) to find out whether a given query is found in the document or not, so that for each keyword input, the program outputs 1 if a logical matching is found (the given keyword is found in the abstract) and 0, otherwise.


In [201]:
# load Keywords from the File
with open('keywords.txt', 'r') as keywords_file:
    keywords_list = [line.strip() for line in keywords_file]

# Sample Queries

queries = [
    "dark matter",
    "radio galaxies",
    "star formation",
    "distributed systems",
    "spectral libraries"
]

# query matching
def query_matching(query):
    for keywords in keywords_list:
        if all(keyword in keywords for keyword in query.split()):
            return 1  # match found
    return 0  # match not found

# run sample queries
for query in queries:
    result = query_matching(query)
    print(f"Query: '{query}' -> Match: {result}")


Query: 'dark matter' -> Match: 1
Query: 'radio galaxies' -> Match: 1
Query: 'star formation' -> Match: 1
Query: 'distributed systems' -> Match: 0
Query: 'spectral libraries' -> Match: 1


## Solution B3:
Now instead, of compiling the abstracts into a single file, we want to keep each abstract as a separate file, labeled as A1, A2, …, A20. Write a script that constructs an inverted file of the abstract files. Then suggest a program that employs a simple string matching operation to output the list of files (abstract-file (A0, A1,..A20)) for each keyword.

First we start by converting out `abstracts.txt` file to individual files such as A0, A1, A2 etc


In [202]:
import os

with open('abstracts.txt', 'r') as abstracts_file:
    abstracts = abstracts_file.read().splitlines()

if not os.path.exists('abstracts'):
    os.makedirs('abstracts')

for i, abstract in enumerate(abstracts):
    abstract_filename = f'abstracts/A{i}.txt'
    with open(abstract_filename, 'w') as abstract_file:
        abstract_file.write(abstract)

Now, let's construct an inverted file for keyword lookup:


In [204]:
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK data if not already installed
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

# most pos tags except preposition and articles
included_pos_tags = ['NN', 'NNS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ', 'JJR', 'JJS']

inverted_file = {}

# read all files in the 'abstracts' folder
for filename in os.listdir('abstracts'):
    if filename.endswith('.txt'):
        abstract_filename = os.path.join('abstracts', filename)
        with open(abstract_filename, 'r') as abstract_file:
            abstract_text = abstract_file.read()
            
            words = word_tokenize(abstract_text.lower())
            # perform part-of-speech (POS) tagging
            pos_tags = nltk.pos_tag(words)
            
            keywords = []
            
            for word, pos in pos_tags:
                # Check if the word is in included parts of speech tags and not a stop word
                if pos in included_pos_tags and word.lower() not in set(stopwords.words('english')):
                    keywords.append(word)
            for keyword in keywords:
                if keyword not in inverted_file:
                    inverted_file[keyword] = []
                inverted_file[keyword].append(filename)

# Save the inverted file
with open('inverted_file.txt', 'w') as inverted_file_txt:
    for keyword, abstracts_list in inverted_file.items():
        inverted_file_txt.write(f'{keyword}: {", ".join(abstracts_list)}\n')
print("Saved inverted_file.txt...")


[nltk_data] Downloading package punkt to /home/burhan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/burhan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/burhan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Saved inverted_file.txt...


Now we can use the index created by `inverted_file.txt` to create a query search system. First we define the `search_keywords()` method



In [206]:
def search_keywords(keywords):
    results = set()
    for keyword in keywords:
        keyword = keyword.lower()
        if keyword in inverted_file:
            results.update(inverted_file[keyword])
    return results


Now we search for `dark, matter`

In [207]:

search_keywords_list = ["dark", "matter"]
matching_abstracts = search_keywords(search_keywords_list)

# Print matching abstracts
for abstract in matching_abstracts:
    print(f'Matching Abstract: {abstract}')


Matching Abstract: A5.txt
Matching Abstract: A2.txt
Matching Abstract: A1.txt
Matching Abstract: A24.txt
Matching Abstract: A18.txt


Another query on `black, hole`

In [209]:

search_keywords_list = ["black", "hole"]
matching_abstracts = search_keywords(search_keywords_list)

# Print matching abstracts
for abstract in matching_abstracts:
    print(f'Matching Abstract: {abstract}')


Matching Abstract: A17.txt


## Solution B4:
We want to relax the assumption of exact matching between keywords and words of the abstract and allow the matching to be considered correct if 90% of the characters of the keywords are found in one word of the abstract. Write a script that implements this reasoning and display the result of your search operation. 

We can implement a fuzzy / partial matching approach without using external libraries by calculating the Levenshtein (edit) distance between keywords and words in the abstract. Here is how to do it

In [212]:
# Function to calculate Levenshtein distance between two strings
def levenshtein_distance(s1, s2):
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)

    if len(s2) == 0:
        return len(s1)

    previous_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1
            deletions = current_row[j] + 1
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]

# Function to check if a keyword is a fuzzy match in an abstract word
def is_fuzzy_match(keyword, word):
    return levenshtein_distance(keyword.lower(), word.lower()) <= max(len(keyword) * 0.1, 1)  # Adjust the threshold as needed

# Function to search for keywords with fuzzy matching
def search_keywords_fuzzy(keywords):
    results = set()
    for keyword in keywords:
        keyword = keyword.lower()
        for i in range(len(abstracts)):
            abstract_filename = f'abstracts/A{i}.txt'
            with open(abstract_filename, 'r') as abstract_file:
                abstract_text = abstract_file.read()
                # Tokenize and process words in the abstract
                words = abstract_text.split()  # Split by space as a basic example
                for word in words:
                    if is_fuzzy_match(keyword, word):
                        results.add(f'A{i}')
                        break  # Break if one match is found in the abstract
    return results



In [213]:

# Example 1: Search for keywords with fuzzy matching
search_keywords_list_fuzzy = ["blak", "hol"]
matching_abstracts_fuzzy = search_keywords_fuzzy(search_keywords_list_fuzzy)

# Print matching abstracts
for abstract in matching_abstracts_fuzzy:
    print(f'Matching Abstract: {abstract}')

Matching Abstract: A5
Matching Abstract: A16
Matching Abstract: A17


In [214]:
# Example 1: Search for keywords with fuzzy matching
search_keywords_list_fuzzy = ["dak", "matar"]
matching_abstracts_fuzzy = search_keywords_fuzzy(search_keywords_list_fuzzy)

# Print matching abstracts
for abstract in matching_abstracts_fuzzy:
    print(f'Matching Abstract: {abstract}')

Matching Abstract: A24
Matching Abstract: A2
Matching Abstract: A18
Matching Abstract: A5
Matching Abstract: A1


### End of Task B