# Task C - Information Retrieval 2
1. Consider the access to Wikipedia as data repository source. Use `beautifulsoup` program to crawl all text issued from Wikipedia pages (no need to crawl the links of these pages).
2. Use scikit-learn library, `countVectorizer` and `tf-Idf` vectorizer to construct the tf-idf matrix representation of all documents (all identified Finish towns + Finland pages). Refine the search when different preprocessing strategies were employed such as, with and without stopwords, your own list of stopwords, lemmatization.
3. Consider a query, `I will visit Oulu this summer and possibly Espoo`.  Write down a program that inputs the query and outputs its tf-idf representation.
4. Write down a program that evaluates the search result of the query to each document by computing the `inner product` of the query tf-idf representation and each of the tf-idf document. `Display the documents who achieved high score`.
5. Repeat the above reasoning when you use `countVector` instead of tf-idf representation.

In [1]:
!pip install wikipedia

import nltk
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

import requests
from bs4 import BeautifulSoup
import re

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

import numpy as np

from nltk.classify.scikitlearn import SklearnClassifier

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

import wikipedia

from urllib.error import HTTPError
from urllib.request import urlopen

from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from nltk.corpus import stopwords
import string



[nltk_data] Downloading package wordnet to /Users/moinul/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/moinul/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/moinul/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
Stopwords = list(set(nltk.corpus.stopwords.words('english')))
stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

In [3]:
# Define the URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Finland"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Create a list to store the extracted text
    extracted_text = []

    # Extract text from the main page content
    main_content = soup.find('div', {'class': 'mw-parser-output'})
    for paragraph in main_content.find_all('p'):
        extracted_text.append(paragraph.get_text())

    # Extract text from hyperlinks within the page
    for link in main_content.find_all('a', href=True):
        link_url = link['href']
        if link_url.startswith("/wiki/"):
            # Construct the full URL for the linked page
            full_link_url = f"https://en.wikipedia.org{link_url}"
            # Send a request to the linked page and extract its text
            linked_page_response = requests.get(full_link_url)
            if linked_page_response.status_code == 200:
                linked_soup = BeautifulSoup(linked_page_response.text, 'html.parser')
                for paragraph in linked_soup.find_all('p'):
                    extracted_text.append(paragraph.get_text())

    # Save the extracted text to a file
    with open('extracted_text.txt', 'w') as file:
        file.write("\n".join(extracted_text))

    print("Text extraction and saving completed.")
else:
    print("Failed to retrieve the webpage.")

Text extraction and saving completed.


In [4]:
# Load the file
with open("extracted_text.txt", "r", encoding="utf-8") as file:
    documents = [line.strip() for line in file]

In [5]:
def preprocess_text(text):
    #removing punctuation
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    
    #removing stopwords
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in text.lower().split() if word not in stop_words and word.isalpha()]
    
    #Stemming
    stemmer = SnowballStemmer("english")
    tokens = [stemmer.stem(word) for word in tokens]
    
    #Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return " ".join(tokens)

# Preprocess the document
preprocessed_documents = [preprocess_text(doc) for doc in documents]

In [6]:
preprocessed_documents = [i for i in preprocessed_documents if i]

In [8]:
query = "I will visit Oulu this summer and possibly Espoo"

preprocessed_query = preprocess_text(query)

### TF-IDF Vectorization 

In [40]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_documents)

In [41]:
query_tfidf = vectorizer.transform([preprocessed_query])
query_tfidf = query_tfidf.toarray()
query_tfidf

array([[0., 0., 0., ..., 0., 0., 0.]])

In [36]:
# Initialize variables to keep track of the highest score and its corresponding document index
max_score_tfidf = -1
max_score_index_tfidf = -1

# Compute the inner product (dot product) between the query and each document
for i in range(len(preprocessed_documents)):
    document_tfidf = tfidf_matrix[i].toarray()
    inner_product_tfidf = np.dot(query_tfidf, document_tfidf.T)

    if inner_product_tfidf > max_score_tfidf:
        max_score_tfidf = inner_product_tfidf
        max_score_index_tfidf = i


In [37]:
# Display the document with the highest score
if max_score_index_tfidf != -1:
    print(f"Document {max_score_index_tfidf + 1}: {preprocessed_documents[max_score_index_tfidf]} (Score: {max_score_tfidf[0][0]:.4f})")
else:
    print("No matching documents found.")

Document 1656: univers oulu oulu univers appli scienc main campus locat oulu (Score: 0.3075)


### CountVectorization 

In [38]:
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(preprocessed_documents)

In [39]:
query_countvec = count_vectorizer.transform([preprocessed_query])
query_countvec = query_countvec.toarray()
query_countvec

array([[0, 0, 0, ..., 0, 0, 0]])

In [29]:
# Initialize variables to keep track of the highest score and its corresponding document index
max_score_count = -1
max_score_index_count = -1

# Compute the inner product (dot product) between the query and each document
for i in range(len(preprocessed_documents)):
    document_countvec = count_matrix[i].toarray()
    inner_product_count = np.dot(query_countvec, document_countvec.T)

    if inner_product_count > max_score_count:
        max_score_count = inner_product_count
        max_score_index_count = i


In [30]:
# Display the document with the highest score
if max_score_index_count != -1:
    print(f"Document {max_score_index_count + 1}: {preprocessed_documents[max_score_index_count]} (Score: {max_score_count[0][0]:.4f})")
else:
    print("No matching documents found.")

Document 1183: second crusad finland settler sweden establish perman agricultur settlement uusimaa espoo subdivis kirkkonummi congreg oldest known document refer kirkkonummi espoo subchapt date although first document direct refer espoo late construct espoo cathedr oldest preserv build espoo mark independ espoo administr espoo part uusimaa provinc split eastern western provinc govern porvoo raseborg castl respect eastern border raseborg provinc espoo road connect import citi finland time king road pas espoo way stockholm via turku porvoo viipuri (Score: 9.0000)


### End of Task C