## Web-Scraping and Data Preprocessing 

In [4]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

The scrape_details function collects quote details such as the quote text, author name, and associated tags from a list of URLs using BeautifulSoup. It then organizes this information into a pandas DataFrame for further processing. It uses the internal structure of the initial HTML page to extract the information.

In [128]:
def scrape_details(urls): 
    df = pd.DataFrame()
    for url in urls:
        response=requests.get(url)
        doc=BeautifulSoup(response.text,'html.parser')
        div_tags=doc.find_all('div',class_='quote')
        for tag in div_tags:
            quote=tag.find('span',class_='text').text
            span_tag=tag.find('span',class_=None)
            author=span_tag.find('small',class_='author').text
            name_tag=tag.find('div',class_='tags').meta['content']
            quotes_dict = [{"quote":quote, "tags":name_tag, "author": author}]
            df_curr=pd.DataFrame(quotes_dict)
            df = pd.concat([df, df_curr], ignore_index=2)
    return df 

The function scrape_all(pages) is designed to scrape data from multiple pages on "quotes.toscrape.com". It first constructs a list of URLs corresponding to the pages specified in the pages argument. Since it has only 10 pages this can be initialized as (1,11). But for future purposes it serves as a great generalized page based scraping. Then, it calls  the scrape_details(urls) to scrape data from each of these URLs and aggregates the results into a single DataFrame df. Finally, it returns this DataFrame.

In [129]:
def scrape_all(pages):
    df = pd.DataFrame()
    urls = []
    for page in pages:
        url = f'https://quotes.toscrape.com/page/{page}/'
        urls.append(url)
    print(urls)
    df = scrape_details(urls)
    return df

In [130]:
df = scrape_all(range(1,11))

['https://quotes.toscrape.com/page/1/', 'https://quotes.toscrape.com/page/2/', 'https://quotes.toscrape.com/page/3/', 'https://quotes.toscrape.com/page/4/', 'https://quotes.toscrape.com/page/5/', 'https://quotes.toscrape.com/page/6/', 'https://quotes.toscrape.com/page/7/', 'https://quotes.toscrape.com/page/8/', 'https://quotes.toscrape.com/page/9/', 'https://quotes.toscrape.com/page/10/']


In [51]:
df.to_csv("quotes.csv", index=False)

In [307]:
df = pd.read_csv('quotes.csv')
df.dropna(inplace=True)

In [308]:
df.tags.iloc[1]

'abilities,choices'

The function preprocess_quotes(df) takes a DataFrame df containing quotes as input and preprocesses each quote in the DataFrame. It removes non-alphanumeric characters, replaces multiple consecutive whitespace characters with a single space, and converts the text to lowercase. By performing these preprocessing steps, the function ensures that the quotes are cleaned and standardized, making them suitable for subsequent text analysis tasks such as topic modeling. The preprocessed quotes can then be used for various natural language processing applications with improved consistency and reliability. This preprocessing pipeline can be also generalized for text preprocessing.

In [309]:
import re
def preprocess_quotes(df):
    for i in range(len(df)):
        quote = df.quote.iloc[i]
        quote = re.sub(r"[^a-zA-Z0-9ğüşöçıİĞÜŞÖÇ\s]", "", quote)
        quote = re.sub(r"\s+", " ", quote)
        quote = quote.lower()
        df.quote.iloc[i] = quote
    return df

In [310]:
preprocess_quotes(df)
df_og = df.copy()

This code snippet first creates a list of unique tags extracted from 'tags' in the DataFrame df. It then iterates over each tag in the list and adds a new column for each tag to the DataFrame, initializing them with zeros. Afterward, it iterates over each row in the DataFrame and extracts the tags associated with each quote. For each tag associated with a quote, it sets the corresponding column value to 1, indicating the presence of that tag for the respective quote. Finally, it removes the original 'tags' column from the DataFrame. This process essentially converts the categorical data represented by tags into a binary format, allowing for easier analysis and modeling.

In [311]:
import pandas as pd
import warnings

warnings.filterwarnings("ignore")
tags_list = df['tags'].str.split(',').explode().dropna().unique()
for tag in tags_list:
    df[tag] = 0
for index, row in df.iterrows():
    quote_tags = str(row['tags']).split(',')
    for tag in quote_tags:
        df.at[index, tag] = 1
df.drop(columns=['tags'], inplace=True)

In [313]:
df

Unnamed: 0,quote,author,change,deep-thoughts,thinking,world,abilities,choices,inspirational,life,...,christianity,faith,sun,adventure,better-life-empathy,difficult,grown-ups,write,writers,mind
0,the world as we have created it is a process o...,Albert Einstein,1,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,it is our choices harry that show what we trul...,J.K. Rowling,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,there are only two ways to live your life one ...,Albert Einstein,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,the person be it gentleman or lady who has not...,Jane Austen,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,imperfection is beauty madness is genius and i...,Marilyn Monroe,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,you never really understand a person until you...,Harper Lee,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
96,you have to write the book that wants to be wr...,Madeleine L'Engle,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,1,0
97,never tell the truth to people who are not wor...,Mark Twain,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,a persons a person no matter how small,Dr. Seuss,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Fine-Tuning KeyBERT

The reason for employing a state-of-the-art model for transfer learning is due to the scarcity of data, which would lead to suboptimal performance with traditional architectures like LSTM and BiLSTM, designed for sequential data processing. Furthermore, it is noteworthy that additional transfer learning models, such as BERT Topic, could be considered for this task. The decision to utilize KeyBERT is motivated by the challenge of dealing with a significantly limited dataset in conjunction with a substantial number of tags. Through the fine-tuning process of KeyBERT, we have achieved an average cosine similarity score exceeding 46%

In [314]:
from transformers import BertForSequenceClassification, BertTokenizerFast 
import torch 
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

The first line instantiates a KeyBERT model, denoted by kw_model KeyBERT is a keyword extraction model that utilizes BERT-based embeddings to identify keywords or key phrases within a given text. For our task, the objective is to extract keywords.
The second line initializes a SentenceTransformer model. SentenceTransformer is a siamese network that employs identical weights for both inputs. The rationale for utilizing this model is to evaluate the results obtained from KeyBERT. Both the extracted keywords from KeyBERT and the actual tags will be fed into the network to obtain embeddings. Subsequently, cosine similarity score will be employed for evaluation, comparing the outputs of both models
n.

In [315]:
kw_model = KeyBERT()
model_for_embeddings = SentenceTransformer('all-MiniLM-L6-v2')

This line creates a list called 'seed_keywords'. It iterates over each column in the DataFrame 'df' that we have created, where every single tag is a column, and checks if the column name is not equal to "quote" or "author". If the condition is met, meaning the column is not named "quote" or "author", which means it is a tag, the column name is added to the 'seed_keywords' list. This list essentially contains the names of all columns in the DataFrame except for "quote" and "author". These tags are essentially going to be used for fine-tuning of KeyBERT.

In [317]:
seed_keywords = [column for column in df.columns if column not in ("qoute", "author")] 

quotes = df.quote.to_list()

The code invokes a keyword extraction process facilitated by the extract_keywords method of the kw_model object. This method takes two main arguments. First, the docs parameter specifies the quotes from which the keywords will be extracted. Second, the seed_keywords parameter provides seed keywords or initial cues to guide the extraction process ,fine-tuning, which are derived from the columns of  our DataFrame, excluding certain specified columns. The extract_keywords method then processes the input documents using the provided seed keywords and returns the extracted keywords.

In [318]:
output = kw_model.extract_keywords(docs=quotes, seed_keywords=seed_keywords)

In [319]:
output[:5]

[[('thinking', 0.5109),
  ('changing', 0.5043),
  ('world', 0.4503),
  ('process', 0.396),
  ('created', 0.3514)],
 [('harry', 0.6125),
  ('abilities', 0.4428),
  ('choices', 0.3607),
  ('far', 0.2797),
  ('truly', 0.2434)],
 [('miracle', 0.6189), ('life', 0.4815), ('live', 0.3917), ('ways', 0.2701)],
 [('novel', 0.5722),
  ('lady', 0.4404),
  ('gentleman', 0.4351),
  ('intolerably', 0.3405),
  ('person', 0.3103)],
 [('imperfection', 0.7058),
  ('madness', 0.5221),
  ('boring', 0.4572),
  ('beauty', 0.4097),
  ('genius', 0.2245)]]

This code snippet filters keywords extracted from text documents based on their associated scores. It iterates through each document's keywords and selects only those with a score greater than 0.35. Then, it formats the filtered keywords into comma-separated strings for each document. This process results in a list where each element represents a document's filtered keywords as a string. This will be then used with the actual tags in the sentence transformer encoder.

In [353]:
output_list = []
for i in range(len(output)):
    curr_keywords = []
    for j in range(len(output[i])):
        if output[i][j][1] > .35: 
            curr_keywords.append(output[i][j][0])
    output_list.append(curr_keywords)
for i in range(len(output_list)): 
     output_list[i] = ",".join(output_list[i])
output_list[:5]

['thinking,changing,world,process,created',
 'harry,abilities,choices',
 'miracle,life,live',
 'novel,lady,gentleman',
 'imperfection,madness,boring,beauty']

In [354]:
df_og

Unnamed: 0,quote,tags,author
0,the world as we have created it is a process o...,"change,deep-thoughts,thinking,world",Albert Einstein
1,it is our choices harry that show what we trul...,"abilities,choices",J.K. Rowling
2,there are only two ways to live your life one ...,"inspirational,life,live,miracle,miracles",Albert Einstein
3,the person be it gentleman or lady who has not...,"aliteracy,books,classic,humor",Jane Austen
4,imperfection is beauty madness is genius and i...,"be-yourself,inspirational",Marilyn Monroe
...,...,...,...
95,you never really understand a person until you...,better-life-empathy,Harper Lee
96,you have to write the book that wants to be wr...,"books,children,difficult,grown-ups,write,write...",Madeleine L'Engle
97,never tell the truth to people who are not wor...,truth,Mark Twain
98,a persons a person no matter how small,inspirational,Dr. Seuss


Same process is also applied to the original tags

In [355]:
original_tags = []
for i in range(len(df_og)): 
    original_tags.append(df_og.tags.iloc[i])
original_tags[:5]

['change,deep-thoughts,thinking,world',
 'abilities,choices',
 'inspirational,life,live,miracle,miracles',
 'aliteracy,books,classic,humor',
 'be-yourself,inspirational']

In [356]:
print(len(original_tags)==len(output_list))

True


This function embedd_outputs_and_tags takes two lists, original_tags and output_list, and generates embeddings for each pair of corresponding elements from these lists using  pre-trained sentence transformer mode called model model_for_embeddings. It iterates through the elements of original_tags and output_list, generates embeddings for each pair, and appends them to the list called embeddings. Finally, it returns the list of embeddings. Essentially, this function embeds both the original tags and the extracted keyword lists into a vector space using the same weights which then will be used for cosine similarity score calculation.

In [357]:
def embedd_outputs_and_tags(original_tags, output_list): 
        embeddings = []
        for i in range(len(original_tags)): 
                embedding = model_for_embeddings.encode([original_tags[i], output_list[i]], convert_to_tensor=True)
                embeddings.append(embedding)
        return embeddings

In [358]:
embeddings =embedd_outputs_and_tags(original_tags, output_list)

This function, get_overall_score, computes the average cosine similarity score between pairs of embeddings provided in the embeddings list. It iterates through each pair of embeddings, calculates the cosine similarity between them, and stores the scores. Finally, it returns the average similarity score, providing a measure of overall similarity between the original tags and the extracted keyword lists.

In [359]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def get_overall_score(embeddings):
    similarity_scores = []
    for i in range(len(embeddings)):
        vector1 = embeddings[i][0].reshape(1, -1) 
        vector2 = embeddings[i][1].reshape(1, -1)
        similarity = cosine_similarity(vector1, vector2)
        similarity_scores.append(similarity)
    return sum(similarity_scores)/len(similarity_scores)

In [360]:
model_performance = get_overall_score(embeddings)

The resulting score of approximately 46% suggests that, on average, the embeddings of the original tags and the extracted keyword lists exhibit a moderate degree of similarity. A cosine similarity score of 1 indicates perfect similarity, while a score of 0 implies no similarity. Therefore, a score of 46% indicates a reasonable level of resemblance between the original tags and the extracted keywords.

In [361]:
model_performance

array([[0.46252713]], dtype=float32)