<h1>Keyphrase Extraction Methods</h1>

<h2>Span-finding approaches</h2>

<h3>Using MpNet for sent similarity</h3>

In [1]:
doc = """
         Supervised learning is the machine learning task of 
         learning a function that maps an input to an output based 
         on example input-output pairs.[1] It infers a function 
         from labeled training data consisting of a set of 
         training examples.[2] In supervised learning, each 
         example is a pair consisting of an input object 
         (typically a vector) and a desired output value (also 
         called the supervisory signal). A supervised learning 
         algorithm analyzes the training data and produces an 
         inferred function, which can be used for mapping new 
         examples. An optimal scenario will allow for the algorithm 
         to correctly determine the class labels for unseen 
         instances. This requires the learning algorithm to  
         generalize from the training data to unseen situations 
         in a 'reasonable' way (see inductive bias).
      """

doc = """Multan is a city and capital of Multan Division located in Punjab, Pakistan. Situated on the bank of the Chenab River, Multan is Pakistan's 7th largest city and is the major cultural and economic centre of Southern Punjab. Multan's history stretches deep into antiquity. The ancient city was site of the renowned Hindu Multan Sun Temple, and was besieged by Alexander the Great during the Mallian Campaign. Multan was one of the most important trading centres of medieval Islamic India, and attracted a multitude of Sufi mystics in the 11th and 12th centuries, earning the city the sobriquet "City of Saints". The city, along with the nearby city of Uch, is renowned for its large number of Sufi shrines dating from that era."""

In [None]:

from sklearn.feature_extraction.text import CountVectorizer

n_gram_range = (5, 5)
stop_words = "english"

# Extract candidate words/phrases
count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
candidates = count.get_feature_names()

In [13]:
from sentence_transformers import SentenceTransformer

message = "oh yes they have a lot of cows and push out a lot of dairy every hour"
questions = ["Where do dairy farms typically consist of high producing dairy cows?",
            "What do a lot of cows push a lot of?"]
model = SentenceTransformer('../../../../models/all-mpnet-base-v2', device='cpu')
doc_embedding = model.encode([message])
candidate_embeddings = model.encode(questions)

In [14]:

from sklearn.metrics.pairwise import cosine_similarity

top_n = 5
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [questions[index] for index in distances.argsort()[0][-top_n:]]

In [15]:
print(keywords)

['What do a lot of cows push a lot of?', 'Where do dairy farms typically consist of high producing dairy cows?']


In [16]:
print(distances)

[[0.6593828 0.5885073]]


In [None]:
from .keyphrase_extraction.pipelines import pipeline

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_path = "../../../../models/t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

In [None]:
q0 = "question: How many tournaments have been held so far? context: Six tournaments have so far been played, and only the West Indies, who currently hold the title, has won the tournament on multiple occasions. The inaugural 2007 World Twenty20, was staged in South Africa, and won by India, who defeated Pakistan in the final at the Wanderers Stadium in Johannesburg. The 2009 tournament took place in England, and was won by the previous runner-up, Pakistan, who defeated Sri Lanka in the final at Lord's. The third tournament was held in 2010, hosted by the countries making up the West Indies cricket team."
q1 = "ask_question: Six tournaments have so far been played, and only the West Indies, who currently hold the title, has won the tournament on multiple occasions. The inaugural 2007 World Twenty20, was staged in South Africa, and won by India, who defeated Pakistan in the final at the Wanderers Stadium in Johannesburg. The 2009 tournament took place in England, and was won by the previous runner-up, Pakistan, who defeated Sri Lanka in the final at Lord's. The third tournament was held in 2010, hosted by the countries making up the West Indies cricket team."
q2 = "ask_question: Organised by cricket's governing body, the International Cricket Council (ICC), the tournament currently consists of 16 teams, comprising the top ten teams from the rankings at the given deadline and six other teams chosen through the T20 World Cup Qualifier."
input_ids = tokenizer(q0, return_tensors='pt', add_special_tokens=True).input_ids
outputs = model.generate(input_ids)
for output in outputs:
    print(tokenizer.decode(output, skip_special_tokens=True))

<h3>End-to-End question generation</h3>

In [17]:
# !git clone https://github.com/patil-suraj/question_generation.git
%cd question_generation

from pipelines import pipeline
nlp = pipeline("e2e-qg", model="../../../../../models/t5-base-e2e-qg", use_cuda=False)

/home/farjad/Documents/1. UNIVERSITY FOLDER/7th sem/FYP/PRACTICAL/CODE/KBC-Agent-Prototype/notebooks/SentSImilarity/keyphrase_extraction/bekaar/question_generation


In [21]:
q0 = "Organised by cricket's governing body, the International Cricket Council (ICC), the tournament currently consists of 16 teams, comprising the top ten teams from the rankings at the given deadline and six other teams chosen through the T20 World Cup Qualifier."
# q0 = "comprising the top ten teams from the rankings at the given deadline and six other teams chosen through the T20 World Cup Qualifier."
q1 = "team."

q2 = "Situated on the bank of the Chenab River, Multan is Pakistan's 7th largest city and is the major cultural and economic centre of Southern Punjab."
q3 = "economic"
nlp(f"I went to the river near Multan")

['What river did I go to near Multan?']

<h3>Question Answeing pipeline</h3>

In [None]:
# !git clone https://github.com/patil-suraj/question_generation.git
%cd question_generation

from pipelines import pipeline
nlp = pipeline("multitask-qa-qg", model="../../../../../models/t5-small-qa-qg-hl")

In [None]:
doc = """Multan is a city and capital of Multan Division located in Punjab, Pakistan. Situated on the bank of the Chenab River, Multan is Pakistan's 7th largest city and is the major cultural and economic centre of Southern Punjab. Multan's history stretches deep into antiquity. The ancient city was site of the renowned Hindu Multan Sun Temple, and was besieged by Alexander the Great during the Mallian Campaign. Multan was one of the most important trading centres of medieval Islamic India, and attracted a multitude of Sufi mystics in the 11th and 12th centuries, earning the city the sobriquet "City of Saints". The city, along with the nearby city of Uch, is renowned for its large number of Sufi shrines dating from that era."""

nlp({
    "question": "I like Multan",
    "context": doc
})

In [None]:
nlp = pipeline("e2e-qg", model="../../../../../models/t5-small-qa-qg-hl")
nlp("Organised by cricket's governing body, the International Cricket Council (ICC), the tournament currently consists of 16 teams, comprising the top ten teams from the rankings at the given deadline and six other teams chosen through the T20 World Cup Qualifier.")