# Keyword Extraction Experiment w/ Yake, Rake, TF-IDF

- TF-IDF (Term Frequency-Inverse Document Frequency)
- YAKE (Yet Another Keyword Extractor)
- RAKE (Rapid Automatic Keyword Extraction)

The goal is to see if given a random assignment question, I can get a title that most effectively captures the keywords of the assignment. These keywords should help discover titles that perform better SEO wise, and for search (Google) results page ranking.

The question is: How can we generate the most efficient title for the given text? In this case, the text is an assignment instructions.

In [21]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/question1/q1.txt
/kaggle/input/question/q.txt


In [22]:
# Step 1: Install required packages
!pip install PyMuPDF rake-nltk yake scikit-learn

# Step 2: Import libraries
import nltk
# Download all required NLTK resources
nltk.download(['stopwords', 'punkt', 'punkt_tab', 'wordnet', 'omw-1.4'])


import fitz  # PyMuPDF
import re
from rake_nltk import Rake
from yake import KeywordExtractor
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize

# Step 3: Document scraping function
def scrape_pdf_text(path):
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text



[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


**Load the question from the file.**

In [37]:

# Example document (replace with your document path)
text = scrape_pdf_text("/kaggle/input/question/q.txt")  # Use any PDF in Kaggle input

text

'Write a research paper that contains the following: \nDefine and describe cloud-based collaboration.   \nGoogle Docs is a cloud-based tool used for document\nsharing.  \nDiscuss pros and cons of using Google Docs for\nbusiness-based documents.   \nCompare and contrast the use of Google Docs with\nMicrosoft 365 Word Docs for business-based documents.\nResearch Paper Requirements:  \nThe paper should be four pages long, not including\nthe title and reference pages. \nUse Times New Roman, size 12 font throughout the\npaper. \nApply APA 7th edition style and include three major\nsections: the Title Page, Main Body, and References.\nA minimum of two scholarly journal articles (besides\nyour textbook) are required.\nWriting should demonstrate a thorough understanding\nof the materials and address all required elements. \nWriting should use exceptional language that\nskillfully communicates meaning to the readers with\nclarity and fluency and is virtually error-free. \nNote: plagiarism check

In [38]:

# Download nltk sentence tokenizer data (run once)
# import nltk
# nltk.download('punkt')

def generate_title_sentence_segment(content):
    sentences = sent_tokenize(content)
    if sentences:
        return sentences[0] # Use the first sentence
    return "No Title Extracted" # Handle cases with no sentences

# Load your CSV
# df = pd.read_csv('daily_questions.csv')

# Generate optimized titles
# df['optimized_title'] = df['content'].apply(generate_title_sentence_segment)

# Save the updated CSV (optional)
# df.to_csv('daily_questions_with_titles_sentence.csv', index=False)

# print(df[['title_raw', 'optimized_title']].head()) # Show a preview
optimized_title = generate_title_sentence_segment(text)
optimized_title

'Write a research paper that contains the following: \nDefine and describe cloud-based collaboration.'

In [39]:
# RAKE for Phrase Ranking:
def generate_title_rake(content):
    r = Rake()
    r.extract_keywords_from_text(content)
    ranked_phrases = r.get_ranked_phrases() # Get ranked phrases
    if ranked_phrases:
        return ranked_phrases[:2] # Use the top-ranked phrase
    return "No Title Extracted"

rake_title = generate_title_rake(text)
rake_title

['apply apa 7th edition style', 'two scholarly journal articles']

In [40]:
# YAKE for Phrase Ranking:
def generate_title_yake(content):
    kw_extractor = KeywordExtractor()
    keywords = kw_extractor.extract_keywords(content) # Extract keywords (phrase, score)
    if keywords:
        return keywords # Use the top-ranked phrase (keyword)
    return "No Title Extracted"

yake_title = generate_title_yake(text)
yake_title

[('describe cloud-based collaboration', 0.01139419529487205),
 ('Define and describe', 0.013083010864350314),
 ('Google Docs', 0.018001472449993478),
 ('Docs', 0.05241615149759153),
 ('cloud-based collaboration', 0.05280622031964822),
 ('describe cloud-based', 0.062238267008141106),
 ('Define', 0.06480294082007379),
 ('Word Docs', 0.07411097376975266),
 ('Google', 0.08145926305964542),
 ('Docs for business-based', 0.09989466125910407),
 ('paper', 0.10347887217316704),
 ('Research Paper Requirements', 0.10370021410516542),
 ('business-based documents', 0.10701176040443292),
 ('research paper', 0.11648890796419961),
 ('cloud-based', 0.14700685900316215),
 ('cloud-based tool', 0.15782633142146119),
 ('Write a research', 0.16766439611237663),
 ('Main Body', 0.169410106738005),
 ('Paper Requirements', 0.16963832012154287),
 ('Times New Roman', 0.17055435761718224)]

In [41]:
from nltk.tokenize import word_tokenize
from nltk import ngrams
from collections import Counter

In [42]:

# Download nltk tokenizer data (run once if you haven't already)
# import nltk
# nltk.download('punkt')

def generate_title_ngram_freq(content, n=2): # Bigrams by default
    tokens = word_tokenize(content.lower()) # Tokenize and lowercase
    bigrams = ngrams(tokens, n)
    bigram_counts = Counter(bigrams)
    most_common_bigram = bigram_counts.most_common(5) # Get most frequent bigram
    if most_common_bigram:
        return most_common_bigram # Join bigram tuple to string
    return "No Title Extracted"

ngram_title = generate_title_ngram_freq(text)

In [43]:
ngram_title

[(('google', 'docs'), 3),
 (('research', 'paper'), 2),
 (('docs', 'for'), 2),
 (('for', 'business-based'), 2),
 (('business-based', 'documents'), 2)]

Pre-process text (sort of normalization equivalent)

In [44]:
# Step 4: Text preprocessing
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    stop_words = set(stopwords.words('english'))
    words = nltk.word_tokenize(text)
    return ' '.join([word for word in words if word not in stop_words and len(word) > 2])

cleaned_text = preprocess_text(text)
cleaned_text

'write research paper contains following define describe cloudbased collaboration google docs cloudbased tool used document sharing discuss pros cons using google docs businessbased documents compare contrast use google docs microsoft 365 word docs businessbased documents research paper requirements paper four pages long including title reference pages use times new roman size font throughout paper apply apa 7th edition style include three major sections title page main body references minimum two scholarly journal articles besides textbook required writing demonstrate thorough understanding materials address required elements writing use exceptional language skillfully communicates meaning readers clarity fluency virtually errorfree note plagiarism check required apa7 format include references within 8hrs'

In [53]:
# Step 5: Keyword extraction implementations

# 5.1 RAKE Implementation
rake = Rake()
rake.extract_keywords_from_text(text)
rake_keywords = rake.get_ranked_phrases()[:10]
rake_keywords

['apply apa 7th edition style',
 'two scholarly journal articles',
 'size 12 font throughout',
 'include three major sections',
 'microsoft 365 word docs',
 'use times new roman',
 'skillfully communicates meaning',
 'use exceptional language',
 'four pages long',
 'based tool used']

In [54]:
rake.get_ranked_phrases()

['apply apa 7th edition style',
 'two scholarly journal articles',
 'size 12 font throughout',
 'include three major sections',
 'microsoft 365 word docs',
 'use times new roman',
 'skillfully communicates meaning',
 'use exceptional language',
 'four pages long',
 'based tool used',
 'using google docs',
 'plagiarism check required',
 'research paper requirements',
 'google docs',
 'google docs',
 'reference pages',
 'include references',
 'research paper',
 'based documents',
 'based documents',
 'based collaboration',
 'within 8hrs',
 'virtually error',
 'thorough understanding',
 'required elements',
 'main body',
 'document sharing',
 'discuss pros',
 'apa7 format',
 'title page',
 'describe cloud',
 'use',
 'required',
 'paper',
 'paper',
 'title',
 'references',
 'cloud',
 'writing',
 'writing',
 'write',
 'textbook',
 'readers',
 'note',
 'minimum',
 'materials',
 'including',
 'free',
 'following',
 'fluency',
 'demonstrate',
 'define',
 'contrast',
 'contains',
 'cons',
 'com

In [47]:
# 5.2 YAKE Implementation
yake = KeywordExtractor(lan="en", top=20)
yake_keywords = yake.extract_keywords(cleaned_text)
yake_keywords

[('docs businessbased documents', 0.0011432701183019271),
 ('word docs businessbased', 0.0012491307146187405),
 ('google docs microsoft', 0.0017235193912572293),
 ('google docs businessbased', 0.0018582265824709117),
 ('edition style include', 0.0018696597986292672),
 ('sharing discuss pros', 0.002017488396269081),
 ('discuss pros cons', 0.002017488396269081),
 ('roman size font', 0.002017488396269081),
 ('scholarly journal articles', 0.002017488396269081),
 ('understanding materials address', 0.002017488396269081),
 ('exceptional language skillfully', 0.002017488396269081),
 ('language skillfully communicates', 0.002017488396269081),
 ('skillfully communicates meaning', 0.002017488396269081),
 ('communicates meaning readers', 0.002017488396269081),
 ('meaning readers clarity', 0.002017488396269081),
 ('readers clarity fluency', 0.002017488396269081),
 ('clarity fluency virtually', 0.002017488396269081),
 ('fluency virtually errorfree', 0.002017488396269081),
 ('virtually errorfree not

In [55]:
# 5.3 TF-IDF Implementation (Scikit-learn)
def tfidf_extractor(text, n=30):
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    tfidf_matrix = vectorizer.fit_transform([text])
    feature_names = vectorizer.get_feature_names_out()
    return sorted(zip(feature_names, tfidf_matrix.sum(0).A1), 
                  key=lambda x: x[1], reverse=True)[:n]

tfidf_keywords = tfidf_extractor(cleaned_text)
tfidf_keywords


[('docs', 0.2377753193111056),
 ('paper', 0.2377753193111056),
 ('google', 0.1783314894833292),
 ('google docs', 0.1783314894833292),
 ('required', 0.1783314894833292),
 ('use', 0.1783314894833292),
 ('businessbased', 0.1188876596555528),
 ('businessbased documents', 0.1188876596555528),
 ('cloudbased', 0.1188876596555528),
 ('docs businessbased', 0.1188876596555528),
 ('documents', 0.1188876596555528),
 ('include', 0.1188876596555528),
 ('pages', 0.1188876596555528),
 ('references', 0.1188876596555528),
 ('research', 0.1188876596555528),
 ('research paper', 0.1188876596555528),
 ('title', 0.1188876596555528),
 ('writing', 0.1188876596555528),
 ('365', 0.0594438298277764),
 ('365 word', 0.0594438298277764),
 ('7th', 0.0594438298277764),
 ('7th edition', 0.0594438298277764),
 ('8hrs', 0.0594438298277764),
 ('address', 0.0594438298277764),
 ('address required', 0.0594438298277764),
 ('apa', 0.0594438298277764),
 ('apa 7th', 0.0594438298277764),
 ('apa7', 0.0594438298277764),
 ('apa7 form

In [57]:
# Step 6: Display results
# print("RAKE Keywords:", [phrase for score, phrase in rake_keywords])
print("\nYAKE Keywords:", [kw[0] for kw in yake_keywords])
print("\nTF-IDF Keywords:", [kw[0] for kw in tfidf_keywords])


YAKE Keywords: ['docs businessbased documents', 'word docs businessbased', 'google docs microsoft', 'google docs businessbased', 'edition style include', 'sharing discuss pros', 'discuss pros cons', 'roman size font', 'scholarly journal articles', 'understanding materials address', 'exceptional language skillfully', 'language skillfully communicates', 'skillfully communicates meaning', 'communicates meaning readers', 'meaning readers clarity', 'readers clarity fluency', 'clarity fluency virtually', 'fluency virtually errorfree', 'virtually errorfree note', 'errorfree note plagiarism']

TF-IDF Keywords: ['docs', 'paper', 'google', 'google docs', 'required', 'use', 'businessbased', 'businessbased documents', 'cloudbased', 'docs businessbased', 'documents', 'include', 'pages', 'references', 'research', 'research paper', 'title', 'writing', '365', '365 word', '7th', '7th edition', '8hrs', 'address', 'address required', 'apa', 'apa 7th', 'apa7', 'apa7 format', 'apply']


In [59]:

# Step 7: Title generation
def generate_title(keywords_list):
    # Simple strategy: Take first keyword from each method
    return ' '.join([keywords_list[0][0], keywords_list[1][0]])

title = generate_title([
    # [phrase for score, phrase in rake_keywords],
    [kw[0] for kw in yake_keywords],
    [kw[0] for kw in tfidf_keywords]
])

print("\nGenerated Title:", title)


Generated Title: docs businessbased documents docs


### Preliminary Conclusions
Keyword extraction is not the right approach to generate any meaningful title from the content provided. DL is needed here.

End.