<a href="https://colab.research.google.com/github/huynhhoc/AI-VLU/blob/main/Sources/KeywordsExtraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Extracting Keywords with TF-IDF and Python’s Scikit-Learn
TF-IDF can be used for a wide range of tasks including:
1. text classification
2. clustering / topic-modeling
3. search
4. keyword extraction and a whole lot more
Source: https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.YRuOZ4gzbIU

**Dataset**
We will be using two files, one file, stackoverflow-data-idf.json has 20,000 posts and is used to compute the Inverse Document Frequency (IDF) and another file, stackoverflow-test.json has 500 posts and we would use that as a test set for us to extract keywords from. This dataset is based on the publicly available stack overflow dump from Google’s Big Query.
https://www.kaggle.com/rowhitswami/nips-papers-1987-2019-updated/?select=authors.csv
https://www.kaggle.com/rowhitswami/nips-papers-1987-2019-updated/?select=papers.csv

In [55]:
# Printing data files
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import requests

In [56]:
# General libraries
import re, os, string
import pandas as pd

# Scikit-learn importings
from sklearn.feature_extraction.text import TfidfVectorizer

In [57]:
def get_stopwords_list(stop_file_path):
    """load stop words """
    stopwords = requests.get(stop_file_path)
    stop_set = set(m.strip() for m in stopwords)
    return list(frozenset(stop_set))

In [58]:
def clean_text(text):
    """Doc cleaning"""
    
    # Lowering text
    text = text.lower()
    
    # Removing punctuation
    text = "".join([c for c in text if c not in PUNCTUATION])
    
    # Removing whitespace and newlines
    text = re.sub('\s+',' ',text)
    
    return text

In [59]:
def sort_coo(coo_matrix):
    """Sort a dict with highest score"""
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

In [60]:
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature, score
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [61]:
def get_keywords(vectorizer, feature_names, doc):
    """Return top k keywords from a doc using TF-IDF method"""

    #generate tf-idf for the given document
    tf_idf_vector = vectorizer.transform([doc])
    
    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only TOP_K_KEYWORDS
    keywords=extract_topn_from_vector(feature_names,sorted_items,TOP_K_KEYWORDS)
    
    return list(keywords.keys())

In [62]:
# Constants
PUNCTUATION = """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""" 
TOP_K_KEYWORDS = 10 # top k number of keywords to retrieve in a ranked document
STOPWORD_PATH = "https://raw.githubusercontent.com/huynhhoc/AI-VLU/main/resources/stopwords.txt"
PAPERS_PATH = "https://raw.githubusercontent.com/huynhhoc/AI-VLU/main/Data/papers.csv"

In [63]:
data = pd.read_csv(PAPERS_PATH)
data.head()

Unnamed: 0,source_id,year,title,abstract,full_text
0,27,1987,Bit-Serial Neural Networks,,573 \n\nBIT - SERIAL NEURAL NETWORKS \n\nAlan...
1,63,1987,Connectivity Versus Entropy,,1 \n\nCONNECTIVITY VERSUS ENTROPY \n\nYaser S...
2,60,1987,The Hopfield Model with Multi-Level Neurons,,278 \n\nTHE HOPFIELD MODEL WITH MUL TI-LEVEL N...
3,59,1987,How Neural Nets Work,,442 \n\nAlan Lapedes \nRobert Farber \n\nThe...
4,69,1987,Spatial Organization of Neural Networks: A Pro...,,740 \n\nSPATIAL ORGANIZATION OF NEURAL NEn...


In [64]:
data.dropna(subset=['full_text'], inplace=True)

# Preparing data

In [65]:
data['full_text'] = data['full_text'].apply(clean_text)

In [66]:
data.head()

Unnamed: 0,source_id,year,title,abstract,full_text
0,27,1987,Bit-Serial Neural Networks,,573 bit serial neural networks alan f murray a...
1,63,1987,Connectivity Versus Entropy,,1 connectivity versus entropy yaser s abumosta...
2,60,1987,The Hopfield Model with Multi-Level Neurons,,278 the hopfield model with mul tilevel neuron...
3,59,1987,How Neural Nets Work,,442 alan lapedes robert farber theoretical div...
4,69,1987,Spatial Organization of Neural Networks: A Pro...,,740 spatial organization of neural nenorks a p...


In [67]:
corpora = data['full_text'].to_list()

# Keywords Extraction using TF-IDF

In [70]:
#load a set of stop words
stopwords=get_stopwords_list(STOPWORD_PATH)

# Initializing TF-IDF Vectorizer with stopwords
vectorizer = TfidfVectorizer(stop_words=stopwords, smooth_idf=True, use_idf=True)

# Creating vocab with our corpora
# Exlcluding first 10 docs for testing purpose
vectorizer.fit_transform(corpora[10::])

# Storing vocab
feature_names = vectorizer.get_feature_names()

In [None]:
#create a vocabulary of words, 
#ignore words that appear in 85% of documents, 
#eliminate stop words
cv=CountVectorizer(max_df=0.85,stop_words=stopwords)
word_count_vector=cv.fit_transform(corpora)


In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

# Result 🔥

In [71]:
result = []
for doc in corpora[0:10]:
    df = {}
    df['full_text'] = doc
    df['top_keywords'] = get_keywords(vectorizer, feature_names, doc)
    result.append(df)
    
final = pd.DataFrame(result)
final

Unnamed: 0,full_text,top_keywords
0,573 bit serial neural networks alan f murray a...,"[the, and, of, to, is, synaptic, bit, state, a..."
1,1 connectivity versus entropy yaser s abumosta...,"[the, v2, 2n, of, h2, 2k, environment, to, is,..."
2,278 the hopfield model with mul tilevel neuron...,"[the, of, is, qnn, neurons, in, hopfields, cap..."
3,442 alan lapedes robert farber theoretical div...,"[the, of, to, is, in, and, bumps, eqn, bump, net]"
4,740 spatial organization of neural nenorks a p...,"[the, of, queueing, in, network, and, stimulat..."
5,775 a neuralnetwork solution to the concentrat...,"[the, of, to, sites, assignment, site, and, in..."
6,642 learning by st ate recurrence detecfion br...,"[the, recurrence, failure, of, to, aseace, sta..."
7,554 stability results for neural networks a n ...,"[the, of, equilibrium, in, and, stability, for..."
8,804 introduction to a system for implementing ...,"[the, of, processors, processor, is, to, paths..."
9,474 optimiza non with artificial neural networ...,"[the, dipole, of, to, eq, for, and, settling, ..."
