# Keywords extraction from papers 


This notebook shows a simple strategy to collect a reasonable set of keywords from a collection of documents. 

As an example, keywords will be extracted from NeurIPS corpus of abstracts. 

In [1]:
import os
import json
from collections import defaultdict

from tqdm.notebook import tqdm

from papeles.paper.neurips import get_key


In [2]:
NEURIPS_ANALYSIS_DATA_PATH = '/var/data/neurips_analysis'

metadata_path = os.path.join(NEURIPS_ANALYSIS_DATA_PATH, 'files_metadata/')
metadata = {}
for filename in tqdm(os.listdir(metadata_path), 'loading metadata'):
    with open(os.path.join(metadata_path, filename), 'r') as f: # open in readonly mode
        for line in f.readlines():
            data = json.loads(line)
            metadata[get_key(data['pdf_name'])] = data


HBox(children=(FloatProgress(value=0.0, description='loading metadata', max=6083.0, style=ProgressStyle(descri…




In [3]:
from papeles.utils import text as text_utils
from papeles.utils import keywords


## Corpus generation 

To extract all keywords, first let's generate a two datasets with n-grams, one for n=2 and another one for n=3. 

The goal is to extract keywords from these versions of the data. 

In [4]:
text_list_n2_year = defaultdict(list)
text_list_n3_year = defaultdict(list)
for file, data in tqdm(metadata.items()):
    text_list_n2_year[data['year']].append(text_utils.generate_ngram_text(data['abstract'], 2))
    text_list_n3_year[data['year']].append(text_utils.generate_ngram_text(data['abstract'], 3))


HBox(children=(FloatProgress(value=0.0, max=6083.0), HTML(value='')))




## Keywords extraction 

Now that the datasets are generated, let's extract keywords using a very simple TF-IDF model implemented in the `papeles` python package. 

In [5]:
# Note that keywords were extracted per year (computing IDF over that particular year documents)

year_keywords_counter_n2 = {} 
year_keywords_counter_n3 = {} 
for year in tqdm(range(2009, 2020), 'year'):    
    year_keywords_counter_n2[year] = keywords.get_keywords(text_list_n2_year[year])
    year_keywords_counter_n3[year] = keywords.get_keywords(text_list_n3_year[year])    


HBox(children=(FloatProgress(value=0.0, description='year', max=11.0, style=ProgressStyle(description_width='i…


