In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(1, '../science_novelty/')
import embeddings
import new_ngram
import new_ngram_comb
import similarity_calculator

new_ngram_combs = new_ngram_comb.calculate_new_ngram_combs(2000, '../data/raw/papers_raw.csv','../data/processed/papers_words.csv' )  # Change 2 to 1 for unigrams, 3 for trigrams, etc.
new_word_comb = new_ngram_combs()

C:\Users\u0152835\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.4SP5SUA7CBGXUEOC35YP2ASOICYYEQZZ.gfortran-win_amd64.dll
C:\Users\u0152835\Anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll


Running on CPU
Baseline built.
Done.


373it [00:00, 462.19it/s]


# Tutorial

## Overview

This notebook shows how to implement novelty and impact metrics calculation for scientific papers.
The notebook is showcase for measuring novelty for a custom set of papers downloaded from OpenAlex.
Novelty and impact is measured restricted to this subset of papers.

The following steps are undertaken:
1. **Data Collection**: Data is downloaded from OpenAlex using the API
2. **Preprocessing**: Title and abstract of the subset of papers are processed
3. **Text Embedding**: Title and abstract (non processed) of the subset of papers are transformed into 768-size vectors using SPECTER
4. **Cosine Distance**: For each paper the average and maximum distance are calculated using the embedding vectors
5. **New words**: For each paper the number of new words and their reuse in future papers are identified
5. **New bigrams**: For each paper the number of new bigrams and their reuse in future papers are identified
5. **New trigrams**: For each paper the number of new trigrams and their reuse in future papers are identified
5. **New word combinations**: For each paper the number of new word combinations and their reuse in future papers are identified

## Initial importings 

In [3]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(1, '../science_novelty/')

import pandas as pd
import requests
import time
import os
from tqdm.notebook import tqdm

import preprocessing
from tqdm.notebook import tqdm
import csv

## Increase the max size of a line reading, otherwise an error is raised
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

import embeddings
import new_ngram
import new_ngram_comb
import similarity_calculator

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Reading stopwords...


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\u0152835\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 1. Data Collection

In this phase all papers containing the words "natural language processing" and "novelty" are downloaded from OpenAlex.

The papers are downloaded in chunks of 100 papers and stored in a file tab-separated format called `papers_raw.csv`.
The date of publication for each paper is important. Also, the papers must be ordered by publication date.

The resulting file contains four columns: *PaperID*, *Date*, *Title* and *Abstract*.

See the notebook [`1.data-collection.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/1.data-collection.ipynb) for more detailed information on this process.

In [6]:
# OpenAlex API URL
url = "https://api.openalex.org/works"

# This is an example query
query = '(natural language processing) & novelty'

# Define the initial page and per page variables
page = 1
per_page = 100 
papers = []

params = {'search': query, 'filter': 'type:article'}

# Send a GET request to the API
response = requests.get(url, params=params)
count = response.json()['meta']['count']
total_pages = round(count/per_page) + 1

print('Total papers: %d'%(count))

time.sleep(1)

print('Start querying...')
# Loop through all pages (+1 to get the last page)
for page in tqdm(range(1,total_pages + 1)):
    
    # Get the cursor for the first page
    if page == 1:
        cursor = '*'

    params = {
        'search': query,
        'sort': 'publication_date',
        'per-page': per_page,
        'filter': 'type:article',
        'cursor' : cursor
    }

    # Send a GET request to the API
    response = requests.get(url, params=params)

    # If the request is successful
    if response.status_code == 200:
        data = response.json()

        # Get the data
        results = data.get('results',[])
        
        # Select the information need from these publications
        papers.extend([((res['id'].split('/')[-1].replace('W','')),
                        res['publication_date'],
                        res['title'],
                        preprocessing.plain_text_from_inverted(res['abstract_inverted_index'])) 
                               for res in results])
        
        # Get the next cursor for the pagination
        cursor = response.json()['meta']['next_cursor']

        # Respect the API rate limit
        time.sleep(1)
        
    else:
        print(f"Request failed with status code {response.status_code}.")
        break
    
print('Creating the dataframe...')
papers = pd.DataFrame(papers, columns = ['PaperID','Date','Title','Abstract'])

print('Drop missing papers with missing title and abstract.')
papers = papers.dropna(subset = ['Title','Abstract'], how = 'all')

Total papers: 3364
Start querying...


  0%|          | 0/35 [00:00<?, ?it/s]

Creating the dataframe...
Drop missing papers with missing title and abstract


## 2. Preprocessing 

In this phase, the subset of papers downloaded from OpenAlex is processed following the procedure in Arts et al. (2023).
The processing phase consists of creating three files containing respectively the words, bigrams and trigrams.
The preprocessing is done on-the-fly, meaning that papers are read one-by-one to not overload the memory.

Three files are generated from this process and put in the `data/processed/` directory:

- `papers_words.csv`: one row and three columns for each paper. Columns: *PaperID*, *Words_Title* and *Words_Abstract*.
- `papers_bigrams.csv`: one row and three columns for each paper. Columns: *PaperID*, *Bigrams_Title* and *Bigrams_Abstract*.
- `papers_trigrams.csv`: one row and three columns for each paper. Columns: *PaperID*, *Trigrams_Title* and *Trigrams_Abstract*.

See the notebook [`2.preprocessing.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/2.preprocessing.ipynb) for more detailed information on this process.

In [5]:
print('Get the number of papers to process...')
with open('../data/raw/papers_raw.csv', 'r', encoding = 'utf-8') as file:
    line_count = sum(1 for line in file)

# Subtract 1 for the header if the CSV has a header
total_papers = line_count - 1

print('Preparing for writing...')
words_write = open('../data/processed/papers_words.csv','w')
words_write.write('PaperID,Words_Title,Words_Abstract\n') # write the first line for the headers
bigrams_write = open('../data/processed/papers_bigrams.csv','w')
bigrams_write.write('PaperID,Bigrams_Title,Bigrams_Abstract\n') # write the first line for the headers
trigrams_write = open('../data/processed/papers_trigrams.csv','w')
trigrams_write.write('PaperID,Trigrams_Title,Trigrams_Abstract\n') # write the first line for the headers

print('Processing...')
with open('../data/raw/papers_raw.csv', 'r', encoding='utf-8') as reader:
    csv_reader = csv.reader(reader, delimiter='\t', quotechar='"')
    
    # Skip header
    next(csv_reader)

    for line in tqdm(csv_reader, total = total_papers):
        
        writing_words = line[0] # add the PaperID
        writing_bigrams = line[0] # add the PaperID
        writing_trigrams = line[0] # add the PaperID
        
        ## Assuming that the first two columns are the PaperID and the Date
        for text in [line[2], line[3]]:  # loop over title and abstract
            
            # preprocess text (either title or abstract)            
            unigrams, bigrams, trigrams = preprocessing.process_text(text)
            
            writing_words += ',' + ' '.join(unigrams)
            writing_bigrams += ',' + ' '.join(bigrams)
            writing_trigrams += ',' + ' '.join(trigrams)
            
        words_write.write(writing_words + '\n')
        bigrams_write.write(writing_bigrams + '\n')
        trigrams_write.write(writing_trigrams + '\n')
            
# close the file
words_write.close()
bigrams_write.close()      
trigrams_write.close()

Get the number of papers to process...
Preparing for writing...
Processing...


  0%|          | 0/3474 [00:00<?, ?it/s]

## 3. Text Embeddings 

In this phase, the subset of papers downloaded from OpenAlex is transformed into 768-size vectors using SPECTER.
A python module called `embedding_generator` is imported to perform the this process.

In this phase, a new file for each year is created, storing the embeddings in either CSV format depending in the directory `data/vectors/`

See the notebook [`3.text-embeddings.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/3.text-embeddings.ipynb) for more detailed information on this process.

In [6]:
import embedding_generator as eg

# Set your directories and parameters
input_file = '../data/raw/papers_raw.csv'
output_dir = '../data/vectors/'
storage_method = 'csv'  # or 'numpy'
chunk_size = 50

# Call the function to process embeddings
eg.process_embeddings(input_file, output_dir, storage=storage_method, chunk_size=chunk_size)

Load the embedding model...
Using CPU.
Get the number of papers to process...
Processing...


  0%|          | 0/70 [00:00<?, ?it/s]

KeyboardInterrupt: 

## 4. Cosine distance 

In this phase, we calculate the cosine distance between the text embeddings to measure the similarity between different papers. We utilize the `similarity_calculator`  module to efficiently compute these distances.

The similarities are stored in a file called `papers_cosine.csv` in the directory `data/metrics`.

The file contains for each paper one row and three columns: *PaperID*,*max_similarity* and *avg_similarity*.

See the notebook [`4.cosine-distance.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/3.text-embeddings.ipynb) for more detailed information on this process.

In [14]:
import similarity_calculator as sc

# Set your directories
input_dir = '../data/vectors/'
output_dir = '../data/metrics/'

# Get the years of the embeddings
years = sorted([int(f.split('_')[0]) for f in os.listdir(input_dir) if 'check' not in f])

# Set the range of years you want to process
start_year = min(years)
end_year = max(years) + 1

# Call the function to calculate similarities
sc.calculate_similarities(start_year, end_year, input_dir, output_dir)

Running on CPU


  0%|          | 0/150 [00:00<?, ?it/s]

Reading 1851...
Reading 1900...
Reading 1906...
Reading 1969...
Reading 1974...
Reading 1976...
Reading 1980...
Reading 1983...
Reading 1984...
Reading 1985...
Reading 1986...
Reading 1987...
Reading 1988...
Calculating similarities for 1988...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1989...
Calculating similarities for 1989...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1990...
Calculating similarities for 1990...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1991...
Calculating similarities for 1991...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1992...
Calculating similarities for 1992...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1993...
Calculating similarities for 1993...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1994...
Calculating similarities for 1994...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1995...
Calculating similarities for 1995...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1996...
Calculating similarities for 1996...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1997...
Calculating similarities for 1997...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1998...
Calculating similarities for 1998...


  0%|          | 0/1 [00:00<?, ?it/s]

Reading 1999...
Calculating similarities for 1999...


  0%|          | 0/1 [00:00<?, ?it/s]

## 5. New words

In this phase, we identify the new words introduced by each paper. 
For each new word, the ID is the first paper is identified and the number of subsequent papers the re-use the word are counted.
A baseline of words is defined. Words that appear in the baseline are not considered as new words. 

### Methodology:
- **Establishing a Baseline**: A list of words serves as our baseline. Any word that belong to the baseline is not classified as a 'new word'

- **Data Processing and Comparison**: The script code by importing data from a designated CSV file. Each word from the dataset is compared against the baseline to determine its novelty. The frequency of each new word, post its introduction, is counted.

- **Utilizing the new_ngram Module**: To streamline and optimize the process of identifying and counting new words, we import the `new_ngram` module.

### Prerequisites:
- **Baseline Year**: A specific year must be defined as the terminal year for the baseline. All papers published prior to this year contribute to the baseline. Consequently, words from these papers are not considered as new words.

- **Directory Definitions**:

    - A directory containing the processed papers.
    - A directory storing the publication year of each paper.

### Assumptions:
To ensure the seamless execution of the code, we operate under the following assumptions:

- The dataset of papers is chronologically ordered based on their publication dates, with the most recent papers listed last.
- The publication year of each paper is located in the second column of the second specified file (the file containing the year information).

### Output:

The findings from this phase are cataloged in a file named `new_words.csv`, located in the `data/metrics` directory. This file is structured with each row representing a new word and three columns detailing the word, the PaperID of the paper that introduced it, and the reuse count, indicating the frequency of its subsequent usage.

See the notebook [`5.new-word.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/5.new-word.ipynb) for more detailed information on this process.

In [20]:
import new_ngram

# define the last year of the baseline and the directory of the dates and the processed file
new_words_calculator = new_ngram.calculate_new_ngrams(2000, '../data/raw/papers_raw.csv', '../data/processed/papers_words.csv')
new_words = new_words_calculator()

new_words

Baseline built.
Done.


3362it [00:00, 15858.00it/s]


{'oil': (4253608092, 24),
 'filter': (4253608092, 52),
 'umls': (4253608092, 8),
 'simulate': (4253608092, 14),
 'disease': (4253608092, 119),
 'biomedical': (4253608092, 79),
 'clinically': (4253608092, 11),
 'fish': (4253608092, 2),
 'connecting': (4253608092, 9),
 'resulted': (4253608092, 23),
 'deficiency': (4253608092, 11),
 'two-step': (4253608092, 10),
 'experimentally': (4253608092, 30),
 'literature-based': (4253608092, 7),
 'swanson': (4253608092, 2),
 'unified': (4253608092, 40),
 'corroborated': (4253608092, 1),
 'simulating': (4253608092, 13),
 'cross-language': (2068649023, 2),
 'goal': (2068649023, 152),
 'series': (2068649023, 70),
 'proposal': (2068649023, 62),
 'organizing': (2068649023, 18),
 'clef': (2068649023, 3),
 'campaign': (2068649023, 22),
 'difficulty': (2068649023, 69),
 'letter': (2068649023, 11),
 'running': (2068649023, 24),
 'future': (2068649023, 300),
 'examined': (2068649023, 42),
 'forum': (2068649023, 36),
 'principle': (2166769854, 69),
 'meaning'

## 6. New bigrams

In this phase, we identify the new bigrams introduced by each paper. 
For each new bigram, the ID is the first paper is identified and the number of subsequent papers the re-use the bigram are counted.
A baseline of bigrams is defined. Bigrams that appear in the baseline are not considered as new bigrams. 

The procedure is the same as for new words.

See the notebook [`6.new-bigram.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/6.new-bigram.ipynb) for more detailed information on this process.

In [21]:
import new_ngram

# define the last year of the baseline and the directory of the dates and the processed file
new_bigrams_calculator = new_ngram.calculate_new_ngrams(2000, '../data/raw/papers_raw.csv', '../data/processed/papers_bigrams.csv')
new_bigrams = new_bigrams_calculator()

new_bigrams

Baseline built.
Done.


3362it [00:00, 24185.07it/s]


{'medical_language': (4253608092, 5),
 'unified_medical': (4253608092, 5),
 'ha_resulted': (4253608092, 3),
 'lexical_rule': (2166769854, 1),
 'formal_semantics': (2166769854, 1),
 'organizing_principle': (2166769854, 1),
 'generative_lexicon': (2166769854, 1),
 'correct_sense': (2048592642, 1),
 'input_text': (2048592642, 5),
 'teaching_effectiveness': (2026914640, 1),
 'national_literacy': (2026914640, 1),
 'older_age': (2064926943, 2),
 'older_people': (2064926943, 1),
 'issue_arising': (1985039455, 1),
 'memory_management': (1985039455, 1),
 'field_focus': (1985039455, 1),
 'logic_program': (1985039455, 1),
 'data_structure': (1985039455, 6),
 'parallel_execution': (1985039455, 1),
 'shared_memory': (1985039455, 1),
 'memory_implementation': (1985039455, 1),
 'dynamic_data': (1985039455, 4),
 'comprehensive_survey': (1985039455, 8),
 'potentially_interesting': (1985039455, 1),
 'interesting_candidate': (1985039455, 3),
 'news_topic': (2094661073, 2),
 'news_coverage': (2094661073, 

## 7. New trigrams

In this phase, we identify the new trigrams introduced by each paper. 
For each new trigram, the ID is the first paper is identified and the number of subsequent papers the re-use the trigram are counted.
A baseline of trigrams is defined. Trigrams that appear in the baseline are not considered as new trigrams. 

The procedure is the same as for new words.

See the notebook [`7.new-trigram.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/7.new-trigram.ipynb) for more detailed information on this process.

In [22]:
import new_ngram

# define the last year of the baseline and the directory of the dates and the processed file
new_trigrams_calculator = new_ngram.calculate_new_ngrams(2000, '../data/raw/papers_raw.csv', '../data/processed/papers_trigrams.csv')
new_trigrams = new_trigrams_calculator()

new_trigrams

Baseline built.
Done.


3362it [00:00, 22869.66it/s]


{'language_system_umls': (4253608092, 5),
 'unified_medical_language': (4253608092, 5),
 'static_and_dynamic': (2166769854, 3),
 'identifying_the_correct': (2048592642, 2),
 'teaching_and_learning': (4254747961, 8),
 'understanding_the_meaning': (2040022066, 1),
 'search_and_mining': (2040022066, 1),
 'structure_and_connection': (2040022066, 1),
 'artificial_intelligence_ai': (2034698323, 78),
 'lack_of_annotated': (2031868269, 2),
 'named_entity_tagging': (2031868269, 1),
 'extraction_of_entity': (2031868269, 1),
 'precision_and_recall': (2031868269, 21),
 'european_patent_office': (2093756022, 1),
 'environment_the_proposed': (2022083943, 1),
 'syntactic_and_semantic': (2022083943, 15),
 'neuro-linguistic_programming_nlp': (1603773285, 4),
 'wavelet_transform_cwt': (1603773285, 1),
 'experiment_on_real': (1501998038, 3),
 'knowledge_in_textual': (1501998038, 1),
 'understanding_of_human': (1652999312, 3),
 'tracking_the_evolution': (2154744626, 1),
 'opera_and_ballet': (2074306890, 1

## 7. New word comb

In this phase, we identify the new word combinations introduced by each paper. 
For each new word combination, the ID is the first paper is identified and the number of subsequent papers the re-use the word combination are counted.
A baseline of word combinations is defined. Word combinations that appear in the baseline are not considered as new word combinations. 

The procedure is the same as for new words.

To streamline and optimize the process of identifying and counting new words, we import the `new_ngram_comb` module.

See the notebook [`8.new-word-comb.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/8.new-word-comb.ipynb) for more detailed information on this process.

In [24]:
import new_ngram_comb

# define the last year of the baseline and the directory of the dates and the processed file
new_word_combs_calculator = new_ngram_comb.calculate_new_ngram_combs(2000, '../data/raw/papers_raw.csv', '../data/processed/papers_words.csv')
new_word_combs = new_word_combs_calculator()

new_word_combs

Baseline built.
Done.


3362it [00:14, 235.51it/s]


{('experimentally', 'knowledge'): (4253608092, 7),
 ('natural', 'successfully'): (4253608092, 22),
 ('medical', 'successfully'): (4253608092, 5),
 ('natural', 'resulted'): (4253608092, 7),
 ('clinically', 'tested'): (4253608092, 1),
 ('natural', 'unified'): (4253608092, 12),
 ('information', 'tested'): (4253608092, 41),
 ('experimentally', 'ha'): (4253608092, 9),
 ('filter', 'natural'): (4253608092, 13),
 ('filter', 'semantic'): (4253608092, 6),
 ('knowledge', 'umls'): (4253608092, 4),
 ('language', 'two-step'): (4253608092, 4),
 ('knowledge', 'two-step'): (4253608092, 3),
 ('filter', 'language'): (4253608092, 15),
 ('context', 'filter'): (4253608092, 3),
 ('simulating', 'tested'): (4253608092, 2),
 ('clinically', 'knowledge'): (4253608092, 2),
 ('ha', 'simulate'): (4253608092, 5),
 ('medical', 'umls'): (4253608092, 6),
 ('corroborated', 'information'): (4253608092, 1),
 ('clinically', 'ha'): (4253608092, 6),
 ('literature-based', 'medical'): (4253608092, 2),
 ('ha', 'medical'): (42536