# Tutorial

## Overview

This notebook shows how to implement novelty and impact metrics calculation for scientific papers.
The notebook is a showcase for measuring novelty for a custom set of papers downloaded from OpenAlex.
Novelty and impact is measured restricted to this subset of papers.

The following steps are undertaken:
1. **Data Collection**: Data is downloaded from OpenAlex using the API
2. **Preprocessing**: Title and abstract of the subset of papers are processed
3. **Text Embedding**: Title and abstract (non processed) of the subset of papers are transformed into 768-size vectors using SPECTER
4. **Semantic Distance**: For each paper the average and maximum distance are calculated using the embedding vectors
5. **New words**: For each paper the number of new words and their reuse in future papers are identified
6. **New phrases**: For each paper the number of new noun phrases and their reuse in future papers are identified
7. **New word combinations**: For each paper the number of new word combinations and their reuse in future papers are identified
8. **New phrase combinations**: For each paper the number of new phrase combinations and their reuse in future papers are identified

## Initial importings 

In [None]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(1, '../science_novelty/')

import pandas as pd
import requests
import time
import os
from tqdm.notebook import tqdm

import preprocessing
from tqdm.notebook import tqdm
import csv

## Increase the max size of a line reading, otherwise an error is raised
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

import embeddings
import new_ngram
import new_ngram_comb
import similarity_calculator

## 1. Data Collection

In this phase all papers containing the words "natural language processing" and "novelty" are downloaded from OpenAlex.

The papers are downloaded in chunks of 100 papers and stored in a file tab-separated format called `papers_raw.csv`.
The date of publication for each paper is important. Also, the papers must be ordered by publication date.

The resulting file contains four columns: *PaperID*, *Date*, *Title* and *Abstract*.

See the notebook [`1.data-collection.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/1.data-collection.ipynb) for more detailed information on this process.

In [None]:
# OpenAlex API URL
url = "https://api.openalex.org/works"

# This is an example query
query = '(natural language processing) & novelty'

# Define the initial page and per page variables
page = 1
per_page = 100 
papers = []

params = {'search': query, 'filter': 'type:article'}

# Send a GET request to the API
response = requests.get(url, params=params)
count = response.json()['meta']['count']
total_pages = round(count/per_page) + 1

print('Total papers: %d'%(count))

time.sleep(1)

print('Start querying...')
# Loop through all pages (+1 to get the last page)
for page in tqdm(range(1,total_pages + 1)):
    
    # Get the cursor for the first page
    if page == 1:
        cursor = '*'

    params = {
        'search': query,
        'sort': 'publication_date',
        'per-page': per_page,
        'filter': 'type:article',
        'cursor' : cursor
    }

    # Send a GET request to the API
    response = requests.get(url, params=params)

    # If the request is successful
    if response.status_code == 200:
        data = response.json()

        # Get the data
        results = data.get('results',[])
        
        # Select the information need from these publications
        papers.extend([((res['id'].split('/')[-1].replace('W','')),
                        res['publication_date'],
                        res['title'],
                        preprocessing.plain_text_from_inverted(res['abstract_inverted_index'])) 
                               for res in results])
        
        # Get the next cursor for the pagination
        cursor = response.json()['meta']['next_cursor']

        # Respect the API rate limit
        time.sleep(1)
        
    else:
        print(f"Request failed with status code {response.status_code}.")
        break
    
print('Creating the dataframe...')
papers = pd.DataFrame(papers, columns = ['PaperID','Date','Title','Abstract'])

print('Drop missing papers with missing title and abstract.')
papers = papers.dropna(subset = ['Title','Abstract'], how = 'all')

## 2. Preprocessing 

In this phase, the subset of papers downloaded from OpenAlex is processed following the procedure in Arts et al. (2023).
The processing phase consists of creating three files containing respectively the words, bigrams and trigrams.
The preprocessing is done on-the-fly, meaning that papers are read one-by-one to not overload the memory.

Three files are generated from this process and put in the `data/processed/` directory:

- `papers_words.csv`: one row and three columns for each paper. Columns: *PaperID*, *Words_Title* and *Words_Abstract*.
- `papers_phrases.csv`: one row and three columns for each paper. Columns: *PaperID*, *Phrases_Title* and *Phrases_Abstract*.

See the notebook [`2.preprocessing.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/2.preprocessing.ipynb) for more detailed information on this process.

In [None]:
print('Get the number of papers to process...')
with open('../data/raw/papers_raw.csv', 'r', encoding = 'utf-8') as file:
    line_count = sum(1 for line in file)

# Subtract 1 for the header if the CSV has a header
total_papers = line_count - 1

print('Preparing for writing...')
words_writer = open('../data/processed/papers_words.csv', mode = 'w', encoding = 'utf-8')
words_writer.write('PaperID,Year,Words_Title,Words_Abstract\n') # write the first line for the headers

phrases_writer = open('../data/processed/papers_phrases.csv', mode = 'w', encoding = 'utf-8')
phrases_writer.write('PaperID,Year,Phrases_Title,Phrases_Abstract\n') # write the first line for the headers

print('Processing...')
with open('../data/raw/papers_raw.csv', mode = 'r', encoding='utf-8') as reader:
    csv_reader = csv.reader(reader, delimiter='\t', quotechar='"')
    
    # Skip header
    next(csv_reader)

    for line in tqdm(csv_reader, total = total_papers):
        
        paperID, date, title, abstract = line # add the PaperID

        year = date.split('-')[0]

        title_words = preprocessing.process_text(title, 'words')
        abstract_words = preprocessing.process_text(abstract, 'words')

        title_phrases = preprocessing.process_text(title, 'phrases')
        abstract_phrases = preprocessing.process_text(abstract, 'phrases')    
            
        words_writer.write(f'{paperID},{year},{title_words},{abstract_words}\n')
        phrases_writer.write(f'{paperID},{year},{title_phrases},{abstract_phrases}\n')
        

## 3. Text Embeddings 

In this phase, the subset of papers downloaded from OpenAlex is transformed into 768-size vectors using SPECTER.
A python module called `embedding_generator` is imported to perform the this process.

In this phase, a new file for each year is created, storing the embeddings in either CSV format depending in the directory `data/vectors/`

See the notebook [`3.text-embeddings.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/3.text-embeddings.ipynb) for more detailed information on this process.

In [None]:
import embedding_generator as eg

# Set your directories and parameters
input_file = '../data/raw/papers_raw.csv'
output_dir = '../data/vectors/'
storage_method = 'csv'  # or 'numpy'
chunk_size = 50

# Call the function to process embeddings
eg.process_embeddings(input_file, output_dir, storage=storage_method, chunk_size=chunk_size)

## 4. Semantic distance 

In this phase, we calculate the cosine distance between the text embeddings to measure the similarity between different papers. We utilize the `similarity_calculator`  module to efficiently compute these distances.

The similarities are stored in a file called `papers_cosine.csv` in the directory `data/metrics`.

The file contains for each paper one row and three columns: *PaperID*,*max_similarity* and *avg_similarity*.

See the notebook [`4.cosine-distance.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/3.text-embeddings.ipynb) for more detailed information on this process.

In [None]:
import similarity_calculator as sc

# Set your directories
input_dir = '../data/vectors/'
output_dir = '../data/metrics/'

# Get the years of the embeddings
years = sorted([int(f.split('_')[0]) for f in os.listdir(input_dir) if 'check' not in f])

# Set the range of years you want to process
start_year = min(years)
end_year = max(years) + 1

# Call the function to calculate similarities
sc.calculate_similarities(start_year, end_year, input_dir, output_dir)

## 5. New words

In this phase, we identify the new words introduced by each paper. 
For each new word, the ID is the first paper is identified and the number of subsequent papers the re-use the word are counted.
A baseline of words is defined. Words that appear in the baseline are not considered as new words. 

### Methodology:
- **Establishing a Baseline**: A list of words serves as our baseline. Any word that belong to the baseline is not classified as a 'new word'

- **Data Processing and Comparison**: The script code by importing data from a designated CSV file. Each word from the dataset is compared against the baseline to determine its novelty. The frequency of each new word, post its introduction, is counted.

- **Utilizing the new_ngram Module**: To streamline and optimize the process of identifying and counting new words, we import the `new_ngram` module.

### Prerequisites:
- **Baseline Year**: A specific year must be defined as the terminal year for the baseline. All papers published prior to this year contribute to the baseline. Consequently, words from these papers are not considered as new words.

- **Directory Definitions**:

    - A directory containing the processed papers.
    - A directory storing the publication year of each paper.

### Assumptions:
To ensure the seamless execution of the code, we operate under the following assumptions:

- The dataset of papers is chronologically ordered based on their publication dates, with the most recent papers listed last.
- The publication year of each paper is located in the second column of the second specified file (the file containing the year information).

### Output:

The findings from this phase are cataloged in a file named `new_words.csv`, located in the `data/metrics` directory. This file is structured with each row representing a new word and three columns detailing the word, the PaperID of the paper that introduced it, and the reuse count, indicating the frequency of its subsequent usage.

See the notebook [`5.new-word.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/5.new-word.ipynb) for more detailed information on this process.

In [None]:
import new_ngram

# define the last year of the baseline and the directory of the dates and the processed file
new_words_calculator = new_ngram.calculate_new_ngrams(2000, '../data/raw/papers_raw.csv', '../data/processed/papers_words.csv')
new_words = new_words_calculator()

new_words

## 6. New phrases

In this phase, we identify the new phrases introduced by each paper. 
For each new phrase, the ID is the first paper is identified and the number of subsequent papers the re-use the phrase are counted.
A baseline of phrases is defined. Phrases that appear in the baseline are not considered as new bigrams. 

The procedure is the same as for new words.

In [None]:
import new_ngram

# define the last year of the baseline and the directory of the dates and the processed file
new_phrases_calculator = new_ngram.calculate_new_ngrams(2000, '../data/raw/papers_raw.csv', '../data/processed/papers_phrases.csv')
new_phrases = new_phrases_calculator()

new_phrases_calculator

## 7. New word comb

In this phase, we identify the new word combinations introduced by each paper. 
For each new word combination, the ID is the first paper is identified and the number of subsequent papers the re-use the word combination are counted.
A baseline of word combinations is defined. Word combinations that appear in the baseline are not considered as new word combinations. 

The procedure is the same as for new words.

To streamline and optimize the process of identifying and counting new words, we import the `new_ngram_comb` module.

See the notebook [`8.new-word-comb.ipynb`](https://github.com/nicolamelluso/science-novelty/blob/main/notebooks/8.new-word-comb.ipynb) for more detailed information on this process.

In [None]:
import new_ngram_comb

# define the last year of the baseline and the directory of the dates and the processed file
new_word_combs_calculator = new_ngram_comb.calculate_new_ngram_combs(2000, '../data/raw/papers_raw.csv', '../data/processed/papers_words.csv')
new_word_combs = new_word_combs_calculator()

new_word_combs

## 7. New phrase comb

In this phase, we identify the new phrase combinations introduced by each paper. 
For each new phrase combination, the ID is the first paper is identified and the number of subsequent papers the re-use the phrase combination are counted.
A baseline of phrase combinations is defined. Phrase combinations that appear in the baseline are not considered as new phrase combinations. 

The procedure is the same as for new phrases.


In [None]:
import new_ngram_comb

# define the last year of the baseline and the directory of the dates and the processed file
new_phrase_combs_calculator = new_ngram_comb.calculate_new_ngram_combs(2000, '../data/raw/papers_raw.csv', '../data/processed/papers_phrases.csv')
new_phrase_combs = new_word_combs_calculator()

new_phrase_combs