# Preprocessing

## Overview 

This notebook is designed for processing a dataset of papers, extracting, and saving the words and noun phrases from the titles and abstracts of each paper. The notebook reads the raw data of papers from a CSV file, processes each paper's title and abstract, and then writes the processed data into separate CSV files for words and noun phrases respectively. This process is performed on-the-fly, in which papers are read and processed line-by-line.

## Workflow
- **Setting Up the Environment**: The script begins by importing necessary libraries and modules. It also adjusts the system’s maximum integer size to avoid errors when reading large lines from the CSV file.

- **Counting the Number of Papers**: It calculates the total number of papers to be processed by counting the lines in the raw data CSV file. This is done in order to have a progress bar (tqdm) that estimates the time necessary to process the text.

- **Preparing Output Files**: The notebook then prepares three separate CSV files to store the processed words and noun phrases. It writes the headers to these files in preparation for data writing.

- **Processing Each Paper**: The script reads the raw data CSV file line by line, skipping the header. For each paper’s title and abstract, it performs the following steps:
    - Extracts and processes the text to obtain words and noun phrases.
    - Writes the processed data into the respective CSV files, associating each set of processed data with the paper’s ID.

> **Note**: The processing is based on the script imported from `../scripts/preprocessing`.

## Output
The notebook generates three CSV files as output:

- A CSV file containing the words extracted from the titles and abstracts of each paper, associated with the paper’s ID.
- A CSV file containing the noun phrases extracted from the titles and abstracts of each paper, associated with the paper’s ID.

Each row in these files corresponds to a paper from the raw data file, and contains the paper’s ID followed by the processed data extracted from the title and abstract. The processed data is stored in a comma-separated format, making it easy to read and analyze in subsequent steps of the data analysis.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(1, '../science_novelty/')

import preprocessing
from tqdm.notebook import tqdm
import csv

## Increase the max size of a line reading, otherwise an error is raised
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\u0152835\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
print('Get the number of papers to process...')
with open('../data/raw/papers_raw.csv', 'r', encoding = 'utf-8') as file:
    line_count = sum(1 for line in file)

# Subtract 1 for the header if the CSV has a header
total_papers = line_count - 1

print('Preparing for writing...')
words_writer = open('../data/processed/papers_words.csv', mode = 'w', encoding = 'utf-8')
words_writer.write('PaperID,Words_Title,Words_Abstract\n') # write the first line for the headers

phrases_writer = open('../data/processed/papers_phrases.csv', mode = 'w', encoding = 'utf-8')
phrases_writer.write('PaperID,Phrases_Title,Phrases_Abstract\n') # write the first line for the headers

print('Processing...')
with open('../data/raw/papers_raw.csv', mode = 'r', encoding='utf-8') as reader:
    csv_reader = csv.reader(reader, delimiter='\t', quotechar='"')
    
    # Skip header
    next(csv_reader)

    for line in tqdm(csv_reader, total = total_papers):
        
        paperID, date, title, abstract = line

        title_words = preprocessing.process_text(title, 'words')
        abstract_words = preprocessing.process_text(abstract, 'words')

        title_phrases = preprocessing.process_text(title, 'phrases')
        abstract_phrases = preprocessing.process_text(abstract, 'phrases')    
            
        words_writer.write(f'{paperID},{title_words},{abstract_words}\n')
        phrases_writer.write(f'{paperID},{title_phrases},{abstract_phrases}\n')
        

Get the number of papers to process...
Preparing for writing...
Processing...


  0%|          | 0/90215 [00:00<?, ?it/s]

KeyboardInterrupt: 