# Data Preprocessing for Arvix abstracts dataset

## Imports

In [1]:
import random
import pandas as pd

from data_preprocessing_util import clean_text

[nltk_data] Downloading package stopwords to /home/vscode/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Data Loading

### Load the raw data from a text file

In [2]:
%%time

with open('../Datasets/Arvix Abstracts/arxiv-abstracts-all.txt', 'r') as data_file:
    all_data = data_file.readlines()

print(f"Total number of abstracts: {len(all_data)}\n")

Total number of abstracts: 1578655

CPU times: user 4.02 s, sys: 3.62 s, total: 7.64 s
Wall time: 45.2 s


### Sample some values for visualization

In [3]:
sample_ids = [random.randint(0, len(all_data) - 1) for _ in range(3)]

print("Sample raw data:\n")

for i, sample_id in enumerate(sample_ids):
    print(f"#{i + 1}\n{all_data[sample_id]}\n")

Sample raw data:

#1
"With the emergence of the Internet-of-Things (IoT), there is a growing need for access control and data protection on low-power, pervasive devices. Biometric-based authentication is promising for IoT due to its convenient nature and lower susceptibility to attacks. However, the costs associated with biometric processing and template protection are nontrivial for smart cards, key fobs, and so forth. In this paper, we discuss the security, cost, and utility of biometric systems and develop two major frameworks for improving them. First, we introduce a new framework for implementing biometric systems based on physical unclonable functions (PUFs) and hardware obfuscation that, unlike traditional software approaches, does not require nonvolatile storage of a biometric template/key. Aside from reducing the risk of compromising the biometric, the nature of obfuscation also provides protection against access control circumvention via malware and fault injection. The PUF p

## Data Cleaning

In [4]:
%%time

all_data = list(map(clean_text, all_data))

print("Processed data:\n")

for i, sample_id in enumerate(sample_ids):
    print(f"#{i + 1}\n{all_data[sample_id]}\n")

Processed data:

#1
emergence internetofthings iot growing need access control data protection lowpower pervasive devices biometricbased authentication promising iot due convenient nature lower susceptibility attacks however costs associated biometric processing template protection nontrivial smart cards key fobs forth paper discuss security cost utility biometric systems develop two major frameworks improving first introduce new framework implementing biometric systems based physical unclonable functions pufs hardware obfuscation unlike traditional software approaches require nonvolatile storage biometric templatekey aside reducing risk compromising biometric nature obfuscation also provides protection access control circumvention via malware fault injection puf provides noninvertibility nonlinkability second major requirement proposed pufobfuscation approach reliable robust key generated users input biometric propose noiseaware biometric quantization framework capable generating uniq

## Saving cleaned data as csv

In [5]:
data_df = pd.DataFrame(all_data, columns = ["abstract"])
data_df.head()

Unnamed: 0,abstract
0,natural basic principles allow extend feynman ...
1,study effects adding loops critical percolatio...
2,propose large quantum fluctuations conformal f...
3,offshell behaviors bound nucleons deep inelast...
4,following recent work gross consider partition...


In [6]:
data_df.to_csv("../Processed Data/arvix_abstracts_cleaned.csv", index = False)