# Data Preprocessing for Arvix abstracts dataset

## Imports

In [1]:
import random
import pandas as pd

from data_preprocessing_util import clean_text

[nltk_data] Downloading package stopwords to /home/vscode/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Data Loading

### Load the raw data from a text file

In [2]:
%%time

with open('../Datasets/Arvix Abstracts/arxiv-abstracts-all.txt', 'r') as data_file:
    all_data = data_file.readlines()

print(f"Total number of abstracts: {len(all_data)}\n")

Total number of abstracts: 1578655

CPU times: user 3.98 s, sys: 3.3 s, total: 7.29 s
Wall time: 43.3 s


### Sample some values for visualization

In [3]:
sample_ids = [random.randint(0, len(all_data) - 1) for _ in range(3)]

print("Sample raw data:\n")

for i, sample_id in enumerate(sample_ids):
    print(f"#{i + 1}\n{all_data[sample_id]}\n")

Sample raw data:

#1
"This report is a review of Darwin's classical theory of bodily tides in which we present the analytical expressions for the orbital and rotational evolution of the bodies and for the energy dissipation rates due to their tidal interaction. General formulas are given which do not depend on any assumption linking the tidal lags to the frequencies of the corresponding tidal waves (except that equal frequency harmonics are assumed to span equal lags). Emphasis is given to the cases of companions having reached one of the two possible final states: (1) the super-synchronous stationary rotation resulting from the vanishing of the average tidal torque; (2) the capture into a 1:1 spin-orbit resonance (true synchronization). In these cases, the energy dissipation is controlled by the tidal harmonic with period equal to the orbital period (instead of the semi-diurnal tide) and the singularity due to the vanishing of the geometric phase lag does not exist. It is also shown t

## Data Cleaning

In [4]:
%%time

all_data = list(map(clean_text, all_data))

print("Processed data:\n")

for i, sample_id in enumerate(sample_ids):
    print(f"#{i + 1}\n{all_data[sample_id]}\n")

Processed data:

#1
report review darwins classical theory bodily tides present analytical expressions orbital rotational evolution bodies energy dissipation rates due tidal interaction general formulas given depend assumption linking tidal lags frequencies corresponding tidal waves except equal frequency harmonics assumed span equal lags emphasis given cases companions reached one two possible final states supersynchronous stationary rotation resulting vanishing average tidal torque capture spinorbit resonance true synchronization cases energy dissipation controlled tidal harmonic period equal orbital period instead semidiurnal tide singularity due vanishing geometric phase lag exist also shown true synchronization nonzero eccentricity possible extra torque exists opposite tidal torque theory developed assuming additional torque produced equatorial permanent asymmetry companion results modeldependent theory developed second degree eccentricity inclination obliquity easily extended hig

## Saving cleaned data as csv

In [5]:
data_df = pd.DataFrame(all_data, columns = ["abstract"])
data_df.head()

Unnamed: 0,abstract
0,natural basic principles allow extend feynman ...
1,study effects adding loops critical percolatio...
2,propose large quantum fluctuations conformal f...
3,offshell behaviors bound nucleons deep inelast...
4,following recent work gross consider partition...


In [6]:
data_df.to_csv("../Processed Data/arvix_abstracts_cleaned.csv", index = False)