In [1]:
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Convert deduplicated text into training data for preprocessing

In [2]:
deduplicated_arxiv = pd.read_csv("./datasets/deduplicated_arxiv.csv")

In [3]:
deduplicated_arxiv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2276449 entries, 0 to 2276448
Data columns (total 5 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   title       object
 1   authors     object
 2   date        object
 3   abstract    object
 4   categories  object
dtypes: object(5)
memory usage: 86.8+ MB


In [4]:
deduplicated_arxiv.head()

Unnamed: 0,title,authors,date,abstract,categories
0,Calculation of prompt diphoton production cros...,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",2008-11-26,A fully differential calculation in perturba...,hep-ph
1,Sparsity-certifying Graph Decompositions,Ileana Streinu and Louis Theran,2008-12-13,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG
2,The evolution of the Earth-Moon system based o...,Hongjun Pan,2008-01-13,The evolution of Earth-Moon system is descri...,physics.gen-ph
3,A determinant of Stirling cycle numbers counts...,David Callan,2007-05-23,We show that a determinant of Stirling cycle...,math.CO
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,Wael Abu-Shammala and Alberto Torchinsky,2013-10-15,In this paper we show how to compute the $\L...,math.CA math.FA


In [6]:
# convert the categories into tuple of categories
deduplicated_arxiv["categories"] = deduplicated_arxiv["categories"].apply(lambda categories : tuple(categories.split()))

deduplicated_arxiv["num_categories"] = deduplicated_arxiv["categories"].apply(lambda x : len(x))

In [7]:
deduplicated_arxiv.head()

Unnamed: 0,title,authors,date,abstract,categories,num_categories
0,Calculation of prompt diphoton production cros...,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",2008-11-26,A fully differential calculation in perturba...,"(hep-ph,)",1
1,Sparsity-certifying Graph Decompositions,Ileana Streinu and Louis Theran,2008-12-13,"We describe a new algorithm, the $(k,\ell)$-...","(math.CO, cs.CG)",2
2,The evolution of the Earth-Moon system based o...,Hongjun Pan,2008-01-13,The evolution of Earth-Moon system is descri...,"(physics.gen-ph,)",1
3,A determinant of Stirling cycle numbers counts...,David Callan,2007-05-23,We show that a determinant of Stirling cycle...,"(math.CO,)",1
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,Wael Abu-Shammala and Alberto Torchinsky,2013-10-15,In this paper we show how to compute the $\L...,"(math.CA, math.FA)",2


In [8]:
deduplicated_arxiv['categories'].nunique()

77452

In [9]:
categories = deduplicated_arxiv["categories"].tolist()

# Getting all unique categories by flattening the 'categories' column
# and creating a set out of the resultant list.
unique_categories = {}
for row in categories:
    for category in row:
        unique_categories[category] = unique_categories.get(category, 0) + 1

print(f"Num. unique categories: {len(unique_categories)}")

Num. unique categories: 176


In [10]:
print(deduplicated_arxiv.abstract[0])

  A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests with CDF and DO data. Predictions are shown for
distributions of diphoton pairs produced at the energy of the Large Hadron
Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs
boson are contrasted with those produced from QCD processes at the LHC, showing
that enhanced sensitivity to the signal can be obtained with judicious
selection of events.



In [11]:
# Since we are going to use only abstract and title, and categories for training, we are going to concatenate title and abstract and drop other columns
NUM = 1000
text = deduplicated_arxiv.title[NUM] + deduplicated_arxiv.abstract[NUM]
print(deduplicated_arxiv.title[NUM])
print()
print(deduplicated_arxiv.abstract[NUM])
print()
print(text)

Tautological relations in Hodge field theory

  We propose a Hodge field theory construction that captures algebraic
properties of the reduction of Zwiebach invariants to Gromov-Witten invariants.
It generalizes the Barannikov-Kontsevich construction to the case of higher
genera correlators with gravitational descendants.
  We prove the main theorem stating that algebraically defined Hodge field
theory correlators satisfy all tautological relations. From this perspective
the statement that Barannikov-Kontsevich construction provides a solution of
the WDVV equation looks as the simplest particular case of our theorem. Also it
generalizes the particular cases of other low-genera tautological relations
proven in our earlier works; we replace the old technical proofs by a novel
conceptual proof.


Tautological relations in Hodge field theory  We propose a Hodge field theory construction that captures algebraic
properties of the reduction of Zwiebach invariants to Gromov-Witten invariants.


In [12]:
deduplicated_arxiv["text"] = deduplicated_arxiv.title + deduplicated_arxiv.abstract

In [13]:
deduplicated_arxiv.head()

Unnamed: 0,title,authors,date,abstract,categories,num_categories,text
0,Calculation of prompt diphoton production cros...,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",2008-11-26,A fully differential calculation in perturba...,"(hep-ph,)",1,Calculation of prompt diphoton production cros...
1,Sparsity-certifying Graph Decompositions,Ileana Streinu and Louis Theran,2008-12-13,"We describe a new algorithm, the $(k,\ell)$-...","(math.CO, cs.CG)",2,Sparsity-certifying Graph Decompositions We d...
2,The evolution of the Earth-Moon system based o...,Hongjun Pan,2008-01-13,The evolution of Earth-Moon system is descri...,"(physics.gen-ph,)",1,The evolution of the Earth-Moon system based o...
3,A determinant of Stirling cycle numbers counts...,David Callan,2007-05-23,We show that a determinant of Stirling cycle...,"(math.CO,)",1,A determinant of Stirling cycle numbers counts...
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,Wael Abu-Shammala and Alberto Torchinsky,2013-10-15,In this paper we show how to compute the $\L...,"(math.CA, math.FA)",2,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...


In [14]:
deduplicated_arxiv.drop(["title", "abstract", "authors", "date", "num_categories"], axis = 1, inplace = True)

In [15]:
deduplicated_arxiv.head()

Unnamed: 0,categories,text
0,"(hep-ph,)",Calculation of prompt diphoton production cros...
1,"(math.CO, cs.CG)",Sparsity-certifying Graph Decompositions We d...
2,"(physics.gen-ph,)",The evolution of the Earth-Moon system based o...
3,"(math.CO,)",A determinant of Stirling cycle numbers counts...
4,"(math.CA, math.FA)",From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...


In [18]:
deduplicated_arxiv.to_csv("./datasets/training_data_for_preprocessing.csv", index=False)

## Preprocessing

In [19]:
df = pd.read_csv("./datasets/training_data_for_preprocessing.csv")

In [20]:
df.head()

Unnamed: 0,categories,text
0,"('hep-ph',)",Calculation of prompt diphoton production cros...
1,"('math.CO', 'cs.CG')",Sparsity-certifying Graph Decompositions We d...
2,"('physics.gen-ph',)",The evolution of the Earth-Moon system based o...
3,"('math.CO',)",A determinant of Stirling cycle numbers counts...
4,"('math.CA', 'math.FA')",From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...


In [21]:
print(df.text[1])

Sparsity-certifying Graph Decompositions  We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use
it obtain a characterization of the family of $(k,\ell)$-sparse graphs and
algorithmic solutions to a family of problems concerning tree decompositions of
graphs. Special instances of sparse graphs appear in rigidity theory and have
received increased attention in recent years. In particular, our colored
pebbles generalize and strengthen the previous results of Lee and Streinu and
give a new proof of the Tutte-Nash-Williams characterization of arboricity. We
also present a new decomposition that certifies sparsity based on the
$(k,\ell)$-pebble game with colors. Our work also exposes connections between
pebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and
Westermann and Hendrickson.



In [24]:
import nltk
nltk.download("stopwords")
from nltk.stem import PorterStemmer
ps = PorterStemmer()

from nltk.corpus import stopwords
stop_word_collection = stopwords.words('english')

import string

from tqdm import tqdm

def text_preprocess(text, progress_bar):
    # Update the progress bar
    progress_bar.update(1)

    # Replace all punctuation with white space
    translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
    text = text.translate(translator)

    # Remove all numbers and words containing numbers
    text = re.sub(r'\w*\d\w*', ' ', text).strip()

    # Changes to lower case
    text = text.lower()

    # Remove all stop words
    text = ' '. join(word for word in text.split() if word not in stop_word_collection)

    # Stemming of all words
    text = [ps.stem(word) for word in text.split()]
    text = ' '.join(text)
    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Nirajan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [27]:
df.text[900_000]

'Degeneration of globally hyperbolic maximal anti-de Sitter structures\n  along rays  Using the parameterisation of the deformation space of GHMC anti-de Sitter\nstructures on $S \\times \\mathbb{R}$ by the cotangent bundle of the\nTeichm\\"uller space of $S$, we study how some geometric quantities, such as the\nLorentzian Hausdorff dimension of the limit set, the width of the convex core\nand the H\\"older exponent, degenerate along rays of quadratic differentials.\n'

In [35]:
progress_bar = tqdm(total=1_000_000)

df.text[:1_000_000] = df.text[:1_000_000].apply(text_preprocess, args=(progress_bar,))
# Close the progress bar
progress_bar.close()

100%|███████████████████████████████████████████████████████████████████████| 1000000/1000000 [43:32<00:00, 382.75it/s]


In [29]:
print(df.text[900_000])

degener global hyperbol maxim anti de sitter structur along ray use parameteris deform space ghmc anti de sitter structur time mathbb r cotang bundl teichm uller space studi geometr quantiti lorentzian hausdorff dimens limit set width convex core h older expon degener along ray quadrat differenti


In [36]:
df.to_csv("./datasets/preprocessed_csv.csv", index = False)

## First 1 million data

In [38]:
data = pd.read_csv("./datasets/preprocessed_csv.csv")

In [39]:
data.head()

Unnamed: 0,categories,text
0,"('hep-ph',)",calcul prompt diphoton product cross section t...
1,"('math.CO', 'cs.CG')",sparsiti certifi graph decomposit describ new ...
2,"('physics.gen-ph',)",evolut earth moon system base dark matter fiel...
3,"('math.CO',)",determin stirl cycl number count unlabel acycl...
4,"('math.CA', 'math.FA')",dyadic lambda alpha lambda alpha paper show co...


In [40]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2276449 entries, 0 to 2276448
Data columns (total 2 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   categories  object
 1   text        object
dtypes: object(2)
memory usage: 34.7+ MB


In [41]:
data.iloc[999_999].text

'proactiv interv downtrend employ attrit use artifici intellig techniqu predict employ attrit beforehand enabl manag take individu prevent action use ensembl classif model techniqu linear regress model could predict accur employ predict lead time separ individu reason cau attrit prior intim employ attrit enabl manag take prevent action retain employ manag busi consequ attrit deploy model help downtrend employ attrit help manag manag team effect model cover natur calam unforeseen event occur individu level like accid death etc'

In [42]:
data.iloc[1_000_000].text

'Random motion on finite rings, II: Noncommutative rings  We extend our previous study of Markov chains on finite commutative rings\n(arXiv:1605.05089) to arbitrary finite rings with identity. At each step, we\neither add or multiply by a randomly chosen element of the ring, where the\naddition (resp. multiplication) distribution is uniform (resp. conjugacy\ninvariant). We prove explicit formulas for some of the eigenvalues of the\ntransition matrix and give lower bounds on their multiplicities. We also give\nrecursive formulas for the stationary distribution and prove that the mixing\ntime is bounded by an absolute constant. For the matrix rings $M_2(\\mathbb\nF_q),$ we compute the entire spectrum explicitly using the representation\ntheory of $\\text{GL}_2(\\mathbb F_q),$ as well as the stationary probabilities.\n'

In [43]:
sample_data = data.iloc[0:999_999]

In [44]:
sample_data.text

0         calcul prompt diphoton product cross section t...
1         sparsiti certifi graph decomposit describ new ...
2         evolut earth moon system base dark matter fiel...
3         determin stirl cycl number count unlabel acycl...
4         dyadic lambda alpha lambda alpha paper show co...
                                ...                        
999994    dynam coupl dilut magnet impur quantum spin li...
999995    recogni cardiac abnorm wearabl devic photoplet...
999996    evolut skyrmion crystal fe co si like quasi tw...
999997    benedick amrein berthier type theorem relat tw...
999998    constraint scalar tensor model gauss bonnet co...
Name: text, Length: 999999, dtype: object

In [45]:
sample_data.to_csv("first 1 million.csv", index = False)