<a href="https://colab.research.google.com/github/kasikotnani23/Kasi-k/blob/main/Copy_of_arxiv_metadata_Analysis_(JSON_RDD).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Arxiv metadata Analytics with PySpark RDD: JSON case study**

### Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark

### Author: Amin Karami (PhD, FHEA)
#### email: amin.karami@ymail.com

In [None]:
########## ONLY in Colab ##########
!pip3 install pyspark
########## ONLY in Colab ##########

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Using cached pyspark-3.3.2.tar.gz (281.4 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Using cached py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824028 sha256=c58e0537bffa02c3d67068ef8c2bf690296c0726703b590ed5bafe811681fc24
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
    Uninstalling py4j-0.10.9.7:
      Successfully uninstalled py4j-0.10.9.7
Successfully installed py4j-0.10.9.5 pyspark-3.3.2


In [None]:
# Initializing Spark
from pyspark import SparkConf,SparkContext
conf = SparkConf().setAppName("first").setMaster("local[*]")
sc = SparkContext(conf = conf)
print(sc)

<SparkContext master=local[*] appName=first>


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Read and Load Data to Spark
# Data source: https://www.kaggle.com/Cornell-University/arxiv/version/62
import json
from pyspark import StorageLevel
rdd_json = sc.textFile("/content/drive/MyDrive/Colab Notebooks/arxiv-metadata-oai-snapshot.json",100)
rdd = rdd_json.map(lambda x : json.loads (x))
rdd.persist()



PythonRDD[2] at RDD at PythonRDD.scala:53

In [None]:
# Check the number of parallelism and partitions:
print(rdd.getNumPartitions())
print(sc.defaultParallelism)
print(sc.defaultMinPartitions)


100
2
2


## Question 1: Count elements

In [None]:
rdd.count()

2011231

## Question 2: Get the first two records


In [8]:
print(rdd.take(2))

[{'id': '0704.0001', 'submitter': 'Pavel Nadolsky', 'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan", 'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies', 'comments': '37 pages, 15 figures; published version', 'journal-ref': 'Phys.Rev.D76:013009,2007', 'doi': '10.1103/PhysRevD.76.013009', 'report-no': 'ANL-HEP-PR-07-12', 'categories': 'hep-ph', 'license': None, 'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermila

## Question 3: Get all attributes


In [9]:
rdd.flatMap(lambda x :x.keys()).distinct().collect()

['authors',
 'comments',
 'title',
 'id',
 'journal-ref',
 'versions',
 'submitter',
 'categories',
 'update_date',
 'authors_parsed',
 'report-no',
 'license',
 'abstract',
 'doi']

## Question 4: Get the name of the licenses

In [11]:
rdd.map(lambda x : x["license"]).distinct().collect()

[None,
 'http://creativecommons.org/licenses/publicdomain/',
 'http://creativecommons.org/licenses/by-nc-nd/4.0/',
 'http://creativecommons.org/licenses/by-nc-sa/4.0/',
 'http://creativecommons.org/licenses/by-nc-sa/3.0/',
 'http://creativecommons.org/licenses/by/3.0/',
 'http://creativecommons.org/licenses/by/4.0/',
 'http://creativecommons.org/publicdomain/zero/1.0/',
 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/',
 'http://creativecommons.org/licenses/by-sa/4.0/']

## Question 5: Get the shortest and the longest titles

In [13]:
shortest_title = rdd.map(lambda x: x["title"]).reduce(lambda x , y : x if x < y else y)
longest_title = rdd.map(lambda x: x["title"]).reduce(lambda x , y : x if x > y else y)
print("shortest title :",shortest_title)
print("longest title :",longest_title)

shortest title : !-Graphs with Trivial Overlap are Context-Free
longest title : Weyl formula for the negative dissipative eigenvalues of Maxwell's
  equations


## Question 6: Find abbreviations with 5 or more letters in the abstract

In [14]:
import re

def get_abbrivations(line):
    result = re.search(r"\(([A-Za-z][^_ /\\<>]{5,})\)", line)
    if result:
        return result.group(1) # return 1st match. group (0) will return all the matches
rdd.filter(lambda x: get_abbrivations(x['abstract'])).count()

192721

## Question 7: Get the number of archive records per month ('update_date' attribute)

In [15]:
import datetime

def extract_date(DateIn):
    d = datetime.datetime.strptime(DateIn, "%Y-%m-%d")
    return d.month

# check the function:
extract_date('2008-12-13')

rdd.map(lambda x: (extract_date(x["update_date"]),1)).reduceByKey(lambda x,y: x+y).collect()


[(1, 134247),
 (2, 116948),
 (3, 126458),
 (4, 117126),
 (5, 296587),
 (6, 191746),
 (7, 122649),
 (8, 138469),
 (9, 138978),
 (10, 197755),
 (11, 297963),
 (12, 132305)]

## Question 8: Get the average number of pages

In [17]:

def get_Page(line):
    search = re.findall('\d+ pages', line)
    if search:
        return int(search[0].split(" ")[0])
    else:
        return 0

In [18]:
rdd_average = rdd.map(lambda x: get_Page(x['comments'] if x['comments'] != None else "None"))

# remove 0:
rdd_average = rdd_average.filter(lambda x: x != 0)

average_counter = rdd_average.count()
avarage_summation = rdd_average.reduce(lambda x,y: int(x)+int(y))

print(average_counter)
print(avarage_summation)
print("the average of pages is ", avarage_summation/average_counter)

1184075
21139516
the average of pages is  17.85319004286046
