# **Arxiv metadata Analytics with PySpark DF: JSON case study**

### Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark

### Author: Amin Karami (PhD, FHEA)
#### email: amin.karami@ymail.com

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
########## ONLY in Colab ##########
!pip3 install pyspark
########## ONLY in Colab ##########

In [None]:
########## ONLY in Ubuntu Machine ##########
# Load Spark engine
# !pip3 install -q findspark
# import findspark
# findspark.init()
########## ONLY in Ubuntu Machine ##########

In [1]:
# import SparkSession
from pyspark.sql import SparkSession

In [4]:
# Read and Load Data to Spark
spark = SparkSession.builder.master('local[*]').appName('arxiv').getOrCreate()
df = spark.read.json('/content/drive/MyDrive/Colab Notebooks/Udemy - Best Hands-on Big Data Practices with PySpark & Spark Tuning 2022-8/arxiv-metadata-oai-snapshot.json')

In [5]:
# check the partitions
df.rdd.getNumPartitions()

25

## Question 1: Create a new Schema

In [6]:
# Esquema original
df.printSchema()

root
 |-- abstract: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- authors_parsed: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- categories: string (nullable = true)
 |-- comments: string (nullable = true)
 |-- doi: string (nullable = true)
 |-- id: string (nullable = true)
 |-- journal-ref: string (nullable = true)
 |-- license: string (nullable = true)
 |-- report-no: string (nullable = true)
 |-- submitter: string (nullable = true)
 |-- title: string (nullable = true)
 |-- update_date: string (nullable = true)
 |-- versions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created: string (nullable = true)
 |    |    |-- version: string (nullable = true)



In [17]:
# Redefiniendo esquema, definir el esquema acelera la carga del archivo
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

schema = StructType([
    StructField('authors', StringType(), True),
    StructField('categories', StringType(), True),
    StructField('license', StringType(), True),
    StructField('comments', StringType(), True),
    StructField('abstract', StringType(), True),
    StructField('versions', ArrayType(StringType(), True))
])

## Question 2: Binding Data to a Schema

In [18]:
df = spark.read.json('/content/drive/MyDrive/Colab Notebooks/Udemy - Best Hands-on Big Data Practices with PySpark & Spark Tuning 2022-8/arxiv-metadata-oai-snapshot.json', schema=schema)
df.show()

+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|             authors|       categories|             license|            comments|            abstract|            versions|
+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|C. Bal\'azs, E. L...|           hep-ph|                NULL|37 pages, 15 figu...|  A fully differe...|[{"version":"v1",...|
|Ileana Streinu an...|    math.CO cs.CG|http://arxiv.org/...|To appear in Grap...|  We describe a n...|[{"version":"v1",...|
|         Hongjun Pan|   physics.gen-ph|                NULL| 23 pages, 3 figures|  The evolution o...|[{"version":"v1",...|
|        David Callan|          math.CO|                NULL|            11 pages|  We show that a ...|[{"version":"v1",...|
|Wael Abu-Shammala...|  math.CA math.FA|                NULL|                NULL|  In this paper w...|[{"version":"v1",...|


## Question 3: Missing values for "comments" and "license" attributes

In [19]:
# drop
df = df.dropna(subset=['comments'])

# replace
df = df.fillna(value="unknown", subset=['license'])

df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|             authors|          categories|             license|            comments|            abstract|            versions|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|C. Bal\'azs, E. L...|              hep-ph|             unknown|37 pages, 15 figu...|  A fully differe...|[{"version":"v1",...|
|Ileana Streinu an...|       math.CO cs.CG|http://arxiv.org/...|To appear in Grap...|  We describe a n...|[{"version":"v1",...|
|         Hongjun Pan|      physics.gen-ph|             unknown| 23 pages, 3 figures|  The evolution o...|[{"version":"v1",...|
|        David Callan|             math.CO|             unknown|            11 pages|  We show that a ...|[{"version":"v1",...|
|Y. H. Pong and C....|   cond-mat.mes-hall|             unknown|6 pages, 4 figure...|  We study the tw..

## Question 4: Get the author names who published a paper in a 'math' category

In [20]:
# Register DF to be used in Spark SQL
df.createOrReplaceTempView('arxiv')

spark.sql("""
    SELECT distinct authors
    FROM arxiv
    WHERE categories LIKE '%math%'
""").show()

+--------------------+
|             authors|
+--------------------+
|H.J. Haubold, A.M...|
|Gabriela Schmithu...|
|      Alexander Fish|
|         Wenhua Zhao|
|F. Cannata, M.V. ...|
|Omer Egecioglu, T...|
|Pedro Daniel Gonz...|
|Arthur Jaffe (1) ...|
|    Fabrizio Zanello|
|Pierpaolo Vivo, M...|
|         Dapeng Zhan|
|Tadashi Ochiai, F...|
|        Jason Lucier|
|Christina Sormani...|
|AM Semikhatov (Le...|
|      Morgan Sherman|
|F. Bourgeois, K. ...|
|Michael J. Gruber...|
|Jacob van den Ber...|
|George Boros and ...|
+--------------------+
only showing top 20 rows



## Question 5: Get linceses with 5 or more letters in the "abstract"

In [21]:
sql_query = """
    select distinct license
    from arxiv
    where abstract regexp '%(([a-zA-Z][^_ /\\<>]{5,}))%'
"""
spark.sql(sql_query).show()

+--------------------+
|             license|
+--------------------+
|http://arxiv.org/...|
|http://creativeco...|
|http://creativeco...|
|http://creativeco...|
|             unknown|
+--------------------+



## Question 6: Extract the statistic of the number of pages for unknown licenses

In [28]:
import re
from pyspark.sql.types import IntegerType

def get_pages(line):
  line = line if line != None else '0 pages'
  result = re.findall("\d+ pages", line)
  if result:
    return int(result[0].split(' ')[0])
  else:
    return 0

spark.udf.register("getPages", get_pages, IntegerType())

sql_query = """
    SELECT
      avg(getPages(comments)) as avg
      , sum(getPages(comments)) as sum
      , min(getPages(comments)) as min
      , max(getPages(comments)) as max
      , std(getPages(comments)) as std
    FROM arxiv
    WHERE license = 'unknown' and getPages(comments)<>0
"""
spark.sql(sql_query).show()

+------------------+-------+---+---+------------------+
|               avg|    sum|min|max|               std|
+------------------+-------+---+---+------------------+
|15.991180538236561|5642584|  1|885|17.168944606277094|
+------------------+-------+---+---+------------------+

