<a href="https://colab.research.google.com/github/kasikotnani23/Kasi-k/blob/main/arxiv_metadata_Analysis_(JSON_DF).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Arxiv metadata Analytics with PySpark DF: JSON case study**

### Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark

### Author: Amin Karami (PhD, FHEA)
#### email: amin.karami@ymail.com

In [1]:
########## ONLY in Colab ##########
!pip3 install pyspark
########## ONLY in Colab ##########

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824028 sha256=537f9bb88bcac2d9bbfe8ed561f1ee4cf75d52d7582b9db51f9d657edb1c0f38
  Stored in directory: /root/.cache/pip/wheels/6c/e3/9b/0525ce8a69478916513509d43693511463c6468db0de237c86
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [2]:
# import SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark


In [49]:
# Read and Load Data to Spark
df = spark.read.format('json').load("/content/drive/MyDrive/Colab Notebooks/arxiv-metadata-oai-snapshot.json")
df.printSchema()

root
 |-- abstract: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- authors_parsed: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- categories: string (nullable = true)
 |-- comments: string (nullable = true)
 |-- doi: string (nullable = true)
 |-- id: string (nullable = true)
 |-- journal-ref: string (nullable = true)
 |-- license: string (nullable = true)
 |-- report-no: string (nullable = true)
 |-- submitter: string (nullable = true)
 |-- title: string (nullable = true)
 |-- update_date: string (nullable = true)
 |-- versions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created: string (nullable = true)
 |    |    |-- version: string (nullable = true)



In [36]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [50]:
# check the partitions
print(df.rdd.getNumPartitions())
df.show(5)

25
+--------------------+--------------------+--------------------+---------------+--------------------+--------------------+---------+--------------------+--------------------+----------------+------------------+--------------------+-----------+--------------------+
|            abstract|             authors|      authors_parsed|     categories|            comments|                 doi|       id|         journal-ref|             license|       report-no|         submitter|               title|update_date|            versions|
+--------------------+--------------------+--------------------+---------------+--------------------+--------------------+---------+--------------------+--------------------+----------------+------------------+--------------------+-----------+--------------------+
|  A fully differe...|C. Bal\'azs, E. L...|[[Balázs, C., ], ...|         hep-ph|37 pages, 15 figu...|10.1103/PhysRevD....|0704.0001|Phys.Rev.D76:0130...|                null|ANL-HEP-PR-07-12|    Pavel N

## Question 1: Create a new Schema

In [51]:
from pyspark.sql.types import *

Schema = StructType([
                     StructField('abstract',StringType(),True),
                     StructField('authors',StringType(),True),
                     StructField('categories',StringType(),True),
                     StructField('comments',StringType(),True),
                     StructField('id',StringType(),True), 
                     StructField('versions',ArrayType(StringType()),True),
                     StructField('license',StringType(),True)
    ])

print(Schema)

StructType([StructField('abstract', StringType(), True), StructField('authors', StringType(), True), StructField('categories', StringType(), True), StructField('comments', StringType(), True), StructField('id', StringType(), True), StructField('versions', ArrayType(StringType(), True), True), StructField('license', StringType(), True)])


## Question 2: Binding Data to a Schema

In [56]:
df = spark.read.json("/content/drive/MyDrive/Colab Notebooks/arxiv-metadata-oai-snapshot.json",schema = Schema)
df.show(5)

+--------------------+--------------------+---------------+--------------------+---------+--------------------+--------------------+
|            abstract|             authors|     categories|            comments|       id|            versions|             license|
+--------------------+--------------------+---------------+--------------------+---------+--------------------+--------------------+
|  A fully differe...|C. Bal\'azs, E. L...|         hep-ph|37 pages, 15 figu...|0704.0001|[{"version":"v1",...|                null|
|  We describe a n...|Ileana Streinu an...|  math.CO cs.CG|To appear in Grap...|0704.0002|[{"version":"v1",...|http://arxiv.org/...|
|  The evolution o...|         Hongjun Pan| physics.gen-ph| 23 pages, 3 figures|0704.0003|[{"version":"v1",...|                null|
|  We show that a ...|        David Callan|        math.CO|            11 pages|0704.0004|[{"version":"v1",...|                null|
|  In this paper w...|Wael Abu-Shammala...|math.CA math.FA|          

## Question 3: Missing values for "comments" and "license" attributes

In [53]:
df = df.dropna(subset=["comments"])
df = df.na.fill({"license": "unknown"}) 
#df = df.fillna(value = "unknown",subset = ["license"])
df.show()


+--------------------+--------------------+--------------------+--------------------+---------+--------------------+--------------------+
|            abstract|             authors|          categories|            comments|       id|            versions|             license|
+--------------------+--------------------+--------------------+--------------------+---------+--------------------+--------------------+
|  A fully differe...|C. Bal\'azs, E. L...|              hep-ph|37 pages, 15 figu...|0704.0001|[{"version":"v1",...|             unknown|
|  We describe a n...|Ileana Streinu an...|       math.CO cs.CG|To appear in Grap...|0704.0002|[{"version":"v1",...|http://arxiv.org/...|
|  The evolution o...|         Hongjun Pan|      physics.gen-ph| 23 pages, 3 figures|0704.0003|[{"version":"v1",...|             unknown|
|  We show that a ...|        David Callan|             math.CO|            11 pages|0704.0004|[{"version":"v1",...|             unknown|
|  We study the tw...|Y. H. Pong a

## Question 4: Get the author names who published a paper in a 'math' category

In [54]:
from pyspark.sql.column import *
#df.createOrReplaceGlobalTempView("table1")
#sql_query = """ select authors from table1 where categories like math% """
#spark.sql(sql_query).show()
#df = df.selectExpr("authors").filter("categories like 'math%'").show()

df = df.select("authors").filter("categories like 'math%'").show()


+--------------------+
|             authors|
+--------------------+
|Ileana Streinu an...|
|        David Callan|
|  Sergei Ovchinnikov|
|Clifton Cunningha...|
|        Koichi Fujii|
|         Norio Konno|
|Simon J.A. Malham...|
|Robert P. C. de M...|
|  P\'eter E. Frenkel|
|          Mihai Popa|
|   Debashish Goswami|
|      Mikkel {\O}bro|
|Nabil L. Youssef,...|
|         Boris Rubin|
|         A. I. Molev|
| Branko J. Malesevic|
|   John W. Robertson|
|     Yu.N. Kosovtsov|
|        Osamu Fujino|
|Stephen C. Power ...|
+--------------------+
only showing top 20 rows



## Question 5: Get linceses with 5 or more letters in the "abstract"

In [60]:
df.createOrReplaceTempView("Archive")

#sql_query = """ SELECT authors FROM ArchiveWHERE categories LIKE 'math%'"""
sql_query= """ SELECT distinct(license) from Archive where abstract REGEXP '%\(([A-Za-z][^_ /\\<>]{5,})\)%'"""

spark.sql(sql_query).show()

print(spark.sql(sql_query).count())

+--------------------+
|             license|
+--------------------+
|http://arxiv.org/...|
|http://creativeco...|
|http://creativeco...|
|http://creativeco...|
|http://creativeco...|
|                null|
+--------------------+

6


## Question 6: Extract the statistic of the number of pages for unknown licenses

In [61]:
import re
def get_Page(line):
    search = re.findall('\d+ pages', line)
    if search:
        return int(search[0].split(" ")[0])
    else:
        return 0

    
spark.udf.register("PageNumbers", get_Page)

sql_query = """SELECT AVG(PageNumbers(comments)) AS avg, SUM(PageNumbers(comments)) AS sum,
                STD(PageNumbers(comments)) AS std
                FROM Archive
                WHERE license="unknown"
            """

spark.sql(sql_query).show()

+----+----+----+
| avg| sum| std|
+----+----+----+
|null|null|null|
+----+----+----+

