# Spark Assignment
---

> *Name:* Panagiota Gkourioti <br />
> *Student ID:* p2822109 <br />
> *Course:* Big Data Systems and Architectures <br />
> *Professor:* Thanasis Vergoulis <br />

> Department of Management Science and Technology <br />
> Athens University of Economics and Business <br />

## Description of the case

This project focuses on exploring books metadata using Apache Spark (and PySpark, in particular) in order to reveal useful insights.

## Load the necessary libraries

Initially, the necessary libraries that will be used in this project are imported.

In [2]:
# import packages
import findspark
# findspark.init('C:\spark\spark-3.2.1-bin-hadoop3.2') for local installation of spark
from pyspark.sql.session import SparkSession
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
import pyspark.sql.types as T
import pyspark.sql.functions as F 

## Task 1

Our first task is to explore the dataset. We will use SparkSQL with Dataframes in a Jupyter notebook that delivers the following:
- It uses the json() function to load the dataset.
- It counts and displays the number of books in the database.
- It counts and displays the number of e-books in the database (based on the “is_ebook” field).
- It uses the summary() command to display basic statistics about the “average_rating” field.
- It uses the groupby() and count() commands to display all distinct values in the “format” field and their number of appearances.

In [3]:
# load the data 
books = spark.read.json("books_5000.json")

In [4]:
# check the schema and data types
books.printSchema()

root
 |-- asin: string (nullable = true)
 |-- authors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- author_id: string (nullable = true)
 |    |    |-- role: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- book_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- edition_information: string (nullable = true)
 |-- format: string (nullable = true)
 |-- image_url: string (nullable = true)
 |-- is_ebook: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- kindle_asin: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- link: string (nullable = true)
 |-- num_pages: string (nullable = true)
 |-- popular_shelves: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- count: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- pub

In [7]:
# check for duplicates
if books.count() > books.dropDuplicates().count():
    print('There are duplicate rows in dataframe.')
else:
    print('No duplicates.')

No duplicates.


In [8]:
# count and display the number of books in the database
books_num = books.count()
print("The number of books in database is:", books_num)

# count and display the number of e-books in the database
ebooks_num = books.select('is_ebook').where(books.is_ebook=='true').count()
print("The number of e-books in database is:", ebooks_num)

The number of books in database is: 4999
The number of e-books in database is: 749


In [20]:
# display descriptive statistics about the “average_rating” field
from pyspark.sql.functions import col
df1 = books.select('average_rating').summary()
df1.show()

+-------+-------------------+
|summary|     average_rating|
+-------+-------------------+
|  count|               4999|
|   mean| 3.9112042408481678|
| stddev|0.43444489528688784|
|    min|               1.00|
|    25%|               3.66|
|    50%|               3.98|
|    75%|               4.23|
|    max|               5.00|
+-------+-------------------+



In [24]:
# display all distinct values in the “format” field and their number of appearances
df2 = books.groupBy('format').count()
df2.show(100,False)

+--------------------------+-----+
|format                    |count|
+--------------------------+-----+
|Paperback comic book      |1    |
|Paperback                 |2629 |
|Bolsillo con sobrecubierta|2    |
|Audible Audio             |1    |
|paperback                 |2    |
|Library Binding           |2    |
|Board book                |11   |
|Klappenbroschur           |1    |
|Nook                      |1    |
|Illustrated               |2    |
|Unknown Binding           |7    |
|Hardcover                 |826  |
|Issue                     |1    |
|Album                     |2    |
|Webtoon                   |2    |
|Book                      |1    |
|Paperback Manga           |2    |
|Kindle Edition            |41   |
|Comics                    |2    |
|hardcover                 |1    |
|Rustica con sobrecubierta |1    |
|Comic Book                |15   |
|Comic                     |15   |
|Mass Market Paperback     |64   |
|comics                    |1    |
|Audio              