# Spark Assignment
---

> *Name:* Panagiota Gkourioti <br />
> *Student ID:* p2822109 <br />
> *Course:* Big Data Systems and Architectures <br />
> *Professor:* Thanasis Vergoulis <br />

> Department of Management Science and Technology <br />
> Athens University of Economics and Business <br />

## Task 2

For this task, we will create a Jupyter notebook, using PySpark and Dataframes, that delivers the following:
- It returns the “book_id” and “title” of the book with the largest “average_rating” that its title starts with the *first* letter of my last name.
- It returns the average “average_rating” of the books that their title starts with the *second* letter of my last name.
- It returns the “book_id” and “title” of the Paperback book with the most pages, when only books with title starting with the *third* letter of my last name are considered. 

In [1]:
# import packages
import findspark
# findspark.init('C:\spark\spark-3.2.1-bin-hadoop3.2') for local installation of spark
from pyspark.sql.session import SparkSession
from pyspark.sql.session import SparkSession
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
import pyspark.sql.types as T
import pyspark.sql.functions as F 

In [2]:
# load the data 
books = spark.read.json("books_5000.json")

In [3]:
# check the schema and data types
books.printSchema()

root
 |-- asin: string (nullable = true)
 |-- authors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- author_id: string (nullable = true)
 |    |    |-- role: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- book_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- edition_information: string (nullable = true)
 |-- format: string (nullable = true)
 |-- image_url: string (nullable = true)
 |-- is_ebook: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- kindle_asin: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- link: string (nullable = true)
 |-- num_pages: string (nullable = true)
 |-- popular_shelves: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- count: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- pub

In [4]:
# convert num_pages from string type to integer
from pyspark.sql.types import IntegerType
books = books.withColumn("num_pages", books["num_pages"].cast(IntegerType()))

In [17]:
# select book ID, title and average rating from books data frame
# filter books based on title starting with G
# sort them by average rating in descending order and 
# extract the first book's ID number and title
b1 = books.select('book_id','title','average_rating')\
.filter("title like 'G%'").orderBy('average_rating', ascending = False).first()[0:2]
print('The book with the largest “average rating” that its title starts with "G" has', 
      b1[0],'ID number and its title is',b1[1]) 

The book with the largest “average rating” that its title starts with "G" has 2513980 ID number and its title is Gary Panter


In [16]:
# select title and average rating from books data frame
# filter books based on title starting with K
# calculate the average "average rating" of the books
b2 = books.select('title','average_rating').filter("title like 'K%'").agg({'average_rating': 'avg'})
print('The average “average rating” of the books that their title starts with "K" is', round(b2.first()[0],2)) 

The average “average rating” of the books that their title starts with "K" is 3.95


In [18]:
# select book ID, title and average rating from books data frame
# select only books with Paperback format
# filter books based on title starting with O
# sort them by number of pages in descending order and 
# extract the first book's ID number and title
b3 = books.select('book_id','title','num_pages').where(books.format=='Paperback').filter("title like 'O%'")\
.orderBy('num_pages',ascending = False).first()[0:2]
print('The Paperback book with the most pages, that its title starts with "O" has', b3[0],
      'ID number and its title is', b3[1]) 

The Paperback book with the most pages, that its title starts with "O" has 21411974 ID number and its title is One Piece: Skypeia 28-29-30, Vol. 10 (One Piece: Omnibus, #10)
