# Final Project Big Data

### Praproses Dataset

Pada proses ini dilakukan pengkonversian dataset untuk mempermudah proses pembuatan API. Dataset yang digunakan adalah Book-Crossing Dataset (http://www2.informatik.uni-freiburg.de/~cziegler/BX/)

## Spark Inizialitation

In [1]:
# Import findspark to read SPARK_HOME and HADOOP_HOME
import findspark
findspark.init()

In [2]:
# Import required library
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession \
    .builder \
    .appName("Convert Dataset") \
    .getOrCreate()

In [3]:
# Print Spark object ID
print(spark)

<pyspark.sql.session.SparkSession object at 0x000002723BF75400>


In [4]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql import types
import os
from pyspark.sql.types import *
from pyspark.sql import functions as F

## Convert Dataset 1 (BX-Book-Ratings)

In [5]:
lines = spark.read.csv("F:/fpbigdata/fpedit/dataset/BX-Book-Ratings.csv", header=True, inferSchema=True)

In [6]:
lines = lines.selectExpr(['`User-ID` as uid','`ISBN` as iid','`Book-Rating` as rating'])

In [7]:
lines.show()

+------+----------+------+
|   uid|       iid|rating|
+------+----------+------+
|276725|034545104X|     0|
|276726|0155061224|     5|
|276727|0446520802|     0|
|276729|052165615X|     3|
|276729|0521795028|     6|
|276733|2080674722|     0|
|276736|3257224281|     8|
|276737|0600570967|     6|
|276744|038550120X|     7|
|276745| 342310538|    10|
|276746|0425115801|     0|
|276746|0449006522|     0|
|276746|0553561618|     0|
|276746|055356451X|     0|
|276746|0786013990|     0|
|276746|0786014512|     0|
|276747|0060517794|     9|
|276747|0451192001|     0|
|276747|0609801279|     0|
|276747|0671537458|     9|
+------+----------+------+
only showing top 20 rows



In [8]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import IndexToString, StringIndexer

stringindexer = StringIndexer(inputCol='iid',outputCol='iid_int')
stringindexer.setHandleInvalid("keep")
model = stringindexer.fit(lines)
lines_int = model.transform(lines)

stringindexer = StringIndexer(inputCol='rating',outputCol='rating_int')
stringindexer.setHandleInvalid("keep")
model = stringindexer.fit(lines_int)
lines_int_fix = model.transform(lines_int)

In [9]:
lines_int_fix.show()

+------+----------+------+--------+----------+
|   uid|       iid|rating| iid_int|rating_int|
+------+----------+------+--------+----------+
|276725|034545104X|     0|  1636.0|       0.0|
|276726|0155061224|     5| 87069.0|       5.0|
|276727|0446520802|     0|   568.0|       0.0|
|276729|052165615X|     3|310005.0|       8.0|
|276729|0521795028|     6|147200.0|       6.0|
|276733|2080674722|     0| 77066.0|       0.0|
|276736|3257224281|     8| 35182.0|       1.0|
|276737|0600570967|     6|293513.0|       6.0|
|276744|038550120X|     7|   232.0|       3.0|
|276745| 342310538|    10| 87749.0|       2.0|
|276746|0425115801|     0|   446.0|       0.0|
|276746|0449006522|     0|   604.0|       0.0|
|276746|0553561618|     0|   424.0|       0.0|
|276746|055356451X|     0|   280.0|       0.0|
|276746|0786013990|     0| 24580.0|       0.0|
|276746|0786014512|     0| 14934.0|       0.0|
|276747|0060517794|     9|  1413.0|       4.0|
|276747|0451192001|     0|   933.0|       0.0|
|276747|06098

In [11]:
lines_int_fix.repartition(1).write.format('com.databricks.spark.csv').save("F:/fpbigdata/fpedit/newdataset/BookCrossing.csv",header = 'true')

## Convert Dataset 2 (BX-Books)

In [13]:
lines = spark.read.csv("F:/fpbigdata/fpedit/dataset/BX-Books.csv", header=True, inferSchema=True)

In [14]:
lines = lines.drop("Year-Of-Publication","Publisher","Image-URL-S","Image-URL-M","Image-URL-L")

In [15]:
print(lines.take(5))

[Row(ISBN='0195153448', Book-Title='Classical Mythology', Book-Author='Mark P. O. Morford'), Row(ISBN='0002005018', Book-Title='Clara Callan', Book-Author='Richard Bruce Wright'), Row(ISBN='0060973129', Book-Title='Decision in Normandy', Book-Author="Carlo D'Este"), Row(ISBN='0374157065', Book-Title='Flu: The Story of the Great Influenza Pandemic of 1918 and the Search for the Virus That Caused It', Book-Author='Gina Bari Kolata'), Row(ISBN='0393045218', Book-Title='The Mummies of Urumchi', Book-Author='E. J. W. Barber')]


In [17]:
lines = lines.selectExpr(['`ISBN` as bid','`Book-Title` as bname','`Book-Author` as bauthor'])

In [18]:
lines.show()

+----------+--------------------+--------------------+
|       bid|               bname|             bauthor|
+----------+--------------------+--------------------+
|0195153448| Classical Mythology|  Mark P. O. Morford|
|0002005018|        Clara Callan|Richard Bruce Wright|
|0060973129|Decision in Normandy|        Carlo D'Este|
|0374157065|Flu: The Story of...|    Gina Bari Kolata|
|0393045218|The Mummies of Ur...|     E. J. W. Barber|
|0399135782|The Kitchen God's...|             Amy Tan|
|0425176428|What If?: The Wor...|       Robert Cowley|
|0671870432|     PLEADING GUILTY|         Scott Turow|
|0679425608|Under the Black F...|     David Cordingly|
|074322678X|Where You'll Find...|         Ann Beattie|
|0771074670|Nights Below Stat...|David Adams Richards|
|080652121X|Hitler's Secret B...|          Adam Lebor|
|0887841740|  The Middle Stories|         Sheila Heti|
|1552041778|            Jane Doe|        R. J. Kaiser|
|1558746218|A Second Chicken ...|       Jack Canfield|
|156740778

In [19]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import IndexToString, StringIndexer

stringindexer = StringIndexer(inputCol='bid',outputCol='bid_int')
stringindexer.setHandleInvalid("keep")
model = stringindexer.fit(lines)
lines_int = model.transform(lines)

In [20]:
lines_int.show()

+----------+--------------------+--------------------+--------+
|       bid|               bname|             bauthor| bid_int|
+----------+--------------------+--------------------+--------+
|0195153448| Classical Mythology|  Mark P. O. Morford|119165.0|
|0002005018|        Clara Callan|Richard Bruce Wright| 33817.0|
|0060973129|Decision in Normandy|        Carlo D'Este| 43432.0|
|0374157065|Flu: The Story of...|    Gina Bari Kolata|188211.0|
|0393045218|The Mummies of Ur...|     E. J. W. Barber|156302.0|
|0399135782|The Kitchen God's...|             Amy Tan|198730.0|
|0425176428|What If?: The Wor...|       Robert Cowley| 37272.0|
|0671870432|     PLEADING GUILTY|         Scott Turow|122925.0|
|0679425608|Under the Black F...|     David Cordingly|248741.0|
|074322678X|Where You'll Find...|         Ann Beattie|142907.0|
|0771074670|Nights Below Stat...|David Adams Richards|228708.0|
|080652121X|Hitler's Secret B...|          Adam Lebor| 60719.0|
|0887841740|  The Middle Stories|       

In [21]:
lines_int.repartition(1).write.format('com.databricks.spark.csv').save("F:/fpbigdata/fpedit/newdataset/BookDetail.csv",header = 'true')