## Display Movies from the Coudant database

Example uses Spark SQL with a Cloudant data source

This sample notebook is written in Python and expects the Python 2.7.5 runtime. Make sure the kernel is started and you are connect to it when executing this notebook.

In [None]:
# Import Python stuff
import pprint
from collections import Counter

In [None]:
# Import PySpark stuff
from pyspark.sql import *
from pyspark.sql.functions import udf, asc, desc
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import IntegerType

### 1. Work with the Spark Context
A Spark Context handle sc is available with every notebook create in the Spark Service.  
Use it to understand the Spark version used, the environment settings, and create a Spark SQL Context object off of it.

In [None]:
sc.version

In [None]:
# sc is an existing SparkContext.
sqlContext = SQLContext(sc)

### 2. Work with a Cloudant database
A Dataframe object can be created directly from a Cloudant database. To configure the database as source, pass these options:  
1 - package name that provides the classes (like CloudantDataSource) implemented in the connector to extend BaseRelation. For the Cloudant Spark connector this will be com.cloudant.spark  
2 - cloudant.host parameter to pass the Cloudant account name  
3 - cloudant.user parameter to pass the Cloudant user name  
4 - cloudant.password parameter to pass the Cloudant account password  


In [None]:
# @hidden_cell
credentials = {
  'username':'f16bd2e3-8b90-4581-97d1-43dcd23c3ccd-bluemix',
  'password':"""28423678f7490bca3312b2f4cd4d839a9e7e587613cddedf8eaaae959a8d2d28""",
  'host':'f16bd2e3-8b90-4581-97d1-43dcd23c3ccd-bluemix.cloudant.com',
  'port':'443',
  'url':'https://f16bd2e3-8b90-4581-97d1-43dcd23c3ccd-bluemix:28423678f7490bca3312b2f4cd4d839a9e7e587613cddedf8eaaae959a8d2d28@f16bd2e3-8b90-4581-97d1-43dcd23c3ccd-bluemix.cloudant.com'
}


In [None]:
cloudantdata = sqlContext.read.format("com.cloudant.spark").\
option("cloudant.host",credentials['host']).\
option("cloudant.username",credentials['username']).\
option("cloudant.password",credentials['password']).\
load("moviedb")

### 3. Work with a Dataframe
At this point all transformations and functions should behave as specified with Spark SQL. (http://spark.apache.org/sql/)  


In [None]:
cloudantdata.printSchema()

In [None]:
cloudantdata.count()

In [None]:
cloudantdata.select("name", "url").show()

In [None]:
import pandas as pd
pandaDf = cloudantdata.select("name").toPandas()
print(pandaDf, 10)

## Display the "randomly generated" Ratings from Object Storage

Drag and drop the ratings file onto the files rectangle. then use the DSX code insertion feature to create a hadoop configuration to allow Spark to access the object storage the ratings file resides on.

In [None]:

from pyspark.sql import SparkSession

# @hidden_cell
# This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def set_hadoop_config_with_credentials_5a2eb3130fdf4523ad2d610fb1f3d6f9(name):
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage using Spark"""

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', 'a47485e45a62467a92be88568bb933cd')
    hconf.set(prefix + '.username', '14b0c25dc91d4f3cb54a5b1bcb714eb7')
    hconf.set(prefix + '.password', 'kwBOG#^Z28Z{vs)y')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', False)

# you can choose any name
name = 'keystone'
set_hadoop_config_with_credentials_5a2eb3130fdf4523ad2d610fb1f3d6f9(name)

spark = SparkSession.builder.getOrCreate()

# Please read the documentation of PySpark to learn more about the possibilities to load data files.
# PySpark documentation: https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
# The SparkSession object is already initalized for you.
# The following variable contains the path to your file on your Object Storage.
ratingsfile = "swift://MovieRecommender." + name + "/ratings.dat"


Read the file from object storage and display the first 10 records.  

Note: change the code insertion 'path_n' variable above to 'ratingsfile'

In [None]:
data_rdd = sc.textFile(ratingsfile)
data_rdd.take(10)