# Load data into a notebook from different sources (Scala)

Before you can start analyzing your data, you have to load the data from a data source. You can store your data in many different data sources. This reference notebook shows you how to load and integrate data in a notebook from the following data sources:
-  Object Storage V3
-  dashDB
-  Cloudant
-  PostgreSQL

The notebook sample code shows you how to load data into a notebook by using Scala. You can copy and paste these code snippets into the notebook you are developing.

## Table of contents

- [Load data from Object Storage V3](#osv3)
  - [Load data by using Scala](#osv3_scala)
  - [Load data by using Stocator](#osv3_stocator)
- [Load data from dashDB](#dashdb)
- [Load data from a Cloudant database](#cloudant)
- [Load data from a PostgreSQL database](#postgresql)
- [Summary](#summary)

<a id="osv3"></a>
## Load data from Object Storage V3
IBM® Object Storage for Bluemix® provides you with access to a fully provisioned Swift Object Storage account to manage your data. Object Storage uses OpenStack Identity (Keystone) for authentication and can be accessed directly by using [OpenStack Object Storage (Swift) API v3](http://developer.openstack.org/api-ref-identity-v3.html#credentials-v3). 

When you load data for use in a notebook, the data file is stored in the Object Storage instance associated with your Spark service.

Click the next code cell to set the focus on the cell. Now add the credentials to access the data file to this code cell by selecting **Palette>Data Sources** and clicking the `Insert to code` function below the data file in the **Data Source** pane.

When you select the `Insert to code` function, a code cell with a Scala hashmap is created for you. Adjust the credentials in the dictionary to correspond with the credentials inserted by the `Insert to code function` and run the dictionary code cell. The access credentials to the Object Storage instance in the dictionary are provided for later usage.

<a id="osv3_scala"></a>
### Load data by using Scala

Run the next cells to load the data from a file in Object Storage by using Scala functions.

In [None]:
import org.apache.spark.sql.SQLContext
import scala.collection.breakOut

def setConfig(name:String, dsConfiguration:String) : Unit = {
    val pfx = "fs.swift.service." + name
    val settings:Map[String,String] = dsConfiguration.split("\\n").
        map(l=>(l.split(":",2)(0).trim(), l.split(":",2)(1).trim()))(breakOut)

    val conf = sc.getConf
    conf.set(pfx + "auth.url", settings.getOrElse("auth_url",""))
    conf.set(pfx + "tenant", settings.getOrElse("tenantId", ""))
    conf.set(pfx + "username", settings.getOrElse("username", ""))
    conf.set(pfx + "password", settings.getOrElse("password", ""))
    conf.set(pfx + "apikey", settings.getOrElse("password", ""))
    conf.set(pfx + "auth.endpoint.prefix", "endpoints")
}

In [None]:
setConfig("spark", credentials.toString())

val sqlctx = new SQLContext(sc)
val scplain = sqlctx.sparkContext
sqlctx.setConf("spark.sql.shuffle.partitions", "10")
import sqlctx.implicits._

val df = (sqlctx.read
    .format("com.databricks.spark.csv")
    .option("header","true")
    .option("inferschema","true")
    .option("mode","DROPMALFORMED")
    .load("swift://notebooks.spark/" + credentials("filename"))
)

df.show(5)

Now your data is in a Spark DataFrame and you can begin analyzing it.

<a id="osv3_stocator"></a>
### Load data using Stocator
Stocator is a storage connector for Spark that eliminates some of the unnecessary Hadoop drivers that are not needed to interact with object storage. Stocator's Hadoop configuration can be set by using the following configuration function:

In [None]:
import org.apache.spark.SparkContext
import scala.util.control.NonFatal
import play.api.libs.json.Json

val sqlctx = new SQLContext(sc)
val scplain = sqlctx.sparkContext

Before you can access data in the data file in Object Storage by using the [`SparkContext`](https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.SparkContext) object, you must set the Hadoop configuration by using the following configuration function:

In [None]:
def setRemoteObjectStorageConfig(name:String, sc: SparkContext, dsConfiguration:String) : Boolean = {
    try {
        val result = scala.util.parsing.json.JSON.parseFull(dsConfiguration)
        result match {
            case Some(e:Map[String,String]) => {
                val prefix = "fs.swift2d.service." + name
                val hconf = sc.hadoopConfiguration
                hconf.set("fs.swift2d.impl","com.ibm.stocator.fs.ObjectStoreFileSystem")
                hconf.set(prefix + ".auth.url", e("auth_url") + "/v3/auth/tokens")
                hconf.set(prefix + ".tenant", e("project_id"))
                hconf.set(prefix + ".username", e("user_id"))
                hconf.set(prefix + ".password", e("password"))
                hconf.set(prefix + "auth.method", "keystoneV3")
                hconf.set(prefix + ".region", e("region"))
                hconf.setBoolean(prefix + ".public", true)
                println("Successfully modified sparkcontext object with remote Object Storage Credentials using datasource name " + name)
                println("")
                return true
            }
            case None => println("Failed.")
                return false
        }
    }
    catch {
       case NonFatal(exc) => println(exc)
           return false
    }
}

Set the Hadoop configuration and load the data.

In [None]:
val setObjStor = setRemoteObjectStorageConfig("sparksql", scplain, Json.toJson(credentials.toMap).toString)
val data_rdd = sc.textFile("swift2d://notebooks.sparksql/" + credentials("filename"))
data_rdd.take(5)

Now your data is in a Spark RDD and you can begin analyzing it.

<a id="dashdb"></a>
## Load data from dashDB

dashDB is a data warehousing and analytics solution. Use dashDB to store relational data, including special data types such as geospatial data. You can leverage the in-memory database technology to use both columnar and row-based tables. 

You must have an IBM dashdDB for Bluemix service instance. In the notebook, select **Palette>Data Sources**. Click **Add Source**, select **From Bluemix**, and choose your dashDB instance. The dashDB instance name appears in the **Data Source** pane. 

Click the next code cell and use the `Insert to code` function below the dashDB instance name in the **Data Source** pane to add the dashDB credentials. 

When you select the `Insert to code` function, a code cell with a Scala hashmap is created for you. Adjust the credentials in the dictionary to correspond with the credentials inserted by the `Insert to code function` and run the dictionary code cell. The access credentials to the dashDB  instance in the dictionary are provided for later usage.

After adding the credentials of your dashDB instance that contains your data, run the next cell to load this data. Be sure to set the `TABLENAME` variable to the name of the table in your DashDB you would like to access.

The code in the cell reads the credentials and loads the data from dashBD into a DataFrame data structure.

In [None]:
import java.util.Properties
import collection.JavaConversions._
import org.apache.spark.sql.SQLContext
val sqlctx = new SQLContext(sc)

In [None]:
val TABLENAME = "<name>"

val propMap = mapAsJavaMap(Map("user"->credentials("username"), "password"->credentials("password")))
val table = credentials("username") + "." + tablename

val props = new Properties()
props.putAll(propMap)

val df = sqlctx.read.jdbc(credentials("jdbcurl"), table, properties=props)
df.show(5)

Now your data is in a Spark DataFrame and you can begin analyzing it.

<a id="cloudant"></a>
## Load data from a Cloudant database
Cloudant is a NoSQL database as a service (DBaaS) built to scale globally, run nonstop, and handle a wide variety of data types like JSON, full-text, and geospatial. Cloudant NoSQL DB is an operational data store optimized to handle concurrent reads and  writes and to provide high availability and data durability.

You must have an IBM Cloudant NoSQL Database for Bluemix service instance. In the notebook, select **Palette>Data Sources**. Click **Add Source**, select **From Bluemix**, and choose your Cloudant NoSQL DB instance. The Cloudant NoSQL DB instance name appears in the **Data Source** pane. 

Click the next code cell and use the `Insert to code` function below the Cloudant NoSQL DB instance name in the **Data Source** pane to add the Cloudant NoSQL DB instance credentials. 

Adjust the credentials in the Scala hashmap, which is prepared for you, to correspond with the credentials inserted by the `Insert to code` function and run the dictionary code cell. The access credentials to your Cloudant NoSQL DB instance in the dictionary are provided for convenience for later usage.

After adding the credentials of your Cloudant instance that contains your data, run the next cell to load this data. Be sure to set the `DBNAME` variable to the name of the database in your Cloudant service you would like to access.

In [None]:
import org.apache.spark.sql.SQLContext
val sqlctx = new SQLContext(sc)

val DBNAME = "<name>"

val df = sqlctx.read.format("com.cloudant.spark").
option("cloudant.host", credentials("host")).
option("cloudant.username", credentials("username")).
option("cloudant.password", credentials("password")).
load(DBNAME)
df.show()

Now your data is in a Spark DataFrame and you can begin analyzing it.

<a id="postgresql"></a>
## Load data from a PostgreSQL database
PostgreSQL is an object-relational database system offered as a Bluemix service. It must be paired with an existing [Compose](https://www.compose.com/) account to be used.

First we will load the jar for the necessary jdbc driver, then set the credentials of the table we would like to access. These can be found in your Compose account.

In [None]:
%Addjar https://jdbc.postgresql.org/download/postgresql-9.4.1208.jre6.jar 

In [None]:
val host = "<host>"
val port = "<port>"
val user = "<user>"
val password = "<password>"
val dbname = "<db>"
val dbtable = "<table>"

Now we can run this cell to have the jdbc driver load your data.

In [None]:
import org.apache.spark.sql.SQLContext
val sqlctx = new SQLContext(sc)

val df = sqlctx.read.format("jdbc").
                    option("url", "jdbc:postgresql://"+host+":"+port+"/"+dbname+"?user="+user+"&password="+password).
                    option("dbtable", dbtable).
                    option("driver", "org.postgresql.Driver").
                    load()

df.printSchema()
df.show(5)

Now your data is in a `pyspark.sql.DataFrame` and you can start analyzing it.

<a id="summary"></a>
## Summary

In this notebook, you learned how to load data from an Object Storage V3, dashDB, or Cloudant instance to a notebook