# Hello PixieDust!
This sample notebook provides you with an introduction to many features included in PixieDust. You can find more information about PixieDust at https://pixiedust.github.io/pixiedust/. To ensure you are running the latest version of PixieDust uncomment and run the following cell. Do not run this cell if you installed PixieDust locally from source and want to continue to run PixieDust from source.

In [1]:
!pip install --user --upgrade pixiedust

Collecting pixiedust
  Downloading pixiedust-1.0.5.tar.gz (110kB)
[K    100% |████████████████████████████████| 112kB 824kB/s ta 0:00:01
[?25hRequirement already up-to-date: mpld3 in /Users/mbrobergus.ibm.com/anaconda/lib/python2.7/site-packages (from pixiedust)
Requirement already up-to-date: lxml in /Users/mbrobergus.ibm.com/.local/lib/python2.7/site-packages (from pixiedust)
Building wheels for collected packages: pixiedust
  Running setup.py bdist_wheel for pixiedust ... [?25l- \ done
[?25h  Stored in directory: /Users/mbrobergus.ibm.com/Library/Caches/pip/wheels/64/88/d8/dab16cc6385c872294f763afd88cb35bc3a6e679506b1e5231
Successfully built pixiedust
Installing collected packages: pixiedust
Successfully installed pixiedust-1.0.5


# Import PixieDust
Run the following cell to import the PixieDust library. You may need to restart your kernel after importing. Follow the instructions, if any, after running the cell. Note: You must import PixieDust every time you restart your kernel.

In [1]:
import pixiedust

Pixiedust database opened successfully


# Enable the Spark Progress Monitor
PixieDust includes a Spark Progress Monitor bar that lets you track the status of your Spark jobs. You can find more info at https://pixiedust.github.io/pixiedust/sparkmonitor.html. Note: there is a known issue with the Spark Progress Monitor on Spark 2.1. Run the following cell to enable the Spark Progress Monitor:

In [2]:
pixiedust.enableJobMonitor();

Succesfully enabled Spark Job Progress Monitor


# Example use of the PackageManager
You can use the PackageManager component of Pixiedust to install and uninstall maven packages into your notebook kernel without editing configuration files. This component is essential when you run notebooks from a hosted cloud environment and do not have access to the configuration files. You can find more info at https://pixiedust.github.io/pixiedust/packagemanager.html. Run the following cell to install the GraphFrame package. You may need to restart your kernel after installing new packages. Follow the instructions, if any, after running the cell. 

In [3]:
pixiedust.installPackage("graphframes:graphframes:0.1.0-spark1.6")
print("done")

Package already installed: graphframes:graphframes:0.1.0-spark1.6
done


Run the following cell to print out all installed packages:

In [4]:
pixiedust.printAllPackages()

0,1,2
▸,:,


graphframes:graphframes:0.1.0-spark1.6 => /Users/mbrobergus.ibm.com/pixiedust/data/libs/graphframes-0.1.0-spark1.6.jar


# Example use of the display() API
PixieDust lets you visualize your data in just a few clicks using the display() API. You can find more info at https://pixiedust.github.io/pixiedust/displayapi.html. The following cell creates a DataFrame and uses the display() API to create a bar chart:

In [5]:
sparkSession = SparkSession.builder.getOrCreate()
d1 = sparkSession.createDataFrame(
[(2010, 'Camping Equipment', 3),
 (2010, 'Golf Equipment', 1),
 (2010, 'Mountaineering Equipment', 1),
 (2010, 'Outdoor Protection', 2),
 (2010, 'Personal Accessories', 2),
 (2011, 'Camping Equipment', 4),
 (2011, 'Golf Equipment', 5),
 (2011, 'Mountaineering Equipment',2),
 (2011, 'Outdoor Protection', 4),
 (2011, 'Personal Accessories', 2),
 (2012, 'Camping Equipment', 5),
 (2012, 'Golf Equipment', 5),
 (2012, 'Mountaineering Equipment', 3),
 (2012, 'Outdoor Protection', 5),
 (2012, 'Personal Accessories', 3),
 (2013, 'Camping Equipment', 8),
 (2013, 'Golf Equipment', 5),
 (2013, 'Mountaineering Equipment', 3),
 (2013, 'Outdoor Protection', 8),
 (2013, 'Personal Accessories', 4)],
["year","zone","unique_customers"])

display(d1)

# Example use of the Scala bridge
Data scientists working with Spark may occasionaly need to call out to one of the hundreds of libraries available on spark-packages.org which are written in Scala or Java. PixieDust provides a solution to this problem by letting users directly write and run scala code in its own cell. It also lets variables be shared between Python and Scala and vice-versa. You can find more info at https://pixiedust.github.io/pixiedust/scalabridge.html.

Start by creating a python variable that we'll use in scala:

In [6]:
python_var = "Hello From Python"
python_num = 10

0,1,2
▸,:,


Create scala code that use the python_var and create a new variable that we'll use in Python:

In [7]:
%%scala
println(python_var)
println(python_num+10)
val __scala_var = "Hello From Scala"

0,1,2
▸,:,


Hello From Python
20


Use the __scala_var from python:

In [8]:
print(__scala_var)

0,1,2
▸,:,


Hello From Scala


# Sample Data
PixieDust includes a number of sample data sets. You can use these sample data sets to start playing with the display() API and other PixieDust features. You can find more info at https://pixiedust.github.io/pixiedust/loaddata.html. Run the following cell to view the available data sets:

In [9]:
pixiedust.sampleData()

0,1,2
▸,:,


Id,Name,Topic,Publisher
1,Car performance data,transportation,IBM
2,"Sample retail sales transactions, January 2009",Economy & Business,IBM Cloud Data Services
3,Total population by country,Society,IBM Cloud Data Services
4,GoSales Transactions for Naive Bayes Model,Leisure,IBM
5,Election results by County,Society,IBM
6,Million dollar home sales in NE Mass late 2016,Economy & Business,Redfin.com
7,"Boston Crime data, 2-week sample",Society,City of Boston


# Example use of sample data
To use sample data locally run the following cell to install required packages. You may need to restart your kernel after running this cell.

In [10]:
pixiedust.installPackage("com.databricks:spark-csv_2.10:1.5.0")
pixiedust.installPackage("org.apache.commons:commons-csv:0")

0,1,2
▸,:,


Downloading package com.databricks:spark-csv_2.10:1.5.0 to /Users/mbrobergus.ibm.com/pixiedust/data/libs/spark-csv_2.10-1.5.0.jar


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Package com.databricks:spark-csv_2.10:1.5.0 downloaded successfully
[31mPlease restart Kernel to complete installation of the new package[0m
Successfully added package com.databricks:spark-csv_2.10:1.5.0
Downloading package org.apache.commons:commons-csv:1.4 to /Users/mbrobergus.ibm.com/pixiedust/data/libs/commons-csv-1.4.jar


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Package org.apache.commons:commons-csv:1.4 downloaded successfully
[31mPlease restart Kernel to complete installation of the new package[0m
Successfully added package org.apache.commons:commons-csv:1.4


<pixiedust.packageManager.package.Package at 0x11761bf90>

Run the following cell to get the first data set from the list. This will return a DataFrame and assign it to the variable d2:

In [11]:
d2 = pixiedust.sampleData(1)

0,1,2
▸,:,


Downloading 'Car performance data' from https://github.com/ibm-watson-data-lab/open-data/raw/master/cars/cars.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Creating pySpark DataFrame for 'Car performance data'. Please wait...


<IPython.core.display.Javascript object>

Successfully created pySpark DataFrame for 'Car performance data'


<IPython.core.display.Javascript object>

Pass the sample data set (d2) into the display() API:

In [12]:
display(d2)

mpg,cylinders,engine,horsepower,weight,acceleration,year,origin,name
18.0,8,307.0,130,3504,12.0,70,American,chevrolet chevelle malibu
15.0,8,350.0,165,3693,11.5,70,American,buick skylark 320
18.0,8,318.0,150,3436,11.0,70,American,plymouth satellite
16.0,8,304.0,150,3433,12.0,70,American,amc rebel sst
17.0,8,302.0,140,3449,10.5,70,American,ford torino
15.0,8,429.0,198,4341,10.0,70,American,ford galaxie 500
14.0,8,454.0,220,4354,9.0,70,American,chevrolet impala
14.0,8,440.0,215,4312,8.5,70,American,plymouth fury iii
14.0,8,455.0,225,4425,10.0,70,American,pontiac catalina
15.0,8,390.0,190,3850,8.5,70,American,amc ambassador dpl


You can also download data from a CSV file into a DataFrame which you can use with the display() API:

In [13]:
d3 = pixiedust.sampleData("https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv")

0,1,2
▸,:,


Downloading 'https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv' from https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Creating pySpark DataFrame for 'https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv'. Please wait...


<IPython.core.display.Javascript object>

Successfully created pySpark DataFrame for 'https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv'


<IPython.core.display.Javascript object>

# PixieDust Log

In [14]:
% pixiedustLog -l debug

0,1,2
▸,:,


2017-06-02 10:27:21,882 - pixiedust.utils.storage - INFO - Change in version detected: 1.0.4 -> 1.0.5.
2017-06-02 10:27:28,995 - pixiedust.utils.scalaBridge.PixiedustScalaMagics - DEBUG - Calling scala compiler with command: /Users/mbrobergus.ibm.com/pixiedust/bin/scala/scala-2.11.8/bin/scalac -classpath /Users/mbrobergus.ibm.com/pixiedust/data/libs/graphframes-0.1.0-spark1.6.jar:/Users/mbrobergus.ibm.com/pixiedust/data/libs/pixiedust.jar:/Users/mbrobergus.ibm.com/pixiedust/bin/spark/spark-2.0.2-bin-hadoop2.7/conf/:/Users/mbrobergus.ibm.com/pixiedust/bin/spark/spark-2.0.2-bin-hadoop2.7/jars/activation-1.1.1.jar:/Users/mbrobergus.ibm.com/pixiedust/bin/spark/spark-2.0.2-bin-hadoop2.7/jars/antlr-2.7.7.jar:/Users/mbrobergus.ibm.com/pixiedust/bin/spark/spark-2.0.2-bin-hadoop2.7/jars/antlr-runtime-3.4.jar:/Users/mbrobergus.ibm.com/pixiedust/bin/spark/spark-2.0.2-bin-hadoop2.7/jars/antlr4-runtime-4.5.3.jar:/Users/mbrobergus.ibm.com/pixiedust/bin/spark/spark-2.0.2-bin-hadoop2.7/jars/aopallianc

# Environment Info.
The following cells will print out information related to your notebook environment.


In [15]:
%%scala
val __scala_version = util.Properties.versionNumberString

0,1,2
▸,:,


In [16]:
import platform
print('PYTHON VERSON = ' + platform.python_version())
print('SPARK VERSON = ' + sc.version)
print('SCALA VERSON = ' + __scala_version)

0,1,2
▸,:,


PYTHON VERSON = 2.7.13
SPARK VERSON = 2.0.2
SCALA VERSON = 2.11.8


# More Info.
For more information about PixieDust check out the following:
#### PixieDust Documentation: https://pixiedust.github.io/pixiedust/index.html
#### PixieDust GitHub Repo: https://github.com/ibm-watson-data-lab/pixiedust