Tutorial: Using Notebooks with PixieDust for Fast, Flexible, and Easier Data Analysis and Experimentation

va barbosa edited this page Aug 22, 2017 · 7 revisions

Interactive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, you create a notebook via the IBM Data Science Experience to explore and visualize data to gain insight. We will be using PixieDust, an open source Python notebook helper library, to visualize the data in different ways (e.g., charts, maps, etc.) with one simple call.

pixiedust

Overview

In this tutorial, you will be learning about and using:

The tutorial can be followed from a local Jupyter Notebook environment. However, the instructions and screenshots here walk through the notebook in the DSX environment.

A corresponding notebook is available here: https://gist.github.com/vabarbosa/dc1eeaa363e8534306a2f5e09270cfee

You may access this tutorial at a later time and try it again at your own pace from here: http://ibm.biz/pixiedustlab

Note: For best results, use the latest version of either Mozilla Firefox or Google Chrome.

DSX

DSX is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data.

Sign Up

DSX is powered by IBM Bluemix, therefore your DSX login is same as your IBM Bluemix login. If you already have a Bluemix account or previously accessed DSX you may proceed to the Sign In section. Otherwise, you first need to sign up for an account.

From your browser:

  1. Go to the DSX site: http://datascience.ibm.com
  2. Click on Sign Up
  3. Enter your Email
  4. Click Continue
  5. Fill out the form to register for IBM Bluemix

Sign In

From your browser:

  1. Go to the DSX site: http://datascience.ibm.com
  2. Click on Sign In
  3. Enter your IBMid or email
  4. Click Continue
  5. Enter your Password
  6. Click Sign In

Jupyter Notebooks

Jupyter Notebooks are a powerful tool for fast and flexible data analysis and can contain live code, equations, visualizations and explanatory text.

Create a New Notebook

You will need to create a noteboook to experiment with the data and a project to house your notebook. After signing into DSX:

  1. On the upper right of the DSX site, click the + and choose Create project.
  2. Enter a Name for your project
  3. Select a Spark Service
  4. Click Create

From within the new project, you will create your notebook:

  1. Click add notebooks
  2. Click the Blank tab in the Create Notebook form
  3. Enter a Name for the notebook
  4. Select Python 2 for the Language
  5. Select 2.0 for the Spark version
  6. Select the Spark Service
  7. Click Create Notebook

You are now in your notebook and ready to start working.

When you use a notebook in DSX, you can run a cell only by selecting it, then going to the toolbar and clicking on the Run Cell (▸) button. When a cell is running, an [*] is shown beside the cell. Once the cell has finished the asterisks is replaced by a number.

If you don’t see the Jupyter toolbar showing the Run Cell (▸) button and other notebook controls, you are not in edit mode. Go to the dark blue toolbar above the notebook and click the edit (pencil) icon.

toobar

PixieDust

PixieDust is an open source Python helper library that works as an add-on to Jupyter notebooks to extends the usability of notebooks.

With interactive notebooks, a mundane task like creating a simple chart or saving data into a persistence repository requires mastery of complex code like this matplotlib snippet:

chart

To improve the notebook experience PixieDust simplifies much of this and provides a single display() API to visualize your data.

Update PixieDust

DSX already comes with the PixieDust library installed, but it is always a good idea to make sure you have the latest version:

  1. In the first cell of the notebook enter:

    !pip install --upgrade pixiedust
    
  2. Click on the Run Cell (▸) button

After the cell completes, if instructed to restart the kernel, from the notebook toolbar menu:

  1. Go to > Kernel > Restart
  2. Click Restart in the confirmation dialog

Note: The status of the kernel briefly flashes near the upper right corner, alerting when it is Not Connected, Restarting, Ready, etc.

Import PixieDust

Before, you can use the PixieDust library it must be imported into the notebook:

  1. In the next cell enter:

    import pixiedust
    
  2. Click on the Run Cell (▸) button

Note: Whenever the kernel is restarted, the import pixiedust cell must be run before continuing.

PixieDust has been updated and imported, you are now ready to play with your data!

Las Vegas Open Data

You now need some data! Many cities are now making much of their data available. One such city is Las Vegas.

Las Vegas Open Data is the online home of a large portion of the data the City of Las Vegas collects and makes available for citizens to see and use.

It would be good to take a look at some data from the city of Las Vegas. More specifically, the Las Vegas Restaurant Inspections data. This dataset contains demerits, grades, etc from inspections of Las Vegas restaurants.

Load the Data

With PixieDust, you can easily load CSV data from a URL into a PySpark DataFrame in the notebook.

In a new cell enter and run:

inspections = pixiedust.sampleData("https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv")

Remember to wait for the [*] indicator to turn into number, at which point the cell has completed running.

Here, you are passing the URL of the Las Vegas Restaurant Inspections CSV file to PixieDust's sampleData API and store the resultant dataframe into an inspections variable. In the output, you will see logging from PixieDust as it downloads the files and creates the dataframe.

View the Data

Now that you have the data into a dataframe in your notebook, it is time to take a look at it. With PixieDust's display API, you can easily view and visualize the data.

In a new cell enter and run:

display(inspections)

The output from this cell is the PixieDust display output which includes a toolbar and a visualization area:


display


By default, you will be presented with the Table View showing a sampling (100 rows max) of the data and the schema of the data, that is which columns are strings, integers, etc.

Filter the Data

Looking at the restaurants data in the table, you may notice it contains entries for restaurants outside of Las Vegas. You can however, filter this to a subset of only Las Vegas restaurants.

In a new cell enter and run:

inspections.registerTempTable("restaurants")
lasDF = sqlContext.sql("SELECT * FROM restaurants WHERE city='Las Vegas'")
lasDF.count()

Using a basic SQL query, you filtered the data and created a new dataframe with only restaurants in Las Vegas. The cell output is a count of entries specifically for Las Vegas.

Visualize the Data

With your data ready to go, you can begin to visualize it as charts and not just a simple table.

Number of Restaurants By Categories

In a new cell enter and run:

bycat = lasDF.groupBy("category_name").count()
display(bycat)

The result is a new table showing the number of entries by categories in the city of Las Vegas. From the PixieDust display output toolbar, you can view this data in multiple ways:

  1. Click the Chart dropdown menu and choose Bar Chart

    toolbar

  2. From the Chart Options dialog

    1. Drag the category_name field and drop it into the Keys area

    2. Drag the count field and drop it into the Values area

    3. Set the # of Rows to Display to 1000

      options

  3. Click OK

And just like that you have a bar chart showing the percentages of the entries with a given grade!

Rendering Options

You can play around with the chart further to provide a better visual experience. PixieDust supports mulitple renderers, each with their own set of features and distinct look. The default renderer is matplotlib but you can easily switch to a different renderer.

  1. Click the Renderer dropdown menu and choose bokeh

    renderer

  2. Toggle the Show Legend Bar Chart Option to show or hide the legend

    renderer

The result is a nice bar chart showing the count of the different categories of places to eat. It's probably no surprise that most are restaurants and bars.

Inspection Demerits and Grades

What if you wanted to visualize something a little more complex? What if you wanted to see the average number of inspection demerits per category clustered by the inspection grade? Give it a try!

  1. In a new cell enter and run:

    display(lasDF)
    
  2. Click the Chart dropdown menu and choose Bar Chart

  3. From the Chart Options dialog

    1. Drag the category_name field and drop it into the Keys area
    2. Drag the inspection_demerits field and drop it into the Values area
    3. Set the Aggregation to AVG
    4. Set the # of Rows to Display to 1000
    5. Click OK
  4. Click the Renderer dropdown menu and choose bokeh

  5. Click the Cluster By dropdown menu and choose inspection_grade

  6. Click the Type dropdown menu and choose the desired bar type (e.g., stacked)

    stacked

Current Demerits vs Inspection Demerits

You are not restricted to just bar charts. You can try other charts to gain additional insights and different perspective of the data.

  1. Click the Options button to launch the Chart Options dialog

  2. From the Chart Options dialog

    1. Set the Keys to inspection_demerits
    2. Set the Values to current_demerits
    3. Set the # of Rows to Display to 1000
    4. Click OK
  3. Click the Chart dropdown menu and choose Scatter Plot

  4. Select bokeh from the Renderer dropdown menu

  5. Select inspection_grade from the Color dropdown menu

    scatter

What can be gathered from this chart?

Map the Data

When looking at the sample data, you may have noticed it also includes the location data of the restaurants. Plotting these points on a map can also be done with PixieDust.

Access Token

For the Map renderers, a token is required for them to display properly. Currently, PixieDust has two map renderers (i.e, Google, MapBox). For this section of the tutorial, you will be using the MapBox renderer and thus a MapBox API Access Token will need to be created if you choose to continue.

Open a new browser tab (do not close the DSX browser tab):

  1. If you do not have an MapBox account, please Sign up for one: https://www.mapbox.com/studio/signup

  2. If you are not already logged into MapBox, go to https://www.mapbox.com and Log in

  3. Navigate to your MapBox account page: https://www.mapbox.com/studio/account

  4. Click the API access tokens tab

  5. Click Create a new token and give your new token a name

  6. Click on Generate

  7. Make note of your token

  8. Return to your notebook in DSX but do not close the MapBox page just yet

Shape the Data

The current data includes the longitude/latitude in the location_1 field as a string like such: POINT (-114.923505 36.114434)

However, the current Map renderers in PixieDust expect the longitude and latitude as separate number fields. The first thing you will need to do is parse the location_1 field into separate longitude and latitude number fields.

Note: Python is indentation sensitive. Do not mix space and tab indentations. Either use strictly spaces or tabs for all indentations.

The last character in the field name location_1 is the number 1.

In a new cell enter and run:

from pyspark.sql.functions import udf
from pyspark.sql.types import *

def valueToLon(value):
    lon = float(value.split('POINT (')[1].strip(')').split(' ')[0])
    return None if lon == 0 else lon if lon < 0 else (lon * -1)

def valueToLat(value):
    lat = float(value.split('POINT (')[1].strip(')').split(' ')[1])
    return None if lat == 0 else lat

udfValueToLon = udf(valueToLon, DoubleType())
udfValueToLat = udf(valueToLat, DoubleType())

lonDF = lasDF.withColumn("lon", udfValueToLon("location_1"))
lonlatDF = lonDF.withColumn("lat", udfValueToLat("location_1"))

lonlatDF.printSchema()

You should have a new dataframe (lonlatDF) with two new columns (lon, lat) which contain the longitude and latitude for the restaurant.

View the Map Data

You are ready to view the data on a map.

  1. In a new cell enter and run:

    display(lonlatDF)
    
  2. Click the Chart dropdown menu and choose Map

  3. From the Chart Options dialog

    1. Drag the lon field and the lat field and drop it into the Keys area

    2. Drag the current_demerits field and drop it into the Keys area

    3. Set the # of Rows to Display to 1000

    4. Enter your access token from MapBox into the MapBox Access Token field. If you left the MapBox browser tab open you may return to it, copy the token and paste it here.

      options

    5. Click OK

  4. Click the kind dropdown menu and choose choropleth

You can move around the map and zoom into the various areas and get a quick glimpse of the restaurants current_demerits based on it's color on the map.

mapbox

Summary

Before finishing the tutorial and stepping away do not forget to sign out of DSX, MapBox, and close out of any additional tabs you opened up.

In this tutorial, you covered some of the basics of visualizing data from a Jupyter Notebook with PixieDust in the IBM Data Science Experience. Visualization is just one aspect of PixieDust. PixieDust contains additional features such as a Package Manager, Spark Progress Monitor, and Scala Bridge to name a few.

Likewise DSX has numerous tools to analyze your data. DSX tools make it easier to share, collaborate, and solve your toughest data challenges.

Feel free to sign back into DSX at later time and continue with analyzing and visualizing this data further. Better yet, load and start experimenting with your own data.

Links

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.