### Loading text data from remote sources

Data files commonly reside in remote sources, such as such as public or private market places or GitHub repositories. You can load comma separated value (csv) data files using Pixiedust's `sampleData` method. 

#### Prerequisites

Import PixieDust 

In [1]:
import pixiedust

Pixiedust database opened successfully


When you run a notebook cell (that loads or processes data) it might trigger execution of one or more jobs. 

#### Enable Apache Spark  job monitoring

In [2]:
pixiedust.enableJobMonitor()

Succesfully enabled Spark Job Progress Monitor


#### Loading  data

To load a data set invoke `pixiedust.sampleData` and specify the data set URL:

In [10]:
homes = pixiedust.sampleData("https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv")

0,1,2
▸,:,


Downloading 'https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv' from https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Creating pySpark DataFrame for 'https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv'. Please wait...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Successfully created pySpark DataFrame for 'https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv'


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<div class="alert alert-block alert-info">
`pixiedust.sampleData` loads the data into an [Apache Spark DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes), which you can inspect and visualize using `display()`.
</div>


#### Inspecting and previewing the loaded data

To inspect the automatically inferred schema and preview a small subset of the data you can use the _DataFrame Table_ view, as shown in the preconfigured example below: 


In [11]:
display(homes)

PROPERTY TYPE,ADDRESS,CITY,STATE,ZIP,PRICE,BEDS,BATHS,LOCATION,SQFT,LOT SIZE,YEAR BUILT,DAYS ON MARKET,URL,SOURCE,LISTING ID,LATITUDE,LONGITUDE
Single Family Residential,4 Newbury Road Rd,Windham,NH,3087,2450000,5.0,7.5,Windham,13461,139392.0,2008,84.0,http://www.redfin.com/NH/Windham/4-Newbury-Rd-03087/home/96548208,NEREN,58467283.0,42.83153747,-71.27639808
Single Family Residential,25 Marshall Rd,Wellesley,MA,2482,1909847,5.0,4.5,Wellesley,4900,12228.0,2016,71.0,http://www.redfin.com/MA/Wellesley/25-Marshall-Rd-02482/home/105557102,MLS PIN,61782463.0,42.2997542,-71.3088256
Single Family Residential,15 E Meadow Ln,Middleton,MA,1949,1177500,,2.5,,4263,40281.0,2015,,http://www.redfin.com/MA/Middleton/15-E-Meadow-Ln-01949/home/67981805,,,42.585715,-71.012888
Condo/Co-op,983 Memorial Dr #302,Cambridge,MA,2138,1100000,3.0,2.0,Harvard Square,1606,,1920,74.0,http://www.redfin.com/MA/Cambridge/983-Memorial-Dr-02138/unit-302/home/105594755,MLS PIN,61690710.0,42.3722656,-71.1252212
Condo/Co-op,1 Franklin St Ph 2E,Boston,MA,2110,8950000,3.0,4.5,Midtown,3435,,2016,86.0,http://www.redfin.com/MA/Boston/1-Franklin-St-02108/unit-2E/home/102070369,MLS PIN,55818606.0,42.35631,-71.05945
Condo/Co-op,18 Yarmouth St #1,Boston,MA,2116,2600000,3.0,3.5,South End,2522,,1880,88.0,http://www.redfin.com/MA/Boston/18-Yarmouth-St-02116/unit-1/home/9313347,MLS PIN,59168291.0,42.3458731,-71.0767967
Single Family Residential,128 Lowell St,Lexington,MA,2420,1185000,5.0,3.5,Lexington,3275,6300.0,2016,88.0,http://www.redfin.com/MA/Lexington/128-Lowell-St-02420/home/8553025,MLS PIN,59375875.0,42.436932,-71.190511
Single Family Residential,20 Jackson Rd,Wellesley,MA,2481,2165000,4.0,4.5,Wellesley,5199,16321.0,2016,88.0,http://www.redfin.com/MA/Wellesley/20-Jackson-Rd-02481/home/8964864,MLS PIN,51221892.0,42.307657,-71.252257
Condo/Co-op,30 Winchester St #3,Brookline,MA,2446,1400000,3.0,3.0,Coolidge Corner,1504,,1915,66.0,http://www.redfin.com/MA/Brookline/30-Winchester-St-02446/unit-3/home/105251020,MLS PIN,58480309.0,42.3420632,-71.1257602
Condo/Co-op,30 Winchester St #4,Brookline,MA,2446,1500000,3.0,3.0,Coolidge Corner,1584,,1915,66.0,http://www.redfin.com/MA/Brookline/30-Winchester-St-02446/unit-4/home/105251022,MLS PIN,58480311.0,42.3420632,-71.1257602


#### Simple visualization using bar charts

With PixieDust `display()` you can visually explore the loaded data using built-in charts such as bar charts, line charts, scatter plots or maps.

To explore a data set
* choose the desired chart type from the drop down
* configure chart options
* configure display options

We can analyze the average home price for each city by choosing 
* chart type: bar chart
* chart options
 * _Options > Keys_: `CITY`
 * _Options > Values_: `PRICE` 
 * _Options > Aggregation_: `AVG`
 
Run the next cell to review the results. 

In [12]:
display(homes)

#### Exploring the data

Changing the display **Options** you can continue to explore the loaded data set without having to pre-process the data. 

For example, changing 
* _Options > Key_ to `YEAR_BUILT` and 
* _Options > aggregation_ to `COUNT` 

you can find out how old the listed properties are:

In [13]:
display(homes)

#### Using sample data sets

PixieDust comes with a set of curated data sets that you can use get familiar with the different chart types and options. 

Type `pixiedust.sampleData()` to display those data sets.

In [14]:
pixiedust.sampleData()

0,1,2
▸,:,


Id,Name,Topic,Publisher
1,Car performance data,transportation,IBM
2,"Sample retail sales transactions, January 2009",Economy & Business,IBM Cloud Data Services
3,Total population by country,Society,IBM Cloud Data Services
4,GoSales Transactions for Naive Bayes Model,Leisure,IBM
5,Election results by County,Society,IBM
6,Million dollar home sales in NE Mass late 2016,Economy & Business,Redfin.com
7,"Boston Crime data, 2-week sample",Society,City of Boston


<div class="alert alert-block alert-info"> The homes sales data set we've loaded earlier is one of the samples. Therefore we could have also loaded it by specifying the displayed data set id as parameter: `home = pixiedust.sampleData(6)`</div>

If your data isn't stored in csv files, you can load it into a DataFrame from any supported Spark [data source](https://spark.apache.org/docs/latest/sql-programming-guide.html#data-sources). Refer to [these Python code snippets](https://apsportal.ibm.com/docs/content/analyze-data/python_load.html) for more information.