# Tutorial #1: Get Data

This tutorial covers:

1. An introduction to Notebooks
1. A quick tour of IBM Data Scientist Workbench
1. Uploading files to the workbench
1. Renaming files in the workbench
1. Importing files from external URLs
1. Loading a CSV file into a `pandas` DataFrame
1. Manipulating a DataFrame

## Import a File
You can also import a publicly addressable file into your workbench data folder by entering its URL into the search box in the top navigation bar.  The file at the URL will automatically download into your workbench data folder as long as it is:

1. accessible via HTTP or HTTPS protocol, and
1. one of the following media types:
    * Plain text
    * CSV
    * JSON (including `*.ipynb` notebooks)

## Load Data
**[pandas](http://pandas-docs.github.io/pandas-docs-travis/)** is a Python package that provides data structures for managing structured data.  The two primary data structures of pandas are the [Series](http://pandas-docs.github.io/pandas-docs-travis/dsintro.html#series) (1-dimensional) and [DataFrame](http://pandas-docs.github.io/pandas-docs-travis/dsintro.html#dataframe) (2-dimensional).

In the following steps, we'll load the olympic medals by country CSV file into a DataFrame in memory.

### Step 1: Import the `pandas` Package into our notebook.
Click on the code cell below, then click the right arrow button (**&#9658;**) in the notebook toolbar to run the code.

In [1]:
import pandas

### Step 2: Create a new code cell
Click on the plus button (**+**) in the Notebook toolbar to create a new cell. 

Click the newly created cell and enter the following line of code:
<pre>medals_df = pandas.read_csv('')</pre>

In [2]:
medals_df = pandas.read_csv('medals.csv')

## Manipulate a DataFrame
Now that we have the data in memory, we can explore and manipulate it.

Print the first and last 5 rows of the data using the `head()` and `tail()` methods.  Run each code cell below.

In [3]:
medals_df.head()

Unnamed: 0,Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
0,1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1,1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
2,1924,Chamonix,Skating,Figure skating,AUT,pairs,X,Gold
3,1924,Chamonix,Bobsleigh,Bobsleigh,BEL,four-man,M,Bronze
4,1924,Chamonix,Ice Hockey,Ice Hockey,CAN,ice hockey,M,Gold


In [4]:
medals_df.tail()

Unnamed: 0,Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
2309,2006,Turin,Skiing,Snowboard,USA,Snowboard Cross,M,Gold
2310,2006,Turin,Skiing,Snowboard,USA,Snowboard Cross,W,Silver
2311,,,,,,,,
2312,SOURCE,IOC,,,,,,
2313,DATALINK,http://www.olympic.org/,,,,,,


The tail output shows us that the CSV file contains lines at the bottom that are not data.  The cell values at these rows and columns is `NaN` (not a number).

We can prune these rows from our data by running the following code cell.

In [6]:
medals_df = medals_df.dropna()
medals_df.tail()

Unnamed: 0,Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
2306,2006,Turin,Skiing,Snowboard,USA,Half-pipe,M,Silver
2307,2006,Turin,Skiing,Snowboard,USA,Half-pipe,W,Gold
2308,2006,Turin,Skiing,Snowboard,USA,Half-pipe,W,Silver
2309,2006,Turin,Skiing,Snowboard,USA,Snowboard Cross,M,Gold
2310,2006,Turin,Skiing,Snowboard,USA,Snowboard Cross,W,Silver


Now we can sort the data by country, year, event, and type of medal. 1 sorts ascendingly and 0 sorts descendingly.

In [8]:
medals_df.sort_values(['NOC', 'Year', 'Event', 'Medal'], ascending=[1, 1 ,1 ,0])

Unnamed: 0,Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
1437,1994,Lillehammer,Skating,Short Track S.,AUS,5000m relay,M,Bronze
1620,1998,Nagano,Skiing,Alpine Skiing,AUS,slalom,W,Bronze
1825,2002,Salt Lake City,Skating,Short Track S.,AUS,1000m,M,Gold
1826,2002,Salt Lake City,Skiing,Freestyle Ski.,AUS,aerials,W,Gold
2059,2006,Turin,Skiing,Freestyle Ski.,AUS,aerials,W,Bronze
2060,2006,Turin,Skiing,Freestyle Ski.,AUS,moguls,M,Gold
0,1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1,1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
2,1924,Chamonix,Skating,Figure skating,AUT,pairs,X,Gold
49,1928,St. Moritz,Skating,Figure skating,AUT,individual,M,Silver
