Ok, now that we have the basics of **IPython Notebooks** down, lets get to work!

As is almost always the case when working with **Python**, we are going to need more than just its basic functionality available to us as we develop our analytical pipelines. 

In order to have this additional functionality available (being able to use **pandas**), we will rely on a  couple `import` statements.

Here they are:

In [1]:
import pandas as pd
import numpy as np

The code above did two things:

* Loaded in all of the functionality that **pandas** provides (`import pandas as pd`)
* Loaded in some additional functionality from a different package that **pandas** relies on called **NumPy** (`import numpy as np`)

Importantly, `pd` is now the alias (new name) for the entire `pandas` library and `np` is the alias for the `numpy` library. Instead of having to type `pandas.something` or `numpy.something` to access a given function, you can now just type `pd` or `np`. 

So what exactly is [**pandas**](http://pandas.pydata.org) and why the funny name (we will talk about [**NumPy**](http://www.numpy.org) a bit later)?

**pandas** is a Data Analysis Library written in and for the **Python** programming language and is a very loose acronym for **P**ython **An**alysis of **Da**taset**s** (or something like that anyway). 

It provides open source, easy-to-use data structures and data analysis tools.

We will be using it exclusively for the next two days.

Before we get started with an actual dataset, lets make a dummy dataset and just understand the basics of the two main kinds of objects we will be working with in **pandas**, `Series` and `DataFrame` objects.

Here is an example `Series` stored in a variable we will call `example_series`:

In [2]:
example_series = pd.Series(range(5), index=['a', 'b', 'c', 'd', 'e'])
print example_series

a    0
b    1
c    2
d    3
e    4
dtype: int64


From the **pandas** documentation, a `Series` is "a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)."

This means that it is simply a table with a single column (that doesn't have a name) and an `index`, which is a pointer that identifies every single row in that `Series`.

In our case, `example_series` contains 5 rows, whose values are the integers from 0-4 (inclusive) and whose index values are the letters a-e (inclusive).

To create a `Series` object you call `pd.Series(data,index)` where `data` is the data you want stored, and `index` is **optional**, so if you don't provide it, it will be made for you:

In [4]:
example_series_no_index_given = pd.Series(range(5))
print example_series_no_index_given

0    0
1    1
2    2
3    3
4    4
dtype: int64


By default, when you don't provide an `index` **pandas** constructs one for you, starting at 0 and ending at the number of rows found in the `Series` minus 1. 

To access just the values or just the index in the `Series` object, you can call `index` or `values` on the objects you just created:

In [11]:
print example_series.values
print example_series.index
print example_series_no_index_given.index

[0 1 2 3 4]
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
Int64Index([0, 1, 2, 3, 4], dtype='int64')


As you can see, the indices of the two objects we just created are different and of different type (one is an `int` and the other is `object`).

This is just **pandas** way of saying this is a type that it knows isnt a number or a `DateTime` (this is for timeseries, we will cover them later).

You can access values in a `Series` by their `index`:

In [14]:
print example_series['a']
print example_series_no_index_given[0]

0
0


Or by their position in the `Series`, including multiple positions at a time:

In [6]:
print "A single value: ", example_series[0]

A single value:  0


When you access multiple rows, you get a series back instead of a single number:

In [7]:
print "A series in return: \n", example_series[0:2]

A series in return: 
a    0
b    1
dtype: int64


You can also rearrange the values in a `Series` when you query it:

In [24]:
print "A series rearranged: \n", example_series[['d','a','c']]

A series rearranged: 
d    3
a    0
c    2
dtype: int64


When working with `Series` objects, you can do all sorts of math and selections on them (as long as the values in the object are numbers!):

In [30]:
print "Multiplying every value in the series * 2: \n", example_series * 2,"\n"
print "Get those indices in the Series that have values greater than 1: \n",\
example_series > 1, "\n"
print "Select those values in the Series that have values greater than 1: \n",\
example_series[example_series > 1]

Multiplying every value in the series * 2: 
a    0
b    2
c    4
d    6
e    8
dtype: int64 

Get those indices in the Series that have values greater than 1: 
a    False
b    False
c     True
d     True
e     True
dtype: bool 

Select those values in the Series that have values greater than 1: 
c    2
d    3
e    4
dtype: int64


**Whenever you extract a single column from a `DataFrame` object, or whenever you compute some values on a `DataFrame` object that are only a single column, you will always get a `Series` back in return.**

In the backend, a `Series` object is essentially a **Python** `dict` object (which you should have practiced with in the pre-work!) where the `keys` are the index values in the `Series` and the `values` of the `dict` are the actual values stored in the `Series`.

This is important to understand for the remainder of the course. If you only get a single column, its a `Series` (represented as a `dict` in the background). If there are multiple columns together, you get a `DataFrame`. 

So let's talk about `DataFrame` objects now. 

Here is an example `DataFrame` object:

In [48]:
d = {'one': pd.Series(range(4), index=['a','b','c','e']),
    'two': pd.Series(['aa','oo','ee','ii',"yy"],index=['a','b','c','d','e'])}
example_df = pd.DataFrame(d)
print "An example dataframe:"
example_df

An example dataframe:


Unnamed: 0,one,two
a,0.0,aa
b,1.0,oo
c,2.0,ee
d,,ii
e,3.0,yy


`example_df` is a `DataFrame` that contains 2 columns, `one` and `two`. They have different datatypes and an `index` that is non-numeric:

In [49]:
print "The datatypes for the columns in the DataFrame: \n", example_df.dtypes ,"\n"
print "The index of the DataFrame: \n", example_df.index
print "The values in the DataFrame: \n", example_df.values

The datatypes for the columns in the DataFrame: 
one    float64
two     object
dtype: object 

The index of the DataFrame: 
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
The values in the DataFrame: 
[[0.0 'aa']
 [1.0 'oo']
 [2.0 'ee']
 [nan 'ii']
 [3.0 'yy']]


Also, notice that in the case of our example `DataFrame`, one of the elements is labeled `NaN` because although the index was created (for the second column), no value was supplied for that index in the first column. By default, **pandas** is smart and automatically fills in `NaN` for that value (this stands for "not a number" and is the default way that it handles nulls). 

As an aside, the `u` before each letter in the `index` tells you that the characters are encoded using the UNICODE format. This is a common format that allows one to represent more symbols than just ASCII can handle (things like characters in non-European languages, characters with accents, non-standard symbols, etc.)

You can also access the column names directly:

In [38]:
example_df.columns

Index([u'one', u'two'], dtype='object')

And you can access the values in a column by passing the column name to the dataframe:

In [39]:
example_df["one"]

a     0
b     1
c     2
d   NaN
Name: one, dtype: float64

You can also access all the values in a set of rows and columns by their index.

To do so, you have to treat the values in the dataframe as part of a 2-d grid and access the specific elements you want directly. If you want the whole row or column, use `:`. 

Here is an example where I simply am getting all of the values in the first column just as I had done above (remember, in Python indexing starts from 0, not 1):

In [36]:
example_df.ix[:,0]

a     0
b     1
c     2
d   NaN
Name: one, dtype: float64

And here is how I would access only the first two rows in the second column of the dataframe by either calling the column or via indexing on the values:

In [40]:
print "Calling the specific column: \n", example_df["two"][0:2],"\n"
print "Using pure indexing on the values: \n", example_df.ix[0:2,1]

Calling the specific column: 
a    aa
b    oo
Name: two, dtype: object 

Using pure indexing on the values: 
a    aa
b    oo
Name: two, dtype: object


Also, keep in mind that `0:2` actually means the indices at 0 and 1, excluding 2. 

If you want to go from some index to the end, use `::`. 

So, here is a way to get all of the rows in the first column from the 3rd row on (again, I will show you two ways of doing it):

In [41]:
print "Access via column name: \n", example_df["one"][2::],"\n"
print "Pure indexing: \n", example_df.ix[2::,0]

Access via column name: 
c     2
d   NaN
Name: one, dtype: float64 

Pure indexing: 
c     2
d   NaN
Name: one, dtype: float64


And this is how you would get all of the values in every column from the 3rd row down:

In [43]:
print "Calling via access on the dataframe: \n", example_df[2::], "\n"
print example_df.ix[2::,:]

   one two
c    2  ee
d  NaN  ii


Unnamed: 0,one,two
c,2.0,ee
d,,ii


If by now you are starting to grok how you accesss data via pure data indexing, then you should quickly see that the following two ways to access all the data in our example dataframe are functionally equivalent:

In [24]:
print "This is one way to get the whole dataframe: \n", example_df,"\n"
print "And this one is equivalent: \n", example_df.ix[:,:]

This is one way to get the whole dataframe: 
   one two
a    0  aa
b    1  oo
c    2  ee
d  NaN  ii 

And this one is equivalent: 
   one two
a    0  aa
b    1  oo
c    2  ee
d  NaN  ii


Selecting and performing math on columns within a `DataFrame` object works identically to how it does in a `Series`, except you need to be careful that the type of the column youre working on matches the operation youre trying to perform:

In [34]:
example_df["one"] * 2

a     0
b     2
c     4
d   NaN
Name: one, dtype: float64

Because sometimes the behavior it gives you is not what you want, if you don't understand what you're doing:

In [46]:
example_df["two"] * 2

a    aaaa
b    oooo
c    eeee
d    iiii
Name: two, dtype: object

What does all of this mean? How do `DataFrame` and `Series` objects relate to each other? 

A `DataFrame` is essentially a collection of `Series` objects, all having the same indices. 

Again, straight from the [**pandas documentation**](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe):

"...`DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a `dict` of `Series` objects."

So, at bottom, what we will be doing for this workshop is learning how to manipulate **Python** `dict` objects in a variety of useful ways.

This is just a basic prelude to get you to understand what we are going to be dealing with.

Just to get some practice with `DataFrame` and `Series` objects, do the following:

1. Get all of the values in the first column of `example_df`
*  Get all of the values in the second column of `example_df`
*  Get all of the values less than 2 in `example_df` and in `example_series`
*  Get the value found in the 4th row of the second column in `example_df`
*  Get the values in every column from the 4th row on in `example_df`
*  Divide every value in `example_series` by 3

In [None]:
##YOUR CODE HERE