# Quick Introduction to pandas

[pandas](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support pandas data structures as inputs.
Although a comprehensive introduction to the API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [pandas docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials. (Note that Datalab may use a slightly older version number, but the parts of pandas covered here are unlikely to differ from version to version.)

## Basic Concepts

The following line imports the pandas API and prints the API version:

In [None]:
import pandas as pd
pd.__version__

The primary data structures in pandas are implemented as two classes:

  * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.
  * **`Series`**, which is a single column. A DataFrame contains one or more Series and a name for each Series.

The Data Frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.

One way to create a Series is to construct a `Series` object. For example:

In [None]:
pd.Series(['San Francisco', 'San Jose', 'Sacramento'])

`DataFrame` objects can be created by passing a `dict` mapping `string` column names to their respective `Series`. If the `Series` don't match in length, missing values are filled with special [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) values. Example:

In [None]:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

pd.DataFrame({ 'City name': city_names, 'Population': population })

But most of the time, you load an entire file into a `DataFrame`. The following example loads a file with California housing data. Run the cells below to load the data and create feature definitions.

In [None]:
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")
california_housing_dataframe.describe()

The example above used `DataFrame.describe` to show interesting statistics about a `DataFrame`. Another useful function is `DataFrame.head`. It lets you look at the first few records:

In [None]:
california_housing_dataframe.head()

Another powerful feature of pandas is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:

In [None]:
california_housing_dataframe.hist('housing_median_age')

## Accessing Data

Most mechanisms familiar in Python work as expected:

In [None]:
cities = pd.DataFrame({ 'City name': city_names, 'Population': population })
print type(cities['City name'])
cities['City name']

In [None]:
print type(cities['City name'][1])
cities['City name'][1]

In [None]:
print type(cities[0:2])
cities[0:2]

In addition, the pandas API provides an extremely rich API for advanced [indexing and selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html) that is too extensive to be covered here.

## Manipulating Data

You may apply Python's basic arithmetic operations to `Series`. For example:

In [None]:
population / 1000.

[Numpy](http://www.numpy.org/) is a popular toolkit for scientific computing. pandas `Series` can be used as arguments to most Numpy functions:

In [None]:
import numpy as np

np.log(population)

For more complex single-column transformations, `Series.apply` provides a powerful mechanism. It accepts a [lambda function](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions) argument to let you do complex processing on each value.
The example below creates a new `Series` that indicates whether the population is over one million:

In [None]:
population.apply(lambda val: val > 1000000)


Modifying `DataFrames` is also straightforward. For example, the following block adds two `Series` to an existing `DataFrame`:

In [None]:
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities

### Exercise #1

Modify the cities table by adding a new boolean column that is True if and only if *both* of the following are True:

  * The city is named after a saint.
  * The city has an area greater than 50 square miles.

Note: Boolean `Series` are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing logical and, use "`&`" instead of "`and`".

Hint: "San" in Spanish means "saint".

In [None]:
# Your code here

## Indexes
Both `Series` and `DataFrame` objects also define an index property that governs row ordering. By default, at construction, pandas creates an index that reflects the ordering of the source data. Once created, the index values are stable; that is, they do not change with reordering.

In [None]:
city_names.index

In [None]:
cities.index

Call `DataFrame.reindex` to manually reorder the rows. For example, the following has the same effect as sorting by city name:

In [None]:
cities.reindex([2, 0, 1])

Reindexing is a great way to shuffle a `DataFrame`. In the example below, we take the index, which is array-like, and pass it to Numpy's `random.permutation` function, which shuffles its values in place. Calling `reindex` with this shuffled array causes the dataframe to be shuffled in the same way.
Try running the cell multiple times!

In [None]:
cities.reindex(np.random.permutation(cities.index))

For more information, see the [Index documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects).

### Exercise #2

The `reindex` method allows index values that are not in the original `DataFrame`'s index values. Try it and see what happens if you use such values! Why do you think this is allowed?

In [None]:
# Your code here