We start this notebook by typing a "magic" command that allows iPython notebooks to display plots directly in the browser.

In [2]:
# Render our plots inline
%matplotlib inline

In order to read and process files, we are going to use a very powerful, and widely used Python library, called pandas. So, our next step is to import the pandas library in Python, and also import the library matplotlib for generating plots:

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

Pandas should already be installed on your machine, but if you get an error in the import statement above, indicating that pandas is not available, please go to the Unix shell and type:

`sudo pip install -U pandas`

It will take a few minutes to get everything installed.

And we type some code to simply change the visual style of the plots. (The code below is optional and not necessary, and for now you do not need to understand what exactly is happening.)

In [4]:
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)

Let's take a look at the restaurant inspections file (at /home/ubuntu/data/restaurants.csv), which we used in our earlier classes.

If you do not have it, then type the following in the shell:

`curl 'https://dl.dropboxusercontent.com/u/16006464/DwD_Winter2015/restaurant.zip' -o /home/ubuntu/data/restaurant.zip`

`unzip /home/ubuntu/data/restaurant.zip -d /home/ubuntu/data/`


In [None]:
!curl 'https://dl.dropboxusercontent.com/u/16006464/DwD_Winter2015/restaurant.zip' -o /home/ubuntu/data/restaurant.zip
!unzip /home/ubuntu/data/restaurant.zip -d /home/ubuntu/data/

In [None]:
!head -5 /home/ubuntu/data/restaurant.csv

We want to be able to read and process this file within Python. The pandas library has a very convenient method `read_csv` which reads the file, and returns back a variable that contains its contents.

In [5]:
restaurants = pd.read_csv("/home/ubuntu/data/restaurant.csv", dtype=unicode, encoding="utf-8")

When you read a CSV, you get back a kind of object called a DataFrame, which is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary. Let's take a look at how the object looks like:

In [None]:
restaurants.head(5)

You will notice that each line now has a number, which in a DataFrame is called the "index number" of the row (and serves as the equivalent of a primary key). If we already have a value that can serve as a primary key for a row then we can specify the "index_col" parameter.

In [None]:
restaurants = pd.read_csv("/home/ubuntu/data/restaurant.csv", dtype=unicode, index_col=["CAMIS"], encoding="utf-8")
restaurants

The read_csv method has many options, and you can read further in the [online documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html).

### Descriptive statistics

We can use the method "describe()" to get a quick overview of the data in the dataframe.

In [None]:
restaurants.describe()

### Selecting a subset of the columns

In a dataframe, we can specify the column(s) that we want to keep, and get back another dataframe with just the subset of the columns that we want to keep.

In [None]:
restaurants["VIOLATION CODE"].head(5)

In [None]:
restaurants[["GRADE DATE","VIOLATION CODE", "DBA"]].head(5)

We can also get quick statistics about the common values that appear in each column:

In [None]:
violation_counts = restaurants["VIOLATION CODE"].value_counts();
violation_counts[0:10]

In [None]:
violation_counts = restaurants["VIOLATION DESCRIPTION"].value_counts();
violation_counts[0:10]

And we can use the "plot" command to plot the resulting histogram:

In [None]:
violation_counts[:10].plot(kind='bar')

#### Using the map command

The map command in Python has the following syntax:

`map(function, [list of values for first argument], [list of values for second argument]...)`

It takes as input a function, which has a set of parameters. Then, it iterates over the lists that follow; the lists contain the arguments that are passed to the function. Map returns a list of values that are the result of applying the function to all the elements of the list(s). 

For example, in the following code, the `add` function is going to be applied to the two lists (`[1, 2, 3, 4]` and `[9, 10, 10, 11]`) that follow. The result of the map will be a list containing the values `[add(1,2), add(2,10), add(3,10), add(4,11)]`




In [None]:
def add(x,y):
    return x+y
    
example = map(add, [1, 2, 3, 4], [9, 10, 10, 11])
example

#### Using the map for dataframes

Using the map command, we can:
* Create new columns for the dataframe
* Modify existing columns
* Generate new columns that are the result of operations on the columns of the dataframe

For example, suppose that we want to format the phone column. We can write a function that takes as input a phone and formats it as we want. Then we apply the function using the map command as follows:

In [None]:
import re

def formatPhone(phoneString):
    
    regex = re.compile(r'([2-9]\d{2})\W*(\d{3})\W*(\d{4})')
    match = regex.search(str(phoneString))
    if match:
        formatted = "(" + match.group(1) + ") " + match.group(2) + "-" + match.group(3)
        return formatted
    else:
        return None
    
restaurants['FormattedPhone'] = map(formatPhone, restaurants['PHONE'])

In [None]:
restaurants[['PHONE', 'FormattedPhone']]

### Selecting rows

To select rows, we can use the following approach, where we generate a list of boolean values, one for each row of the dataframe, and then we use the list to select which of the rows of the dataframe we want to keep"

In [None]:
is_08A = (restaurants["VIOLATION CODE"] == "08A")
inspections08A = restaurants[is_08A]
inspections08A["DBA"].value_counts()[:10]

And we can use more complex conditions:

In [None]:
is_08A_manhattan = (restaurants["VIOLATION CODE"] == "08A") & (restaurants["BORO"] == "MANHATTAN")
inspections08A_in_manhattan = restaurants[is_08A_manhattan]
inspections08A_in_manhattan["DBA"].value_counts()[:10].plot(kind='bar')

## Reading Excel files

Pandas make it trivially easy to read the contents of Excel files. For example, I stored the restaurant inspection dataset as an excel file. Let's grab it and get it stored locally:

In [None]:
!curl -L -s "https://dl.dropboxusercontent.com/u/16006464/DwD_Fall2014/Restaurants.xlsx" -o Restaurants.xlsx

To read the Excel file, pandas uses the xlrd package. It is already installed in your machine, but if not, then type this in the shell:

`sudo pip install xlrd`

In [None]:
restaurantsExcelFile = pd.ExcelFile("Restaurants.xlsx");

Read the worksheet named "WebExtract"

In [None]:
tableWebExtract = restaurantsExcelFile.parse(sheetname="WebExtract");

In [None]:
tableViolationCodes = restaurantsExcelFile.parse(sheetname='Violation');

In [None]:
tableWebExtract

In [None]:
tableViolationCodes

### Comparison with SQL

For a comparison with SQL, see http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

Pandas supports its own set of operators for doing SQL-like operations (in reality it runs a in-memory SQL database in the backend). It is also possible to use straightforward SQL queries to query Pandas dataframes, by using the `pandasql` package:

In [6]:
!sudo -H pip install -U pandasql

Requirement already up-to-date: pandasql in /usr/local/lib/python2.7/dist-packages
Requirement already up-to-date: sqlalchemy in /usr/local/lib/python2.7/dist-packages (from pandasql)
Requirement already up-to-date: pandas in /usr/local/lib/python2.7/dist-packages (from pandasql)
Requirement already up-to-date: numpy in /usr/local/lib/python2.7/dist-packages (from pandasql)
Requirement already up-to-date: pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas->pandasql)
Requirement already up-to-date: python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas->pandasql)
Requirement already up-to-date: six>=1.5 in /usr/local/lib/python2.7/dist-packages (from python-dateutil->pandas->pandasql)


In [7]:
from pandasql import sqldf

In [8]:
# PandaSQL does not like column names with spaces. So we will rename (some of) them.
restaurants.rename(columns={"VIOLATION CODE": "VIOLATION"}, inplace = True)

In [9]:
rest = restaurants[["DBA", "BORO", "VIOLATION"]]

In [11]:
q  = """
SELECT BORO, VIOLATION, COUNT(*) AS CNT 
FROM
  rest
GROUP BY BORO, VIOLATION
ORDER BY CNT DESC
LIMIT 20;
"""

df = sqldf(q, globals())

In [12]:
df

Unnamed: 0,BORO,VIOLATION,CNT
0,MANHATTAN,10F,26854
1,MANHATTAN,02G,19721
2,MANHATTAN,08A,19124
3,QUEENS,10F,15200
4,BROOKLYN,10F,14925
5,MANHATTAN,04L,14049
6,MANHATTAN,06D,13604
7,MANHATTAN,10B,12773
8,BROOKLYN,08A,12424
9,QUEENS,08A,11558
