### Common Pandas Operations

We will use the data set from NYC OpenData called "New York City Leading Causes of Death".

In [None]:
%matplotlib inline
import requests
import pandas as pd
import numpy as np

#### Fetching the data


We fetch the data in JSON format using the NYC OpenData API:

In [None]:
# Data set: New York City Leading Causes of Death
# https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam
url = 'http://data.cityofnewyork.us/api/views/jb7j-dtam/rows.json'
results = requests.get(url).json()

In [None]:
results.keys()

There are two main fields in the returned JSON. The `meta` part that describes the metadata, and the actual `data`.

In [None]:
results['meta']['view'].keys()

In [None]:
results['data']

### Creating a DataFrame from JSON data

Let's create a pandas dataframe from the `results["data"]` part.

In [None]:
df = pd.DataFrame(results["data"])
df

### Adding Column Names

Hm, this is kind of ugly without column names...

We need to peek at the "meta" part to find information about the columns.

In [None]:
# This part of the results contains the description and names for the columns
columns = results["meta"]["view"]["columns"]
columns

In [None]:
# We will create a list of the column names, to reuse it when creating our dataframe
headers = [c["fieldName"] for c in columns]
headers

In [None]:
# Now we also pass a list of column names
df = pd.DataFrame(results["data"], columns=headers)
df

### Deleting Columns and/or Rows

We do not need all these columns. Let's drop a few that we will definitely not use. For that, we will use the `drop` command

In [None]:
df.drop(labels = [':sid', ':position', ':meta', ':created_meta', ':updated_meta'], 
        axis='columns', inplace=True)
df

##### Common Patterns: axis and inplace

* The `axis='columns'` says that we are looking to drop columns. If we had `axis='index'` we would be dropping rows with the passed id's. The ids for the row is the index value for the row.

* The `inplace=True` specifies that we will not be creating a new dataframe, but we just replace the current one, with the new dataframe that has fewer columns.

In [None]:
df

### Renaming Columns

We do not like some of these column names. Let's rename them.

We will use a dictionary, for specifying the existing and the new names for the columns.

In [None]:
# This dictionary specifies as a key the existing name of the column, and as value the new name
renaming_dict = {
    ':id': 'key', 
    ':created_at': 'created_at', 
    ':updated_at': 'updated_at'
}

df.rename(columns=renaming_dict, inplace=True)
df

### Converting Data Types

In [None]:
df.dtypes

In [None]:
# Let's convert to the right data types the year,count,percent
df["year"] = pd.to_numeric(df["year"])
df.dtypes

Sometimes, during the conversion of data, the cells contain values that cannot be properly converted. We can specify how we want pandas to handle such cases. By default, it will raise an exception, and will not allow us to convert the data to a new data type.

In [None]:
# This one will cause an error, as the "deaths" column contains non-numeric values.
# Try by uncommenting
# df["deaths"] = pd.to_numeric(df["deaths"])

We can pass the `errors` command to specify what should happen. From the [documentation of to_numeric](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html), we get:
* If ‘raise’, then invalid parsing will raise an exception
* If ‘coerce’, then invalid parsing will be set as NaN
* If ‘ignore’, then invalid parsing will return the input

In [None]:
df["deaths"] = pd.to_numeric(df["deaths"], errors='coerce')
df["death_rate"] = pd.to_numeric(df["death_rate"], errors='coerce')
df["age_adjusted_death_rate"] = pd.to_numeric(df["age_adjusted_death_rate"], errors='coerce')
df.dtypes

In [None]:
df

We will also mark the other values as Categorical.

In [None]:
df["sex"] = pd.Categorical(df["sex"])
df["race_ethnicity"] = pd.Categorical(df["race_ethnicity"])
df["leading_cause"] = pd.Categorical(df["leading_cause"])
df.dtypes

And we will also convert the timestamps to dates. Notice that we specify the unit to be `s` which is seconds since 1970.

In [None]:
df["created_at"] = pd.to_datetime(df["created_at"], unit='s')
df["updated_at"] = pd.to_datetime(df["updated_at"], unit='s')
df.dtypes

In [None]:
df

### Exploratory Data Analysis

In [None]:
df["race_ethnicity"].value_counts()

In [None]:
df["sex"].value_counts()

In [None]:
df["leading_cause"].value_counts()

### Pivot Tables

Let's create a pivot table now. We are going to put the "leading cause" as the row, with sex and race as columns. For the cell values we will use the number of deaths, and we are going to sum (`np.sum`) the values.

_Note: You will also find the `pivot` and `crosstab` functions in Pandas. The `pivot_table` function is typically a more general version of both._

In [None]:

import numpy as np
pivot = pd.pivot_table(df, 
                       values='deaths', 
                       index=['leading_cause'], # rows
                       columns=['sex', 'race_ethnicity'], # columns
                       aggfunc=np.mean) # aggregation function
pivot

And we can easily transpose the dataframe

In [None]:
pivot.transpose()
# alternatively
# pivot.T

In [None]:
# And we can of course, plot:
pivot.transpose()["Diseases of Heart (I00-I09, I11, I13, I20-I51)"].plot.bar()

#### Exercises

* Write a function that will change the values for the "leading cause" column, and make them shorter. For example, we want to eliminate the codes within the parentheses; the value "Alzheimer's Disease (G30)" should become "Alzheimer's Disease". Use the `apply` function and/or the `map` function to create a new column with the shortened values. Then use the `drop` command to delete the old `leading_cause` column. 
* Change the pivot_table to compute the average `age_adjusted_death_rate` instead of the sum of deaths. (Hint: you can use the `numpy.mean` function to compute averages.

In [None]:
# Example input: 
# 'Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)
# Example output
# 'Accidents Except Drug Posioning'
import re

def shorten(cause):
    # Get everything before the parentheses
    regex_expression = r'(.*)\(.*\)' # notice that we escape the parentheses
    regex= re.compile(regex_expression)
    matches = regex.finditer(cause)
    for m in matches:
        return m.group(1).strip()[:30]
    return cause[:30]

shorten('Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)')

In [None]:
[shorten(cause) for cause in set(df['leading_cause'].values)]

In [None]:
df["cause"] = df["leading_cause"].apply(shorten)
df

In [None]:

import numpy as np
pivot = pd.pivot_table(df, 
                       values='deaths', 
                       index=['cause'], # rows
                       columns=['sex', 'race_ethnicity'], # columns
                       aggfunc=np.mean) # aggregation function
pivot

#### Exercise

* Get a new dataset from NYC Open Data. (Go for something small.) Fetch it and load it into a dataframe. Put the right column names into the dataframe, eliminate columns and rows that you do not need. Create a basic plot that summarizes some aspect of the dataset.