# Data Wrangling with Pandas

We've seen how to get data with Python. Now let's do some stuff! From here on, we're going to mostly use the PyData stack rather than Python built-in functionality.

Our objective in this section is to learn enough to clean the larger sample of Chicago Health Inspection data and get it ready for modeling.

## Preliminaries: DataFrames

As mentioned, the core data structure in pandas is called a DataFrame. A DataFrame is a tabular data structure, holding many columns, similar to a spreadsheet.

The **Key Features** are

* Easy handling of **missing data**
* **Size mutability**: columns can be inserted and deleted from DataFrames
* Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
* Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
* Intelligent label-based **slicing**, **fancy indexing**, and **subsetting** of large data sets
* Intuitive **merging and joining** data sets
* Flexible **reshaping and pivoting** of data sets
* **Hierarchical labeling** of axes
* Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
* **Time series functionality**: 
  * date range generation and frequency conversion
  * moving window statistics
  * moving window linear regressions
  * date shifting and lagging, etc.

In [None]:
dta = pd.read_csv("data/health_inspection_chi.csv")

Pandas provides labelled **indices** to access rows and columns, should they have natural labels.

In [None]:
dta.index

In [None]:
dta.columns

For example, with this data set we have a natural unique identifier in the `inspection_id` column. We might wish to make this out index.

In [None]:
dta.head()

In [None]:
dta = dta.set_index('inspection_id')

In [None]:
dta.head()

## Indexing

To look at a column from a DataFrame, you can either use attribute lookup.

In [None]:
dta.address

Or you can use the **geitem** syntax that relies on square brackets `[]`, which is familiar from dealing with dictionaries.

In [None]:
dta['address']

These two operations return pandas **Series** objects. **Series** are like single-column DataFrames. If you want to preserve the DataFrame type, index the DataFrame with a list.

In [None]:
dta[['address']]

You can use this syntax to pull out multiple columns.

In [None]:
dta[['address', 'inspection_date']]

You can index the rows, by using the **loc** and **iloc** accessors.

`loc` does *label-based* indexing.

In [None]:
dta.loc[[1965287, 1329698]]

`iloc` on the other hand provides *integer-based* indexing. We can pass a list of rows integers.

In [None]:
dta.iloc[[0, 2]]

Both support the Python **slice notation** (`start:stop:end`). This can be really powerful.

In [None]:
dta.iloc[:5]

In [None]:
dta.iloc[:1335320]

Note that these inspection ids are *not* sorted, yet we can still use slice notation.

Of course, we can also combine row and index labeling.

In [None]:
dta.iloc[:5, [0, 5]]

In [None]:
dta.loc[:68091, ["address", "inspection_date"]]

## Reading Data and Dealing with Types

We saw above that `csv` reads everything in as strings, `json` does some type conversion with facility for doing more, and `pandas` does a bit more type conversion (but it isn't always what we want. We want the zip codes to stay strings).

First, we can use the `parse_dates` argument to read in the larger inspections data sample and tell pandas that one of our columns is a date column. We'll also go ahead and make `inspection_id` the index.

In [None]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv", 
    index_col="inspection_id",
    parse_dates=["inspection_date"]
)

And let's cast zip code from a float to a string.

In [None]:
import numpy as np

def float_to_zip(zip_code):
    # convert from the string in the file to a float
    try:
        zip_code = float(zip_code)
    except ValueError:  # some of them are empty
        return np.nan
    
    # 0 makes sure to left-pad with zero
    # zip codes have 5 digits
    # .0 means, we don't want anything after the decimal
    # f is for float
    zip_code = "{:05.0f}".format(zip_code)
    return zip_code

Here we use Python's **string formatting** facilities to convert from a numeric type to a string. Some of the zip codes are empty strings. Pandas uses numpy's `NaN` to indicate missingness, so we'll return it here.

In [None]:
float_to_zip('1234')

In [None]:
float_to_zip('123456')

In [None]:
float_to_zip('')

We can supply this function to the `converters` argument.

In [None]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    index_col='inspection_id',
    parse_dates=['inspection_date'],
    converters={
        'zip': float_to_zip
    },
)

In [None]:
dta.head()

Finally, we might want to exclude a column like `location` since we have the separate `latitude` and `longitude` columns. We can take advantage of the fact that the `usecols` argument accepts a function to exclude `location`.

In [None]:
dta = pd.read_csv(
    "data/health_inspection_chi.csv",
    index_col='inspection_id',
    parse_dates=['inspection_date'],
    converters={
        'zip': float_to_zip
    },
    usecols=lambda col: col != 'location'
)

Here we are using a **lambda function** that returns `False` for the location parameter. Lambda functions are what are known as anonymous functions, because they don't have a name. This kind of thing is precisely their intended use.

Of course, you don't have to let `read_csv` have all the fun. You can do all of this on-the-fly with the DataFrames themselves.

In [None]:
dta = pd.read_csv("data/health_inspection_chi.csv")

We can set the index. Note the use of `inplace`.

In [None]:
dta.set_index("inspection_id", inplace=True)

Convert to datetime types. Here we'll use the **apply** function to apply a function to each row of a Series.

In [None]:
dta.inspection_date = dta.inspection_date.apply(pd.to_datetime)

And finally, convert the zip code data.

In [None]:
dta.zip = dta.zip.apply(float_to_zip)

In [None]:
dta.head()

DataFrames have a `dtypes` attribute for checking the data types. Pandas relies on NumPy's dtypes objects. Here we see that the `object` dtype is used to hold strings. This for technical reasons.

In [None]:
dta.dtypes[['inspection_date', 'zip']]

In a few cases, we may want to take advantage of the pandas native `categorical` type. We can convert these variables, using `astype`.

In [None]:
dta.info()

In [None]:
dta.results = dta.results.astype('category')
dta.risk = dta.risk.astype('category')
dta.inspection_type = dta.inspection_type.astype('category')
dta.facility_type = dta.facility_type.astype('category')

If we only select the categorical types, we can see some categorical variables descriptions.

We can use the `select_dtypes` method to pull out a DataFrame with only the asked for types.

In [None]:
dta.select_dtypes(['category'])

Finally, we can delete columns in a DataFrame using Python's built-in `del` statement.

In [None]:
del dta['location']