# Simple Data Processing using Pandas

## Data source

This notebook uses a sample of the the [The Stanford Open Policing Project](https://openpolicing.stanford.edu/)'s Oakland dataset. A description of the data is [here](https://openpolicing.stanford.edu/data).

<br>

## Data format

The data is a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values). CSV stands for *comma-separated values*. It is a very common format for large data files. Here's a view of our data, simplified for right now:

<div style="width: 60%; margin: auto;">

```
date,time,location
2013-04-01,20:52:00,700 Blk Of Center St
2013-04-01,15:55:00,73R D AV&INTERNATIONAL BLVD
NA,01:33:00,E. 28th St. & Park BLVD
```

</div>

CSVs hold *tabular* data (i.e. rows and columns of information). Each line of the file indicates a rows. Each comma indicates a new column.

|date | time | location |
|:----|:-----|:---------|
|2013-04-01|20:52:00|700 Blk Of Center St|
|2013-04-01|15:55:00|73R D AV&INTERNATIONAL BLVD|
|NA|01:33:00|E. 28th St. & Park BLVD|

<br>

## Python tools

Python is a great tool for processing CSV data at a large scale or with a lot of complexity. There are several ways to go about this task: with [Python's built-in `csv` module](https://docs.python.org/3/library/csv.html), with the [`Numpy` library](https://numpy.org/), or with the [`Pandas` library](https://pandas.pydata.org/). Read more about the differences [here].

All are worth knowing. This notebook uses `Pandas`.

[here]: https://janakiev.com/blog/csv-in-python/

<hr>

## 1. Creating a `dataframe` object

In [None]:
# Import Pandas
# You'll first have to install it if you haven't already.
# (https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html)
import pandas as pd

In [None]:
# The name of the csv file (in the same folder as this notebook)
path = 'ca_oakland_2020_04_01_SHORT.csv'

# Here, we create a 'dataframe' object using our csv data.
# Dataframe objects are specific to the Pandas library.
# 'df' is common shorthand for 'dataframe'.
df = pd.read_csv(path)

# One nice thing about the dataframe object is it lets us
# view our csv data as an easy-to-read table.
df

## Navigating the `dataframe` object

In [None]:
# Show only the first ten lines of the dataframe object.
df.head(10)

In [None]:
# Grab only specific columns.
df[['lat','lng']]

In [None]:
# Grab all of the data where the 'reason_for_stop' column is 'Probable Cause'.
df[df['reason_for_stop'] == 'Probable Cause']

## 3. Creating simple visualizations

In [None]:
# Dataframe objects come with plotting methods.
# Here, we create a scatter plot using the lng and lat columns
# of data.
df.plot.scatter(x='lng',
                y='lat',
                c='DarkBlue')

In [None]:
# Get only one column of data.
df_reason = df[['reason_for_stop']]

# Create a pie chart for that column of data.
df_reason.apply(pd.value_counts).plot.pie(y='reason_for_stop',
                                          figsize=(8, 8))