# DVS Week 2: Processing Data

## CSV examples

The python standard library does have some support for reading CSV:


In [None]:
import csv

with open("../data/flights.csv") as f:
    reader = csv.DictReader(f)
    data = list(reader)

data

As you can see, _every_ field is imported as a string. We _could_ tidy that up:

In [None]:
def tidy_flights_data(d):
    numeric_fields = [
        "year",
        "month",
        "day",
        "dep_time",
        "sched_dep_time",
        "dep_delay",
        "arr_time",
        "sched_arr_time",
        "arr_delay",
        "air_time",
        "distance",
        "hour",
        "minute"
    ]

    for field in numeric_fields:
        if d[field] == "NA":
            d[field] = None
        elif d[field] != None:
            d[field] = int(d[field])

    return d


flights = [tidy_flights_data(d) for d in data]
flights[0]

But that's long-winded and painful. Python's internal CSV support isn't great, partly because it has amazing external libraries - and we'll look at `pandas` later.

## JSON

Python also has good JSON support. Let's load our sample JSON file into Python, and parse it.

In [None]:
import json

with open('../data/flights_ten.json') as f:
    sample_flights = json.load(f)

sample_flights

Let's just get the first item in the `flights` object.

In [None]:
sample_flights['flights'][0]

Note how nested objects are preserved, and numbers have been automatically parsed. (Note also how some of the times in the sample data _weren't_ numbers, they were strings, and that's been preserved too.)

## CSV with Pandas

`pandas` is a very popular data science / analysis library for Python. It's a great alternative for handling CSV, and is built on top of `numpy`.

This will only be a _very_ brief introduction to `pandas`, and there will be an in-class exercise on it, to explore using it a little.

### Key `pandas` Ideas

A `Series` is a single-dimensional list of data.

A `DataFrame` is a table, composed of many columns; each column is a `Series`.

Each item in a Series - or DataFrame - has an `index`. That could be an integer - just like in an array - but it could also be a string, or _multiple_ fields (eg: year/month/day). The point is: an index is a **unique way of identifying an item **.


In [None]:
import pandas as pd

pokemon = pd.read_csv('../data/pokemon.csv')

pokemon

Wow! That's a lot more impressive than the Python CSV library.

The pretty tabular view is, in fact, a Jupyter feature with Pandas. Let's look at the data in more detail:

In [None]:
pokemon.info()

See how it's guessed the data types? Numbers are picked up as integers, `Legendary` is a boolean, and the other items are strings.

Let's ask pandas about some statistics:

In [None]:
pokemon.describe()

We can filter the table. Let's find every Gen1 pokemon:

In [None]:
gen1 = pokemon.query('Generation == 1')
gen1

Every gen1 with an HP above 100?

In [None]:
gen1.query('HP > 100')

We can also perform aggregate operations:

In [None]:
pokemon.groupby("Type 1").size()