In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
pd.set_option('max_rows', 6)  # max number of rows to show in this notebook — to save space!
import seaborn as sns  # for better style in plots

# 1D analysis: `pandas`!

For 1D analysis, we are generally thinking about data that varies in time, so time series analysis. The `pandas` package is particularly suited to deal with this type of data, particularly having very convenient methods for interpreted, searching through, and using time representations.

Let's start with the example we started the class with: taxi rides in New York City.

In [None]:
df = pd.read_csv('../data/yellow_tripdata_2016-05-01_decimated.csv', header=0, parse_dates=[0, 2], index_col=[0])

What do all these (and other) input keyword arguments do?

* header: tells which row of the data file is the header, from which it will extract column names
* parse_dates: try to interpret the values in `[col]` or `[[col1, col2]]` as dates, to convert them into `datetime` objects.
* index_col: if no index column is given, an index counting from 0 is given to the rows. By inputting `index_col=[column integer]`, that column will be used as the index instead. This is usually done with the time information for the dataset.
* skiprows: can skip specific rows, `skiprows=[row number list]`, or number of rows to skip, `skiprows=[number of rows integer]`.


In [None]:
df

We can check to make sure the date/time information has been read in as the index:

In [None]:
df.index

From this we see that the index is indeed using the timing information in the file, and we can see that the `dtype` is `datetime`.

We can now access the file information using keyword arguments, like so:

In [None]:
df['trip_distance']

We can plot in this way, too:

In [None]:
df['trip_distance'].plot(figsize=(14,6))

One of the biggest benefits of using `pandas` is being able to easily reference the data in intuitive ways. For example, because we set up the index of the dataframe to be the date, we can pull out data using dates. In the following, we pull out all data from the first hour of the day:

In [None]:
df['2016-05-01 00']

Here we further subdivide to examine the passenger count during that time period:

In [None]:
df['2016-05-01 00']['passenger_count']

---
###  *Exercise*

> Figure out how to access the data from dataframe `df` for the first three hours of the day at once. Plot the tip amount (`tip_amount`) for this time period.

> After you can make a line plot, try making a histogram of the data. Play around with the data range and the number of bins.

---

We can change many `plot` parameters directly from `pandas`. We can do this in our exercise plot.

We can add data to our dataframe very easily. Below we add an index that gives the minute in the hour throughout the day.

We now can use the values from the key `minute` to FIXcompute the daily river flow average over all of the years — one of the questions on the homework (for the grad students). We access the data in the dataframe, `groupby` the day of the year (spanning all of the years), and then compute the mean.

In [None]:
df.loc[:, 'minute'] = df.index.minute  # adding a field for the minute of the hour
df.groupby('minute').aggregate(np.mean)['total_amount'].plot(color='k', grid=True, figsize=(14, 4), lw=3)