# Data

We'll use weather data to talk about several methods of using the data, some are packages out of `pypi` and others are builtin. The builtin packages tend to be designed to handle smaller amounts of data.

We'll start with data from a weather station in the Capitol hill area of Seattle. Take a look at the `3235995.csv` file. I extracted this from [NOAA yesterday](www.ncdc.noaa.gov/) (using instructions found in your book).

First, lets load it up with the normal `csv` package. Use `help(csv)` after you've `import csv` to get some brief help on the package.

Here we will load the file, taken straight from our book:

In [None]:
import csv

with open('3235995.csv', newline='') as f:
    reader = csv.reader(f)
    header = reader.__next__()
    print(header)

Note each call to `__next__()` grabs the next item. Lets read 5 lines using some trickery and the `enumerate` function for a `for` loop:

In [None]:
with open('3235995.csv', newline='') as f:
    reader = csv.reader(f)
    header = reader.__next__()
    print(header)
    for index, row in enumerate(reader):
        if index < 5:
            print(row)
        else:
            break

Now - lets get 1000 entries in the `PRCP` (precipitation) into a single list, `measurements`, which we will then plot.

WARNING: 

Next, lets make a simple plot of them. We need to think a second of what we want. We want a trend line - as a function of time. We won't do date just yet - lets start with just doing sequence number.

But this means a scatter plot, connected by lines.

In [None]:
from matplotlib import pyplot as plt

plt.scatter(list(range(0, 1000)), measurements)
plt.xlabel('Measurement Number')
plt.ylabel('Precipitation (in)')

Ok - we can already see patterns! While the data contains three years, we've only pulled in about 3or 4 years here.

## Using `pandas`

Pandas is the way to manipulate square data. There are courses taught on this. We are going to go through some very simple stuff here.

First, lets read in the whole sample and make a quick plot. Note the integration with Jupyter!

In [None]:
import pandas as pd

df = pd.read_csv('3235995.csv')
df

Lets make the same plot as we did previously.

In [None]:
plt.scatter(list(range(0, len(df))), df['PRCP'])

OK - great - can we do anything with the months?

In [None]:
df.dtypes

In [None]:
df['DATE'] = pd.to_datetime(df['DATE'])
df.dtypes

Now we can do things like asking for summaries as a function of year:

In [None]:
df['year'] = df['DATE'].apply(lambda x: x.year)
df['month'] = df['DATE'].apply(lambda x: x.month)
df['day'] = df['DATE'].apply(lambda x: x.day)
df

In [None]:
by_year = df.groupby('year')['PRCP'].sum()
by_year

In [None]:
by_year.plot.bar()

## Seaborn

Just to give you a quick example of some of the crazy visualizations you can do, lets look at the rain fall by month. We'll use a very nice, and very opinionated, plot library called `seaborn`.

In [None]:
!pip install seaborn

Lets make a plot of accumulation per month, with the years on top of each other so we can see the general trend.

In [None]:
import seaborn as sns
by_month = df.groupby(['year', 'month'])['PRCP'].sum().reset_index()
sns.relplot(x='month', y='PRCP', data=by_month, hue="year", kind="line")