

# Introduction to Pandas

![pandas Logo](https://github.com/pandas-dev/pandas/raw/master/web/pandas/static/img/pandas.svg "pandas Logo")

## Questions
1. What are the important pandas data structures?
1. How do I interact with these?
1. What else can pandas do for me?

In [None]:
import pandas as pd

## The pandas [`DataFrame`](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)...
... is a **labeled**, two dimensional columnal structure similar to a table, spreadsheet, or the R `data.frame`.

![dataframe schematic](https://github.com/pandas-dev/pandas/raw/master/doc/source/_static/schemas/01_table_dataframe.svg "Schematic of a pandas DataFrame")

The `columns` that make up our `DataFrame` can be lists, dictionaries, NumPy arrays, pandas `Series`, or more. Within these `columns` our data can be any texts, numbers, dates and times, or many other data types you may have encountered in Python and NumPy. Shown here on the left in dark gray, our very first `column`  is uniquely referrred to as an `Index`, and this contains information characterizing each row of our `DataFrame`. Similar to any other `column`, the `index` can label our rows by text, numbers, `datetime`s (a popular one!), or more.

Let's take a look by reading in some `.csv` data [[ref](https://www.ncdc.noaa.gov/teleconnections/enso/indicators/sst/)].

In [None]:
df = pd.read_csv('data/enso_data.csv')

df

In [None]:
df.index

Our indexing column isn't particularly helpful currently. pandas is clever! A few optional keyword arguments later, and...

In [None]:
df = pd.read_csv('data/enso_data.csv', index_col=0, parse_dates=True)

df

In [None]:
df.index

... now we have our data helpfully organized by a proper `datetime`-like object. Each of our multiple columns of data can now be referenced by their date! This sneak preview at the pandas `DatetimeIndex` also unlocks for us much of pandas most useful time series functionality. Don't worry, we'll get there. What are the actual columns of data we've read in here?

In [None]:
df.columns

## The pandas [`Series`](https://pandas.pydata.org/docs/user_guide/dsintro.html#series)...

... is essentially any one of the columns of our `DataFrame`, with its accompanying `Index` to provide a label for each value in our column.

![pandas Series](https://github.com/pandas-dev/pandas/raw/master/doc/source/_static/schemas/01_table_series.svg "Schematic of a pandas Series")

The pandas `Series` is a fast and capable 1-dimensional array of nearly any data type we could want, and it can behave very similarly to a NumPy `ndarray` or a Python `dict`. You can take a look at any of the `Series` that make up your `DataFrame` with its label and Python `dict`-like notation, as well as dot-shorthand:

In [None]:
df["Nino34"]

In [None]:
df.Nino34

## Investigating the `DataFrame` and `Series`

Pandas has some helpful shorthand for quickly investigating our `DataFrame`.

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
#TODO add df label indexing here, mention coordinate comparison, abstract the discussion a bit

Let's take a look at a `Series` on its own. Recall selecting just one of our columns.

In [None]:
nino34_series = df["Nino34"]

nino34_series

`Series` can be indexed, selected, and subset as both `ndarray`-like,

In [None]:
nino34_series[3]

and `dict`-like,

In [None]:
nino34_series["1982-04-01"]

and these can be extended in both ways you might expect and ways you might not:

In [None]:
# numpy-like interval slices
nino34_series[::12]

In [None]:
# label-based slicing
nino34_series["2000-12-01":"2002-04-01"]

In [None]:
#TODO add `slice`, expand on label-based indexing of entire df, lists of columns, etc.

## The Powers of Pandas

### Quick Plots of Your Data

In [None]:
df.Nino34.plot()

### Calculations

In [None]:
df.describe()

In [None]:
df.mean()

In [None]:
df.Nino34.mean()

Other Plots

In [None]:
df[['Nino12', 'Nino34']].plot.hist()

In [None]:
df[['Nino12', 'Nino34']].plot.box()

For more examples of plotting choices, check out [the pandas plot documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)

### Advanced subsetting

In [None]:
# Uses the datetime column
df[df.index.month == 1]

In [None]:
df.loc[df.index.year == 1995]

In [None]:
df[df.Nino34anom > 2]

In [None]:
nino_temp_cols = ['Nino12', 'Nino3', 'Nino4', 'Nino34']
nino_anom_cols = ['Nino12anom', 'Nino3anom', 'Nino4anom', 'Nino34anom']

In [None]:
df[nino_temp_cols].plot()

In [None]:
df[nino_anom_cols].plot()

### Resampling

In [None]:
df.Nino34.plot()

In [None]:
df.Nino34.resample('1Y').mean().plot()

In [None]:
df.Nino34.resample('1Y').mean().plot()

In [None]:
def convert_degc_to_kelvin(temperature_degc):
    """
    Converts from degrees celsius to Kelvin
    """
    
    return temperature_degc + 273.15

In [None]:
# Convert a single value
convert_degc_to_kelvin(0)

In [None]:
nino_34 = df.Nino34

In [None]:
nino_34

In [None]:
type(df.Nino12.values[0:10])

In [None]:
type(df.Nino12[0:10])

In [None]:
convert_degc_to_kelvin(nino_34)

In [None]:
df['Nino34_degK'] = convert_degc_to_kelvin(nino_34)

In [None]:
df.Nino34_degK

In [None]:
df.Nino34_degF.plot()

In [None]:
df.to_csv('nino_analyzed_output.csv')

In [None]:
pd.read_csv('nino_analyzed_output.csv', index_col=0, parse_dates=True)