## A Few More Handy Pandas Ditties

##### Imports

In [None]:
# Python 2 & 3 Compatibility
from __future__ import print_function, division
import numpy as np
import pandas as pd
%matplotlib inline

##### Read in some Weather Data

In [None]:
# Read it in, set the Date/Time as the index (we're going to build a time series!)
weather = pd.read_csv('data/weather.csv', index_col='Date/Time')
# Take a look
weather['Temp (C)'].plot(figsize=(15, 6))

In [None]:
# Examine the columns
weather.info()

In [None]:
# Examine some of the values
weather.head()

### Applying Functions Across `Series` or `Dataframe`
Often times we'll want to update a series or Dataframe by applying functions to all of the values in a column or columns.  There are 3 main functions for doing this: `map()`, `apply()`, and `applymap()`.

**`map()`**  
Run a function on every element in a **`Series`**.  For instance, let's try converting temperature to Fahrenheit and adding that column:

In [None]:
# Function that converts Celsius to Fahrenheit
def celsius_to_fahrenheit(temp):
    return (9.0*temp/5.0) + 32

# Use it to make the conversion and add a new column for it
weather['Temp (F)'] = weather['Temp (C)'].map(celsius_to_fahrenheit)
weather.head()

##### Lambda Functions
Often we won't want to explicitly write out the function definition for something like this because we'll just use it once and never again.  This is where "throwaway" or "temp" functions come in with the `lambda` operator.  Here's how you would do the same task with a `lambda`:

In [None]:
weather['Temp (F)'] = weather['Temp (C)'].map(lambda x: 9.0*x/5.0 + 32)
weather.head()

**`apply()`**  
This is for functions that operate on entire arrays (`Series`) within a `Dataframe`.  Examples would include your usual aggregation functions like `sum()`, `mean()`, etc.  Here for example is how we might use it to find the range for each numeric column:

In [None]:
weather.info()

In [None]:
# Select only temperature columns and find their range
weather_temps = weather[['Temp (C)', 'Temp (F)']]
weather_temps.apply(lambda x: x.max() - x.min())

**`applymap`**  
This does element-wise operations on everything in a **`Dataframe`**:

In [None]:
# Function to format numerics
format = lambda x: '%.2f' % x
weather_temps.applymap(format)

#### String Operators
`map`, `apply`, and `applymap` are general and allow you to write just about any function to apply to elements in dataframes.  However, `pandas` has a bunch of its own built-in functions for these things, especially when working with strings.

The `Weather` column is our only string here, so let's use it to look at some string operators.  First let's check out the unique values in there:

In [None]:
weather.Weather.unique()

**`replace()`**  
Just to demonstrate, let's replace all of the occurrences of "Fog" with " Fog":

In [None]:
spacey_fog = weather.Weather.str.replace('^Fog', ' Fog')
spacey_fog.unique()

**`strip()`**  
Now let's undo our work with `strip()` to remove leading/trailing whitespace:

In [None]:
spacey_fog.str.strip().unique()

**`contains()`**  
Let's use this method to check if the Weather contains "Snow" and if it does store that is `is_snowing`:

In [None]:
is_snowing = weather.Weather.str.contains('Snow')
weather[is_snowing]

### Time Series  
As you're probably well aware by now, `pandas` can be indexed by datetimes without batting an eye.  Effectively, this means it handles **Time Series Data** out of the box!

Here are a few nice methods for working with time series, we'll expand on these when we discuss time series explicitly later in the course.

**`date_range()`**  
Creates a `DateTimeIndex` which can index a time series.  This is especially useful if you don't already have one or want to make changes to one:

In [None]:
# Example date range with frequency specifier every 3 days, starting january 1st, for 6 cycles
dates = pd.date_range('20130101', periods=6, freq='3D')
dates

In [None]:
# Create a random dataframe with it
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

**`resample()`**  
Every datetime index has an inherent frequency.  For instance, in the example above the frequency was every 3 days.  The `resample()` method is so valuable in that it allows us to either **upsample** or **downsample** to change the frequency of the observations.  Upsampling involved getting more frequent observations--obviously this is limited by the total number that you have--but you can also potentially **interpolate** values for these new observations and `pandas` has methods for doing this.  In downsampling, you're simple reducing the observations down to the appropriate frequency.

Let's use resampling along with our `is_snowing` variable to determine the snowiest **month** (as opposed to the current data by day)!  When downsampling like this, you can specify parameters as to how to aggregate the observations being dropped.  Here we'll use the **mean**:

In [None]:
# What's happening here?
is_snowing.astype(float).resample('M', how=np.mean)

### Joining Related Datasets

**`merge()`**  
What if now we wanted to join together our `is_snowing` with the columns from `weather` so they're all alligned in the same Dataframe.  This is literally called a **join** in classical SQL (database query language) terms, and `pandas` has a few ways to accomplish it.  `merge` is the best, so let's start there:

In [None]:
# What's happening?
weather_snowing = weather.merge(pd.DataFrame(is_snowing), left_index=True, right_index=True)
weather_snowing

**`join()`**  
This is slightly different than `merge`, we prefer `merge`:

In [None]:
weather.join(is_snowing, how="inner", rsuffix='2')

**`concat()`**  
This concatenates 2 dataframes vertically, aka adds a bunch of rows to a bunch of other rows:

In [None]:
weather_concat = pd.concat([weather.iloc[0:100,], weather.iloc[200:300,]])
weather_concat.info()

**`append()`**  
Add a single row to a DataFrame:

In [None]:
weather_append = weather_concat.append(weather.iloc[305,])
weather_append.info()

### Summarizing Data

**`corr()`**  
This function is really useful as it calculates pairwise **correlations** between all of the variables in your data table:

In [None]:
weather.corr()

**`cov()`**  
Similarly, here is **covariance**:

In [None]:
weather.cov()

### Categorical Variables
`pandas` let's you work with categorical variables.  These are variables that can take only a certain set of values ("R", "PG-13", "PG", "G").  We'll see this a lot later on, for now just an example:

In [None]:
weather['Weather Cat'] = weather.Weather.astype('category')
weather.info()

### Misc

**`shift()`**  
Shift a column forward a backward some number of rows:

In [None]:
weather['Forward'] = weather['Temp (F)'].shift(periods=2)
weather['Backward'] = weather['Temp (F)'].shift(periods=-1)
weather

**`diff()`**  
Calculate the diff between rows:

In [None]:
weather['Temp Diff'] = weather['Temp (F)'].diff()
weather