##### Imports

In [None]:
# Python 2 & 3 Compatibility
from __future__ import print_function, division
import numpy as np
import pandas as pd
%matplotlib inline

## A Few More Handy Pandas Ditties

##### Read in some Weather Data

In [None]:
# Read it in, set the Date/Time as the index (we're going to build a time series!)
weather = pd.read_csv('data/weather.csv', index_col='Date/Time')
# Take a look
weather['Temp (C)'].plot(figsize=(15, 6))

In [None]:
# Examine the columns
weather.info()

In [None]:
# Examine some of the values
weather.head()

### Applying Functions Across `Series` or `Dataframe`
- Often times we'll want to update a series or Dataframe by applying functions to all of the values in a column or columns.  
- There are 3 main functions for doing this: 
  - `map()`, 
  - `apply()`
  - `applymap()`.

**`map()`**  
Run a function on every element in a **`Series`**.  For instance, let's try converting temperature to Fahrenheit and adding that column:

In [None]:
# Function that converts Celsius to Fahrenheit
def celsius_to_fahrenheit(temp):
    return (9.0*temp/5.0) + 32

def hello_fog(weather):
    if weather == 'Fog':
        return 'Hello'
    else:
        return weather

# Use it to make the conversion and add a new column for it
weather['Temp (F)'] = weather['Temp (C)'].map(celsius_to_fahrenheit)
weather.head()
# weather.Weather.map(hello_fog)

*** Lambda Functions ***
- Often we won't want to explicitly write out the function definition for something like this because we'll just use it once and never again.  
- This is where "throwaway" or "temp" functions come in with the `lambda` operator.  
- Here's how you would do the same task with a `lambda`:

In [None]:
weather['Temp (F)'] = weather['Temp (C)'].map(lambda x: 9.0*x/5.0 + 32)
weather.head()

**`apply()`**  
- This is for functions that operate on entire arrays (`Series`) within a `Dataframe`.  
- Examples would include your usual aggregation functions like `sum()`, `mean()`, etc.  
- Here for example is how we might use it to find the range for each numeric column:

In [None]:
# Select only temperature columns and find their range
weather_temps = weather[['Temp (C)', 'Temp (F)']]
a = weather_temps.apply(lambda x: x.max() - x.min())
a.loc['Temp (C)']

**`applymap`**  
This does element-wise operations on everything in a **`Dataframe`**:

In [None]:
# Function to format numerics
format = lambda x: '%.2f' % x
weather_temps.applymap(format)

#### String Operators
- `map`, `apply`, and `applymap` are general and allow you to write just about any function to apply to elements in dataframes.  

- However, `pandas` has a bunch of its own built-in functions for these things, especially when working with strings.

- The `Weather` column is our only string here, so let's use it to look at some string operators.  First let's check out the unique values in there:

In [None]:
weather.Weather.unique()

**`replace()`**  
Just to demonstrate, let's replace all of the occurrences of "Fog" with " Fog":

In [17]:
spacey_fog = weather.Weather.str.replace('^Fog', ' Fog')
spacey_fog.unique()

array([' Fog', 'Freezing Drizzle,Fog', 'Mostly Cloudy', 'Cloudy', 'Rain',
       'Rain Showers', 'Mainly Clear', 'Snow Showers', 'Snow', 'Clear',
       'Freezing Rain,Fog', 'Freezing Rain', 'Freezing Drizzle',
       'Rain,Snow', 'Moderate Snow', 'Freezing Drizzle,Snow',
       'Freezing Rain,Snow Grains', 'Snow,Blowing Snow', 'Freezing Fog',
       'Haze', 'Rain,Fog', 'Drizzle,Fog', 'Drizzle',
       'Freezing Drizzle,Haze', 'Freezing Rain,Haze', 'Snow,Haze',
       'Snow,Fog', 'Snow,Ice Pellets', 'Rain,Haze', 'Thunderstorms,Rain',
       'Thunderstorms,Rain Showers', 'Thunderstorms,Heavy Rain Showers',
       'Thunderstorms,Rain Showers,Fog', 'Thunderstorms',
       'Thunderstorms,Rain,Fog', 'Thunderstorms,Moderate Rain Showers,Fog',
       'Rain Showers,Fog', 'Rain Showers,Snow Showers', 'Snow Pellets',
       'Rain,Snow,Fog', 'Moderate Rain,Fog',
       'Freezing Rain,Ice Pellets,Fog', 'Drizzle,Ice Pellets,Fog',
       'Drizzle,Snow', 'Rain,Ice Pellets', 'Drizzle,Snow,Fog',
      

**`strip()`**  
Now let's undo our work with `strip()` to remove leading/trailing whitespace:

In [18]:
spacey_fog.str.strip().unique()

array(['Fog', 'Freezing Drizzle,Fog', 'Mostly Cloudy', 'Cloudy', 'Rain',
       'Rain Showers', 'Mainly Clear', 'Snow Showers', 'Snow', 'Clear',
       'Freezing Rain,Fog', 'Freezing Rain', 'Freezing Drizzle',
       'Rain,Snow', 'Moderate Snow', 'Freezing Drizzle,Snow',
       'Freezing Rain,Snow Grains', 'Snow,Blowing Snow', 'Freezing Fog',
       'Haze', 'Rain,Fog', 'Drizzle,Fog', 'Drizzle',
       'Freezing Drizzle,Haze', 'Freezing Rain,Haze', 'Snow,Haze',
       'Snow,Fog', 'Snow,Ice Pellets', 'Rain,Haze', 'Thunderstorms,Rain',
       'Thunderstorms,Rain Showers', 'Thunderstorms,Heavy Rain Showers',
       'Thunderstorms,Rain Showers,Fog', 'Thunderstorms',
       'Thunderstorms,Rain,Fog', 'Thunderstorms,Moderate Rain Showers,Fog',
       'Rain Showers,Fog', 'Rain Showers,Snow Showers', 'Snow Pellets',
       'Rain,Snow,Fog', 'Moderate Rain,Fog',
       'Freezing Rain,Ice Pellets,Fog', 'Drizzle,Ice Pellets,Fog',
       'Drizzle,Snow', 'Rain,Ice Pellets', 'Drizzle,Snow,Fog',
       

**`contains()`**  
Let's use this method to check if the Weather contains "Snow" and if it does store that is `is_snowing`:

In [19]:
is_snowing = weather.Weather.str.contains('Snow')
weather[is_snowing]

Unnamed: 0_level_0,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,Temp (F)
Date/Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2012-01-02 17:00:00,-2.1,-9.5,57,22,25.0,99.66,Snow Showers,28.22
2012-01-02 20:00:00,-5.6,-13.4,54,24,25.0,100.07,Snow Showers,21.92
2012-01-02 21:00:00,-5.8,-12.8,58,26,25.0,100.15,Snow Showers,21.56
2012-01-02 23:00:00,-7.4,-14.1,59,17,19.3,100.27,Snow Showers,18.68
2012-01-03 00:00:00,-9.0,-16.0,57,28,25.0,100.35,Snow Showers,15.80
2012-01-03 02:00:00,-10.5,-15.8,65,22,12.9,100.53,Snow Showers,13.10
2012-01-03 03:00:00,-11.3,-18.7,54,33,25.0,100.61,Snow Showers,11.66
2012-01-03 05:00:00,-12.9,-19.1,60,22,25.0,100.76,Snow Showers,8.78
2012-01-03 06:00:00,-13.3,-19.3,61,19,25.0,100.85,Snow Showers,8.06
2012-01-03 07:00:00,-14.0,-19.5,63,19,25.0,100.95,Snow,6.80


### Time Series  
- As you're probably well aware by now, `pandas` can be indexed by datetimes without batting an eye.  
- Effectively, this means it handles **Time Series Data** out of the box!

Here are a few nice methods for working with time series, we'll expand on these when we discuss time series explicitly later in the course:

**`date_range()`**  
Creates a `DateTimeIndex` which can index a time series.  This is especially useful if you don't already have one or want to make changes to one:

In [21]:
# Example date range with frequency specifier every 3 days, starting january 1st, for 6 cycles
dates = pd.date_range('20130101', periods=6, freq='3D')
dates

DatetimeIndex(['2013-01-01', '2013-01-04', '2013-01-07', '2013-01-10',
               '2013-01-13', '2013-01-16'],
              dtype='datetime64[ns]', freq='3D')

In [22]:
# Create a random dataframe with it
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.018081,-2.353688,0.34905,-0.488859
2013-01-04,0.578237,-1.692687,-0.177682,-0.005176
2013-01-07,0.370273,-2.367376,0.698478,1.091079
2013-01-10,0.601254,0.258627,-0.614227,0.70627
2013-01-13,0.29119,-0.320473,1.598379,0.828028
2013-01-16,-1.257092,-0.516609,-0.37079,-0.360395


**`resample()`**  
- Every datetime index has an inherent frequency.  
  - For instance, in the example above the frequency was every 3 days.      

- The `resample()` method is so valuable in that it allows us to either **upsample** or **downsample** to change the frequency of the observations.
  - Upsampling involved getting more frequent observations--obviously this is limited by the total number that you have--but you can also potentially **interpolate** values for these new observations and `pandas` has methods for doing this.
  - In downsampling, you're simple reducing the observations down to the appropriate frequency.

- Let's use resampling along with our `is_snowing` variable to determine the snowiest **month** (as opposed to the current data by day)!  
  - When downsampling like this, you can specify parameters as to how to aggregate the observations being dropped.
  - Here we'll use the **mean**:

In [23]:
# What's happening here?
is_snowing.index = is_snowing.index.to_datetime()
is_snowing.astype(float).resample('M', how=np.mean)

the new syntax is .resample(...)..apply(<func>)
  app.launch_new_instance()


2012-01-31    0.240591
2012-02-29    0.162356
2012-03-31    0.087366
2012-04-30    0.015278
2012-05-31    0.000000
2012-06-30    0.000000
2012-07-31    0.000000
2012-08-31    0.000000
2012-09-30    0.000000
2012-10-31    0.000000
2012-11-30    0.038889
2012-12-31    0.251344
Freq: M, Name: Weather, dtype: float64

### Joining Related Datasets
**`merge()`**  
- What if now we wanted to join together our `is_snowing` with the columns from `weather` so they're all alligned in the same Dataframe.  
- This is literally called a **join** in classical SQL (database query language) terms, and `pandas` has a few ways to accomplish it.  
  - `merge` is the best, so let's start there:

In [None]:
# What's happening?
weather.index = weather.index.to_datetime()
weather_snowing = weather.merge(pd.DataFrame(is_snowing).rename(columns={'Weather': 'Is_Snowing'}), 
                                left_index=True, right_index=True)
weather_snowing

**`join()`**  
This is slightly different than `merge`, we prefer `merge`:

In [None]:
weather.join(is_snowing, how="inner", rsuffix='2')

**`concat()`**  
This concatenates 2 dataframes vertically, aka adds a bunch of rows to a bunch of other rows:

In [24]:
weather_concat = pd.concat([weather.iloc[0:100,], weather.iloc[200:300,]])
weather_concat.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 2012-01-01 00:00:00 to 2012-01-13 11:00:00
Data columns (total 8 columns):
Temp (C)              200 non-null float64
Dew Point Temp (C)    200 non-null float64
Rel Hum (%)           200 non-null int64
Wind Spd (km/h)       200 non-null int64
Visibility (km)       200 non-null float64
Stn Press (kPa)       200 non-null float64
Weather               200 non-null object
Temp (F)              200 non-null float64
dtypes: float64(5), int64(2), object(1)
memory usage: 14.1+ KB


**`append()`**  
Add a single row to a DataFrame:

In [30]:
weather_append = weather_concat.append(weather.iloc[250:252,])
weather_append.info()

<class 'pandas.core.frame.DataFrame'>
Index: 202 entries, 2012-01-01 00:00:00 to 2012-01-11 11:00:00
Data columns (total 8 columns):
Temp (C)              202 non-null float64
Dew Point Temp (C)    202 non-null float64
Rel Hum (%)           202 non-null int64
Wind Spd (km/h)       202 non-null int64
Visibility (km)       202 non-null float64
Stn Press (kPa)       202 non-null float64
Weather               202 non-null object
Temp (F)              202 non-null float64
dtypes: float64(5), int64(2), object(1)
memory usage: 14.2+ KB


### Summarizing Data
**`corr()`**  
This function is really useful as it calculates pairwise **correlations** between all of the variables in your data table:

In [31]:
weather.corr()

Unnamed: 0,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Temp (F)
Temp (C),1.0,0.932714,-0.220182,-0.061876,0.273455,-0.236389,1.0
Dew Point Temp (C),0.932714,1.0,0.139494,-0.095685,0.050813,-0.320616,0.932714
Rel Hum (%),-0.220182,0.139494,1.0,-0.092743,-0.633683,-0.231424,-0.220182
Wind Spd (km/h),-0.061876,-0.095685,-0.092743,1.0,0.004883,-0.356613,-0.061876
Visibility (km),0.273455,0.050813,-0.633683,0.004883,1.0,0.231847,0.273455
Stn Press (kPa),-0.236389,-0.320616,-0.231424,-0.356613,0.231847,1.0,-0.236389
Temp (F),1.0,0.932714,-0.220182,-0.061876,0.273455,-0.236389,1.0


**`cov()`**  
Similarly, here is **covariance**:

In [32]:
weather.cov()

Unnamed: 0,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Temp (F)
Temp (C),136.606604,118.641308,-43.540073,-6.28366,40.343485,-2.331894,245.891887
Dew Point Temp (C),118.641308,118.441263,25.684917,-9.047958,6.980371,-2.944971,213.554355
Rel Hum (%),-43.540073,25.684917,286.24855,-13.633521,-135.3305,-3.304649,-78.372132
Wind Spd (km/h),-6.28366,-9.047958,-13.633521,75.49344,0.535508,-2.615151,-11.310589
Visibility (km),40.343485,6.980371,-135.3305,0.535508,159.332259,2.470011,72.618273
Stn Press (kPa),-2.331894,-2.944971,-3.304649,-2.615151,2.470011,0.712344,-4.19741
Temp (F),245.891887,213.554355,-78.372132,-11.310589,72.618273,-4.19741,442.605396


### Categorical Variables
`pandas` let's you work with categorical variables.  These are variables that can take only a certain set of values ("R", "PG-13", "PG", "G").  We'll see this a lot later on, for now just an example:

In [None]:
weather['Weather Cat'] = weather.Weather.astype('category')
weather.info()

### Misc

**`shift()`**  
Shift a column forward or backward some number of rows:

In [33]:
weather['Forward'] = weather['Temp (F)'].shift(periods=2)
weather['Backward'] = weather['Temp (F)'].shift(periods=-1)
weather

Unnamed: 0_level_0,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,Temp (F),Forward,Backward
Date/Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog,28.76,,28.76
2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog,28.76,,28.76
2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog",28.76,28.76,29.30
2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog",29.30,28.76,29.30
2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog,29.30,28.76,29.48
2012-01-01 05:00:00,-1.4,-3.3,87,9,6.4,101.27,Fog,29.48,29.30,29.30
2012-01-01 06:00:00,-1.5,-3.1,89,7,6.4,101.29,Fog,29.30,29.30,29.48
2012-01-01 07:00:00,-1.4,-3.6,85,7,8.0,101.26,Fog,29.48,29.48,29.48
2012-01-01 08:00:00,-1.4,-3.6,85,9,8.0,101.23,Fog,29.48,29.30,29.66
2012-01-01 09:00:00,-1.3,-3.1,88,15,4.0,101.20,Fog,29.66,29.48,30.20


**`diff()`**  
Calculate the diff between rows:

In [34]:
weather['Temp Diff'] = weather['Temp (F)'].diff()
weather

Unnamed: 0_level_0,Temp (C),Dew Point Temp (C),Rel Hum (%),Wind Spd (km/h),Visibility (km),Stn Press (kPa),Weather,Temp (F),Forward,Backward,Temp Diff
Date/Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2012-01-01 00:00:00,-1.8,-3.9,86,4,8.0,101.24,Fog,28.76,,28.76,
2012-01-01 01:00:00,-1.8,-3.7,87,4,8.0,101.24,Fog,28.76,,28.76,0.00
2012-01-01 02:00:00,-1.8,-3.4,89,7,4.0,101.26,"Freezing Drizzle,Fog",28.76,28.76,29.30,0.00
2012-01-01 03:00:00,-1.5,-3.2,88,6,4.0,101.27,"Freezing Drizzle,Fog",29.30,28.76,29.30,0.54
2012-01-01 04:00:00,-1.5,-3.3,88,7,4.8,101.23,Fog,29.30,28.76,29.48,0.00
2012-01-01 05:00:00,-1.4,-3.3,87,9,6.4,101.27,Fog,29.48,29.30,29.30,0.18
2012-01-01 06:00:00,-1.5,-3.1,89,7,6.4,101.29,Fog,29.30,29.30,29.48,-0.18
2012-01-01 07:00:00,-1.4,-3.6,85,7,8.0,101.26,Fog,29.48,29.48,29.48,0.18
2012-01-01 08:00:00,-1.4,-3.6,85,9,8.0,101.23,Fog,29.48,29.30,29.66,0.00
2012-01-01 09:00:00,-1.3,-3.1,88,15,4.0,101.20,Fog,29.66,29.48,30.20,0.18


### Advanced Aggregation
Let's be masters of `groupby`!

In [35]:
# Load some sample MTA data
mta_df = pd.read_csv('data/sample_df.csv')
mta_df

Unnamed: 0,C/A,UNIT,STATION,wk,d,tot_traffic
0,A002,R051,59 ST,13,2017-04-01,12002.0
1,A002,R051,59 ST,13,2017-04-02,12516.0
2,A002,R051,59 ST,14,2017-04-03,22615.0
3,A002,R051,59 ST,14,2017-04-04,23477.0
4,A002,R051,59 ST,14,2017-04-05,24106.0
5,A002,R051,59 ST,14,2017-04-06,24102.0
6,A002,R051,59 ST,14,2017-04-07,22872.0


In [36]:
# Group by week
week_grouped = mta_df.groupby('wk')

**`agg()`** and **`aggregate()`**  
Create your own custom aggregation functions to perform after a group by:

In [40]:
# Create a custom aggregation function
def get_day(week, day):
    if len(week) > day:
        return week.iloc[day]
    else:
        return None

In [41]:
week_grouped.agg({'tot_traffic': lambda x: get_day(x,1)})

0    12002.0
1    12516.0
Name: tot_traffic, dtype: float64
2    22615.0
3    23477.0
4    24106.0
5    24102.0
6    22872.0
Name: tot_traffic, dtype: float64


Unnamed: 0_level_0,tot_traffic
wk,Unnamed: 1_level_1
13,12516.0
14,23477.0


In [42]:
# For each day, create a column, join it to original by week, rename to day n
for day in range(7):
    mta_df = pd.merge(mta_df, week_grouped.agg
                      ({'tot_traffic': lambda x: get_day(x,day)}), 
                      left_on='wk', right_index=True)
    mta_df = mta_df.rename(columns={'tot_traffic_x': 'tot_traffic', 
                                    'tot_traffic_y': 'day ' + str(day + 1)})

0    12002.0
1    12516.0
Name: tot_traffic, dtype: float64
2    22615.0
3    23477.0
4    24106.0
5    24102.0
6    22872.0
Name: tot_traffic, dtype: float64
0    12002.0
1    12516.0
Name: tot_traffic, dtype: float64
2    22615.0
3    23477.0
4    24106.0
5    24102.0
6    22872.0
Name: tot_traffic, dtype: float64
2    22615.0
3    23477.0
4    24106.0
5    24102.0
6    22872.0
Name: tot_traffic, dtype: float64
2    22615.0
3    23477.0
4    24106.0
5    24102.0
6    22872.0
Name: tot_traffic, dtype: float64
2    22615.0
3    23477.0
4    24106.0
5    24102.0
6    22872.0
Name: tot_traffic, dtype: float64


In [43]:
# What happened?
mta_df

Unnamed: 0,C/A,UNIT,STATION,wk,d,tot_traffic,day 1,day 2,day 3,day 4,day 5,day 6,day 7
0,A002,R051,59 ST,13,2017-04-01,12002.0,12002.0,12516.0,,,,,
1,A002,R051,59 ST,13,2017-04-02,12516.0,12002.0,12516.0,,,,,
2,A002,R051,59 ST,14,2017-04-03,22615.0,22615.0,23477.0,24106.0,24102.0,22872.0,,
3,A002,R051,59 ST,14,2017-04-04,23477.0,22615.0,23477.0,24106.0,24102.0,22872.0,,
4,A002,R051,59 ST,14,2017-04-05,24106.0,22615.0,23477.0,24106.0,24102.0,22872.0,,
5,A002,R051,59 ST,14,2017-04-06,24102.0,22615.0,23477.0,24106.0,24102.0,22872.0,,
6,A002,R051,59 ST,14,2017-04-07,22872.0,22615.0,23477.0,24106.0,24102.0,22872.0,,


**`drop_duplicates()`**:  
Remove duplicates based on a subset of columns, with a strategy on which to keep.

In [44]:
# Drop all but the first instance of each week and the tot_traffic column
mta_weeks_by_day = mta_df.drop_duplicates(subset='wk', 
                                          keep='first').drop('tot_traffic', axis=1)
mta_weeks_by_day

Unnamed: 0,C/A,UNIT,STATION,wk,d,day 1,day 2,day 3,day 4,day 5,day 6,day 7
0,A002,R051,59 ST,13,2017-04-01,12002.0,12516.0,,,,,
2,A002,R051,59 ST,14,2017-04-03,22615.0,23477.0,24106.0,24102.0,22872.0,,


`groupby` has a lot more things you can do.  Use `tab` and `shift+tab` on `week_grouped.` to explore the different functions available on groupby and what they can do for you.