# Notes on plots in jupyter from
https://www.reddit.com/r/IPython/comments/36p360/try_matplotlib_notebook_for_interactive_plots/:

- The old %matplotlib inline activates the inline backend, which renders figures in the notebook as static pngs.

- The new %matplotlib notebook activates the nbagg backend, added in matplotlib 1.4, which will include a javascript interface for interaction with inline figures in the notebook (e.g. move, zoom, resize, and save). This only works in IPython 3.x; for older IPython versions, use %matplotlib nbagg

- nbagg is different than mpld3 in that it requires a live connection to a Python kernel. This allows it to be more feature complete than mpld3, but any static rendering of the notebook will not include the interactivity.

In [20]:
%matplotlib notebook

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas

# This notebook is me playing around with pandas using the following tutorial as a basis: 
http://synesthesiam.com/posts/an-introduction-to-pandas.html

I modified several lines to be Python 3 compliant

In [2]:
data = pandas.read_csv("synasthesiam.com__intro-to-pandas.csv")

In [3]:
data.head()

Unnamed: 0,EDT,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,2012-3-10,56,40,24,24,20,16,74,50,26,...,10,10,10,13,6,17.0,0.00,0,,138
1,2012-3-11,67,49,30,43,31,24,78,53,28,...,10,10,10,22,7,32.0,T,1,Rain,163
2,2012-3-12,71,62,53,59,55,43,90,76,61,...,10,10,6,24,14,36.0,0.03,6,Rain,190
3,2012-3-13,76,63,50,57,53,47,93,66,38,...,10,10,4,16,5,24.0,0.00,0,,242
4,2012-3-14,80,62,44,58,52,43,93,68,42,...,10,10,10,16,6,22.0,0.00,0,,202


In [4]:
data

Unnamed: 0,EDT,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,2012-3-10,56,40,24,24,20,16,74,50,26,...,10,10,10,13,6,17.0,0.00,0,,138
1,2012-3-11,67,49,30,43,31,24,78,53,28,...,10,10,10,22,7,32.0,T,1,Rain,163
2,2012-3-12,71,62,53,59,55,43,90,76,61,...,10,10,6,24,14,36.0,0.03,6,Rain,190
3,2012-3-13,76,63,50,57,53,47,93,66,38,...,10,10,4,16,5,24.0,0.00,0,,242
4,2012-3-14,80,62,44,58,52,43,93,68,42,...,10,10,10,16,6,22.0,0.00,0,,202
5,2012-3-15,79,69,58,61,58,53,90,69,48,...,10,10,10,31,10,41.0,0.04,3,Rain-Thunderstorm,209
6,2012-3-16,75,64,52,57,54,51,100,75,49,...,10,10,10,14,5,20.0,T,2,,169
7,2012-3-17,78,62,46,60,54,46,100,78,56,...,10,5,0,12,5,17.0,T,3,Fog-Thunderstorm,162
8,2012-3-18,80,70,59,61,58,57,93,69,45,...,10,10,9,18,8,25.0,T,2,Rain,197
9,2012-3-19,84,72,59,58,56,50,90,66,42,...,10,10,10,17,6,23.0,0.00,1,,165


In [39]:
type(data)

pandas.core.frame.DataFrame

# pandas dataframe have a nice describe method for summary statistics

In [5]:
data.describe()



Unnamed: 0,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,Max Sea Level PressureIn,Mean Sea Level PressureIn,Min Sea Level PressureIn,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,CloudCover,WindDirDegrees
count,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,366.0,365.0,366.0,366.0
mean,66.803279,55.68306,44.101093,49.54918,44.057377,37.980874,90.027322,67.860656,45.193989,30.108907,30.022705,29.936831,9.994536,8.73224,5.797814,16.418033,6.057377,22.764384,2.885246,189.704918
std,20.361247,18.436506,17.301141,16.397178,16.829996,17.479449,9.108438,9.945591,15.360261,0.172189,0.174112,0.182476,0.073821,1.875406,3.792219,5.564329,3.20094,8.131092,2.707261,94.04508
min,16.0,11.0,1.0,0.0,-3.0,-5.0,54.0,37.0,15.0,29.64,29.42,29.23,9.0,2.0,0.0,6.0,0.0,7.0,0.0,1.0
25%,51.0,41.0,30.0,36.0,30.0,24.0,85.0,61.25,35.0,29.99,29.91,29.83,10.0,8.0,2.0,13.0,4.0,,0.0,131.0
50%,69.0,59.0,47.0,54.5,48.0,41.0,93.0,68.0,42.0,30.1,30.02,29.94,10.0,10.0,6.0,16.0,6.0,,2.0,192.5
75%,84.0,70.75,57.75,62.0,57.0,51.0,96.0,74.0,54.0,30.21,30.1275,30.04,10.0,10.0,10.0,20.0,8.0,,5.0,259.75
max,106.0,89.0,77.0,77.0,72.0,71.0,100.0,95.0,90.0,30.6,30.48,30.44,10.0,10.0,10.0,39.0,19.0,63.0,8.0,360.0


Get Number of rows

In [6]:
len(data)

366

Get names of columns

In [7]:
data.columns

Index(['EDT', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',
       'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity',
       ' Mean Humidity', ' Min Humidity', ' Max Sea Level PressureIn',
       ' Mean Sea Level PressureIn', ' Min Sea Level PressureIn',
       ' Max VisibilityMiles', ' Mean VisibilityMiles', ' Min VisibilityMiles',
       ' Max Wind SpeedMPH', ' Mean Wind SpeedMPH', ' Max Gust SpeedMPH',
       'PrecipitationIn', ' CloudCover', ' Events', ' WindDirDegrees'],
      dtype='object')

Get number of columns

In [8]:
len(data.columns)

23

# Columns can be accessed in two ways. The first is using the DataFrame like a dictionary with string keys:

In [16]:
data["EDT"]

You can get multiple columns out at the same time by passing in a list of strings.



In [11]:
data.head()[["EDT", "Mean TemperatureF"]]

Unnamed: 0,EDT,Mean TemperatureF
0,2012-3-10,40
1,2012-3-11,49
2,2012-3-12,62
3,2012-3-13,63
4,2012-3-14,62


The second way to access columns is using the dot syntax. This only works if your column name could also be a Python variable name (i.e., no spaces), and if it doesn't collide with another DataFrame property or function name (e.g., count, sum).

In [None]:
data.EDT

In [12]:
data.EDT.head()

0    2012-3-10
1    2012-3-11
2    2012-3-12
3    2012-3-13
4    2012-3-14
Name: EDT, dtype: object

Passing in a number n gives us the first n items in the column. There is also a corresponding tail() method that gives the last n items or rows.

In [None]:
data.EDT.tail(10) # or:   data["EDT"].tail(10)

The column names in data are a little unweildy, so we're going to rename them. This is as easy as assigning a new list of column names to the columns property of the DataFrame.

In [14]:
data.columns = ["date", "max_temp", "mean_temp", "min_temp", "max_dew",
                "mean_dew", "min_dew", "max_humidity", "mean_humidity",
                "min_humidity", "max_pressure", "mean_pressure",
                "min_pressure", "max_visibilty", "mean_visibility",
                "min_visibility", "max_wind", "mean_wind", "min_wind",
                "precipitation", "cloud_cover", "events", "wind_dir"]

In [15]:
data.columns

Index(['date', 'max_temp', 'mean_temp', 'min_temp', 'max_dew', 'mean_dew',
       'min_dew', 'max_humidity', 'mean_humidity', 'min_humidity',
       'max_pressure', 'mean_pressure', 'min_pressure', 'max_visibilty',
       'mean_visibility', 'min_visibility', 'max_wind', 'mean_wind',
       'min_wind', 'precipitation', 'cloud_cover', 'events', 'wind_dir'],
      dtype='object')

Now our columns can all be accessed using the dot syntax!

In [17]:
data.mean_temp.head()

0    40
1    49
2    62
3    63
4    62
Name: mean_temp, dtype: int64

In [18]:
data.mean_temp.std()

18.43650599625107

In [23]:
data.mean_temp.hist()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1158260b8>

By the way, many of the column-specific methods also work on the entire DataFrame. Instead of a single number, you'll get a result for each column.

In [22]:
data.std()

max_temp           20.361247
mean_temp          18.436506
min_temp           17.301141
max_dew            16.397178
mean_dew           16.829996
min_dew            17.479449
max_humidity        9.108438
mean_humidity       9.945591
min_humidity       15.360261
max_pressure        0.172189
mean_pressure       0.174112
min_pressure        0.182476
max_visibilty       0.073821
mean_visibility     1.875406
min_visibility      3.792219
max_wind            5.564329
mean_wind           3.200940
min_wind            8.131092
cloud_cover         2.707261
wind_dir           94.045080
dtype: float64

---------------------------
# Bulk Operations with apply()
Methods like sum() and std() work on entire columns. We can run our own functions across all values in a column (or row) using apply().

To give you an idea of how this works, let's consider the "date" column in our DataFrame (formally "EDT").

In [24]:
data.date.head()

0    2012-3-10
1    2012-3-11
2    2012-3-12
3    2012-3-13
4    2012-3-14
Name: date, dtype: object

We can use the values property of the column to get a list of values for the column. Inspecting the first value reveals that these are strings with a particular format.

In [25]:
first_date = data.date.values[0]
first_date

'2012-3-10'

In [26]:
from datetime import datetime
datetime.strptime(first_date, "%Y-%m-%d")

datetime.datetime(2012, 3, 10, 0, 0)

Using the <font color='red'>apply()</font> method, which takes an anonymous function, we can apply strptime to each value in the column. We'll overwrite the string date values with their Python datetime equivalents.

In [33]:
data.date = data.date.apply(lambda d: datetime.strptime(d, "%Y-%m-%d"))
data.date.head()

0   2012-03-10
1   2012-03-11
2   2012-03-12
3   2012-03-13
4   2012-03-14
Name: date, dtype: datetime64[ns]

________

## Side note: how to change color of fonts in markdown cells in jupyter notebook...the following code yields "bar" in red font:


In [32]:
 foo <font color='red'>bar</font> foo

foo <font color='red'>bar</font> foo

_____________

Let's go one step futher. Each row in our DateFrame represents the weather from a single day. Each row in a DataFrame is associated with an index, which is a label that uniquely identifies a row.

#### Our row indices up to now have been auto-generated by pandas, and are simply integers from 0 to 365. <font color='blue'>If we use dates instead of integers for our index, we will get some extra benefits from pandas when plotting later on.</font> Overwriting the index is as easy as assigning to the index property of the DataFrame.

In [34]:
data.index = data.date
data.head()

Unnamed: 0_level_0,date,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-03-10,2012-03-10,56,40,24,24,20,16,74,50,26,...,10,10,10,13,6,17.0,0.00,0,,138
2012-03-11,2012-03-11,67,49,30,43,31,24,78,53,28,...,10,10,10,22,7,32.0,T,1,Rain,163
2012-03-12,2012-03-12,71,62,53,59,55,43,90,76,61,...,10,10,6,24,14,36.0,0.03,6,Rain,190
2012-03-13,2012-03-13,76,63,50,57,53,47,93,66,38,...,10,10,4,16,5,24.0,0.00,0,,242
2012-03-14,2012-03-14,80,62,44,58,52,43,93,68,42,...,10,10,10,16,6,22.0,0.00,0,,202


#### <font color='green'>Now we can quickly look up a row by its date with the ix[] property</font>....note that ix will also take an integer index

In [35]:
data.ix[1].head()

date         2012-03-11 00:00:00
max_temp                      67
mean_temp                     49
min_temp                      30
max_dew                       43
Name: 2012-03-11 00:00:00, dtype: object

In [38]:
data.ix[datetime(2012, 3, 11)].head()

date         2012-03-11 00:00:00
max_temp                      67
mean_temp                     49
min_temp                      30
max_dew                       43
Name: 2012-03-11 00:00:00, dtype: object

With all of the dates in the index now, we no longer need the "date" column. Let's drop it.

In [40]:
data = data.drop(["date"], axis=1)
data.columns

Index(['max_temp', 'mean_temp', 'min_temp', 'max_dew', 'mean_dew', 'min_dew',
       'max_humidity', 'mean_humidity', 'min_humidity', 'max_pressure',
       'mean_pressure', 'min_pressure', 'max_visibilty', 'mean_visibility',
       'min_visibility', 'max_wind', 'mean_wind', 'min_wind', 'precipitation',
       'cloud_cover', 'events', 'wind_dir'],
      dtype='object')

# Handling missing values

Pandas considers values like NaN and None to represent missing data. The pandas.isnull function can be used to tell whether or not a value is missing.

Let's use apply() across all of the columns in our DataFrame to figure out which values are missing.

In [45]:
empty = data.apply(lambda col: pandas.isnull(col)) # See pandas.notnull for the opposite
empty.head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-03-10,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2012-03-11,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2012-03-12,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2012-03-13,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2012-03-14,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False


We got back a dataframe (empty) with boolean values for all 22 columns and 366 rows. Inspecting the first 10 values of the "events", column we can see that there are some missing values because a True was returned from pandas.isnull.

In [46]:
empty.events.head(10)

date
2012-03-10     True
2012-03-11    False
2012-03-12    False
2012-03-13     True
2012-03-14     True
2012-03-15    False
2012-03-16     True
2012-03-17    False
2012-03-18    False
2012-03-19     True
Freq: D, Name: events, dtype: bool

Looking at the corresponding rows in the original DataFrame reveals that pandas has filled in NaN for empty values in the "events" column.

In [47]:
data.events.head(10)

date
2012-03-10                  NaN
2012-03-11                 Rain
2012-03-12                 Rain
2012-03-13                  NaN
2012-03-14                  NaN
2012-03-15    Rain-Thunderstorm
2012-03-16                  NaN
2012-03-17     Fog-Thunderstorm
2012-03-18                 Rain
2012-03-19                  NaN
Freq: D, Name: events, dtype: object

This isn't exactly what we want. One option is to drop all rows in the DataFrame with missing "events" values.

In [51]:
data.dropna(subset=["events"]).head() 
## default is to return an object, see "inplace" option to return in place

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-03-11,67,49,30,43,31,24,78,53,28,30.37,...,10,10,10,22,7,32.0,T,1,Rain,163
2012-03-12,71,62,53,59,55,43,90,76,61,30.13,...,10,10,6,24,14,36.0,0.03,6,Rain,190
2012-03-15,79,69,58,61,58,53,90,69,48,30.13,...,10,10,10,31,10,41.0,0.04,3,Rain-Thunderstorm,209
2012-03-17,78,62,46,60,54,46,100,78,56,30.15,...,10,5,0,12,5,17.0,T,3,Fog-Thunderstorm,162
2012-03-18,80,70,59,61,58,57,93,69,45,30.14,...,10,10,9,18,8,25.0,T,2,Rain,197


The DataFrame we get back has only 162 rows, so we can infer that there were 366 - 162 = 204 missing values in the "events" column. Note that this didn't affect data; we're just looking at a copy.

Instead of dropping the rows with missing values, let's fill them with empty strings (you'll see why in a moment). This is easily done with the fillna() function. We'll go ahead and overwrite the "events" column with empty string missing values instead of NaN.

In [53]:
data.events = data.events.fillna("")
data.events.head(10)

date
2012-03-10                     
2012-03-11                 Rain
2012-03-12                 Rain
2012-03-13                     
2012-03-14                     
2012-03-15    Rain-Thunderstorm
2012-03-16                     
2012-03-17     Fog-Thunderstorm
2012-03-18                 Rain
2012-03-19                     
Freq: D, Name: events, dtype: object

# Accessing Individual Rows (and <font color='blue'>columns</font>)
note there seem to many several ways to do this, see the following for clear examples: http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation

- loc works on labels in the index.
- iloc works on the positions in the index (so it only takes integers).
- ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.

It's important to note some subtleties that can make ix slightly tricky to use:
- if the index is of integer type, ix will only use label-based indexing and not fall back to position-based indexing. If the label is not in the index, an error is raised.
- if the index does not contain only integers, then given an integer, ix will immediately use position-based indexing rather than label-based indexing. If however ix is given another type (e.g. a string), it can use label-based indexing.

to extract the max_temp and min_temp  <font color='blue'>columns</font>
data.iloc[:,[0,2]].head()

Seems like there are at least three ways to access rows...see below. They all seem to return the same type of object

In [85]:
# use the row name
data.ix[datetime(2012, 3, 10)].head()

max_temp     56
mean_temp    40
min_temp     24
max_dew      24
mean_dew     20
Name: 2012-03-10 00:00:00, dtype: object

In [86]:
# use the row index
data.ix[0].head()

max_temp     56
mean_temp    40
min_temp     24
max_dew      24
mean_dew     20
Name: 2012-03-10 00:00:00, dtype: object

In [70]:
data.irow(0).head() ## Note the (), not []

  if __name__ == '__main__':


max_temp     56
mean_temp    40
min_temp     24
max_dew      24
mean_dew     20
Name: 2012-03-10 00:00:00, dtype: object

In [68]:
data.iloc[0].head()  ### Note the square brackets

max_temp     56
mean_temp    40
min_temp     24
max_dew      24
mean_dew     20
Name: 2012-03-10 00:00:00, dtype: object

In [76]:
# to extract the max_temp and min_temp columns
data.iloc[:,[0,2]].head()

Unnamed: 0_level_0,max_temp,min_temp
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2012-03-10,56,24
2012-03-11,67,30
2012-03-12,71,53
2012-03-13,76,50
2012-03-14,80,44


_______

You can iterate over each row in the DataFrame with iterrows(). Note that this function returns both the index and the row. Also, you must access columns in the row you get back from iterrows() with the dictionary syntax.

In [89]:
num_rain = 0
for idx, row in data.iterrows():
    if "Rain" in row["events"]:
        num_rain += 1

"Days with rain: {0}".format(num_rain)

'Days with rain: 121'

In [95]:
cntr = 0
for idx, row in data.iterrows():
    if cntr < 5:
        print(idx, row['min_temp'])
    cntr +=1


2012-03-10 00:00:00 24
2012-03-11 00:00:00 30
2012-03-12 00:00:00 53
2012-03-13 00:00:00 50
2012-03-14 00:00:00 44


## Filtering 

Most of your time using pandas will likely be devoted to selecting rows of interest from a DataFrame. In addition to strings, the dictionary syntax accepts things like this:

In [101]:
freezing_days = data[data.max_temp <= 32]
# freezing_days.head()
freezing_days.describe()

Unnamed: 0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,mean_pressure,min_pressure,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,cloud_cover,wind_dir
count,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0
mean,28.142857,21.714286,14.857143,19.0,13.619048,6.666667,83.047619,69.047619,54.809524,30.323333,30.18619,30.068095,9.952381,7.761905,3.904762,17.190476,7.285714,23.857143,4.857143,269.47619
std,4.452928,5.487648,7.708993,7.35527,8.06521,9.046178,8.114655,10.200373,13.581675,0.122202,0.165302,0.200565,0.218218,2.278262,3.858818,5.192485,3.180296,7.939054,2.651145,70.01544
min,16.0,11.0,1.0,0.0,-3.0,-5.0,69.0,50.0,27.0,30.14,29.86,29.6,9.0,3.0,0.0,6.0,1.0,7.0,0.0,30.0
25%,26.0,19.0,10.0,16.0,8.0,-2.0,74.0,59.0,43.0,30.21,30.09,30.0,10.0,6.0,1.0,15.0,6.0,21.0,3.0,237.0
50%,30.0,22.0,14.0,20.0,14.0,7.0,84.0,71.0,58.0,30.31,30.22,30.11,10.0,9.0,2.0,17.0,7.0,23.0,5.0,284.0
75%,31.0,26.0,20.0,23.0,19.0,15.0,92.0,75.0,63.0,30.39,30.34,30.19,10.0,10.0,7.0,20.0,9.0,29.0,7.0,308.0
max,32.0,31.0,29.0,31.0,26.0,25.0,92.0,85.0,78.0,30.6,30.44,30.32,10.0,10.0,10.0,28.0,15.0,39.0,8.0,353.0


We get back another DataFrame with fewer rows (21 in this case). This DataFrame can be filtered down even more.

In [100]:
freezing_days[freezing_days.min_temp >= 20].head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-11-24,31,26,21,20,18,15,81,72,63,30.3,...,10,10,9,9,4,14.0,0.00,4,,270
2012-12-21,29,26,22,25,19,15,85,74,63,30.21,...,10,5,0,25,14,39.0,0.02,7,Fog-Snow,285
2012-12-29,32,28,23,28,25,16,92,80,68,30.29,...,10,3,0,18,9,29.0,0.20,8,Fog-Snow,308
2013-01-01,32,26,20,31,25,16,92,83,74,30.2,...,9,5,2,14,5,15.0,T,8,,353
2013-01-25,30,25,20,18,12,0,74,57,39,30.35,...,10,8,1,16,7,21.0,0.02,6,Snow,192


Or, using boolean operations, we could apply both filters to the original DataFrame at the same time.

In [102]:
data[(data.max_temp <= 32) & (data.min_temp >= 20)].head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-11-24,31,26,21,20,18,15,81,72,63,30.3,...,10,10,9,9,4,14.0,0.00,4,,270
2012-12-21,29,26,22,25,19,15,85,74,63,30.21,...,10,5,0,25,14,39.0,0.02,7,Fog-Snow,285
2012-12-29,32,28,23,28,25,16,92,80,68,30.29,...,10,3,0,18,9,29.0,0.20,8,Fog-Snow,308
2013-01-01,32,26,20,31,25,16,92,83,74,30.2,...,9,5,2,14,5,15.0,T,8,,353
2013-01-25,30,25,20,18,12,0,74,57,39,30.35,...,10,8,1,16,7,21.0,0.02,6,Snow,192


It's important to understand what's really going on underneath with filtering. Let's look at what kind of object we actually get back when creating a filter.

In [103]:
temp_max = data.max_temp <= 32
type(temp_max)

pandas.core.series.Series

This is a pandas Series object, which is the one-dimensional equivalent of a DataFrame. Because our DataFrame uses datetime objects for the index, we have a specialized TimeSeries object.

###### What's inside the filter? and how to access some info on it

In [104]:
temp_max.head()

date
2012-03-10    False
2012-03-11    False
2012-03-12    False
2012-03-13    False
2012-03-14    False
Freq: D, Name: max_temp, dtype: bool

In [125]:
### Aside to figure out how to use what's inside a filter

In [105]:
temp_max.describe()

count       366
unique        2
top       False
freq        345
Name: max_temp, dtype: object

In [106]:
temp_max_info = temp_max.describe()

In [124]:
print(type(temp_max_info))
print("")
print(temp_max_info[[0,2]])
print("")
print(temp_max_info[0:3:2])

<class 'pandas.core.series.Series'>

count      366
top      False
Name: max_temp, dtype: object

count      366
top      False
Name: max_temp, dtype: object


In [None]:
### End of aside

Our filter is nothing more than a Series with a boolean value for every item in the index. When we "run the filter" as so:

In [127]:
data[temp_max]

pandas lines up the rows of the DataFrame and the filter using the index, and then keeps the rows with a True filter value. That's it.

Let's create another filter.

In [130]:
temp_min = data.min_temp >= 20
temp_min.head()

date
2012-03-10    True
2012-03-11    True
2012-03-12    True
2012-03-13    True
2012-03-14    True
Freq: D, Name: min_temp, dtype: bool

In [137]:
# rows were max_temp column is <= 32 and min_temp column is >= 20
temp_min & temp_max

date
2012-03-10    False
2012-03-11    False
2012-03-12    False
2012-03-13    False
2012-03-14    False
2012-03-15    False
2012-03-16    False
2012-03-17    False
2012-03-18    False
2012-03-19    False
2012-03-20    False
2012-03-21    False
2012-03-22    False
2012-03-23    False
2012-03-24    False
2012-03-25    False
2012-03-26    False
2012-03-27    False
2012-03-28    False
2012-03-29    False
2012-03-30    False
2012-03-31    False
2012-04-01    False
2012-04-02    False
2012-04-03    False
2012-04-04    False
2012-04-05    False
2012-04-06    False
2012-04-07    False
2012-04-08    False
              ...  
2013-02-09    False
2013-02-10    False
2013-02-11    False
2013-02-12    False
2013-02-13    False
2013-02-14    False
2013-02-15    False
2013-02-16    False
2013-02-17    False
2013-02-18    False
2013-02-19    False
2013-02-20    False
2013-02-21    False
2013-02-22    False
2013-02-23    False
2013-02-24    False
2013-02-25    False
2013-02-26    False
2013-02-27    F

...is just lining up the two filters using the index, performing a boolean AND operation, and returning the result as another Series.

In [138]:
(temp_min & temp_max).describe()

count       366
unique        2
top       False
freq        359
dtype: object

We can do other boolean operations too, like OR:

In [139]:
temp_min | temp_max

date
2012-03-10     True
2012-03-11     True
2012-03-12     True
2012-03-13     True
2012-03-14     True
2012-03-15     True
2012-03-16     True
2012-03-17     True
2012-03-18     True
2012-03-19     True
2012-03-20     True
2012-03-21     True
2012-03-22     True
2012-03-23     True
2012-03-24     True
2012-03-25     True
2012-03-26     True
2012-03-27     True
2012-03-28     True
2012-03-29     True
2012-03-30     True
2012-03-31     True
2012-04-01     True
2012-04-02     True
2012-04-03     True
2012-04-04     True
2012-04-05     True
2012-04-06     True
2012-04-07     True
2012-04-08     True
              ...  
2013-02-09     True
2013-02-10     True
2013-02-11     True
2013-02-12     True
2013-02-13     True
2013-02-14     True
2013-02-15     True
2013-02-16     True
2013-02-17    False
2013-02-18     True
2013-02-19     True
2013-02-20     True
2013-02-21    False
2013-02-22     True
2013-02-23     True
2013-02-24     True
2013-02-25     True
2013-02-26     True
2013-02-27     

Because the result is just another Series, we have all of the regular pandas functions at our disposal. The any() function returns True if any value in the Series is True.

In [140]:
temp_both = temp_min & temp_max
temp_both.any()

True

Sometimes filters aren't so intuitive. This (sadly) doesn't work:

In [145]:
try:
    data["Rain" in data.events]
except:
    print("dang it, it would be nice it that would've worked")

dang it, it would be nice it that would've worked


We can wrap it up in an apply() call fairly easily, though:

In [150]:
data[data.events.apply(lambda e: "Rain" in e)]
# Remember, data.events.apply(lambda e: "Rain" in e) will return a True/False vector (Series)
## the length of data (based on if "rain" is in that row of the events column),
## then by putting that result in data[] returns the rows that were true'''

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,max_visibilty,mean_visibility,min_visibility,max_wind,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-03-11,67,49,30,43,31,24,78,53,28,30.37,...,10,10,10,22,7,32.0,T,1,Rain,163
2012-03-12,71,62,53,59,55,43,90,76,61,30.13,...,10,10,6,24,14,36.0,0.03,6,Rain,190
2012-03-15,79,69,58,61,58,53,90,69,48,30.13,...,10,10,10,31,10,41.0,0.04,3,Rain-Thunderstorm,209
2012-03-18,80,70,59,61,58,57,93,69,45,30.14,...,10,10,9,18,8,25.0,T,2,Rain,197
2012-03-22,81,69,57,63,57,51,87,65,42,30.11,...,10,10,2,31,4,41.0,0.14,3,Rain,159
2012-03-23,73,64,55,61,58,54,97,79,61,30.03,...,10,9,2,21,6,24.0,0.86,7,Rain-Thunderstorm,129
2012-03-24,65,56,46,54,49,43,100,80,48,29.88,...,10,8,0,12,5,14.0,0.06,5,Fog-Rain,222
2012-03-29,69,58,46,45,39,35,76,55,34,30.08,...,10,10,10,14,6,17.0,T,2,Rain,84
2012-03-30,81,66,51,61,50,42,78,59,39,29.93,...,10,10,10,25,11,37.0,0.01,4,Rain-Thunderstorm,182
2012-04-01,79,64,48,62,52,44,96,75,54,29.89,...,10,6,1,24,7,31.0,0.51,4,Rain-Thunderstorm,169


# Grouping
Besides  <font color='red'>apply()</font>, another great DataFrame function is  <font color='red'>groupby()</font>. It will group a DataFrame by one or more columns, and let you iterate through each group.

As an example, let's group our DataFrame by the "cloud_cover" column (a value ranging from 0 to 8).

In [154]:
cover_temps = {}
for cover, cover_data in data.groupby("cloud_cover"):
    cover_temps[cover] = cover_data.mean_temp.mean()
    # Abover line produces: The mean of the mean_temp columns grouped by cloud cover
cover_temps

{0: 59.73076923076923,
 1: 61.41509433962264,
 2: 59.72727272727273,
 3: 58.0625,
 4: 51.5,
 5: 50.827586206896555,
 6: 57.72727272727273,
 7: 46.5,
 8: 40.90909090909091}

In [164]:
cover_temps = {}
for cover, cover_data in data.groupby("cloud_cover"):
    print(cover) # prints the value of the first group in "cloud_cover"
    print()
    #print(cover_data) # prints all of data in the first group
    print(cover_data.describe()) # prints a description of all of data in the first group
    break


0

         max_temp   mean_temp    min_temp     max_dew    mean_dew     min_dew  \
count  104.000000  104.000000  104.000000  104.000000  104.000000  104.000000   
mean    74.057692   59.730769   44.942308   49.144231   44.201923   38.884615   
std     18.819958   17.698791   16.834908   15.524749   15.605812   15.927815   
min     25.000000   13.000000    1.000000   13.000000    7.000000   -2.000000   
25%     56.750000   44.750000   31.750000   34.000000   28.000000   24.000000   
50%     79.500000   64.000000   48.500000   54.000000   48.500000   41.000000   
75%     90.000000   73.250000   57.250000   61.000000   57.000000   51.000000   
max    103.000000   89.000000   77.000000   74.000000   70.000000   67.000000   

       max_humidity  mean_humidity  min_humidity  max_pressure  mean_pressure  \
count    104.000000     104.000000    104.000000    104.000000     104.000000   
mean      88.375000      61.480769     34.019231     30.151346      30.087885   
std       10.037007     

When you iterate through the result of <font color = "red">groupby()</font>, you will get a tuple. The first item is the column value, and the second item is a filtered DataFrame (where the column equals the first tuple value).

You can group by more than one column as well. In this case, the first tuple item returned by groupby() will itself be a tuple with the value of each column.

In [167]:
for (cover, events), group_data in data.groupby(["cloud_cover", "events"]):
    print("Cover: {0}, Events: {1}, Count: {2}".format(cover, events, len(group_data)))

Cover: 0, Events: , Count: 99
Cover: 0, Events: Fog, Count: 2
Cover: 0, Events: Rain, Count: 2
Cover: 0, Events: Thunderstorm, Count: 1
Cover: 1, Events: , Count: 35
Cover: 1, Events: Fog, Count: 5
Cover: 1, Events: Fog-Rain, Count: 1
Cover: 1, Events: Rain, Count: 4
Cover: 1, Events: Rain-Thunderstorm, Count: 2
Cover: 1, Events: Thunderstorm, Count: 6
Cover: 2, Events: , Count: 20
Cover: 2, Events: Fog, Count: 1
Cover: 2, Events: Rain, Count: 5
Cover: 2, Events: Rain-Thunderstorm, Count: 4
Cover: 2, Events: Snow, Count: 1
Cover: 2, Events: Thunderstorm, Count: 2
Cover: 3, Events: , Count: 12
Cover: 3, Events: Fog, Count: 2
Cover: 3, Events: Fog-Rain-Thunderstorm, Count: 3
Cover: 3, Events: Fog-Thunderstorm, Count: 1
Cover: 3, Events: Rain, Count: 9
Cover: 3, Events: Rain-Thunderstorm, Count: 4
Cover: 3, Events: Snow, Count: 1
Cover: 4, Events: , Count: 16
Cover: 4, Events: Fog, Count: 3
Cover: 4, Events: Fog-Rain, Count: 2
Cover: 4, Events: Fog-Rain-Thunderstorm, Count: 2
Cover: 4, Ev

# Creating New Columns
Weather events in our DataFrame are stored in strings like "Rain-Thunderstorm" to represent that it rained and there was a thunderstorm that day. Let's split them out into boolean "rain", "thunderstorm", etc. columns.

First, let's discover the different kinds of weather events we have with <font color = red>unique()</font>.

In [173]:
print(type(data.events.unique()))

data.events.unique()

<class 'numpy.ndarray'>


array(['', 'Rain', 'Rain-Thunderstorm', 'Fog-Thunderstorm', 'Fog-Rain',
       'Thunderstorm', 'Fog-Rain-Thunderstorm', 'Fog', 'Fog-Rain-Snow',
       'Fog-Rain-Snow-Thunderstorm', 'Fog-Snow', 'Snow', 'Rain-Snow'], dtype=object)

Looks like we have "Rain", "Thunderstorm", "Fog", and "Snow" events. Creating a new column for each of these event kinds is a piece of cake with the dictionary syntax.

In [176]:
for event_kind in ["Rain", "Thunderstorm", "Fog", "Snow"]:
    col_name = event_kind.lower()  # Turn "Rain" into "rain", etc.
    data[col_name] = data.events.apply(lambda e: event_kind in e)
    # Above line: left side of "=" makes a new column; right side of "="
    ## for each row (that's the apply part), check (and report) 
    ## if that row has the event kind in that row...returns Trues and Falses
data.head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir,rain,thunderstorm,fog,snow
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-03-10,56,40,24,24,20,16,74,50,26,30.53,...,6,17.0,0.00,0,,138,False,False,False,False
2012-03-11,67,49,30,43,31,24,78,53,28,30.37,...,7,32.0,T,1,Rain,163,True,False,False,False
2012-03-12,71,62,53,59,55,43,90,76,61,30.13,...,14,36.0,0.03,6,Rain,190,True,False,False,False
2012-03-13,76,63,50,57,53,47,93,66,38,30.12,...,5,24.0,0.00,0,,242,False,False,False,False
2012-03-14,80,62,44,58,52,43,93,68,42,30.15,...,6,22.0,0.00,0,,202,False,False,False,False


Our new columns show up at the bottom. We can access them now with the dot syntax.

In [177]:
data.rain.head()

date
2012-03-10    False
2012-03-11     True
2012-03-12     True
2012-03-13    False
2012-03-14    False
Freq: D, Name: rain, dtype: bool

We can also do cool things like find out how many True values there are (i.e., how many days had rain)...

In [178]:
data.rain.sum()

121

...and get all the days that had both rain and snow!

In [179]:
data[data.rain & data.snow].head()

Unnamed: 0_level_0,max_temp,mean_temp,min_temp,max_dew,mean_dew,min_dew,max_humidity,mean_humidity,min_humidity,max_pressure,...,mean_wind,min_wind,precipitation,cloud_cover,events,wind_dir,rain,thunderstorm,fog,snow
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-11-12,57,43,28,54,38,21,96,72,48,30.41,...,8,29.0,0.77,6,Fog-Rain-Snow,267,True,False,True,True
2012-12-20,57,43,28,50,36,25,92,79,66,29.75,...,19,54.0,0.44,8,Fog-Rain-Snow-Thunderstorm,219,True,True,True,True
2013-01-30,68,48,27,57,42,20,96,82,68,29.7,...,14,63.0,0.99,8,Rain-Snow,260,True,False,False,True
2013-02-19,47,35,23,43,19,10,86,71,55,30.19,...,16,39.0,0.1,8,Rain-Snow,282,True,False,False,True
2013-02-21,33,24,15,27,20,12,88,71,54,30.35,...,10,32.0,0.3,5,Fog-Rain-Snow,91,True,False,True,True


# Plotting 

We've already seen how the <font color = red>hist()</font> function makes generating histograms a snap. Let's look at the <font color = red>plot()</font> function now.

In [187]:
data.max_temp.plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x116d026a0>

That one line of code did a lot for us. First, it created a nice looking line plot using the maximum temperature column from our DataFrame. Second, because we used datetime objects in our index, pandas labeled the x-axis appropriately.

Pandas is smart too. If we're only looking at a couple of days, the x-axis looks different:

In [188]:
data.max_temp.tail().plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x116f38e48>

### Side note about matplotlib notebook magic
If you don't stop the interactive mode and then make another call using the same data, you don't get a new plot, but the old plot gets modified; e.g. try "data.max_temp.plot()" followed by "data.max_temp.tail().plot()"
### End of side note about %matplotlib notebook

Prefer a bar plot? Pandas has got your covered.

In [189]:
data.max_temp.tail().plot(kind="bar", rot=10)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x116f91780>

The plot() function returns a matplotlib AxesSubPlot object. You can pass this object into subsequent calls to plot() in order to compose plots.

Although plot() takes a variety of parameters to customize your plot, users familiar with matplotlib will feel right at home with the AxesSubPlot object.

In [190]:
ax = data.max_temp.plot(title="Min and Max Temperatures")
data.min_temp.plot(style="red", ax=ax)
ax.set_ylabel("Temperature (F)")

<IPython.core.display.Javascript object>

<matplotlib.text.Text at 0x117aab2b0>

# Getting data out
Writing data out in pandas is as easy as getting data in. To save our DataFrame out to a new csv file, we can just do this:

In [None]:
data.to_csv("data/weather-mod.csv") # to make comma separated
data.to_csv("data/weather-mod.tsv", sep="\t") # to make tab separated

There's also support for reading and writing Excel files, if you need it. (http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files)

# Miscellanea
We've only covered a small fraction of the pandas library here. Before I wrap up, however, there are a few miscellaneous tips I'd like to go over.

First, it can be confusing to know when an operation will modify a DataFrame and when it will return a copy to you. Pandas behavior here is entirely dictated by NumPy, and some situations are unintuitive.

For example, what do you think will happen here?

In [191]:
for idx, row in data.iterrows():
    row["max_temp"] = 0
data.max_temp.head()

date
2012-03-10    56
2012-03-11    67
2012-03-12    71
2012-03-13    76
2012-03-14    80
Freq: D, Name: max_temp, dtype: int64

Contrary to what you might expect, modifying row did not modify data! This is because row is a copy, and does not point back to the original DataFrame.

Here's the right way to do it:

In [193]:
for idx, row in data.iterrows():
    data.ix[idx, "max_temp"] = 0
any(data.max_temp != 0)  # Any rows with max_temp not equal to zero?

False

Just to make you even more confused, this also doesn't work:

In [194]:
for idx, row in data.iterrows():
    data.ix[idx]["max_temp"] = 100
data.max_temp.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


date
2012-03-10    0
2012-03-11    0
2012-03-12    0
2012-03-13    0
2012-03-14    0
Freq: D, Name: max_temp, dtype: int64

When using <font color = red>apply()</font>, the <font color = blue>default</font> behavior is to go over columns.

In [195]:
data.apply(lambda c: c.name)

max_temp                  max_temp
mean_temp                mean_temp
min_temp                  min_temp
max_dew                    max_dew
mean_dew                  mean_dew
min_dew                    min_dew
max_humidity          max_humidity
mean_humidity        mean_humidity
min_humidity          min_humidity
max_pressure          max_pressure
mean_pressure        mean_pressure
min_pressure          min_pressure
max_visibilty        max_visibilty
mean_visibility    mean_visibility
min_visibility      min_visibility
max_wind                  max_wind
mean_wind                mean_wind
min_wind                  min_wind
precipitation        precipitation
cloud_cover            cloud_cover
events                      events
wind_dir                  wind_dir
rain                          rain
thunderstorm          thunderstorm
fog                            fog
snow                          snow
dtype: object



You can make apply() go over rows by passing axis=1

In [196]:
data.apply(lambda r: r["max_pressure"] - r["min_pressure"], axis=1).head()

date
2012-03-10    0.19
2012-03-11    0.24
2012-03-12    0.25
2012-03-13    0.15
2012-03-14    0.11
Freq: D, dtype: float64

When you call drop(), though, it's flipped. To drop a column, you need to pass axis=1

In [197]:
data.drop(["events"], axis=1).columns

Index(['max_temp', 'mean_temp', 'min_temp', 'max_dew', 'mean_dew', 'min_dew',
       'max_humidity', 'mean_humidity', 'min_humidity', 'max_pressure',
       'mean_pressure', 'min_pressure', 'max_visibilty', 'mean_visibility',
       'min_visibility', 'max_wind', 'mean_wind', 'min_wind', 'precipitation',
       'cloud_cover', 'wind_dir', 'rain', 'thunderstorm', 'fog', 'snow'],
      dtype='object')