# Pandas

pandas extends the numpy ndarray to allow for a data-structure that labels the columns (called a data frame).  You can kind of think about this functionality as operating how a spreadsheet might work.

In this manner, it provides much of the same functionality of R -- the data frame provides the basis for data analysis.

Nice documentation is here:

http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'pandas'

## series

A series is a labeled array.  It looks superficially like a dictionary, but is fixed size, and can handle missing values.  It also can also be operated on with any numpy operation or the standard operators (a dictionary cannot).  The labels are referred to as the _index_.

Some examples from: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

NameError: name 'pd' is not defined

In [3]:
s.index

NameError: name 's' is not defined

If you don't specify an index, one will be made up for you

In [4]:
pd.Series(np.random.randn(5))

NameError: name 'pd' is not defined

you can initialize from a dictionary.  By default it will use the dictionary keys (sorted) as the index

In [5]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

NameError: name 'pd' is not defined

In [6]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

NameError: name 'pd' is not defined

Note that NaN indicates a missing value

you can operate on a series as you would any ndarray

In [7]:
s

NameError: name 's' is not defined

In [8]:
s[0]

NameError: name 's' is not defined

In [9]:
s[:3]

NameError: name 's' is not defined

In [10]:
s[s > s.median()]

NameError: name 's' is not defined

In [11]:
np.exp(s)

NameError: name 'np' is not defined

you can also index by label -- this mimics the behavior of a dictionary

In [12]:
s['a']

NameError: name 's' is not defined

In [13]:
s['e']

NameError: name 's' is not defined

In [14]:
'e' in s

NameError: name 's' is not defined

The `get()` method can be used to safely access an element if it is possible it does not exist -- you can specify a default to return in that case.  The alternative is to use a `try` / `except` block.

In [15]:
s.get('f', np.nan)

NameError: name 's' is not defined

Operations, like those you use with an ndarray work fine on a Series

In [16]:
s + s

NameError: name 's' is not defined

In [17]:
s * 2

NameError: name 's' is not defined

note that operations are always done on like labels, so the following is not exactly the same as numpy arrays.  In this sense, pandas results respect the union of indices 

In [18]:
s[1:] + s[:-1]

NameError: name 's' is not defined

a series can have a name

In [19]:
s = pd.Series(np.random.randn(5), name='something')
s

NameError: name 'pd' is not defined

## DataFrame

The dataframe is like a spreadsheet -- the columns and rows have labels.  It is 2-d.  This is what you will usually use with pandas.

you can initialize from:
  * Dict of 1D ndarrays, lists, dicts, or Series
  * 2-D numpy.ndarray
  * Structured or record ndarray
  * A Series
  * Another DataFrame

In [20]:
d = {'one' : pd.Series([1., 2., 3.], index=['b', 'a', 'c']),
     'two' : pd.Series([2, 1., 3., 4.], index=['b', 'a', 'c', 'd'])}

NameError: name 'pd' is not defined

In [21]:
df = pd.DataFrame(d)
df

NameError: name 'pd' is not defined

In [22]:
df.mean()

NameError: name 'df' is not defined

You can exclude some labels

In [23]:
pd.DataFrame(d, index=['d', 'b', 'a'])

NameError: name 'pd' is not defined

Here's initialization from lists / ndarrays

In [24]:
d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1.]}

In [25]:
pd.DataFrame(d)

NameError: name 'pd' is not defined

In [26]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

NameError: name 'pd' is not defined

there are lots of other initialization methods, e.g, list of dicts

In [27]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2, index=['first', 'second'])

NameError: name 'pd' is not defined

### working with the dataframe

you can index it as it it were Series objects.  Other access is as follows:

  * Select column: `df[col]` (returns Series)
  * Select row by label: `df.loc[label]` (returns Series)
  * Select row by integer location: `df.iloc[loc]` (returns Series)
  * Slice rows: `df[5:10]` (returns DataFrame)
  * Select rows by boolean vector: `df[bool_vec]` (return DataFrame)

In [28]:
df['one']

NameError: name 'df' is not defined

In [29]:
df

NameError: name 'df' is not defined

In [30]:
type(df['one'])

NameError: name 'df' is not defined

In [31]:
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
df

NameError: name 'df' is not defined

you can also treat any index name as if it were a property

In [32]:
df.three

NameError: name 'df' is not defined

you can delete or pop columns---popping returns a `Series`

In [33]:
del df['two']

NameError: name 'df' is not defined

In [34]:
three = df.pop('three')

NameError: name 'df' is not defined

In [35]:
df

NameError: name 'df' is not defined

In [36]:
three

NameError: name 'three' is not defined

In [37]:
type(three)

NameError: name 'three' is not defined

initializing with a scalar propagates that scalar to all the rows

In [38]:
df['foo'] = 'bar'

NameError: name 'df' is not defined

In [39]:
df

NameError: name 'df' is not defined

## CSV

you can also read from CSV

Note, if there is stray whitespace in your strings in the CSV, pandas will keep it.  This is a bit annoying, and you might need to investigate converters to get things properly formatted.

There are similar methods for HDF5 and excel

In [40]:
grades = pd.read_csv('sample.csv', index_col="student", skipinitialspace=True)

NameError: name 'pd' is not defined

In [41]:
grades

NameError: name 'grades' is not defined

In [42]:
grades.index

NameError: name 'grades' is not defined

In [43]:
grades.columns

NameError: name 'grades' is not defined

A single student's grades

In [44]:
grades.loc["A"]

NameError: name 'grades' is not defined

All the grades for the first homework

In [45]:
grades['hw 1']

NameError: name 'grades' is not defined

Creating a new column based on the existing ones

In [46]:
grades['hw average'] = (grades['hw 1'] + grades['hw 2'] + grades['hw 3'] + grades['hw 4'])/4.0

NameError: name 'grades' is not defined

In [47]:
grades

NameError: name 'grades' is not defined

this didn't handle the missing data properly -- let's replace the NaNs with 0

In [48]:
g2 = grades.fillna(0)

NameError: name 'grades' is not defined

In [49]:
g2['hw average'] = (g2['hw 1'] + g2['hw 2'] + g2['hw 3'] + g2['hw 4'])/4.0

NameError: name 'g2' is not defined

In [50]:
g2

NameError: name 'g2' is not defined

For big dataframes, we can view just pieces

In [51]:
g2.head()

NameError: name 'g2' is not defined

In [52]:
g2.tail(2)

NameError: name 'g2' is not defined

### statistics

we can get lots of statistics

In [53]:
g2.describe()

NameError: name 'g2' is not defined

want to sort by values?

In [54]:
g2.sort_values(by="exam")

NameError: name 'g2' is not defined

In [55]:
g2.mean()

NameError: name 'g2' is not defined

In [56]:
g2.median()

NameError: name 'g2' is not defined

In [57]:
g2.max()

NameError: name 'g2' is not defined

In [58]:
g2

NameError: name 'g2' is not defined

`.apply()` let's you apply a function to the `DataFrame`.  By default, it will work over indices (e.g., applying your function on a column), treating the inputs as a `Series`.  

In [59]:
g2.apply(lambda x: x.max() - x.min())

NameError: name 'g2' is not defined

### access

Pandas provides optimizes methods for accessing data: .at, .iat, .loc, .iloc, and .ix

The standard slice notation works for rows, but note *when using labels, both endpoints are included*

In [60]:
g2["E":"I"]

NameError: name 'g2' is not defined

In [61]:
g2.loc[:,["hw 1", "exam"]]

NameError: name 'g2' is not defined

`at` is a faster access method

In [62]:
g2.at["A","exam"]

NameError: name 'g2' is not defined

The `i` routines work in index space, similar to how numpy does

In [63]:
g2.iloc[3:5,0:2]

NameError: name 'g2' is not defined

In [64]:
g2.iloc[[1,3,5], [1,2,3,4]]

NameError: name 'g2' is not defined

In [65]:
g2.iat[2,2]

NameError: name 'g2' is not defined

### boolean indexing

In [66]:
g2[g2.exam > 90]

NameError: name 'g2' is not defined

### np arrays

In [67]:
g2.loc[:, "new"] = np.random.random(len(g2))

NameError: name 'np' is not defined

In [68]:
g2

NameError: name 'g2' is not defined

resetting values

In [69]:
a = g2[g2.exam < 80].index

NameError: name 'g2' is not defined

In [70]:
a

NameError: name 'a' is not defined

In [71]:
g2.loc[a, "exam"] = 80

NameError: name 'g2' is not defined

In [72]:
g2

NameError: name 'g2' is not defined

## histogramming

In [73]:
g2["exam"].value_counts()

NameError: name 'g2' is not defined

## plotting

In [74]:
%matplotlib inline

In [75]:
g2.plot()

NameError: name 'g2' is not defined

In [76]:
g2.plot.scatter(x="hw average", y="exam", marker="o")

NameError: name 'g2' is not defined

A lot more examples at: http://pandas.pydata.org/pandas-docs/stable/visualization.html

In [77]:
g2

NameError: name 'g2' is not defined

In [78]:
g2.loc["R", :] = 1

NameError: name 'g2' is not defined

In [79]:
g2

NameError: name 'g2' is not defined

In [80]:
g2.to_latex()

NameError: name 'g2' is not defined