# Review of Pandas

-----

Pandas is one of the most useful packages in Python. Its primary data structure is the DataFrame, which is a table of data with rows and columns.

In this review, I will show you the basics of DataFrames. I will show you how to create, manipulate, and filter DataFrames. However, before I can do that, I also need to review the Series data structure since that is the basis of the DataFrame.

In a future lesson, I will show you how to group data in a DataFrame.

## Preliminaries

In [1]:
# Print working directory
%pwd

'/home/data_scientist'

In [2]:
%cd '/home/data_scientist/accy575/readonly/Pcard'

/home/data_scientist/accy575/readonly/Pcard


In [3]:
%ls

[0m[01;32mPCard_FY2010.csv[0m*  [01;32mPCard_FY2012.csv[0m*  [01;32mPCard_FY2014.csv[0m*
[01;32mPCard_FY2011.csv[0m*  [01;32mPCard_FY2013.csv[0m*  [01;32mPCard_FY2015.csv[0m*


In [29]:
import pandas as pd
import numpy as np

## The `Series` Data Structure

You can think of a series as a single column of data. Each element of the series has a label (called the index). 

Let's create a simple series:

In [5]:
s1 = pd.Series(['a','b','c','d'])
s1

0    a
1    b
2    c
3    d
dtype: object

In [6]:
s1[2]

'c'

Notice that, by default, Pandas created labels for each element of my series. These default labels always start at 0. If I want to use different labels, I can do so:

In [7]:
s2 = pd.Series(
    ['a','b','c','d'], 
    index = ['element 1', 'element 2', 'element 3', 'element 4']
)
s2

element 1    a
element 2    b
element 3    c
element 4    d
dtype: object

In [8]:
s2['element 2']

'b'

### A useful function: value_counts

You will likely find you want to count the number of times each item appears in a Pandas Series. Here's a built-in way to do it:

In [9]:
s3 = pd.Series((list('abc') * 3) + ['d', 'e', 'f'])
s3

0     a
1     b
2     c
3     a
4     b
5     c
6     a
7     b
8     c
9     d
10    e
11    f
dtype: object

In [10]:
s3.value_counts()

a    3
b    3
c    3
f    1
e    1
d    1
dtype: int64

## The `DataFrame` Data Structure

You can think of a DataFrame as a table of data, with rows and columns. Alternatively, you can think of a DataFrame as a collection of Series objects, _each of which share the same row index_.

I will now show you how to create a DataFrames from a CSV file.

In [12]:
df2012 = pd.read_csv('PCard_FY2012.csv')

Let's look at the first 5 rows of the DataFrame using the head command.

In [13]:
df2012.head()

Unnamed: 0,Agency Number,Agency Name,Cardholder Last Name,Cardholder First Initial,Description,Amount,Vendor,Transaction Date,Posted Date,Merchant Category Code (MCC)
0,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$60.07,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS
1,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$41.29,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS
2,1000,OKLAHOMA STATE UNIVERSITY,BEROUSEK,M,GENERAL PURCHASE,($180.00),A.C.E. SUPPLY & SERVICE,30-Jun-11,1-Jul-11,STATIONERY OFFICE SUPPLIES PRINTING AND WRIT...
3,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$9.36,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES
4,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$16.86,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES


This is looking pretty good. Let's get some basic stats about our DataFrame:

In [14]:
# This gives (number of rows, number of columns)
df2012.shape

(442184, 10)

In [15]:
df2012.columns

Index(['Agency Number', 'Agency Name', 'Cardholder Last Name',
       'Cardholder First Initial', 'Description', 'Amount', 'Vendor',
       'Transaction Date', 'Posted Date', 'Merchant Category Code (MCC)'],
      dtype='object')

In [16]:
df2012.dtypes

Agency Number                    int64
Agency Name                     object
Cardholder Last Name            object
Cardholder First Initial        object
Description                     object
Amount                          object
Vendor                          object
Transaction Date                object
Posted Date                     object
Merchant Category Code (MCC)    object
dtype: object

### What's the actual data type??!!

Consider the Description field in the data. A number of you have issues removing leading and trailing spaces. That's because some of the values in the Description field are blank and Python imported those blank values as numbers ('nan') instead of empty strings.

How can we look at the output of dtype above and actually figure out what data types are stored? Here's a little code snipped to help:

In [26]:
# Take the Description column and apply the type function. 
# Then use value_counts to see counts of the different types.
df2012.Description.apply(type).value_counts()

<class 'str'>      442147
<class 'float'>        37
Name: Description, dtype: int64

The above output shows that there are 37 problematic values in the Description column.

### Renaming Columns

What if we want to rename the columns? You can rename the columns as follows:

In [33]:
newColNames = {
    'Agency Number': 'AgencyNum', 
    'Agency Name': 'AgencyName',
    'Cardholder Last Name': 'LastName'}

df2012.rename(columns = newColNames).head()

Unnamed: 0,AgencyNum,AgencyName,LastName,Cardholder First Initial,Description,Amount,Vendor,Transaction Date,Posted Date,Merchant Category Code (MCC),PyType
0,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$60.07,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS,<class 'str'>
1,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$41.29,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS,<class 'str'>
2,1000,OKLAHOMA STATE UNIVERSITY,BEROUSEK,M,GENERAL PURCHASE,($180.00),A.C.E. SUPPLY & SERVICE,30-Jun-11,1-Jul-11,STATIONERY OFFICE SUPPLIES PRINTING AND WRIT...,<class 'str'>
3,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$9.36,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES,<class 'str'>
4,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$16.86,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES,<class 'str'>


Ain't life great? Let's take a look at our DataFrame again.

In [34]:
df2012.head()

Unnamed: 0,Agency Number,Agency Name,Cardholder Last Name,Cardholder First Initial,Description,Amount,Vendor,Transaction Date,Posted Date,Merchant Category Code (MCC),PyType
0,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$60.07,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS,<class 'str'>
1,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$41.29,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS,<class 'str'>
2,1000,OKLAHOMA STATE UNIVERSITY,BEROUSEK,M,GENERAL PURCHASE,($180.00),A.C.E. SUPPLY & SERVICE,30-Jun-11,1-Jul-11,STATIONERY OFFICE SUPPLIES PRINTING AND WRIT...,<class 'str'>
3,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$9.36,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES,<class 'str'>
4,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$16.86,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES,<class 'str'>


WTF? Why didn't the new column names stick? The reason is that the rename function returned a new DataFrame. It didn't make the changes "in place". To make the changes permanent, use one of the following 2 commands.

In [35]:
# The following are equivalent. # Only run one of these!
df2012.rename(columns = newColNames, inplace = True)
#df2012 = df2012.rename(columns = newColNames)

In [36]:
df2012.head()

Unnamed: 0,AgencyNum,AgencyName,LastName,Cardholder First Initial,Description,Amount,Vendor,Transaction Date,Posted Date,Merchant Category Code (MCC),PyType
0,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$60.07,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS,<class 'str'>
1,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$41.29,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS,<class 'str'>
2,1000,OKLAHOMA STATE UNIVERSITY,BEROUSEK,M,GENERAL PURCHASE,($180.00),A.C.E. SUPPLY & SERVICE,30-Jun-11,1-Jul-11,STATIONERY OFFICE SUPPLIES PRINTING AND WRIT...,<class 'str'>
3,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$9.36,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES,<class 'str'>
4,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$16.86,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES,<class 'str'>


A better approach is to rename the columns when you import the file. Like this:

In [40]:
# Note: if you don't say header = 0, it will import these names as the first row of your dataset!

df2012 = pd.read_csv(
    'PCard_FY2012.csv', 
    header = 0,
    names = [
        'AgencyNum', 'AgencyName', 
        'LastName', 'FirstInit',
        'Description','Amount','Vendor',
        'TransDate','PostDate',
        'MCC']
)

In [39]:
df2012.head()

Unnamed: 0,AgencyNum,AgencyName,LastName,FirstInit,Description,Amount,Vendor,TransDate,PostDate,MCC
0,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$60.07,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS
1,1000,OKLAHOMA STATE UNIVERSITY,BELL,D,GENERAL PURCHASE,$41.29,WM SUPERCENTER,30-Jun-11,1-Jul-11,GROCERY STORES AND SUPERMARKETS
2,1000,OKLAHOMA STATE UNIVERSITY,BEROUSEK,M,GENERAL PURCHASE,($180.00),A.C.E. SUPPLY & SERVICE,30-Jun-11,1-Jul-11,STATIONERY OFFICE SUPPLIES PRINTING AND WRIT...
3,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$9.36,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES
4,1000,OKLAHOMA STATE UNIVERSITY,FOCHT,R,GENERAL PURCHASE,$16.86,NAPA AUTO PARTS,29-Jun-11,1-Jul-11,AUTOMOTIVE PARTS AND ACCESSORIES STORES


In [41]:
df2012.shape

(442184, 10)

### Filtering

Sometimes, you only want to work with a subset of your DataFrame. There are many ways to filter a DataFrame and I will only show you a few.

Let's say we're only interested in cardholders whose last name is Bell.

In [None]:
# The following are equivalent:

df2012[df2012['LastName'] == 'BELL']
#df2012[df2012.LastName == 'BELL']
#df2012[df2012.loc[:,'LastName'] == 'BELL']

#### Multiple filters

What if we want to use multiple filters? Use the following tips:
* Each condition *MUST* be grouped in parentheses
* Use the operators & for and, | for or, and ~ for not

In the following, note that the date field hasn't been converted to a Python date.

In [None]:
df2012[(df2012['LastName'] == 'BELL') & (df2012['Transaction Date'] == '30-Jun-11')]