# Basic Data Processing with Pandas

From Coursera: Intro to Data Science, Week 2  
Patricia Schuster, University of Michigan  
March 2017

# The `DataFrame` data structure

## Basics

This is the heart of the pandas library. It is conceptually a 2-d Series object where there is an index and multiple columns of content, with each column having a label. The distinction between columns and rows is really only conceptual, and you can think of it as a two-axes labeled array. 

You can create a DataFrame in many ways. Try it out.

**Note: Series vs. DataFrame**
What is the difference between a pandas Series and a pandas DataFrame? Detailed documentation here: <https://pandas.pydata.org/pandas-docs/stable/dsintro.html>

* Series: a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. 
* DataFrame: a two-dimensional labeled data structure with columns of potentially different types. Like a spreadsheet or SQL table, or a dict of Series of objects. 

Create three purchase orders as Series objects. Each has the name of a customer, a string that describes the item being purchased, and a price.

In [1]:
import pandas as pd

In [28]:
purchase_1 = pd.Series({'Name' : 'Chris',
                        'Item Purchased' : 'Dog Food',
                        'Cost' : 22.50})

purchase_2 = pd.Series({'Name' : 'Kevin',
                        'Item Purchased' : 'Kitty Litter',
                        'Cost' : 2.50})

purchase_3 = pd.Series({'Name' : 'Vinod',
                        'Item Purchased' : 'Bird Seed',
                        'Cost' : 5.00})

In [5]:
print(purchase_1)
print(type(purchase_1))

Cost                  22.5
Item Purchased    Dog Food
Name                 Chris
dtype: object
<class 'pandas.core.series.Series'>


Store these purchases into a DataFrame that includes the store where each purchase was made.

In [29]:
df = pd.DataFrame([purchase_1,purchase_2,purchase_3], index = ['Store 1','Store 1','Store 2'])

Pandas prints it out as a beautiful table.

In [7]:
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevin
Store 2,5.0,Bird Seed,Vinod


In this DataFrame, the row names are the indices. `iloc` and `loc` are used for row selection. (They can be swapped with the column names using `df.T`)

Indexing operator directly on the DataFrame is reserved for columns.

Use `loc` to pull entries from each store separately.

In [8]:
df.loc['Store 1']

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevin


In [9]:
df.loc['Store 2']

Cost                      5
Item Purchased    Bird Seed
Name                  Vinod
Name: Store 2, dtype: object

An important distinction here is the `type` of what is returned to us in these last two examples. In the first query for `Store 1`, there are multiple entries, so pandas returns a DataFrame. In the query for `Store 2`, there is only one entry, so pandas returns a Series.

In [13]:
print(type(df.loc['Store 1']))
print(type(df.loc['Store 2']))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


The indices along either axis, horizontal or vertical, can be non-unique.

How would I get a list of all items which had been purchased, regardless of where or by whom they were purchased?

In [10]:
df['Item Purchased']

Store 1        Dog Food
Store 1    Kitty Litter
Store 2       Bird Seed
Name: Item Purchased, dtype: object

One of the most powerful capabilities of pandas DataFrames is that you can quickly select data based on multiple axes. 

In [11]:
df.loc['Store 1','Cost']

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

## Some more on indexing

As briefly explained above, the row names in a DataFrame are the indices, and therefore `iloc` and `loc` are reserved for row selection. 

Indexing operator directly on the DataFrame is reserved for columns. Try it out.

In [12]:
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevin
Store 2,5.0,Bird Seed,Vinod


Start by exploring `loc`:

In [13]:
df.loc['Store 1']

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevin


In [None]:
df.loc['Cost'] # This will create a Key Error

Now try direct indexing.

In [15]:
df['Cost']

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64

In [None]:
df['Store 1'] # This will create a key error

You can also chain indexing together.

In [16]:
df.loc['Store 1']['Cost']

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

In [17]:
df['Cost'].loc['Store 1']

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

Chaining can come with some costs and should be avoided if possible. Chaining generally causes pandas to return a copy of the DataFrame instead of a view of the DataFrame. For selecting data this may not be a problem, especially if the DataFrame is small. For changing data, however, this is a major point of distinction.

## `.loc` slicing

As we saw, `.loc` does row selection and it can take two parameters: the row index and the list of column names. `.loc` also supports slicing. 

Ask for all of the name and cost values for all stores:

In [18]:
df.loc[:, {'Name','Cost'}]

Unnamed: 0,Name,Cost
Store 1,Chris,22.5
Store 1,Kevin,2.5
Store 2,Vinod,5.0


In [19]:
df.loc['Store 1',{'Name','Cost'}]

Unnamed: 0,Name,Cost
Store 1,Chris,22.5
Store 1,Kevin,2.5


## Dropping data

It's easy to delete data in series and DataFrames using the `drop` function. The `drop` function takes the index or row label. It doesn't change the DataFrame by default, but instead returns a copy with those DataFrames removed. 

In [20]:
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevin
Store 2,5.0,Bird Seed,Vinod


In [21]:
copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df

Unnamed: 0,Cost,Item Purchased,Name
Store 2,5.0,Bird Seed,Vinod


Drop has two interesting optional parameters. The first is called `in place`, and if set to True, the DataFrame will be updated in place instead of a copy being returned. The second parameter is the axis that should be dropped. By default the axis is 0, indicating the row axis, but you can change it to 1 if you want to drop a column. 

There is a second way to drop a column, and that is through the use of `del`. 

In [38]:
copy_df.drop?

In [39]:
del copy_df['Name']
copy_df

Unnamed: 0,Cost,Item Purchased
Store 2,5.0,Bird Seed


Using `del` takes immediate effect and does not return a view

## Adding data

Adding a new column is as easy as assigning it to some value. That will broadcast the default value to the new column immediately. 

In [41]:
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevin
Store 2,5.0,Bird Seed,Vinod


In [42]:
df['Location'] = None
df

Unnamed: 0,Cost,Item Purchased,Name,Location
Store 1,22.5,Dog Food,Chris,
Store 1,2.5,Kitty Litter,Kevin,
Store 2,5.0,Bird Seed,Vinod,


Exercise: For the purchase records from the pet store, how would you update the DataFrame, applying a discount of 20% across all the values in the 'Cost' column?

In [44]:
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])

# Your answer here
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevyn
Store 2,5.0,Bird Seed,Vinod


In [48]:
df['Cost'] *= 0.8
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,18.0,Dog Food,Chris
Store 1,2.0,Kitty Litter,Kevyn
Store 2,4.0,Bird Seed,Vinod


# Modifying, copying data

Warning: Be careful of whether you are copying a dataframe or merely showing a different visualization of it. Here is an example where it seems that we are modifying data in a new DataFrame, but really we are also modifying the original DataFrame. 

In [30]:
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevin
Store 2,5.0,Bird Seed,Vinod


Store the `Cost` column to a different DataFrame.

In [31]:
cost = df['Cost']
print(cost)

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64


In [32]:
cost += 2
print(cost)

Store 1    24.5
Store 1     4.5
Store 2     7.0
Name: Cost, dtype: float64


In [33]:
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,24.5,Dog Food,Chris
Store 1,4.5,Kitty Litter,Kevin
Store 2,7.0,Bird Seed,Vinod


Although we modified `costs`, the values in the `Cost` column from the original DataFrame have risen as well. 

If we don't want to modify the original DataFrame, then we should have used the `copy` method when we produced `costs`.

# Key Concepts

Just remember that the rows and columns are really just there for our benefit. Underneath, this is just a two-axes labeled array, and transposing the axes is easy. 

Avoid chaining calls. 