Pandas is probably the most heavily utilized package in Python for data scientists. Built on top of NumPy, it is essential to data manipulation, organization, and modeling. Here we'll introduce some of its core functionalities as well as its primary data structure: the **_data frame_**.

In [1]:
import pandas as pd
import numpy as np

## The Data Frame

The *data frame* is like a NumPy array, with a few additional features like column names and row indexing. It is probably the primary way data scientists handle data. You can create a data frame in many different ways, either from csv files, by querying databases, or explicitly. For your first data frame, let's create a 2-dimensional Numpy array. Then, to create a data frame use the `pd.DataFrame()` function and pass in the NumPy array:

In [2]:
my_array = np.array([['Montgomery','Yellohammer state',52423],
                     ['Sacramento','Golden state',163707],
                     ['Oklahoma City','Sooner state',69960 ]])
df = pd.DataFrame(my_array)
df

Unnamed: 0,0,1,2
0,Montgomery,Yellohammer state,52423
1,Sacramento,Golden state,163707
2,Oklahoma City,Sooner state,69960


If you're familiar with Excel, much of Pandas may be familiar to you. Data frames are organized into rows and columns that are nameable. Columns are labeled with column names, rows with an index number (starting with zero by default). You can set both column names and indexes explicitly during the creation of the data frame or after the fact. Let's set both for `df` from above.

In [3]:
df.columns = ['Capital','Nickname','Area']
df.index = ['Alabama','California','Oklahoma']
df

Unnamed: 0,Capital,Nickname,Area
Alabama,Montgomery,Yellohammer state,52423
California,Sacramento,Golden state,163707
Oklahoma,Oklahoma City,Sooner state,69960


You can set the column and index names through the `column=` or `index` keyword arguments when you call the `pd.DataFrame()` function to initially construct the data frame.

In [4]:
df2 = pd.DataFrame(
    my_array,
    columns=['Capital','Nickname','Area'],
    index=['Alabama','California','Oklahoma'])
df2

Unnamed: 0,Capital,Nickname,Area
Alabama,Montgomery,Yellohammer state,52423
California,Sacramento,Golden state,163707
Oklahoma,Oklahoma City,Sooner state,69960


Whichever method you use, we now have a data frame with labeled rows and columns. This will be useful for working with data frames, because it makes these elements easily callable and makes your code more natural to write and simpler to read.


<div class="note">NOTE: you're probably used to seeing a space around <code>=</code> when used for assignment and <code>==</code>, which is used for comparison. In Python, the custom is to <a href="https://www.python.org/dev/peps/pep-0008/#other-recommendations">omit spaces</a> around <code>=</code> with keyword arguments to improve readability and make it easy to distinguish keyword arguments from variable assignments. </div>

#### Adding More Data
To show what data frames can really do we're going to need to make something a little bit bigger. Let's assemble a data frame with named columns via lists. You can create an empty data frame by calling the `pd.DataFrame()` function and passing in the indexes you'd like to use for row names, then add columns using `df['COLUMN_NAME'] = [LIST_OF_VALUES]`.

In [5]:
# This list will become our row names.
names = ['George',
         'John',
         'Thomas',
         'James',
         'Andrew',
         'Martin',
         'William',
         'Zachary',
         'Millard',
         'Franklin']

# Create an empty data frame with named rows.
purchases = pd.DataFrame(index=names)

# Add our columns to the data frame one at a time.
purchases['country'] = ['US', 'CAN', 'CAN', 'US', 'CAN', 'US', 'US', 'US', 'CAN', 'US']
purchases['ad_views'] = [16, 42, 32, 13, 63, 19, 65, 23, 16, 77]
purchases['items_purchased'] = [2, 1, 0, 8, 0, 5, 7, 3, 0, 5]
purchases 

Unnamed: 0,country,ad_views,items_purchased
George,US,16,2
John,CAN,42,1
Thomas,CAN,32,0
James,US,13,8
Andrew,CAN,63,0
Martin,US,19,5
William,US,65,7
Zachary,US,23,3
Millard,CAN,16,0
Franklin,US,77,5


Let's say this is the purchase and browsing history for several users of an ecommerce website for a given year. Page views is the number of pages they've loaded on the site and purchases is the number of items they've bought that year.

Now we have a data frame we can do something with. First note that you can call out a column as a series using *either* dot notation *or* bracket notation: `df.column_name` or `df['column_name']` both work. So `purchases['Name']` returns the names of users who visited the ecommerce website, as does `purchases.Name`. **_Bracket notation is generally preferred and we'll use bracket notation here_**.

Pandas also makes it very easy to create a new column out of our previous data. Let's say we want to create a column of the average items purchased per page view, and call the column `items_purch_per_view`. We can do that with this one-liner:

In [7]:
purchases['items_purch_per_ad'] = purchases['items_purchased'] / purchases['ad_views']
purchases

Unnamed: 0,country,ad_views,items_purchased,items_purch_per_ad
George,US,16,2,0.125
John,CAN,42,1,0.02381
Thomas,CAN,32,0,0.0
James,US,13,8,0.615385
Andrew,CAN,63,0,0.0
Martin,US,19,5,0.263158
William,US,65,7,0.107692
Zachary,US,23,3,0.130435
Millard,CAN,16,0,0.0
Franklin,US,77,5,0.064935


If we just want to see those values and don't need to store them as a new column in our data frame we can just run that function without assigning it to purchases['items_purch_per_ad'] and it will return labeled values giving the name and the purchases per ad for each user.

In [8]:
purchases['items_purchased'] / purchases['ad_views']

George      0.125000
John        0.023810
Thomas      0.000000
James       0.615385
Andrew      0.000000
Martin      0.263158
William     0.107692
Zachary     0.130435
Millard     0.000000
Franklin    0.064935
dtype: float64