# Introduction to Pandas
Lesson Goals

    Learn about Pandas data structures and how to extract data from them.
    Convert other Python data structures to Pandas DataFrames.
    Apply Pandas mathematical functions to data frame fields.

Introduction

As a data analyst with Python in your tool kit, one of the libraries you will use most often will be Pandas. Pandas is a useful library that makes data wrangling, transformation, and analysis easier and more intuitive. In this lesson, we will learn about Pandas data structures and how to apply some basic math functions to them.

As with Numpy in the previous lesson, you must import Pandas to be able to use it. Just like Numpy is typically aliased to np, Pandas is usually aliased to pd. Let's import both of them so that we can use them in this lesson. 

In [1]:
import numpy as np
import pandas as pd

Now that both libraries are imported, we can begin using them.


# Pandas Data Structures

The primary data structures in Pandas are Series and DataFrames. A series is an indexed one-dimensional array where the values can be of any data type. Let's create a series consisting of 10 random numbers using the Pandas Series() method. 

In [2]:
a = pd.Series(np.random.random(10))
print(a)

0    0.859944
1    0.371274
2    0.439095
3    0.335297
4    0.906550
5    0.899059
6    0.715666
7    0.513221
8    0.334326
9    0.648430
dtype: float64


As you can see, this generated an indexed array of random numbers. Just as with other Python data structures, we can reference the elements of this array by their indexes.

In [3]:
print (a[0])

print (a[5])

0.8599440682303597
0.899059241425277


The other type of data structure, DataFrames, are two-dimensional indexed structures where each column can be of a different data type. DataFrames are very similar to spreadsheets and database tables, and they are one of the most useful data structures you will be working with as a data analyst.

DataFrames can be generated using the Pandas DataFrame method as follows. We are also going to assign specific column names to each column in the data frame by passing a variable named colnames with a list of column names to the columns argument. 

In [4]:
colnames = ['Column1','Column2','Column3','Column4','Column5']
df = pd.DataFrame(np.random.random((10,5)), columns=colnames)
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,0.694961,0.105933,0.282105,0.865168,0.755852
1,0.316824,0.596603,0.326177,0.167962,0.412996
2,0.104791,0.497471,0.912999,0.49221,0.994047
3,0.787107,0.305769,0.080767,0.459807,0.190078
4,0.314104,0.195255,0.672712,0.421197,0.301431
5,0.939372,0.884106,0.306418,0.258253,0.954056
6,0.745006,0.890056,0.137691,0.540925,0.686374
7,0.441376,0.362989,0.945984,0.565267,0.033716
8,0.770482,0.277832,0.77934,0.264238,0.794917
9,0.672179,0.764007,0.584587,0.661882,0.380346


We can reference each of the columns in a data frame directly by the column name as follows.

In [5]:
df['Column1']

0    0.694961
1    0.316824
2    0.104791
3    0.787107
4    0.314104
5    0.939372
6    0.745006
7    0.441376
8    0.770482
9    0.672179
Name: Column1, dtype: float64

This returns a series consisting of the values in the first column of our data frame. If we wanted to extract just the first three columns of our data frame, we would need to include the column names in a list inside the square brackets (so there would be two sets of square brackets).

In [6]:
df[['Column1','Column2','Column3']]

Unnamed: 0,Column1,Column2,Column3
0,0.694961,0.105933,0.282105
1,0.316824,0.596603,0.326177
2,0.104791,0.497471,0.912999
3,0.787107,0.305769,0.080767
4,0.314104,0.195255,0.672712
5,0.939372,0.884106,0.306418
6,0.745006,0.890056,0.137691
7,0.441376,0.362989,0.945984
8,0.770482,0.277832,0.77934
9,0.672179,0.764007,0.584587


When we extract more than one column, it returns the results in a data frame since a series is only one-dimensional.



# Converting Other Data Structures to DataFrames

We can also convert data we receive in other Python data structures into data frames so that we can work with them more intuitively. For example, suppose we had a list of prices that houses sold for recently and we wanted to get those into a data frame. We could do that by applying the pd.DataFrame method to the list of prices. 

In [7]:
lst = [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]

price_df = pd.DataFrame(lst, columns=['SalePrice'])
price_df

Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000
5,143000
6,307000
7,200000
8,129900
9,118000


The list was converted into a one-column data frame with a column name of SalePrice.

What if we had more than just one list of data? What if we had a list of lists where each sublist in the master list contained information about the sale of a house (the lot area, neighborhood, year built, quality score, and final sale price)? We can apply the same pd.DataFrame method to that list of lists and Pandas will create a data frame with columns based on each index in the sublists. 

In [8]:
lst_lst = [[8450, 'CollgCr', 2003, 7, 208500],
           [9600, 'Veenker', 1976, 6, 181500],
           [11250, 'CollgCr', 2001, 7, 223500],
           [9550, 'Crawfor', 1915, 7, 140000],
           [14260, 'NoRidge', 2000, 8, 250000],
           [14115, 'Mitchel', 1993, 5, 143000],
           [10084, 'Somerst', 2004, 8, 307000],
           [10382, 'NWAmes', 1973, 7, 200000],
           [6120, 'OldTown', 1931, 7, 129900],
           [7420, 'BrkSide', 1939, 5, 118000]]

colnames = ['LotSize','Neighborhood','YearBuilt','Quality','SalePrice']
pd.DataFrame(lst_lst, columns=colnames)

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
0,8450,CollgCr,2003,7,208500
1,9600,Veenker,1976,6,181500
2,11250,CollgCr,2001,7,223500
3,9550,Crawfor,1915,7,140000
4,14260,NoRidge,2000,8,250000
5,14115,Mitchel,1993,5,143000
6,10084,Somerst,2004,8,307000
7,10382,NWAmes,1973,7,200000
8,6120,OldTown,1931,7,129900
9,7420,BrkSide,1939,5,118000


List are not the only data structures that can be converted to a data frame. Data frames can also be created from data stored in a dictionary. Suppose we had a dictionary where the values contained the same information we had in our list of lists, but the keys of the dictionary consisted of the names of each house.

In [9]:
house_dict = {'Baker House': [7420, 'BrkSide', 1939, 5, 118000],
              'Beazley House': [14115, 'Mitchel', 1993, 5, 143000],
              'Dominguez House': [14260, 'NoRidge', 2000, 8, 250000],
              'Hamilton House': [6120, 'OldTown', 1931, 7, 129900],
              'James House': [11250, 'CollgCr', 2001, 7, 223500],
              'Martinez House': [9600, 'Veenker', 1976, 6, 181500],
              'Roberts House': [9550, 'Crawfor', 1915, 7, 140000],
              'Smith House': [8450, 'CollgCr', 2003, 7, 208500],
              'Snyder House': [10084, 'Somerst', 2004, 8, 307000],
              'Zuckerman House': [10382, 'NWAmes', 1973, 7, 200000]}

If we use the same approach as with the list of lists, Pandas would by default return a column for each house.

In [10]:
pd.DataFrame(house_dict)

Unnamed: 0,Baker House,Beazley House,Dominguez House,Hamilton House,James House,Martinez House,Roberts House,Smith House,Snyder House,Zuckerman House
0,7420,14115,14260,6120,11250,9600,9550,8450,10084,10382
1,BrkSide,Mitchel,NoRidge,OldTown,CollgCr,Veenker,Crawfor,CollgCr,Somerst,NWAmes
2,1939,1993,2000,1931,2001,1976,1915,2003,2004,1973
3,5,5,8,7,7,6,7,7,8,7
4,118000,143000,250000,129900,223500,181500,140000,208500,307000,200000


This is not the format we want for our data. Instead, we want each house represented as a row and the attributes of the houses represented as columns. There are (at least) two ways to transform the data frame to the format we want. Both methods below will return the same result - a data frame with houses as rows and house attributes as columns.



In [11]:
# You can transpose the result and adjust the column names.
house_df = pd.DataFrame(house_dict).transpose()
house_df.columns = colnames

# Or you can add the from_dict method and specify 'index' for the orient parameter, and then adjust your column names.
house_df = pd.DataFrame.from_dict(house_dict, orient='index')
house_df.columns = colnames
house_df

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
Baker House,7420,BrkSide,1939,5,118000
Beazley House,14115,Mitchel,1993,5,143000
Dominguez House,14260,NoRidge,2000,8,250000
Hamilton House,6120,OldTown,1931,7,129900
James House,11250,CollgCr,2001,7,223500
Martinez House,9600,Veenker,1976,6,181500
Roberts House,9550,Crawfor,1915,7,140000
Smith House,8450,CollgCr,2003,7,208500
Snyder House,10084,Somerst,2004,8,307000
Zuckerman House,10382,NWAmes,1973,7,200000


# Applying Mathematical Functions to Data Frames

Like Numpy, Pandas also has some built-in mathematical functions that you can apply to series and data frames. Let's take a look at some of the basic ones.

In [12]:
# Total price of all houses sold
print (house_df['SalePrice'].sum())



# Average lot size of houses sold
print (house_df['LotSize'].mean())



# The latest year a house in the data set was built
print (house_df['YearBuilt'].max())



# The eariliest year a house in the data set was built
print (house_df['YearBuilt'].min())


1901400
10123.1
2004
1915
