# Introduction to Pandas

## Introduction

As a data analyst with Python in your tool kit, one of the libraries you will use most often will be Pandas. Pandas is a useful library that makes data wrangling, transformation, and analysis easier and more intuitive. In this lesson, we will learn about Pandas data structures and how to apply some basic math functions to them.

As with Numpy in the previous lesson, you must import Pandas to be able to use it. Just like Numpy is typically aliased to np, Pandas is usually aliased to pd. Let's import both of them so that we can use them in this lesson.

In [None]:
#import pandas and numpy
import pandas as pd
import numpy as np

Now that both libraries are imported, we can begin using them.

## Pandas Data Structures

The primary data structures in Pandas are Series and DataFrames. A series is an indexed one-dimensional array where the values can be of any data type. Let's create a series consisting of 10 random numbers using the Pandas Series() method.

In [43]:
# create series using np.random
series = pd.Series(np.random.random(10))
series

0    0.150070
1    0.879443
2    0.046824
3    0.186894
4    0.689854
5    0.782881
6    0.123231
7    0.698700
8    0.325350
9    0.519169
dtype: float64

As you can see, this generated an indexed array of random numbers. Just as with other Python data structures, we can reference the elements of this array by their indexes.

In [44]:
# select element 0
series[0]

0.15006988243631436

In [45]:
# select element 5
series[5]

0.7828807174095559

The other type of data structure, DataFrames, are two-dimensional indexed structures where each column can be of a different data type. DataFrames are very similar to spreadsheets and database tables, and they are one of the most useful data structures you will be working with as a data analyst.

DataFrames can be generated using the Pandas DataFrame method as follows. We are also going to assign specific column names to each column in the data frame by passing a variable named colnames with a list of column names to the columns argument.

In [49]:
# colnames = ['Column1','Column2','Column3','Column4','Column5']
# create a pd df using random 2D np.array with the columns from above 

colnames = ['Numbers','Column2','Column3','Column4','Column5']

df = pd.DataFrame(np.random.random((10,5)), columns=colnames)
df

Unnamed: 0,Numbers,Column2,Column3,Column4,Column5
0,0.596279,0.773378,0.449101,0.824192,0.144607
1,0.233571,0.671616,0.069127,0.397902,0.282668
2,0.143126,0.950178,0.900198,0.288258,0.021933
3,0.938806,0.875987,0.841989,0.676988,0.318459
4,0.687991,0.191566,0.882886,0.105449,0.406818
5,0.572548,0.958527,0.873035,0.876831,0.387775
6,0.263113,0.976593,0.313295,0.906597,0.803908
7,0.208961,0.952869,0.17985,0.653155,0.647801
8,0.487279,0.064812,0.823618,0.980643,0.368015
9,0.080729,0.37549,0.769549,0.571066,0.035947


We can reference each of the columns in a data frame directly by the column name as follows.

In [51]:
# select the first column 
df['Column2']

0    0.773378
1    0.671616
2    0.950178
3    0.875987
4    0.191566
5    0.958527
6    0.976593
7    0.952869
8    0.064812
9    0.375490
Name: Column2, dtype: float64

This returns a series consisting of the values in the first column of our data frame. If we wanted to extract just the first three columns of our data frame, we would need to include the column names in a list inside the square brackets (so there would be two sets of square brackets).

In [52]:
# select the first 3 columns
df[['Numbers', 'Column2', 'Column3']]

Unnamed: 0,Numbers,Column2,Column3
0,0.596279,0.773378,0.449101
1,0.233571,0.671616,0.069127
2,0.143126,0.950178,0.900198
3,0.938806,0.875987,0.841989
4,0.687991,0.191566,0.882886
5,0.572548,0.958527,0.873035
6,0.263113,0.976593,0.313295
7,0.208961,0.952869,0.17985
8,0.487279,0.064812,0.823618
9,0.080729,0.37549,0.769549


When we extract more than one column, it returns the results in a data frame since a series is only one-dimensional.

## Converting Other Data Structures to Dataframes

We can also convert data we receive in other Python data structures into data frames so that we can work with them more intuitively. For example, suppose we had a list of prices that houses sold for recently and we wanted to get those into a data frame. We could do that by applying the pd.DataFrame method to the list of prices.

In [55]:
lst = [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]

# convert list to df with column name 'SalesPrice'
new_df = pd.DataFrame(lst, columns=['Sales'])
new_df

Unnamed: 0,Sales
0,208500
1,181500
2,223500
3,140000
4,250000
5,143000
6,307000
7,200000
8,129900
9,118000


The list was converted into a one-column data frame with a column name of SalePrice.

What if we had more than just one list of data? What if we had a list of lists where each sublist in the master list contained information about the sale of a house (the lot area, neighborhood, year built, quality score, and final sale price)? We can apply the same pd.DataFrame method to that list of lists and Pandas will create a data frame with columns based on each index in the sublists.

In [57]:
lst_lst = [[8450, 'CollgCr', 2003, 7, 208500],
           [9600, 'Veenker', 1976, 6, 181500],
           [11250, 'CollgCr', 2001, 7, 223500],
           [9550, 'Crawfor', 1915, 7, 140000],
           [14260, 'NoRidge', 2000, 8, 250000],
           [14115, 'Mitchel', 1993, 5, 143000],
           [10084, 'Somerst', 2004, 8, 307000],
           [10382, 'NWAmes', 1973, 7, 200000],
           [6120, 'OldTown', 1931, 7, 129900],
           [7420, 'BrkSide', 1939, 5, 118000]]



colnames = ['LotSize','Neighborhood','YearBuilt','Quality','SalePrice']

#convert list of lists to df using col_names above
new_df_two = pd.DataFrame(lst_lst, columns=colnames)
new_df_two

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
0,8450,CollgCr,2003,7,208500
1,9600,Veenker,1976,6,181500
2,11250,CollgCr,2001,7,223500
3,9550,Crawfor,1915,7,140000
4,14260,NoRidge,2000,8,250000
5,14115,Mitchel,1993,5,143000
6,10084,Somerst,2004,8,307000
7,10382,NWAmes,1973,7,200000
8,6120,OldTown,1931,7,129900
9,7420,BrkSide,1939,5,118000


In [63]:
lst_lst = [[8450, 'CollgCr', 2003, 7, 208500, ['Hello']],
           [9600, 'Veenker', 1976, 6, 181500, ['Hello']]]

double_lst_df = pd.DataFrame(lst_lst)
double_lst_df

Unnamed: 0,0,1,2,3,4,5
0,8450,CollgCr,2003,7,208500,[Hello]
1,9600,Veenker,1976,6,181500,[Hello]


List are not the only data structures that can be converted to a data frame. Data frames can also be created from data stored in a dictionary. Suppose we had a dictionary where the values contained the same information we had in our list of lists, but the keys of the dictionary consisted of the names of each house.

In [None]:
house_dict = {'Baker House': [7420, 'BrkSide', 1939, 5, 118000],
              'Beazley House': [14115, 'Mitchel', 1993, 5, 143000],
              'Dominguez House': [14260, 'NoRidge', 2000, 8, 250000],
              'Hamilton House': [6120, 'OldTown', 1931, 7, 129900],
              'James House': [11250, 'CollgCr', 2001, 7, 223500],
              'Martinez House': [9600, 'Veenker', 1976, 6, 181500],
              'Roberts House': [9550, 'Crawfor', 1915, 7, 140000],
              'Smith House': [8450, 'CollgCr', 2003, 7, 208500],
              'Snyder House': [10084, 'Somerst', 2004, 8, 307000],
              'Zuckerman House': [10382, 'NWAmes', 1973, 7, 200000]}

If we use the same approach as with the list of lists, Pandas would by default return a column for each house.

In [64]:
# convert dictionary to df 

dict_df = pd.DataFrame(house_dict)
dict_df

Unnamed: 0,Baker House,Beazley House,Dominguez House,Hamilton House,James House,Martinez House,Roberts House,Smith House,Snyder House,Zuckerman House
0,7420,14115,14260,6120,11250,9600,9550,8450,10084,10382
1,BrkSide,Mitchel,NoRidge,OldTown,CollgCr,Veenker,Crawfor,CollgCr,Somerst,NWAmes
2,1939,1993,2000,1931,2001,1976,1915,2003,2004,1973
3,5,5,8,7,7,6,7,7,8,7
4,118000,143000,250000,129900,223500,181500,140000,208500,307000,200000


This is not the format we want for our data. Instead, we want each house represented as a row and the attributes of the houses represented as columns. There are (at least) two ways to transform the data frame to the format we want. Both methods below will return the same result - a data frame with houses as rows and house attributes as columns.

In [75]:
colnames = ['LotSize','Neighborhood','YearBuilt','Quality','SalePrice']

# You can transpose the result and adjust the column names.
dict_df_transpose = pd.DataFrame(house_dict).transpose()

#add columns (i.e. colnames)


dict_df_transpose.columns = colnames
#dict_df_transpose.transpose().transpose()

dict_df_transpose.head()

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
Baker House,7420,BrkSide,1939,5,118000
Beazley House,14115,Mitchel,1993,5,143000
Dominguez House,14260,NoRidge,2000,8,250000
Hamilton House,6120,OldTown,1931,7,129900
James House,11250,CollgCr,2001,7,223500


In [76]:
# Or you can add the from_dict method and specify 'index' for the orient parameter, and then adjust your column names.
houses = pd.DataFrame.from_dict(house_dict, orient='index')
houses.columns = colnames
houses

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
Baker House,7420,BrkSide,1939,5,118000
Beazley House,14115,Mitchel,1993,5,143000
Dominguez House,14260,NoRidge,2000,8,250000
Hamilton House,6120,OldTown,1931,7,129900
James House,11250,CollgCr,2001,7,223500
Martinez House,9600,Veenker,1976,6,181500
Roberts House,9550,Crawfor,1915,7,140000
Smith House,8450,CollgCr,2003,7,208500
Snyder House,10084,Somerst,2004,8,307000
Zuckerman House,10382,NWAmes,1973,7,200000


## Applying Mathematical Functions to Dataframes

Like Numpy, Pandas also has some built-in mathematical functions that you can apply to series and data frames. Let's take a look at some of the basic ones.

In [78]:
# Total price of all houses sold
houses['SalePrice'].sum()

1901400

In [79]:
# Average lot size of houses sold
houses['LotSize'].mean()

10123.1

In [None]:
#@title
# The latest year a house in the data set was built


In [81]:
# The earliest year a house in the data set was built
houses['YearBuilt'].max()

2004

## Summary

In this lesson, we learned about the two basic Pandas data structures - Series and DataFrames. We also learned how to reference the elements of a series and how to extract specific columns from a data frame. From there, we converted other Python data structures, such as lists and dictionaries, into Pandas DataFrames. Finally, we briefly looked at some basic mathematical functions that can be applied to a data frame's columns. We will be using all of these concepts and math functions as we progress through the program.