# Introduction to Pandas

## Introduction

As a data analyst with Python in your tool kit, one of the libraries you will use most often will be Pandas. Pandas is a useful library that makes data wrangling, transformation, and analysis easier and more intuitive. In this lesson, we will learn about Pandas data structures and how to apply some basic math functions to them.

As with Numpy in the previous lesson, you must import Pandas to be able to use it. Just like Numpy is typically aliased to np, Pandas is usually aliased to pd. Let's import both of them so that we can use them in this lesson.

In [22]:
#import pandas and numpy
import pandas as pd
import numpy as np

Now that both libraries are imported, we can begin using them.

## Pandas Data Structures

The primary data structures in Pandas are Series and DataFrames. A series is an indexed one-dimensional array where the values can be of any data type. Let's create a series consisting of 10 random numbers using the Pandas Series() method.

Difference between Series and Lists https://discuss.analyticsvidhya.com/t/what-is-the-difference-between-pandas-series-and-python-lists/27373/2

In [25]:
# create series using np.random
series = pd.Series(np.random.random(10))
series

0    0.649666
1    0.884218
2    0.792807
3    0.011039
4    0.952290
5    0.055465
6    0.808826
7    0.877389
8    0.679460
9    0.831964
dtype: float64

As you can see, this generated an indexed array of random numbers. Just as with other Python data structures, we can reference the elements of this array by their indexes.

In [27]:
# select element 0
series[3]

0.011038827445547517

In [28]:
# select element 5
series[4]

#select element in row and column
#ecom[ecom['Job'] == 'Lawyer']

0.9522902882709691

The other type of data structure, DataFrames, are two-dimensional indexed structures where each column can be of a different data type. DataFrames are very similar to spreadsheets and database tables, and they are one of the most useful data structures you will be working with as a data analyst.

DataFrames can be generated using the Pandas DataFrame method as follows. We are also going to assign specific column names to each column in the data frame by passing a variable named colnames with a list of column names to the columns argument.

In [30]:
df = pd.DataFrame(np.random.random((10,5)))
df

Unnamed: 0,0,1,2,3,4
0,0.167778,0.296816,0.300655,0.454297,0.292484
1,0.231249,0.281262,0.523183,0.946752,0.415238
2,0.822198,0.964453,0.318988,0.379873,0.69398
3,0.940336,0.43454,0.455373,0.625974,0.688153
4,0.732387,0.347126,0.78574,0.084137,0.709637
5,0.236131,0.625193,0.647873,0.858854,0.233969
6,0.046572,0.242555,0.331329,0.011596,0.7806
7,0.981143,0.198434,0.752157,0.654623,0.631839
8,0.721109,0.218124,0.732908,0.250873,0.921514
9,0.343666,0.266107,0.706864,0.62032,0.722628


In [34]:
# colnames = ['Column1','Column2','Column3','Column4','Column5']
# create a pd df using random 2D np.array with the columns from above 

colnames = ['numbers','1','4554','dsfsdf','dfsfd']

df = pd.DataFrame(np.random.random((10,5)),columns=colnames)
df

Unnamed: 0,numbers,1,4554,dsfsdf,dfsfd
0,0.389641,0.298789,0.734208,0.180049,0.334192
1,0.334714,0.242308,0.754796,0.577529,0.236346
2,0.464575,0.295058,0.240759,0.105076,0.19716
3,0.426392,0.934772,0.597856,0.848264,0.551463
4,0.202924,0.760555,0.062501,0.464854,0.501113
5,0.849701,0.184532,0.495586,0.858156,0.812954
6,0.126903,0.758729,0.929305,0.036866,0.552371
7,0.171071,0.737782,0.320192,0.038276,0.895634
8,0.956261,0.056366,0.845096,0.356021,0.081855
9,0.173738,0.530969,0.899641,0.336831,0.975319


We can reference each of the columns in a data frame directly by the column name as follows.

In [36]:
# select the first column 
df[['4554']]

Unnamed: 0,4554
0,0.734208
1,0.754796
2,0.240759
3,0.597856
4,0.062501
5,0.495586
6,0.929305
7,0.320192
8,0.845096
9,0.899641


In [37]:
df['4554']

0    0.734208
1    0.754796
2    0.240759
3    0.597856
4    0.062501
5    0.495586
6    0.929305
7    0.320192
8    0.845096
9    0.899641
Name: 4554, dtype: float64

This returns a series consisting of the values in the first column of our data frame. If we wanted to extract just the first three columns of our data frame, we would need to include the column names in a list inside the square brackets (so there would be two sets of square brackets).

In [40]:
df['4554','dsfsdf']

KeyError: ('4554', 'dsfsdf')

In [39]:
# select the first 3 columns
df[['4554','dsfsdf']]

Unnamed: 0,4554,dsfsdf
0,0.734208,0.180049
1,0.754796,0.577529
2,0.240759,0.105076
3,0.597856,0.848264
4,0.062501,0.464854
5,0.495586,0.858156
6,0.929305,0.036866
7,0.320192,0.038276
8,0.845096,0.356021
9,0.899641,0.336831


When we extract more than one column, it returns the results in a data frame since a series is only one-dimensional.

## Converting Other Data Structures to Dataframes

We can also convert data we receive in other Python data structures into data frames so that we can work with them more intuitively. For example, suppose we had a list of prices that houses sold for recently and we wanted to get those into a data frame. We could do that by applying the pd.DataFrame method to the list of prices.

In [45]:
lst = [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]

# convert list to df with column name 'SalesPrice'
new_df = pd.DataFrame(lst,columns=['frnkmlrmgkerslgr'])
new_df

Unnamed: 0,frnkmlrmgkerslgr
0,208500
1,181500
2,223500
3,140000
4,250000
5,143000
6,307000
7,200000
8,129900
9,118000


The list was converted into a one-column data frame with a column name of SalePrice.

What if we had more than just one list of data? What if we had a list of lists where each sublist in the master list contained information about the sale of a house (the lot area, neighborhood, year built, quality score, and final sale price)? We can apply the same pd.DataFrame method to that list of lists and Pandas will create a data frame with columns based on each index in the sublists.

In [47]:
lst_lst = [[8450, 'CollgCr', 2003, 7, 208500],
           [9600, 'Veenker', 1976, 6, 181500],
           [11250, 'CollgCr', 2001, 7, 223500],
           [9550, 'Crawfor', 1915, 7, 140000],
           [14260, 'NoRidge', 2000, 8, 250000],
           [14115, 'Mitchel', 1993, 5, 143000],
           [10084, 'Somerst', 2004, 8, 307000],
           [10382, 'NWAmes', 1973, 7, 200000],
           [6120, 'OldTown', 1931, 7, 129900],
           [7420, 'BrkSide', 1939, 5, 118000]]

colnames = ['LotSize','Neighborhood','YearBuilt','Quality','SalePrice']
new_df_two = pd.DataFrame(lst_lst, columns=colnames)
new_df_two


Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
0,8450,CollgCr,2003,7,208500
1,9600,Veenker,1976,6,181500
2,11250,CollgCr,2001,7,223500
3,9550,Crawfor,1915,7,140000
4,14260,NoRidge,2000,8,250000
5,14115,Mitchel,1993,5,143000
6,10084,Somerst,2004,8,307000
7,10382,NWAmes,1973,7,200000
8,6120,OldTown,1931,7,129900
9,7420,BrkSide,1939,5,118000


In [49]:
lst_lst = [[8450, 'CollgCr', 2003, 7, 208500, ['Hello']],
           [9600, 'Veenker', 1976, 6, 181500, ['Hello']]]

double_lst_df = pd.DataFrame(lst_lst)
double_lst_df

Unnamed: 0,0,1,2,3,4,5
0,8450,CollgCr,2003,7,208500,[Hello]
1,9600,Veenker,1976,6,181500,[Hello]


List are not the only data structures that can be converted to a data frame. Data frames can also be created from data stored in a dictionary. Suppose we had a dictionary where the values contained the same information we had in our list of lists, but the keys of the dictionary consisted of the names of each house.

In [50]:
house_dict = {'Baker House': [7420, 'BrkSide', 1939, 5, 118000],
              'Beazley House': [14115, 'Mitchel', 1993, 5, 143000],
              'Dominguez House': [14260, 'NoRidge', 2000, 8, 250000],
              'Hamilton House': [6120, 'OldTown', 1931, 7, 129900],
              'James House': [11250, 'CollgCr', 2001, 7, 223500],
              'Martinez House': [9600, 'Veenker', 1976, 6, 181500],
              'Roberts House': [9550, 'Crawfor', 1915, 7, 140000],
              'Smith House': [8450, 'CollgCr', 2003, 7, 208500],
              'Snyder House': [10084, 'Somerst', 2004, 8, 307000],
              'Zuckerman House': [10382, 'NWAmes', 1973, 7, 200000]}

If we use the same approach as with the list of lists, Pandas would by default return a column for each house.

In [51]:
# convert dictionary to df 
dict_df = pd.DataFrame(house_dict)
dict_df

Unnamed: 0,Baker House,Beazley House,Dominguez House,Hamilton House,James House,Martinez House,Roberts House,Smith House,Snyder House,Zuckerman House
0,7420,14115,14260,6120,11250,9600,9550,8450,10084,10382
1,BrkSide,Mitchel,NoRidge,OldTown,CollgCr,Veenker,Crawfor,CollgCr,Somerst,NWAmes
2,1939,1993,2000,1931,2001,1976,1915,2003,2004,1973
3,5,5,8,7,7,6,7,7,8,7
4,118000,143000,250000,129900,223500,181500,140000,208500,307000,200000


This is not the format we want for our data. Instead, we want each house represented as a row and the attributes of the houses represented as columns. There are (at least) two ways to transform the data frame to the format we want. Both methods below will return the same result - a data frame with houses as rows and house attributes as columns.

In [53]:
colnames = ['LotSize','Neighborhood','YearBuilt','Quality','SalePrice']

# You can transpose the result and adjust the column names.
dict_df = pd.DataFrame(house_dict).transpose()

#add columns (i.e. colnames)

dict_df.columns = colnames
#dict_df_transpose.transpose().transpose()
dict_df


Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
Baker House,7420,BrkSide,1939,5,118000
Beazley House,14115,Mitchel,1993,5,143000
Dominguez House,14260,NoRidge,2000,8,250000
Hamilton House,6120,OldTown,1931,7,129900
James House,11250,CollgCr,2001,7,223500
Martinez House,9600,Veenker,1976,6,181500
Roberts House,9550,Crawfor,1915,7,140000
Smith House,8450,CollgCr,2003,7,208500
Snyder House,10084,Somerst,2004,8,307000
Zuckerman House,10382,NWAmes,1973,7,200000


In [56]:
# Or you can add the from_dict method and specify 'index' for the orient parameter, and then adjust your column names.
houses = pd.DataFrame.from_dict(house_dict, orient = 'index')
houses.columns = colnames
houses

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
Baker House,7420,BrkSide,1939,5,118000
Beazley House,14115,Mitchel,1993,5,143000
Dominguez House,14260,NoRidge,2000,8,250000
Hamilton House,6120,OldTown,1931,7,129900
James House,11250,CollgCr,2001,7,223500
Martinez House,9600,Veenker,1976,6,181500
Roberts House,9550,Crawfor,1915,7,140000
Smith House,8450,CollgCr,2003,7,208500
Snyder House,10084,Somerst,2004,8,307000
Zuckerman House,10382,NWAmes,1973,7,200000


## Applying Mathematical Functions to Dataframes

Like Numpy, Pandas also has some built-in mathematical functions that you can apply to series and data frames. Let's take a look at some of the basic ones.

In [57]:
# Total price of all houses sold
houses['SalePrice'].sum()

1901400

In [58]:
# Average lot size of houses sold
houses['LotSize'].mean()

10123.1

In [60]:
# The earliest year a house in the data set was built
houses['YearBuilt'].min()

1915

## Summary

In this lesson, we learned about the two basic Pandas data structures - Series and DataFrames. We also learned how to reference the elements of a series and how to extract specific columns from a data frame. From there, we converted other Python data structures, such as lists and dictionaries, into Pandas DataFrames. Finally, we briefly looked at some basic mathematical functions that can be applied to a data frame's columns. We will be using all of these concepts and math functions as we progress through the program.