# Introduction to Pandas

## Introduction

<font size="10">As a data analyst with Python in your tool kit, one of the libraries you will use most often will be Pandas. Pandas is a useful library that makes data wrangling, transformation, and analysis easier and more intuitive. 
</font>



<font size="10">
In this lesson, we will learn 

- Pandas data structures.
- How to convert native Python data structures to data frames.
- How to apply some basic math functions to them.
</font>

<font size="10">
As with Numpy in the previous lesson, you must import Pandas to be able to use it. Just like Numpy is typically aliased to np, Pandas is usually aliased to pd. Let's import both of them so that we can use them in this lesson.
    </font>

In [1]:
import numpy as np
import pandas as pd

<font size="10">Now that both libraries are imported, we can begin using them.</font>


## Pandas Data Structures


<font size="10">The primary data structures in Pandas are Series and DataFrames. A series is an indexed one-dimensional array where the values can be of any data type. </font>


<font size="10">Let's create a series consisting of 10 random numbers using the Pandas Series() method.</font>


In [7]:
a = pd.Series(np.random.random(10))
print(a)
#print(type(a))
#print(type(np.random.random(10)))

0    0.801157
1    0.867751
2    0.445935
3    0.603851
4    0.973916
5    0.076177
6    0.639740
7    0.395717
8    0.077096
9    0.044014
dtype: float64


<font size="10">As you can see, this generated an indexed array of random numbers. Just as with other Python data structures, we can reference the elements of this array by their indexes.</font>




In [8]:
a[0]  

0.8011573023255094

In [9]:
a[5]

0.07617746924506874

<font size="10">The other type of data structure, DataFrames, are two-dimensional indexed structures where each column can be of a different data type. 
</font>


<font size="10">DataFrames are very similar to spreadsheets and database tables, and they are one of the most useful data structures you will be working with as a data analyst.
</font>


<font size="10">DataFrames can be generated using the Pandas DataFrame method as follows: We are also going to assign specific column names to each column in the data frame by passing a variable named colnames with a list of column names to the columns argument.
</font>

In [49]:
colnames = ['Column1','Column2','Column3','Column4','Column5']
df = pd.DataFrame(np.random.random((10,5)), columns=colnames)
df


Unnamed: 0,Column1,Column2,Column3,Column4,Column5
0,0.279659,0.468769,0.894473,0.263416,0.370674
1,0.111617,0.542098,0.085141,0.480043,0.342572
2,0.891393,0.419078,0.820403,0.693637,0.542505
3,0.846374,0.326464,0.863644,0.73803,0.155853
4,0.342263,0.104574,0.897018,0.636758,0.843403
5,0.772898,0.558498,0.802863,0.455871,0.09294
6,0.73614,0.625658,0.694276,0.816592,0.840895
7,0.612466,0.075737,0.946215,0.573653,0.60288
8,0.906494,0.414587,0.461941,0.753424,0.945062
9,0.305812,0.654644,0.417261,0.484547,0.878865


<font size="10">We can reference each of the columns in a data frame directly by the column name as follows.
</font>



In [27]:
df['Column1']

#df['Column1'][0]
#df['Column1'][5]
#.iloc or loc

0    0.396597
1    0.670957
2    0.209814
3    0.289061
4    0.588067
5    0.807916
6    0.434587
7    0.753622
8    0.834930
9    0.764514
Name: Column1, dtype: float64

<font size="10">This returns a series consisting of the values in the first column of our data frame. If we wanted to extract just the first three columns of our data frame, we would need to include the column names in a list inside the square brackets (so there would be two sets of square brackets).
</font>




In [40]:
df[['Column1','Column2','Column3']] 

#df['Column1']

Unnamed: 0,Column1,Column2,Column3
0,0.396597,0.223153,0.538237
1,0.670957,0.158445,0.502927
2,0.209814,0.664318,0.455744
3,0.289061,0.218563,0.915153
4,0.588067,0.302597,0.76092
5,0.807916,0.114545,0.950195
6,0.434587,0.390598,0.868689
7,0.753622,0.016189,0.05504
8,0.83493,0.721148,0.436045
9,0.764514,0.945365,0.949419


<font size="10">When we extract more than one column, it returns the results in a data frame since a series is only one-dimensional.
</font>






## Converting Other Data Structures to Dataframes


<font size="10">We can also convert data we receive in other Python data structures into data frames so that we can work with them more intuitively.
</font>









<font size="10">For example, suppose we had a list of prices that houses sold for recently and we wanted to get those into a data frame. We could do that by applying the pd.DataFrame method to the list of prices.
</font>









In [47]:
lst = [208500, 181500, 223500, 140000, 250000, 143000, 307000, 200000, 129900, 118000]

price_df = pd.DataFrame(lst, columns=['SalePrice'])

#price_df.columns 
#type(price_df)
price_df

Unnamed: 0,SalePrice
0,208500
1,181500
2,223500
3,140000
4,250000
5,143000
6,307000
7,200000
8,129900
9,118000


<font size="10">The list was converted into a one-column data frame with a column name of SalePrice.
</font>



<font size="10">
What if we had more than just one list of data? What if we had a list of lists where each sublist in the master list contained information about the sale of a house (the lot area, neighborhood, year built, quality score, and final sale price)? 
</font>

<font size="10">
We can then apply the same pd.DataFrame method to that list of lists and Pandas will create a data frame with columns based on each index in the sublists.
</font>

In [56]:
lst_lst = [[8450, 'CollgCr', 2003, 7, 208500],
           [9600, 'Veenker', 1976, 6, 181500],
           [11250, 'CollgCr', 2001, 7, 223500],
           [9550, 'Crawfor', 1915, 7, 140000],
           [14260, 'NoRidge', 2000, 8, 250000 ],
           [14115, 'Mitchel', 1993, 5, 143000],
           [10084, 'Somerst', 2004, 8, 307000],
           [10382, 'NWAmes', 1973, 7, 200000],
           [6120, 'OldTown', 1931, 7, 129900],
           [7420, 'BrkSide', 1939, 5, 118000]]

colnames = ['LotSize','Neighborhood','YearBuilt','Quality','SalePrice']
pd.DataFrame(lst_lst, columns=colnames)

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
0,8450,CollgCr,2003,7,208500
1,9600,Veenker,1976,6,181500
2,11250,CollgCr,2001,7,223500
3,9550,Crawfor,1915,7,140000
4,14260,NoRidge,2000,8,250000
5,14115,Mitchel,1993,5,143000
6,10084,Somerst,2004,8,307000
7,10382,NWAmes,1973,7,200000
8,6120,OldTown,1931,7,129900
9,7420,BrkSide,1939,5,118000


<font size="10">List are not the only data structures that can be converted to a data frame. Data frames can also be created from data stored in a dictionary. 
</font>



<font size="10">Suppose we had a dictionary where the values contained the same information we had in our list of lists, but the keys of the dictionary consisted of the names of each house.
</font>



In [58]:
house_dict = {'Baker House': [7420, 'BrkSide', 1939, 5, 118000],
              'Beazley House': [14115, 'Mitchel', 1993, 5, 143000],
              'Dominguez House': [14260, 'NoRidge', 2000, 8, 250000],
              'Hamilton House': [6120, 'OldTown', 1931, 7, 129900],
              'James House': [11250, 'CollgCr', 2001, 7, 223500],
              'Martinez House': [9600, 'Veenker', 1976, 6, 181500],
              'Roberts House': [9550, 'Crawfor', 1915, 7, 140000],
              'Smith House': [8450, 'CollgCr', 2003, 7, 208500],
              'Snyder House': [10084, 'Somerst', 2004, 8, 307000],
              'Zuckerman House': [10382, 'NWAmes', 1973, 7, 200000]}
house_dict['Baker House']

[7420, 'BrkSide', 1939, 5, 118000]

<font size="10">If we use the same approach as with the list of lists, Pandas would by default return a column for each house.
</font>



In [60]:
pd.DataFrame(house_dict)

Unnamed: 0,Baker House,Beazley House,Dominguez House,Hamilton House,James House,Martinez House,Roberts House,Smith House,Snyder House,Zuckerman House
0,7420,14115,14260,6120,11250,9600,9550,8450,10084,10382
1,BrkSide,Mitchel,NoRidge,OldTown,CollgCr,Veenker,Crawfor,CollgCr,Somerst,NWAmes
2,1939,1993,2000,1931,2001,1976,1915,2003,2004,1973
3,5,5,8,7,7,6,7,7,8,7
4,118000,143000,250000,129900,223500,181500,140000,208500,307000,200000


<font size="10">This is not the format we want for our data. Instead, we want each house represented as a row and the attributes of the houses represented as columns.
</font>





<font size="10">There are (at least) two ways to transform the data frame to the format we want. The first one uses the transpose()-method, whereas the second one uses the from_dict method. 
</font>





In [78]:
# You can transpose the result and adjust the column names.
colnames = ['LotSize','Neighborhood','YearBuilt','Quality','SalePrice']
house_df = pd.DataFrame(house_dict).transpose()
house_df.columns = colnames


house_df

#house_df = house_df.reset_index()
#house_df.columns
#print(house_df)

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
Baker House,7420,BrkSide,1939,5,118000
Beazley House,14115,Mitchel,1993,5,143000
Dominguez House,14260,NoRidge,2000,8,250000
Hamilton House,6120,OldTown,1931,7,129900
James House,11250,CollgCr,2001,7,223500
Martinez House,9600,Veenker,1976,6,181500
Roberts House,9550,Crawfor,1915,7,140000
Smith House,8450,CollgCr,2003,7,208500
Snyder House,10084,Somerst,2004,8,307000
Zuckerman House,10382,NWAmes,1973,7,200000


In [79]:
# Or you can add the from_dict method and specify 'index' for the orient parameter, and then adjust your column names.
colnames = ['LotSize','Neighborhood','YearBuilt','Quality','SalePrice']

house_df = pd.DataFrame.from_dict(house_dict, orient='index')
house_df.columns = colnames
house_df

Unnamed: 0,LotSize,Neighborhood,YearBuilt,Quality,SalePrice
Baker House,7420,BrkSide,1939,5,118000
Beazley House,14115,Mitchel,1993,5,143000
Dominguez House,14260,NoRidge,2000,8,250000
Hamilton House,6120,OldTown,1931,7,129900
James House,11250,CollgCr,2001,7,223500
Martinez House,9600,Veenker,1976,6,181500
Roberts House,9550,Crawfor,1915,7,140000
Smith House,8450,CollgCr,2003,7,208500
Snyder House,10084,Somerst,2004,8,307000
Zuckerman House,10382,NWAmes,1973,7,200000


## Applying Mathematical Functions to Dataframes


<font size="10">Like Numpy, Pandas also has some built-in mathematical functions that you can apply to series and data frames. Let's take a look at some of the basic ones.
</font>








In [80]:
# Total price of all houses sold
house_df['SalePrice'].sum()

1901400

In [81]:
# Average lot size of houses sold
house_df['LotSize'].mean()

10123.1

In [82]:
# The latest (i.e. most recent) year a house in the data set was built
house_df['YearBuilt'].max()

2004

In [83]:
# The earliest year a house in the data set was built
house_df['YearBuilt'].min() 

1915

In [86]:
house_df['YearBuilt'].max() - house_df['YearBuilt'].min()

89

## Summary


<font size="10">
In this lesson, we learned about the two basic Pandas data structures - Series and DataFrames. We also learned how to reference the elements of a series and how to extract specific columns from a data frame. From there, we converted other Python data structures, such as lists and dictionaries, into Pandas DataFrames. Finally, we briefly looked at some basic mathematical functions that can be applied to a data frame's columns. We will be using all of these concepts and math functions as we progress through the program.
</font>








