[Return HOME](https://joshbutch.github.io)

<h1>Review of "Pandas for Everyone: Python Data Analysis"</h1>
<h4>Author: Daniel Chen</h4>
<h5>Reviewer: Josh Butch</h5>
<h5>Part One: 1/8/2019</h5>

Pandas is a very popular Python library used to analyze data, and one that is becoming more relevant than ever in the workplace.  This post is a collection of Pandas analysis examples.  I hope to use this as a future reference for code snippets and demonstrate proficiency with the Pandas library.  The ability to organize and analyze data is relevant to every facet of data science and business intelligence/analysis.<br><br>
Pandas is open source and has a very strong support community online.  If you have questions about Pandas and have exhausted the documentation check Stack Overflow.  Pandas introduces two new data types: `DataFrame` and `Series`.  Think of the `DataFrame` as a representation of the entire spreadsheet and `Series` is a column of the `DataFrame`.  Daniel Chen also says, "A Pandas `DataFrame` can also be thought of as a dictionary or collection of `Series` objects" (pg 1).<br>
<h5>Loading a Dataset</h5><br>
After selecting a dataset to analyze it needs to be loaded so we can begin to look at its organizational structure.  Pandas is not part of the standard Python library so it needs to be loaded:

In [1]:
import pandas as pd #load pandas library and represent it as "pd" instead of "pandas" for brevity

At this point we have the `pandas` library loaded and can import our dataset.  The dataset Daniel Chen uses in his book for this section is the GapMinder dataset: www.gapminder.org. The dataset used for this analysis was prepared by Jennifer Bryan at the University of British Columbia: https://github.com/jennybc/gapminder.  We will load the dataset into a `DataFrame` named `df` which will serve as our primary variable for the first part of this analysis.

In [2]:
df = pd.read_csv('C:/Users/Josh/Desktop/data/gapminder.tsv', sep='\t') #Loading the dataset to 'df' and indicating the dataset is tab separated
print(type(df)) #Ensure we're working with a pandas DataFrame

<class 'pandas.core.frame.DataFrame'>


As you can see the `print(type(df))` returned a `pandas` `DataFrame` value, so we're off to a good start.  Next we want to check how many rows and columns are in the dataset.  We've begun our initial exploratory data analysis (EDA) where we will get to know the GapMinder data a little better.

In [3]:
print(df.shape) #Returns the number of rows and columns

(1704, 6)


So there are 1,704 rows and six columns in the GapMinder dataset.  Next, let's take a look at the column header names.

In [4]:
print(df.columns) #Returns the column names

Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')


Now that we have column names I find myself immediately making assumptions about this data.  The column names are: country, continent, year, lifeExp, pop, and gdpPercap.  I immediately thought of looking at life expectancy based on gdpPercap.  There's also opportunities to look at global distribution of wealth, life expectancy and how that data has changed over time.  There's going to be a lot of opportunity to derive meaningful insights from this dataset, so I'm excited to dig in further.  It's a good idea to verify the type of data in each `Series` of your `DataFrame`.

In [5]:
print(df.dtypes) #Returns the data type of each column (or Series) in the dataset

country       object
continent     object
year           int64
lifeExp      float64
pop            int64
gdpPercap    float64
dtype: object


The `object` data type in `pandas` is the same as a `string` data type in Python.  `Integers` and `Floats` should be pretty self explanatory.  Let's get a little more info about our data set.

In [6]:
print(df.info()) #Return more detailed information about the Series in our dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
continent    1704 non-null object
year         1704 non-null int64
lifeExp      1704 non-null float64
pop          1704 non-null int64
gdpPercap    1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None


<h5>Examining the Different Columns, Rows, and Cells</h5><br>
Now that our dataset is loaded and we understand a bit about the datatypes we can begin to inspect it's elements a little more closely.  We can subset our dataset by selecting specific columns, rows, and/or cells to make the data more manageable and return only the data relevant to our work.

In [7]:
country_df = df['country'] #Create a variable that only contains the Series named 'country'

In [8]:
print(country_df.head()) #Print the first 5 rows of 'country' column

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object


In [9]:
print(country_df.tail()) #Prints the last 5 rows of 'country' column

1699    Zimbabwe
1700    Zimbabwe
1701    Zimbabwe
1702    Zimbabwe
1703    Zimbabwe
Name: country, dtype: object


In [10]:
subset = df[['country','continent','year']] #Creates a variable 'subset' with 3 Series (or columns) of data

In [11]:
print(subset.head()) #Print the first 5 rows of our new subset

       country continent  year
0  Afghanistan      Asia  1952
1  Afghanistan      Asia  1957
2  Afghanistan      Asia  1962
3  Afghanistan      Asia  1967
4  Afghanistan      Asia  1972


In [12]:
print(subset.tail()) #Print the last 5 rows of subset

       country continent  year
1699  Zimbabwe    Africa  1987
1700  Zimbabwe    Africa  1992
1701  Zimbabwe    Africa  1997
1702  Zimbabwe    Africa  2002
1703  Zimbabwe    Africa  2007


We can subset our rows in a few different ways, but it's similar to subsetting columns.  Rows have an 'index label' which is a column-less identifier such as a row number.  These 'index labels' act like column names and are easy references to point to when analyzing data.

In [13]:
print(df.head()) #Print the first 5 rows of the entire dataset

       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106


In [14]:
print(df.loc[0]) #Return the first row by calling 'index label' 0

country      Afghanistan
continent           Asia
year                1952
lifeExp           28.801
pop              8425333
gdpPercap        779.445
Name: 0, dtype: object


In [15]:
print(df.loc[99]) #Returns the 100th row (remember to start counting from zero)

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap       721.186
Name: 99, dtype: object


In [16]:
print(df.tail(n=1)) #Returns the last row - otherwise do print(df.shape) and subtract 1 from rows

       country continent  year  lifeExp       pop   gdpPercap
1703  Zimbabwe    Africa  2007   43.487  12311143  469.709298


In [17]:
print(df.loc[[0,99,999]]) #Returns the 1st, 100th, and 1000th rows

         country continent  year  lifeExp       pop    gdpPercap
0    Afghanistan      Asia  1952   28.801   8425333   779.445314
99    Bangladesh      Asia  1967   43.453  62821884   721.186086
999     Mongolia      Asia  1967   51.253   1149500  1226.041130


In [19]:
small_range = list(range(5)) #Create a list of integers 0-4 and store them as 'small_range'
print(small_range) #Print the contents of 'small_range'

[0, 1, 2, 3, 4]


In [20]:
subset = df.iloc[:, small_range] #Subset the dataset by selecing rows (:, ) based on locations of row numbers stored in 'small_range' - in this case rows 0-4
print(subset.head()) #Prints the first 5 rows of dataset - in this case the entire dataset

       country continent  year  lifeExp       pop
0  Afghanistan      Asia  1952   28.801   8425333
1  Afghanistan      Asia  1957   30.332   9240934
2  Afghanistan      Asia  1962   31.997  10267083
3  Afghanistan      Asia  1967   34.020  11537966
4  Afghanistan      Asia  1972   36.088  13079460


Now that we know how to load a dataset, look at it's shape (rows and columns), data types (dtypes), column names, overall structure, and navigate by row and column we can begin to do some basic analysis.  That will be where I pick up in the next part.  This will be a running document as I work my way through the book.

<h5>Part 2: Grouped and Aggregated Calculations</h5>
<h5>Reviewer: Josh Butch</h5>
<h5>Date: TBD (expected before 1/14/2019)
   