# Chapter 02: Working with Pandas ``DataFrame`` Objects

## Setup

In this section, I will import all of the necessary packages I need.

In [50]:
# Necessary imports
import pandas as pd
import numpy as np

## Creating a ``DataFrame`` Object from a CSV File

### Reading a CSV File

Using ``pd.read_csv()``, I can use pandas to read a CSV file ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). Pandas is usually very good at figuring out which options to use based on the input data, so I often won't need to add arguments to the call; however, there are many options available should I need them, some of which include the following:

| Parameter | Purpose |
| --- | --- |
| `sep` | Specifies the delimiter |
| `header` | Row number where the column names are located; the default option has `pandas` infer whether they are present |
| `names` | List of column names to use as the header |
| `index_col` | Column to use as the index |
| `usecols` | Specifies which columns to read in |
| `dtype` | Specifies data types for the columns | 
| `converters` | Specifies functions for converting data in certain columns |
| `skiprows` | Rows to skip |
| `nrows` | Number of rows to read at a time (combine with `skiprows` to read a file bit by bit) |
| `parse_dates` | Automatically parse columns containing dates into datetime objects |
| `chunksize` | For reading the file in chunks |
| `compression` | For reading in compressed files without extracting beforehand |
| `encoding` | Specifies the file encoding |

Notably, I could pull the data from:

* The csv file stored in the data directory (``data/earthquakes.csv``)
* The GitHub repository for this book (https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/blob/master/ch_02/data/earthquakes.csv?raw=True)

I will use the pre-downloaded csv file to make this easy on myself.

In [46]:
csv_file_path = 'data/earthquakes.csv'
"""
str: The path to the earthquake data from September 18, 2018 - October 13, 2018

Obtained from the US Geological Survey (USGS) using the USGS API
"""

df_csv = pd.read_csv(csv_file_path)
"""
pandas.dataframe: The DataFrame we will examine in this notebook.

It will be the test DataFrame used in this notebook.
The DataFrame is built on the CSV file I loaded in before.
"""

df_csv

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.020030,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.021370,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.026180,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.077990,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9327,,,73086771,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.018060,,185.0,",nc73086771,",0.62,md,...,",nc,",reviewed,1537230228060,"M 0.6 - 9km ENE of Mammoth Lakes, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1537285598315,https://earthquake.usgs.gov/earthquakes/eventp...
9328,,,38063967,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.030410,,50.0,",ci38063967,",1.00,ml,...,",ci,",reviewed,1537230135130,"M 1.0 - 3km W of Julian, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1537276800970,https://earthquake.usgs.gov/earthquakes/eventp...
9329,,,2018261000,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.452600,,276.0,",pr2018261000,",2.40,md,...,",pr,",reviewed,1537229908180,"M 2.4 - 35km NNE of Hatillo, Puerto Rico",0,earthquake,",geoserve,origin,phase-data,",-240.0,1537243777410,https://earthquake.usgs.gov/earthquakes/eventp...
9330,,,38063959,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.018650,,61.0,",ci38063959,",1.10,ml,...,",ci,",reviewed,1537229545350,"M 1.1 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537230211640,https://earthquake.usgs.gov/earthquakes/eventp...


## Inspecting A ``DataFrame`` Object

Now I can check the properties of the ``df_csv`` object. Namely:

### Whether the ``DataFrame`` Object Is Empty or Not

The ``empty`` property returns a ``False`` value if the ``DataFrame`` object contains data. Otherwise, it returns ``True``.

In [11]:
# Checks to see if DataFrame object is empty or not
df_csv.empty

False

### The Dimensions of the ``DataFrame`` Object

The ``shape`` property returns the number of rows and columnns in the ``DataFrame`` object.

In [14]:
# Get the shape of the DataFrame object
df_csv.shape

(9332, 26)

### The Columns Inside the ``DataFrame`` Object

The ``columns`` property returns the 'key' for all columns in the ``DataFrame`` object.

In [26]:
df_csv.columns

# Test to see if this returns an iterable item
test = df_csv.columns
for col in test:
    print(col)

# See the original return of df_csv.columns
test

alert
cdi
code
detail
dmin
felt
gap
ids
mag
magType
mmi
net
nst
place
rms
sig
sources
status
time
title
tsunami
type
types
tz
updated
url


Index(['alert', 'cdi', 'code', 'detail', 'dmin', 'felt', 'gap', 'ids', 'mag',
       'magType', 'mmi', 'net', 'nst', 'place', 'rms', 'sig', 'sources',
       'status', 'time', 'title', 'tsunami', 'type', 'types', 'tz', 'updated',
       'url'],
      dtype='object')

### The Head and Tail of the ``DataFrame`` Object

The head of the ``DataFrame`` object can be retrieved by using the ``head()`` method. If no integer argument is passed, then it automatically gets the first 5 rows of data.

In [29]:
# Get the head of the DataFrame object
df_csv.head()

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...
3,,,37389186,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02618,,39.0,",ci37389186,",0.44,ml,...,",ci,",automatic,1539474978070,"M 0.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475196167,https://earthquake.usgs.gov/earthquakes/eventp...
4,,,73096941,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.07799,,192.0,",nc73096941,",2.16,md,...,",nc,",automatic,1539474716050,"M 2.2 - 10km NW of Avenal, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,scit...",-480.0,1539477547926,https://earthquake.usgs.gov/earthquakes/eventp...


The tail of the ``DataFrame`` object can be retrieved using the ``tail()`` method (very similar to the ``head()`` method).

In [32]:
# Get the tail of the DataFrame object
df_csv.tail(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
9330,,,38063959,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.01865,,61.0,",ci38063959,",1.1,ml,...,",ci,",reviewed,1537229545350,"M 1.1 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537230211640,https://earthquake.usgs.gov/earthquakes/eventp...
9331,,,38063935,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.01698,,39.0,",ci38063935,",0.66,ml,...,",ci,",reviewed,1537228864470,"M 0.7 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537305830770,https://earthquake.usgs.gov/earthquakes/eventp...


### The Data Types Within the ``DataFrame`` Object

The ``dtypes`` property returns the type of data within each column of the ``DataFrame`` object.

In [35]:
# Get the data types in each column of the DataFrame object
df_csv.dtypes

alert       object
cdi        float64
code        object
detail      object
dmin       float64
felt       float64
gap        float64
ids         object
mag        float64
magType     object
mmi        float64
net         object
nst        float64
place       object
rms        float64
sig          int64
sources     object
status      object
time         int64
title       object
tsunami      int64
type        object
types       object
tz         float64
updated      int64
url         object
dtype: object

### Getting Extra Information and Location of ``null`` Values in the ``DataFrame`` Object

The ``info()`` method returns information about the ``DataFrame`` object. It provides much of the information discussed in the other methods and properties, but also counts the number of ``null`` items in each column of the ``DataFrame`` object. This is a very useful method.

In [38]:
# Get a lot of information about the DataFrame object
df_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9332 entries, 0 to 9331
Data columns (total 26 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alert    59 non-null     object 
 1   cdi      329 non-null    float64
 2   code     9332 non-null   object 
 3   detail   9332 non-null   object 
 4   dmin     6139 non-null   float64
 5   felt     329 non-null    float64
 6   gap      6164 non-null   float64
 7   ids      9332 non-null   object 
 8   mag      9331 non-null   float64
 9   magType  9331 non-null   object 
 10  mmi      93 non-null     float64
 11  net      9332 non-null   object 
 12  nst      5364 non-null   float64
 13  place    9332 non-null   object 
 14  rms      9332 non-null   float64
 15  sig      9332 non-null   int64  
 16  sources  9332 non-null   object 
 17  status   9332 non-null   object 
 18  time     9332 non-null   int64  
 19  title    9332 non-null   object 
 20  tsunami  9332 non-null   int64  
 21  type     9332 

## Describing & Summarizing ``DataFrame`` Objects

The ``describe()`` method will provide a quick summary of all integer and float elements in a ``DataFrame`` object. To be more specific, it provides the following statistics if called without any arguments:

* Count
* Mean
* Standard deviation
* Minimum
* 25<sup>th</sup> percentile
* 75<sup>th</sup> percentile
* Maximum

In [41]:
# Base describe() method
df_csv.describe()

Unnamed: 0,cdi,dmin,felt,gap,mag,mmi,nst,rms,sig,time,tsunami,tz,updated
count,329.0,6139.0,329.0,6164.0,9331.0,93.0,5364.0,9332.0,9332.0,9332.0,9332.0,9331.0,9332.0
mean,2.754711,0.544925,12.31003,121.506588,1.497345,3.651398,19.053878,0.362122,56.899914,1538284000000.0,0.006537,-451.99014,1538537000000.0
std,1.010637,2.214305,48.954944,72.962363,1.203347,1.790523,15.492315,0.317784,91.872163,608030600.0,0.080589,231.752571,656413500.0
min,0.0,0.000648,0.0,12.0,-1.26,0.0,0.0,0.0,0.0,1537229000000.0,0.0,-720.0,1537230000000.0
25%,2.0,0.020425,1.0,66.1425,0.72,2.68,8.0,0.119675,8.0,1537793000000.0,0.0,-540.0,1537996000000.0
50%,2.7,0.05905,2.0,105.0,1.3,3.72,15.0,0.21,26.0,1538245000000.0,0.0,-480.0,1538621000000.0
75%,3.3,0.17725,5.0,159.0,1.9,4.57,25.0,0.59,56.0,1538766000000.0,0.0,-480.0,1539110000000.0
max,8.4,53.737,580.0,355.91,7.5,9.12,172.0,1.91,2015.0,1539475000000.0,1.0,720.0,1539537000000.0


We can also specify what percentiles we want to use by using the percentile argument (``percentile = [a, b]``), where $0 \leq$ ``a`` $<$ ``b`` $\leq 1$.

In [47]:
floor = 0.05
"""
float: The lowest percentile to look at
"""
ceiling = 0.95
"""
float: The highest percentile to look at
"""
percentile_range = [floor, ceiling]
"""
list(float, float): The floor and ceiling to use for our example percentile range
"""

# Describe the DataFrame object within a given range
df_csv.describe(percentiles=percentile_range)

Unnamed: 0,cdi,dmin,felt,gap,mag,mmi,nst,rms,sig,time,tsunami,tz,updated
count,329.0,6139.0,329.0,6164.0,9331.0,93.0,5364.0,9332.0,9332.0,9332.0,9332.0,9331.0,9332.0
mean,2.754711,0.544925,12.31003,121.506588,1.497345,3.651398,19.053878,0.362122,56.899914,1538284000000.0,0.006537,-451.99014,1538537000000.0
std,1.010637,2.214305,48.954944,72.962363,1.203347,1.790523,15.492315,0.317784,91.872163,608030600.0,0.080589,231.752571,656413500.0
min,0.0,0.000648,0.0,12.0,-1.26,0.0,0.0,0.0,0.0,1537229000000.0,0.0,-720.0,1537230000000.0
5%,2.0,0.005491,1.0,35.0,-0.04,0.0,4.0,0.03,0.0,1537344000000.0,0.0,-600.0,1537387000000.0
50%,2.7,0.05905,2.0,105.0,1.3,3.72,15.0,0.21,26.0,1538245000000.0,0.0,-480.0,1538621000000.0
95%,4.3,2.6789,40.2,276.0,4.4,6.38,49.0,0.96,298.0,1539319000000.0,0.0,-60.0,1539400000000.0
max,8.4,53.737,580.0,355.91,7.5,9.12,172.0,1.91,2015.0,1539475000000.0,1.0,720.0,1539537000000.0


The ``include`` argument specifies what data type should be examined. Or, if you set ``include = 'all'``, then you'll get a summary of every data type in the ``DataFrame`` object.

In [53]:
data_type = object
"""
object: The data type to test the include argument with
"""
df_csv.describe(include = data_type)

Unnamed: 0,alert,code,detail,ids,magType,net,place,sources,status,title,type,types,url
count,59,9332,9332,9332,9331,9332,9332,9332,9332,9332,9332,9332,9332
unique,2,9332,9332,9332,10,14,5433,52,2,7807,5,42,9332
top,green,37366074,https://earthquake.usgs.gov/fdsnws/event/1/que...,",ak20266927,",ml,ak,"10km NE of Aguanga, CA",",ak,",reviewed,"M 0.4 - 10km NE of Aguanga, CA",earthquake,",geoserve,origin,phase-data,",https://earthquake.usgs.gov/earthquakes/eventp...
freq,58,1,1,1,6803,3166,306,2981,7797,55,9081,5301,1


In [None]:
data_type