# Inspecting a DataFrame Object

## About the Data
In this notebook, we will be working with earthquake data from September 18, 2018 - October 13, 2018 (obtained from the US Geological Survey (USGS) using the [USGS API](https://earthquake.usgs.gov/fdsnws/event/1/))

## Setup
We will be working with the `data/earthquakes.csv` file again, so we need to handle our imports and read it in.

In [4]:
import numpy as np
import pandas as pd

df = pd.read_csv('earthquakes.csv')

## Examining dataframes
### Is it empty?

In [5]:
df.empty

False

### What are the dimensions?

In [6]:
df.shape

(10232, 26)

### What columns do we have?
We know there are 26 columns, but what are they? Let's use the `columns` attribute to see:

In [7]:
df.columns

Index(['mag', 'place', 'time', 'updated', 'tz', 'url', 'detail', 'felt', 'cdi',
       'mmi', 'alert', 'status', 'tsunami', 'sig', 'net', 'code', 'ids',
       'sources', 'types', 'nst', 'dmin', 'rms', 'gap', 'magType', 'type',
       'title'],
      dtype='object')

### What does the data look like?
View rows from the top with `head()`:

In [8]:
df.head()

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,...,ids,sources,types,nst,dmin,rms,gap,magType,type,title
0,2.0,"23 km N of Willow, Alaska",1622332708509,1622333534854,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",ak0216ut468m,",",ak,",",origin,",,,0.62,,ml,earthquake,"M 2.0 - 23 km N of Willow, Alaska"
1,0.35,"2km N of The Geysers, CA",1622331217450,1622338572406,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",nc73567366,",",nc,",",nearby-cities,origin,phase-data,scitech-link,",14.0,0.005785,0.02,120.0,md,earthquake,"M 0.4 - 2km N of The Geysers, CA"
2,4.5,"91 km W of San Antonio de los Cobres, Argentina",1622330654734,1622398852359,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.0,1.0,,...,",us6000efw3,",",us,",",dyfi,origin,phase-data,",,1.451,0.51,106.0,mb,earthquake,"M 4.5 - 91 km W of San Antonio de los Cobres, ..."
3,2.16,"18km E of Ocotillo, CA",1622330638080,1622330856366,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",ci39909792,",",ci,",",nearby-cities,origin,phase-data,scitech-link,",25.0,0.09278,0.28,57.0,ml,earthquake,"M 2.2 - 18km E of Ocotillo, CA"
4,4.4,"115 km WNW of San Antonio de los Cobres, Argen...",1622330295993,1622332202040,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",us6000efw1,",",us,",",origin,phase-data,",,1.111,0.44,100.0,mb,earthquake,M 4.4 - 115 km WNW of San Antonio de los Cobre...


View rows from the bottom with `tail()`. Let's view 2 rows:

In [9]:
df.tail(2)

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,...,ids,sources,types,nst,dmin,rms,gap,magType,type,title
10230,0.89,"15km E of Seven Trees, CA",1619741326460,1619753767570,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",nc73556285,",",nc,",",nearby-cities,origin,phase-data,",17.0,0.02941,0.06,56.0,md,earthquake,"M 0.9 - 15km E of Seven Trees, CA"
10231,0.15,"6km NW of The Geysers, CA",1619740962850,1619741059078,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",nc73556280,",",nc,",",nearby-cities,origin,phase-data,",11.0,0.006624,0.02,64.0,md,earthquake,"M 0.2 - 6km NW of The Geysers, CA"


In [17]:
pd.set_option('display.max_columns', 26)
df.head()

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,alert,status,tsunami,sig,net,code,ids,sources,types,nst,dmin,rms,gap,magType,type,title
0,2.0,"23 km N of Willow, Alaska",1622332708509,1622333534854,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,,automatic,0,62,ak,0216ut468m,",ak0216ut468m,",",ak,",",origin,",,,0.62,,ml,earthquake,"M 2.0 - 23 km N of Willow, Alaska"
1,0.35,"2km N of The Geysers, CA",1622331217450,1622338572406,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,,automatic,0,2,nc,73567366,",nc73567366,",",nc,",",nearby-cities,origin,phase-data,scitech-link,",14.0,0.005785,0.02,120.0,md,earthquake,"M 0.4 - 2km N of The Geysers, CA"
2,4.5,"91 km W of San Antonio de los Cobres, Argentina",1622330654734,1622398852359,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.0,1.0,,,reviewed,0,312,us,6000efw3,",us6000efw3,",",us,",",dyfi,origin,phase-data,",,1.451,0.51,106.0,mb,earthquake,"M 4.5 - 91 km W of San Antonio de los Cobres, ..."
3,2.16,"18km E of Ocotillo, CA",1622330638080,1622330856366,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,,automatic,0,72,ci,39909792,",ci39909792,",",ci,",",nearby-cities,origin,phase-data,scitech-link,",25.0,0.09278,0.28,57.0,ml,earthquake,"M 2.2 - 18km E of Ocotillo, CA"
4,4.4,"115 km WNW of San Antonio de los Cobres, Argen...",1622330295993,1622332202040,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,,reviewed,0,298,us,6000efw1,",us6000efw1,",",us,",",origin,phase-data,",,1.111,0.44,100.0,mb,earthquake,M 4.4 - 115 km WNW of San Antonio de los Cobre...


In [18]:
pd.reset_option('display.max_columns')
df.head()

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,...,ids,sources,types,nst,dmin,rms,gap,magType,type,title
0,2.0,"23 km N of Willow, Alaska",1622332708509,1622333534854,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",ak0216ut468m,",",ak,",",origin,",,,0.62,,ml,earthquake,"M 2.0 - 23 km N of Willow, Alaska"
1,0.35,"2km N of The Geysers, CA",1622331217450,1622338572406,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",nc73567366,",",nc,",",nearby-cities,origin,phase-data,scitech-link,",14.0,0.005785,0.02,120.0,md,earthquake,"M 0.4 - 2km N of The Geysers, CA"
2,4.5,"91 km W of San Antonio de los Cobres, Argentina",1622330654734,1622398852359,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.0,1.0,,...,",us6000efw3,",",us,",",dyfi,origin,phase-data,",,1.451,0.51,106.0,mb,earthquake,"M 4.5 - 91 km W of San Antonio de los Cobres, ..."
3,2.16,"18km E of Ocotillo, CA",1622330638080,1622330856366,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",ci39909792,",",ci,",",nearby-cities,origin,phase-data,scitech-link,",25.0,0.09278,0.28,57.0,ml,earthquake,"M 2.2 - 18km E of Ocotillo, CA"
4,4.4,"115 km WNW of San Antonio de los Cobres, Argen...",1622330295993,1622332202040,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",us6000efw1,",",us,",",origin,phase-data,",,1.111,0.44,100.0,mb,earthquake,M 4.4 - 115 km WNW of San Antonio de los Cobre...


*Tip: we can modify the display options in order to see more columns:*

```python
# check the max columns setting
>>> pd.get_option('display.max_columns')
20

# set the max columns to show when printing the dataframe to 26
>>> pd.set_option('display.max_columns', 26)
# OR
>>> pd.options.display.max_columns = 26

# reset the option
>>> pd.reset_option('display.max_columns')

# get information on all display settings
>>> pd.describe_option('display')
```

*More information can be found in the documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).*

### What data types do we have?

In [19]:
df.dtypes

mag        float64
place       object
time         int64
updated      int64
tz         float64
url         object
detail      object
felt       float64
cdi        float64
mmi        float64
alert       object
status      object
tsunami      int64
sig          int64
net         object
code        object
ids         object
sources     object
types       object
nst        float64
dmin       float64
rms        float64
gap        float64
magType     object
type        object
title       object
dtype: object

### Getting extra info and finding nulls

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10232 entries, 0 to 10231
Data columns (total 26 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   mag      10232 non-null  float64
 1   place    10232 non-null  object 
 2   time     10232 non-null  int64  
 3   updated  10232 non-null  int64  
 4   tz       0 non-null      float64
 5   url      10232 non-null  object 
 6   detail   10232 non-null  object 
 7   felt     540 non-null    float64
 8   cdi      540 non-null    float64
 9   mmi      85 non-null     float64
 10  alert    51 non-null     object 
 11  status   10232 non-null  object 
 12  tsunami  10232 non-null  int64  
 13  sig      10232 non-null  int64  
 14  net      10232 non-null  object 
 15  code     10232 non-null  object 
 16  ids      10232 non-null  object 
 17  sources  10232 non-null  object 
 18  types    10232 non-null  object 
 19  nst      7544 non-null   float64
 20  dmin     7075 non-null   float64
 21  rms      102

## Describing and Summarizing
### Get summary statistics

In [21]:
df.describe()

Unnamed: 0,mag,time,updated,tz,felt,cdi,mmi,tsunami,sig,nst,dmin,rms,gap
count,10232.0,10232.0,10232.0,0.0,540.0,540.0,85.0,10232.0,10232.0,7544.0,7075.0,10232.0,8602.0
mean,1.585611,1621042000000.0,1621296000000.0,,42.598148,2.855926,3.741165,0.000391,61.697224,21.401246,0.586679,0.282517,119.039631
std,1.210489,738841600.0,707583700.0,,382.036385,1.42327,1.745962,0.019769,94.278662,15.917219,2.043666,0.292542,64.811104
min,-1.29,1619741000000.0,1619741000000.0,,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,9.0
25%,0.85,1620398000000.0,1620744000000.0,,1.0,2.0,3.233,0.0,11.0,10.0,0.025,0.09,68.0
50%,1.36,1621028000000.0,1621522000000.0,,2.0,2.7,3.759,0.0,28.0,17.0,0.07319,0.16,105.0
75%,2.0,1621726000000.0,1621894000000.0,,6.0,3.6,4.344,0.0,62.0,28.0,0.167384,0.36,151.0
max,7.3,1622333000000.0,1622489000000.0,,8284.0,9.1,9.0,1.0,1072.0,171.0,35.302,2.68,360.0


Specifying the 5<sup>th</sup> and 95<sup>th</sup> percentile:

In [22]:
df.describe(percentiles=[0.05, 0.95])

Unnamed: 0,mag,time,updated,tz,felt,cdi,mmi,tsunami,sig,nst,dmin,rms,gap
count,10232.0,10232.0,10232.0,0.0,540.0,540.0,85.0,10232.0,10232.0,7544.0,7075.0,10232.0,8602.0
mean,1.585611,1621042000000.0,1621296000000.0,,42.598148,2.855926,3.741165,0.000391,61.697224,21.401246,0.586679,0.282517,119.039631
std,1.210489,738841600.0,707583700.0,,382.036385,1.42327,1.745962,0.019769,94.278662,15.917219,2.043666,0.292542,64.811104
min,-1.29,1619741000000.0,1619741000000.0,,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,9.0
5%,0.08,1619877000000.0,1620069000000.0,,1.0,1.0,0.0,0.0,0.0,5.0,0.006526,0.02,43.0
50%,1.36,1621028000000.0,1621522000000.0,,2.0,2.7,3.759,0.0,28.0,17.0,0.07319,0.16,105.0
95%,4.4,1622174000000.0,1622249000000.0,,156.35,5.4,6.5492,0.0,298.0,52.0,3.295,0.9,254.0
max,7.3,1622333000000.0,1622489000000.0,,8284.0,9.1,9.0,1.0,1072.0,171.0,35.302,2.68,360.0


Describe specific data types:

In [23]:
df.describe(include=np.object)

Unnamed: 0,place,url,detail,alert,status,net,code,ids,sources,types,magType,type,title
count,10232,10232,10232,51,10232,10232,10232,10232,10232,10232,10232,10232,10232
unique,5173,10232,10232,3,2,15,10229,10232,67,46,8,4,8301
top,"7km NW of The Geysers, CA",https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,green,reviewed,nc,2021jzan,",mb80502414,",",nc,",",origin,phase-data,",ml,earthquake,"M 0.9 - 8km NW of The Geysers, CA"
freq,120,1,1,48,7094,1912,2,1,1782,4846,6443,10031,39


Or describe all of them:

In [24]:
df.describe(include='all')

Unnamed: 0,mag,place,time,updated,tz,url,detail,felt,cdi,mmi,...,ids,sources,types,nst,dmin,rms,gap,magType,type,title
count,10232.0,10232,10232.0,10232.0,0.0,10232,10232,540.0,540.0,85.0,...,10232,10232,10232,7544.0,7075.0,10232.0,8602.0,10232,10232,10232
unique,,5173,,,,10232,10232,,,,...,10232,67,46,,,,,8,4,8301
top,,"7km NW of The Geysers, CA",,,,https://earthquake.usgs.gov/earthquakes/eventp...,https://earthquake.usgs.gov/fdsnws/event/1/que...,,,,...,",mb80502414,",",nc,",",origin,phase-data,",,,,,ml,earthquake,"M 0.9 - 8km NW of The Geysers, CA"
freq,,120,,,,1,1,,,,...,1,1782,4846,,,,,6443,10031,39
mean,1.585611,,1621042000000.0,1621296000000.0,,,,42.598148,2.855926,3.741165,...,,,,21.401246,0.586679,0.282517,119.039631,,,
std,1.210489,,738841600.0,707583700.0,,,,382.036385,1.42327,1.745962,...,,,,15.917219,2.043666,0.292542,64.811104,,,
min,-1.29,,1619741000000.0,1619741000000.0,,,,0.0,0.0,0.0,...,,,,2.0,0.0,0.0,9.0,,,
25%,0.85,,1620398000000.0,1620744000000.0,,,,1.0,2.0,3.233,...,,,,10.0,0.025,0.09,68.0,,,
50%,1.36,,1621028000000.0,1621522000000.0,,,,2.0,2.7,3.759,...,,,,17.0,0.07319,0.16,105.0,,,
75%,2.0,,1621726000000.0,1621894000000.0,,,,6.0,3.6,4.344,...,,,,28.0,0.167384,0.36,151.0,,,


This works on columns also:

In [25]:
df.felt.describe()

count     540.000000
mean       42.598148
std       382.036385
min         0.000000
25%         1.000000
50%         2.000000
75%         6.000000
max      8284.000000
Name: felt, dtype: float64

There are methods for specific statistics as well. Here is a sampling of them:

| Method | Description | Data types |
| --- | --- | --- |
| `count()` | The number of non-null observations | Any |
| `nunique()` | The number of unique values | Any |
| `sum()` | The total of the values | Numerical or Boolean |
| `mean()` | The average of the values | Numerical or Boolean |
| `median()` | The median of the values | Numerical |
| `min()` | The minimum of the values | Numerical |
| `idxmin()` | The index where the minimum values occurs | Numerical |
| `max()` | The maximum of the values | Numerical |
| `idxmax()` | The index where the maximum value occurs | Numerical |
| `abs()` | The absolute values of the data | Numerical |
| `std()` | The standard deviation | Numerical |
| `var()` | The variance |  Numerical |
| `cov()` | The covariance between two `Series`, or a covariance matrix for all column combinations in a `DataFrame` | Numerical |
| `corr()` | The correlation between two `Series`, or a correlation matrix for all column combinations in a `DataFrame` | Numerical |
| `quantile()` | Calculates a specific quantile | Numerical |
| `cumsum()` | The cumulative sum | Numerical or Boolean |
| `cummin()` | The cumulative minimum | Numerical |
| `cummax()` | The cumulative maximum | Numerical |

For example, finding the unique values in the `alert` column:

In [26]:
df.alert.unique()

array([nan, 'green', 'yellow', 'orange'], dtype=object)

We can then use `value_counts()` to see how many of each unique value we have:

In [27]:
df.alert.value_counts()

green     48
orange     2
yellow     1
Name: alert, dtype: int64

Note that `Index` objects also have several methods to help describe and summarize our data:

| Method | Description |
| --- | --- |
| `argmax()`/`argmin()` | Find the location of the maximum/minimum value in the index |
| `equals()` | Compare the index to another `Index` object for equality |
| `isin()` | Check if the index values are in a list of values and return an array of Booleans |
| `max()`/`min()` | Find the maximum/minimum value in the index |
| `nunique()` | Get the number of unique values in the index |
| `to_series()` | Create a `Series` object from the index |
| `unique()` | Find the unique values of the index |
| `value_counts()`| Create a frequency table for the unique values in the index |

<hr>
<div>
    <a href="./3-making_dataframes_from_api_requests.ipynb">
        <button style="float: left;">&#8592; Previous Notebook</button>
    </a>
    <a href="./5-subsetting_data.ipynb">
        <button style="float: right;">Next Notebook &#8594;</button>
    </a>
</div>
<br>
<hr>