---
<center><h1>Basic intro to pandas</h1></center>

<center><h2>Work with pandas DataFrames: filtering, indexing and missing data</h2></center>
---

## Table of Contents

- [Work with pandas DataFrames: filtering, indexing and missing data](#Work-with-pandas-DataFrames:-filtering,-indexing-and-missing-data)
    * [Get basic information](#Get-basic-information)
    * [Conditional indexing and selection](#Conditional-indexing-and-selection)
    * [Work with indexes and MultiIndex option](#Work-with-indexes-and-MultiIndex-option)
    * [Selection by label and position](#Selection-by-label-and-position)
    * [Work with missing data](#Work-with-missing-data)
    - [*Exercise 1*](#Exercise-1)

In [44]:
import pandas as pd
import numpy as np
import random

## Work with pandas DataFrames: filtering, indexing and missing data

[[back to top]](#Table-of-Contents)

In this part we will continue our acquaintance with DataFrames and will get to know 
1.	how to get basic information about DataFrame and its content;
2.	how to get a segment of a Dataframe and select rows from DataFrame, which satisfy some conditions;
3.	how to change indexes in DataFrame and make advanced indexing;
4.	how to select any rows by its indexes, labels and positions;
5.	how to work with missing data.

Thus, we will divide the whole text of this lesson into logic constructed code blocks with respect to mentioned above points. In the following posts we will continue our learning of pandas and will consider its other features.

In [136]:
url="https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
eqPastMonth=pd.read_csv(url)
eqPastMonth.head(10)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2025-04-16T19:44:39.410Z,46.617832,-119.800499,6.79,1.565842,ml,11.0,66.0,,0.04,...,2025-04-16T19:46:42.850Z,"11 km SE of Desert Aire, Washington",earthquake,0.35,0.29,0.273023,8.0,automatic,uw,uw
1,2025-04-16T19:30:36.660Z,33.861167,-117.5075,1.35,1.25,ml,41.0,43.0,0.04688,0.18,...,2025-04-16T19:35:11.402Z,"2 km SSE of Home Gardens, CA",earthquake,0.2,0.3,0.289,38.0,automatic,ci,ci
2,2025-04-16T19:21:00.822Z,34.1706,86.4226,10.0,4.1,mb,36.0,103.0,6.426,0.87,...,2025-04-16T19:41:06.040Z,western Xizang,earthquake,8.72,1.927,0.135,15.0,reviewed,us,us
3,2025-04-16T18:59:21.660Z,32.143333,-116.051666,15.51,3.46,mlr,62.0,44.0,0.1638,0.24,...,2025-04-16T19:30:24.040Z,"62 km ENE of Ensenada, B.C., MX",earthquake,0.27,1.14,0.401,18.0,reviewed,ci,ci
4,2025-04-16T18:57:51.552Z,-0.6994,127.5452,10.0,4.6,mb,42.0,62.0,1.472,1.17,...,2025-04-16T19:17:03.040Z,"11 km SE of Labuha, Indonesia",earthquake,6.85,1.903,0.117,22.0,reviewed,us,us
5,2025-04-16T18:34:18.170Z,33.467,-116.556167,11.44,1.01,ml,39.0,48.0,0.04736,0.19,...,2025-04-16T18:39:24.493Z,"15 km SE of Anza, CA",earthquake,0.21,0.44,0.256,29.0,automatic,ci,ci
6,2025-04-16T18:14:00.081Z,11.7508,141.6119,38.462,4.9,mb,78.0,68.0,3.666,0.7,...,2025-04-16T18:58:05.040Z,"249 km NNE of Fais, Micronesia",earthquake,9.04,7.329,0.055,103.0,reviewed,us,us
7,2025-04-16T18:06:36.870Z,19.337667,-155.179333,7.36,2.48,ml,49.0,81.0,0.04263,0.1,...,2025-04-16T19:07:16.040Z,"12 km SSE of Volcano, Hawaii",earthquake,0.31,0.44,0.226572,37.0,reviewed,hv,hv
8,2025-04-16T18:02:11.241Z,62.4338,-148.125,28.2,2.2,ml,,,,0.52,...,2025-04-16T18:03:54.344Z,"73 km NNE of Chickaloon, Alaska",earthquake,,0.2,,,automatic,ak,ak
9,2025-04-16T17:52:02.109Z,61.8794,-149.6647,2.8,2.3,ml,,,,0.64,...,2025-04-16T17:54:00.079Z,"24 km NE of Willow, Alaska",earthquake,,0.2,,,automatic,ak,ak


### Get basic information

[[back to top]](#Table-of-Contents)

pandas has a set of functions for getting basic information about DataFrame:

Lets take a look on type of `eqPastMonth` columns

In [137]:
eqPastMonth.dtypes

time                object
latitude           float64
longitude          float64
depth              float64
mag                float64
magType             object
nst                float64
gap                float64
dmin               float64
rms                float64
net                 object
id                  object
updated             object
place               object
type                object
horizontalError    float64
depthError         float64
magError           float64
magNst             float64
status              object
locationSource      object
magSource           object
dtype: object

You may notice that the dtype forthe time column is by default of type "object" meaning a string.  You can change this by using the apply function which allows one to apply a function to every row in series or dataframe. A "lambda" is a shorthand way to write your own function.

In [141]:
eqPastMonth['magPlus5'] = eqPastMonth['mag'].apply(lambda x: x + 5)
eqPastMonth[["magPlus5", 'mag']].head(5)

Unnamed: 0,magPlus5,mag
0,6.565842,1.565842
1,6.25,1.25
2,9.1,4.1
3,8.46,3.46
4,9.6,4.6


In [142]:
def addFive(x):
    return x + 5
eqPastMonth['magPlus5'] = eqPastMonth['mag'].apply(addFive)
eqPastMonth[["magPlus5", 'mag']].head(5)

Unnamed: 0,magPlus5,mag
0,6.565842,1.565842
1,6.25,1.25
2,9.1,4.1
3,8.46,3.46
4,9.6,4.6


In [143]:
eqPastMonth['datetime'] = eqPastMonth['time'].apply(lambda x: (datetime.datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%fZ')))
eqPastMonth['datetime'].head(5)

0   2025-04-16 19:44:39.410
1   2025-04-16 19:30:36.660
2   2025-04-16 19:21:00.822
3   2025-04-16 18:59:21.660
4   2025-04-16 18:57:51.552
Name: datetime, dtype: datetime64[ns]

In [146]:
from datetime import datetime
now = datetime.now()

eqPastMonth['howlongago'] = now - eqPastMonth['datetime']

eqPastMonth[['datetime', 'howlongago', 'mag']].head(5)

Unnamed: 0,datetime,howlongago,mag
0,2025-04-16 19:44:39.410,-1 days +19:18:11.471471,1.565842
1,2025-04-16 19:30:36.660,-1 days +19:32:14.221471,1.25
2,2025-04-16 19:21:00.822,-1 days +19:41:50.059471,4.1
3,2025-04-16 18:59:21.660,-1 days +20:03:29.221471,3.46
4,2025-04-16 18:57:51.552,-1 days +20:04:59.329471,4.6


In [147]:
eqPastMonth['magPlus1'] = eqPastMonth['mag'].apply(lambda x: x + 0.1)
eqPastMonth.head(5)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,depthError,magError,magNst,status,locationSource,magSource,magPlus5,datetime,howlongago,magPlus1
0,2025-04-16T19:44:39.410Z,46.617832,-119.800499,6.79,1.565842,ml,11.0,66.0,,0.04,...,0.29,0.273023,8.0,automatic,uw,uw,6.565842,2025-04-16 19:44:39.410,-1 days +19:18:11.471471,1.665842
1,2025-04-16T19:30:36.660Z,33.861167,-117.5075,1.35,1.25,ml,41.0,43.0,0.04688,0.18,...,0.3,0.289,38.0,automatic,ci,ci,6.25,2025-04-16 19:30:36.660,-1 days +19:32:14.221471,1.35
2,2025-04-16T19:21:00.822Z,34.1706,86.4226,10.0,4.1,mb,36.0,103.0,6.426,0.87,...,1.927,0.135,15.0,reviewed,us,us,9.1,2025-04-16 19:21:00.822,-1 days +19:41:50.059471,4.2
3,2025-04-16T18:59:21.660Z,32.143333,-116.051666,15.51,3.46,mlr,62.0,44.0,0.1638,0.24,...,1.14,0.401,18.0,reviewed,ci,ci,8.46,2025-04-16 18:59:21.660,-1 days +20:03:29.221471,3.56
4,2025-04-16T18:57:51.552Z,-0.6994,127.5452,10.0,4.6,mb,42.0,62.0,1.472,1.17,...,1.903,0.117,22.0,reviewed,us,us,9.6,2025-04-16 18:57:51.552,-1 days +20:04:59.329471,4.7


Notice the new "datetime" column.  It of time datetime.

In [148]:
eqPastMonth.dtypes"aassesmet"

time                        object
latitude                   float64
longitude                  float64
depth                      float64
mag                        float64
magType                     object
nst                        float64
gap                        float64
dmin                       float64
rms                        float64
net                         object
id                          object
updated                     object
place                       object
type                        object
horizontalError            float64
depthError                 float64
magError                   float64
magNst                     float64
status                      object
locationSource              object
magSource                   object
magPlus5                   float64
datetime            datetime64[ns]
howlongago         timedelta64[ns]
magPlus1                   float64
dtype: object

You can also see basic statistics about the DataFrame’s numeric columns

In [149]:
eqPastMonth.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10443 entries, 0 to 10442
Data columns (total 26 columns):
 #   Column           Non-Null Count  Dtype          
---  ------           --------------  -----          
 0   time             10443 non-null  object         
 1   latitude         10443 non-null  float64        
 2   longitude        10443 non-null  float64        
 3   depth            10443 non-null  float64        
 4   mag              10443 non-null  float64        
 5   magType          10443 non-null  object         
 6   nst              8806 non-null   float64        
 7   gap              8806 non-null   float64        
 8   dmin             8805 non-null   float64        
 9   rms              10442 non-null  float64        
 10  net              10443 non-null  object         
 11  id               10443 non-null  object         
 12  updated          10443 non-null  object         
 13  place            10443 non-null  object         
 14  type             10443

Method `info()` shows (top down)
+ that `eqPastMonth` is an instance of DataFrame’s class; this information we have obtained with help of function `type()`;
+ number of rows in DataFrame;
+ type of each column and number of non-null rows in this column; this information in a shorted view was given by `dtypes`;
+ memory size of the DataFrame etc.
method `describe()` allows to quickly get average, minimal and maximal values, standard deviation etc. in each DataFrame column with numeric items

In [153]:
eqPastMonth[eqPastMonth["mag"] >= 0].describe()[["mag","latitude"]]

Unnamed: 0,mag,latitude
count,9669.0,9669.0
mean,1.63814,38.513747
min,0.0,-65.0709
25%,0.81,33.467
50%,1.32,38.801666
75%,2.0,49.394
max,7.7,83.5057
std,1.199725,18.204314


### Conditional indexing and selection

[[back to top]](#Table-of-Contents)

As we said above DataFrame is a group of Series objects. This allows you to select specific column (a Series) from the DataFrame (in this case you get a Series) or a few columns (in this case you get another DataFrame)

In [54]:
eqPastMonth_mag = eqPastMonth['mag']
# Here we are showing only one column, i.e. a Series
print ('type:', type(eqPastMonth_mag))
eqPastMonth_mag.head(10)

type: <class 'pandas.core.series.Series'>


0    1.03
1    2.16
2    3.81
3    1.76
4    0.28
5    1.60
6    1.34
7    0.93
8    0.94
9    0.74
Name: mag, dtype: float64

In [154]:
eqPastMonth_record = eqPastMonth[['time','depth', 'mag', 'place']]
# Here we are showing four columns, i.e. a new DataFrame
print ('type:', type(eqPastMonth_record))
eqPastMonth_record.tail()

type: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,time,depth,mag,place
10438,2025-03-17T20:03:00.608Z,10.0,4.5,South Sandwich Islands region
10439,2025-03-17T20:00:35.954Z,19.966,4.8,"2 km SSW of Honmachi, Japan"
10440,2025-03-17T19:58:35.742Z,5.5176,1.2,"23 km W of Garden City, Texas"
10441,2025-03-17T19:56:41.181Z,152.4,1.9,"43 km ENE of Pedro Bay, Alaska"
10442,2025-03-17T19:56:39.438Z,118.54,4.3,"272 km ENE of Lospalos, Timor Leste"


You can also refer to one column in such way

In [155]:
eqPastMonth_record.time

0        2025-04-16T19:44:39.410Z
1        2025-04-16T19:30:36.660Z
2        2025-04-16T19:21:00.822Z
3        2025-04-16T18:59:21.660Z
4        2025-04-16T18:57:51.552Z
                   ...           
10438    2025-03-17T20:03:00.608Z
10439    2025-03-17T20:00:35.954Z
10440    2025-03-17T19:58:35.742Z
10441    2025-03-17T19:56:41.181Z
10442    2025-03-17T19:56:39.438Z
Name: time, Length: 10443, dtype: object

Filtered DataFrames can be obtained by using of logic operators

In [156]:
# Let's display only large earthquakes
eqPastMonth_large = eqPastMonth[eqPastMonth['mag'] > 5]
eqPastMonth_large['mag'].mean()

5.481596638655462

In [157]:
#Getting records that are large (>5mag) earthquakes and that occurred in the northern hemisphere
filtered_df_1 = eqPastMonth[(eqPastMonth['mag'] > 5 ) & (eqPastMonth['latitude'] > 0)]
filtered_df_1[['time','depth', 'mag', 'place']].describe()

Unnamed: 0,depth,mag
count,47.0,47.0
mean,33.390596,5.570426
std,60.403762,0.554644
min,7.673,5.1
25%,10.0,5.2
50%,10.0,5.4
75%,29.3475,5.8
max,382.253,7.7


In [158]:
#Getting records that are large (>5mag) earthquakes and that occurred in the southern hemisphere
filtered_df_1 = eqPastMonth[(eqPastMonth['mag'] > 5 ) & (eqPastMonth['latitude'] < 0)]
filtered_df_1[['time','depth', 'mag', 'place']].describe()

Unnamed: 0,depth,mag
count,72.0,72.0
mean,41.416403,5.423611
std,60.777892,0.450819
min,10.0,5.1
25%,10.0,5.1
50%,10.0,5.25
75%,55.79525,5.525
max,347.281,7.0


In [163]:
#Getting records that are large (>5mag) earthquakes and that occurred in the western hemisphere, but not after 120 w longitude, also filter columns in output
filtered_df_2 = eqPastMonth[(eqPastMonth['mag'] > 2 ) & (eqPastMonth['longitude'] < 0) & (eqPastMonth['longitude'] > -95)][['depth', 'mag', 'place']]
filtered_df_2.head(10)

Unnamed: 0,depth,mag,place
61,27.29,3.24,"100 km NNE of Cruz Bay, U.S. Virgin Islands"
66,10.0,5.6,southern Mid-Atlantic Ridge
86,22.3,2.37,"4 km SW of Las Piedras, Puerto Rico"
104,99.338,4.1,"194 km NW of Antofagasta de la Sierra, Argentina"
115,15.15,2.26,"7 km SSE of Guánica, Puerto Rico"
161,15.92,2.17,"6 km WNW of Linwood, Georgia"
177,14.85,2.42,"0 km SSW of Indios, Puerto Rico"
180,10.0,2.9,"12 km SW of Évora, Portugal"
190,16.09,2.04,"3 km S of Indios, Puerto Rico"
233,10.0,5.0,Scotia Sea


You can also use the method `isin(range)` for checking the presence of Series items in range, method `isnull()` for define `null` (`NaN`) values and boolean operators `&` (`AND`) and `|` (`OR`) in complicated conditions.

As you can see after filtering result tables (i.e. DataFrames) have non-ordered indexes. To fix this trouble you may write the following:

In [164]:
filtered_df_2.reset_index().head(10)

Unnamed: 0,index,depth,mag,place
0,61,27.29,3.24,"100 km NNE of Cruz Bay, U.S. Virgin Islands"
1,66,10.0,5.6,southern Mid-Atlantic Ridge
2,86,22.3,2.37,"4 km SW of Las Piedras, Puerto Rico"
3,104,99.338,4.1,"194 km NW of Antofagasta de la Sierra, Argentina"
4,115,15.15,2.26,"7 km SSE of Guánica, Puerto Rico"
5,161,15.92,2.17,"6 km WNW of Linwood, Georgia"
6,177,14.85,2.42,"0 km SSW of Indios, Puerto Rico"
7,180,10.0,2.9,"12 km SW of Évora, Portugal"
8,190,16.09,2.04,"3 km S of Indios, Puerto Rico"
9,233,10.0,5.0,Scotia Sea


to start indexing form 0 and regularize it.

Also remember that you can add new columns and rows to the DataFrame:

In [166]:
#set new custom_score column and fill it with empty strings
eqPastMonth['custom_mag'] = ''
eqPastMonth['custom_mag'] = np.where(eqPastMonth['mag'] < 4, 'Small', "Large")
eqPastMonth[['time','depth', 'mag', 'place','custom_mag']].head(10)

Unnamed: 0,time,depth,mag,place,custom_mag
0,2025-04-16T19:44:39.410Z,6.79,1.565842,"11 km SE of Desert Aire, Washington",Small
1,2025-04-16T19:30:36.660Z,1.35,1.25,"2 km SSE of Home Gardens, CA",Small
2,2025-04-16T19:21:00.822Z,10.0,4.1,western Xizang,Large
3,2025-04-16T18:59:21.660Z,15.51,3.46,"62 km ENE of Ensenada, B.C., MX",Small
4,2025-04-16T18:57:51.552Z,10.0,4.6,"11 km SE of Labuha, Indonesia",Large
5,2025-04-16T18:34:18.170Z,11.44,1.01,"15 km SE of Anza, CA",Small
6,2025-04-16T18:14:00.081Z,38.462,4.9,"249 km NNE of Fais, Micronesia",Large
7,2025-04-16T18:06:36.870Z,7.36,2.48,"12 km SSE of Volcano, Hawaii",Small
8,2025-04-16T18:02:11.241Z,28.2,2.2,"73 km NNE of Chickaloon, Alaska",Small
9,2025-04-16T17:52:02.109Z,2.8,2.3,"24 km NE of Willow, Alaska",Small


### Work with indexes and MultiIndex option

[[back to top]](#Table-of-Contents)

Pandas allows to set specific indexes to a DataFrame. It can be defined at creating of a DataFrame:

In [63]:
import random
indexes = [random.randrange(0,100) for i in range(5)]
data = [{i:random.randint(0,10) for i in 'ABCDE'} for i in range(5)]
df = pd.DataFrame(data, index=indexes)
df

Unnamed: 0,A,B,C,D,E
84,4,10,0,3,10
84,5,4,8,9,4
63,0,7,0,6,0
21,7,7,9,6,0
74,6,5,6,2,0


Or be change any time

In [64]:
df.index = ['a', 'b', 'c', 'd', 'e']
df

Unnamed: 0,A,B,C,D,E
a,4,10,0,3,10
b,5,4,8,9,4
c,0,7,0,6,0
d,7,7,9,6,0
e,6,5,6,2,0


There is the possibility to select any column (one or more) as index column

In [65]:
# if duplicates exist you can drop duplicates to get unique values
#eqPastMonth_nodups = eqPastMonth.drop_duplicates(subset='time', keep='last')
# we don't need to do that.
# set 'time' as index
eqPastMonth_indexChange = eqPastMonth.set_index('time')
eqPastMonth_indexChange.head(10)

Unnamed: 0_level_0,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource,magPlus5,datetime,magPlus1,custom_mag
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
2025-04-16T15:22:14.240Z,38.821499,-122.762337,1.78,1.03,md,14.0,123.0,0.01074,0.02,nc,nc75166321,2025-04-16T15:23:50.969Z,"3 km W of Cobb, CA",earthquake,0.29,0.53,0.09,15.0,automatic,nc,nc,6.03,2025-04-16 15:22:14.240,1.13,Small
2025-04-16T15:19:27.860Z,19.203501,-155.373001,30.68,2.16,ml,61.0,162.0,0.04099,0.13,hv,hv74654467,2025-04-16T15:23:16.150Z,"11 km E of Pāhala, Hawaii",earthquake,0.42,0.45,0.72,17.0,automatic,hv,hv,7.16,2025-04-16 15:19:27.860,2.26,Small
2025-04-16T15:04:56.790Z,40.368999,-125.008835,2.99,3.81,ml,103.0,247.0,0.5156,0.19,nc,nc75166296,2025-04-16T15:18:08.320Z,"62 km WNW of Petrolia, CA",earthquake,1.27,2.0,0.145,11.0,reviewed,nc,nc,8.81,2025-04-16 15:04:56.790,3.91,Small
2025-04-16T15:00:49.220Z,44.342667,-115.176667,10.84,1.76,ml,16.0,123.0,0.3349,0.24,mb,mb90078908,2025-04-16T15:15:17.250Z,"23 km NW of Stanley, Idaho",earthquake,0.63,1.07,0.123255,14.0,reviewed,mb,mb,6.76,2025-04-16 15:00:49.220,1.86,Small
2025-04-16T14:59:44.190Z,38.803665,-122.779831,3.24,0.28,md,13.0,84.0,0.009603,0.02,nc,nc75166291,2025-04-16T15:17:21.353Z,"4 km NNW of The Geysers, CA",earthquake,0.33,0.92,0.12,12.0,automatic,nc,nc,5.28,2025-04-16 14:59:44.190,0.38,Small
2025-04-16T14:42:52.624Z,64.5566,-149.2173,0.8,1.6,ml,,,,0.45,ak,ak0254vjjyy6,2025-04-16T14:44:46.021Z,"5 km W of Nenana, Alaska",earthquake,,0.3,,,automatic,ak,ak,6.6,2025-04-16 14:42:52.624,1.7,Small
2025-04-16T14:38:39.010Z,36.940498,-121.473,11.03,1.34,md,5.0,190.0,0.1241,0.03,nc,nc75166286,2025-04-16T14:57:17.213Z,"11 km SE of Gilroy, CA",earthquake,1.19,1.93,,1.0,automatic,nc,nc,6.34,2025-04-16 14:38:39.010,1.44,Small
2025-04-16T14:34:00.340Z,46.615833,-119.797333,7.74,0.93,ml,11.0,80.0,0.02353,0.08,uw,uw62091961,2025-04-16T15:02:51.910Z,"11 km SE of Desert Aire, Washington",earthquake,0.26,0.18,0.139248,11.0,reviewed,uw,uw,5.93,2025-04-16 14:34:00.340,1.03,Small
2025-04-16T14:30:22.000Z,33.499,-116.441833,7.48,0.94,ml,44.0,73.0,0.1341,0.22,ci,ci40930047,2025-04-16T14:46:05.920Z,"22 km SW of La Quinta, CA",earthquake,0.21,1.04,0.166,22.0,reviewed,ci,ci,5.94,2025-04-16 14:30:22.000,1.04,Small
2025-04-16T14:29:29.390Z,38.799168,-122.752167,2.1,0.74,md,9.0,97.0,0.006073,0.02,nc,nc75166276,2025-04-16T14:47:19.151Z,"2 km NNE of The Geysers, CA",earthquake,0.4,0.9,0.07,9.0,automatic,nc,nc,5.74,2025-04-16 14:29:29.390,0.84,Small


By default, `set_index()` returns a new DataFrame, so you’ll have to specify if you’d like the changes to occur in place.

Let’s create a many levels index for `filtered_df_2` DataFrame

In [66]:
# set 'id' & 'type' as index
eqPastMonth_multi = eqPastMonth.set_index(['id','type'])[["latitude","longitude","depth","mag", "place"]]
eqPastMonth_multi.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,latitude,longitude,depth,mag,place
id,type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
nc75166321,earthquake,38.821499,-122.762337,1.78,1.03,"3 km W of Cobb, CA"
hv74654467,earthquake,19.203501,-155.373001,30.68,2.16,"11 km E of Pāhala, Hawaii"
nc75166296,earthquake,40.368999,-125.008835,2.99,3.81,"62 km WNW of Petrolia, CA"
mb90078908,earthquake,44.342667,-115.176667,10.84,1.76,"23 km NW of Stanley, Idaho"
nc75166291,earthquake,38.803665,-122.779831,3.24,0.28,"4 km NNW of The Geysers, CA"
ak0254vjjyy6,earthquake,64.5566,-149.2173,0.8,1.6,"5 km W of Nenana, Alaska"
nc75166286,earthquake,36.940498,-121.473,11.03,1.34,"11 km SE of Gilroy, CA"
uw62091961,earthquake,46.615833,-119.797333,7.74,0.93,"11 km SE of Desert Aire, Washington"
ci40930047,earthquake,33.499,-116.441833,7.48,0.94,"22 km SW of La Quinta, CA"
nc75166276,earthquake,38.799168,-122.752167,2.1,0.74,"2 km NNE of The Geysers, CA"


and see the type of `eqPastMonth_multi.index()`

In [67]:
print ('type: ', type(eqPastMonth_multi.index))

type:  <class 'pandas.core.indexes.multi.MultiIndex'>


Thus, we get a new pandas class MultiIndex, which contains information about indexing of DataFrame and allows manipulating with this data. It’s interesting what is the type of `filtered_df_2.index()`?

You can get levels, labels and names values simply address it as to an attribute

### Selection by label and position
[[back to top]](#Table-of-Contents)

After reading previous three subparagraphs probably you have the question: Ok, I know now filter a DataFrame, how make it multi-indexed, but I don’t know how select any specific row in the table.
Object selection in pandas is now supported by two types of multi-axis indexing.

* `.loc` works on labels in the index;
* `.iloc` works on the positions in the index (so it only takes integers);

    
The sequence of the following examples demonstrates how we can manipulate with DataFrame’s rows.
At first let’s get the first row of equakes in the past month.

In [68]:
#To return a single record(i.e. row), in this case the first one.
eqPastMonth.loc[0]

time                 2025-04-16T15:22:14.240Z
latitude                            38.821499
longitude                         -122.762337
depth                                    1.78
mag                                      1.03
magType                                    md
nst                                      14.0
gap                                     123.0
dmin                                  0.01074
rms                                      0.02
net                                        nc
id                                 nc75166321
updated              2025-04-16T15:23:50.969Z
place                      3 km W of Cobb, CA
type                               earthquake
horizontalError                          0.29
depthError                               0.53
magError                                 0.09
magNst                                   15.0
status                              automatic
locationSource                             nc
magSource                         

and rows from 1 to 3 (pay attention on setting of ranges in `.loc`, the right boundary is included to this range which IS different than Python lists and string data structures)

In [69]:
eqPastMonth.loc[1:3]

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource,magPlus5,datetime,magPlus1,custom_mag
1,2025-04-16T15:19:27.860Z,19.203501,-155.373001,30.68,2.16,ml,61.0,162.0,0.04099,0.13,hv,hv74654467,2025-04-16T15:23:16.150Z,"11 km E of Pāhala, Hawaii",earthquake,0.42,0.45,0.72,17.0,automatic,hv,hv,7.16,2025-04-16 15:19:27.860,2.26,Small
2,2025-04-16T15:04:56.790Z,40.368999,-125.008835,2.99,3.81,ml,103.0,247.0,0.5156,0.19,nc,nc75166296,2025-04-16T15:18:08.320Z,"62 km WNW of Petrolia, CA",earthquake,1.27,2.0,0.145,11.0,reviewed,nc,nc,8.81,2025-04-16 15:04:56.790,3.91,Small
3,2025-04-16T15:00:49.220Z,44.342667,-115.176667,10.84,1.76,ml,16.0,123.0,0.3349,0.24,mb,mb90078908,2025-04-16T15:15:17.250Z,"23 km NW of Stanley, Idaho",earthquake,0.63,1.07,0.123255,14.0,reviewed,mb,mb,6.76,2025-04-16 15:00:49.220,1.86,Small


As you can see the first argument of `.loc` corresponds to index name. If you want return value of specific column(s), you should to define the name of this(these) column(s)

In [70]:
eqPastMonth.loc[0, 'place']

'3 km W of Cobb, CA'

In [71]:
eqPastMonth.loc[3:10, ['place', 'mag']]

Unnamed: 0,place,mag
3,"23 km NW of Stanley, Idaho",1.76
4,"4 km NNW of The Geysers, CA",0.28
5,"5 km W of Nenana, Alaska",1.6
6,"11 km SE of Gilroy, CA",1.34
7,"11 km SE of Desert Aire, Washington",0.93
8,"22 km SW of La Quinta, CA",0.94
9,"2 km NNE of The Geysers, CA",0.74
10,"9 km NW of Tonasket, Washington",3.03


Let’s repeat that the first argument of `.loc` is not row number but name of the index for this row

But if it is necessary to obtain rows by it number you may use `.iloc`

In [72]:
eqPastMonth.iloc[0]

time                 2025-04-16T15:22:14.240Z
latitude                            38.821499
longitude                         -122.762337
depth                                    1.78
mag                                      1.03
magType                                    md
nst                                      14.0
gap                                     123.0
dmin                                  0.01074
rms                                      0.02
net                                        nc
id                                 nc75166321
updated              2025-04-16T15:23:50.969Z
place                      3 km W of Cobb, CA
type                               earthquake
horizontalError                          0.29
depthError                               0.53
magError                                 0.09
magNst                                   15.0
status                              automatic
locationSource                             nc
magSource                         

In [73]:
eqPastMonth.iloc[1:5,3:5]

Unnamed: 0,depth,mag
1,30.68,2.16
2,2.99,3.81
3,10.84,1.76
4,3.24,0.28


In the first case column’s number coincides with its name. The second example demonstrates the difference between `.loc` and `.iloc`

In [74]:
eqPastMonth.loc[1:5]

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource,magPlus5,datetime,magPlus1,custom_mag
1,2025-04-16T15:19:27.860Z,19.203501,-155.373001,30.68,2.16,ml,61.0,162.0,0.04099,0.13,hv,hv74654467,2025-04-16T15:23:16.150Z,"11 km E of Pāhala, Hawaii",earthquake,0.42,0.45,0.72,17.0,automatic,hv,hv,7.16,2025-04-16 15:19:27.860,2.26,Small
2,2025-04-16T15:04:56.790Z,40.368999,-125.008835,2.99,3.81,ml,103.0,247.0,0.5156,0.19,nc,nc75166296,2025-04-16T15:18:08.320Z,"62 km WNW of Petrolia, CA",earthquake,1.27,2.0,0.145,11.0,reviewed,nc,nc,8.81,2025-04-16 15:04:56.790,3.91,Small
3,2025-04-16T15:00:49.220Z,44.342667,-115.176667,10.84,1.76,ml,16.0,123.0,0.3349,0.24,mb,mb90078908,2025-04-16T15:15:17.250Z,"23 km NW of Stanley, Idaho",earthquake,0.63,1.07,0.123255,14.0,reviewed,mb,mb,6.76,2025-04-16 15:00:49.220,1.86,Small
4,2025-04-16T14:59:44.190Z,38.803665,-122.779831,3.24,0.28,md,13.0,84.0,0.009603,0.02,nc,nc75166291,2025-04-16T15:17:21.353Z,"4 km NNW of The Geysers, CA",earthquake,0.33,0.92,0.12,12.0,automatic,nc,nc,5.28,2025-04-16 14:59:44.190,0.38,Small
5,2025-04-16T14:42:52.624Z,64.5566,-149.2173,0.8,1.6,ml,,,,0.45,ak,ak0254vjjyy6,2025-04-16T14:44:46.021Z,"5 km W of Nenana, Alaska",earthquake,,0.3,,,automatic,ak,ak,6.6,2025-04-16 14:42:52.624,1.7,Small


### Work with missing data

[[back to top]](#Table-of-Contents)

Pandas primarily uses the value `np.nan` to represent missing data (in table missed/empty value are marked by `NaN`). It is by default not included in computations. Missing data creates many issues at mathematical or computational tasks with DataFrames and Series and it’s important to know how fight with these values.

Previously we have learned how to check `null` and `non-null` values in the DataFrame and Series and how to miss `null` row in the table. But what to do if we need to use rows with `null` data, for example, find sum of all values in the dataset?

Let’s try do this


In [167]:
magError = eqPastMonth['magError']
sum(magError)

nan

The result is unexpected because there many `non-null` values in `eqPastMonth['magError']` Series. Sure, we could filter `magError['magError']`  and remain only `non-null` values. But what if we need sum all numerical values in `magError`? This way will be powerless or too complicated, because we will drop all row items even there is only one `null` value in this row. You can try to do this yourself.

To solve the assigned task you may use an elegant pandas method `fillna(value)`, which replace all `null` values by value.


In [168]:
magError = eqPastMonth['magError'].fillna(0)
magError.median()

0.141

In [169]:
eqPastMonth_fillna = eqPastMonth.fillna(0)
eqPastMonth_fillna.head(10)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,magError,magNst,status,locationSource,magSource,magPlus5,datetime,howlongago,magPlus1,custom_mag
0,2025-04-16T19:44:39.410Z,46.617832,-119.800499,6.79,1.565842,ml,11.0,66.0,0.0,0.04,...,0.273023,8.0,automatic,uw,uw,6.565842,2025-04-16 19:44:39.410,-1 days +19:18:11.471471,1.665842,Small
1,2025-04-16T19:30:36.660Z,33.861167,-117.5075,1.35,1.25,ml,41.0,43.0,0.04688,0.18,...,0.289,38.0,automatic,ci,ci,6.25,2025-04-16 19:30:36.660,-1 days +19:32:14.221471,1.35,Small
2,2025-04-16T19:21:00.822Z,34.1706,86.4226,10.0,4.1,mb,36.0,103.0,6.426,0.87,...,0.135,15.0,reviewed,us,us,9.1,2025-04-16 19:21:00.822,-1 days +19:41:50.059471,4.2,Large
3,2025-04-16T18:59:21.660Z,32.143333,-116.051666,15.51,3.46,mlr,62.0,44.0,0.1638,0.24,...,0.401,18.0,reviewed,ci,ci,8.46,2025-04-16 18:59:21.660,-1 days +20:03:29.221471,3.56,Small
4,2025-04-16T18:57:51.552Z,-0.6994,127.5452,10.0,4.6,mb,42.0,62.0,1.472,1.17,...,0.117,22.0,reviewed,us,us,9.6,2025-04-16 18:57:51.552,-1 days +20:04:59.329471,4.7,Large
5,2025-04-16T18:34:18.170Z,33.467,-116.556167,11.44,1.01,ml,39.0,48.0,0.04736,0.19,...,0.256,29.0,automatic,ci,ci,6.01,2025-04-16 18:34:18.170,-1 days +20:28:32.711471,1.11,Small
6,2025-04-16T18:14:00.081Z,11.7508,141.6119,38.462,4.9,mb,78.0,68.0,3.666,0.7,...,0.055,103.0,reviewed,us,us,9.9,2025-04-16 18:14:00.081,-1 days +20:48:50.800471,5.0,Large
7,2025-04-16T18:06:36.870Z,19.337667,-155.179333,7.36,2.48,ml,49.0,81.0,0.04263,0.1,...,0.226572,37.0,reviewed,hv,hv,7.48,2025-04-16 18:06:36.870,-1 days +20:56:14.011471,2.58,Small
8,2025-04-16T18:02:11.241Z,62.4338,-148.125,28.2,2.2,ml,0.0,0.0,0.0,0.52,...,0.0,0.0,automatic,ak,ak,7.2,2025-04-16 18:02:11.241,-1 days +21:00:39.640471,2.3,Small
9,2025-04-16T17:52:02.109Z,61.8794,-149.6647,2.8,2.3,ml,0.0,0.0,0.0,0.64,...,0.0,0.0,automatic,ak,ak,7.3,2025-04-16 17:52:02.109,-1 days +21:10:48.772471,2.4,Small


Thus, we replace all `NaN` items to `0`. If `inplace=True` in `fillna()` method, then a DataFrame renew.
   
To remain only rows with `non-null` values you can use method `dropna()`

In [170]:
# Drop rows with any missing values
eqPastMonth_fillna = eqPastMonth.dropna(axis=0)
# Print mean of each numeric column
print(eqPastMonth_fillna.mean(numeric_only=True))
# Show the first 10 rows of the cleaned DataFrame
eqPastMonth_fillna.head(10)

latitude            35.566607
longitude         -102.154534
depth               17.344161
mag                  1.469658
nst                 23.781165
gap                112.714409
dmin                 0.493266
rms                  0.208085
horizontalError      1.616073
depthError           2.381997
magError             0.171572
magNst              19.377305
magPlus5             6.469658
magPlus1             1.569658
dtype: float64


Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,magError,magNst,status,locationSource,magSource,magPlus5,datetime,howlongago,magPlus1,custom_mag
1,2025-04-16T19:30:36.660Z,33.861167,-117.5075,1.35,1.25,ml,41.0,43.0,0.04688,0.18,...,0.289,38.0,automatic,ci,ci,6.25,2025-04-16 19:30:36.660,-1 days +19:32:14.221471,1.35,Small
2,2025-04-16T19:21:00.822Z,34.1706,86.4226,10.0,4.1,mb,36.0,103.0,6.426,0.87,...,0.135,15.0,reviewed,us,us,9.1,2025-04-16 19:21:00.822,-1 days +19:41:50.059471,4.2,Large
3,2025-04-16T18:59:21.660Z,32.143333,-116.051666,15.51,3.46,mlr,62.0,44.0,0.1638,0.24,...,0.401,18.0,reviewed,ci,ci,8.46,2025-04-16 18:59:21.660,-1 days +20:03:29.221471,3.56,Small
4,2025-04-16T18:57:51.552Z,-0.6994,127.5452,10.0,4.6,mb,42.0,62.0,1.472,1.17,...,0.117,22.0,reviewed,us,us,9.6,2025-04-16 18:57:51.552,-1 days +20:04:59.329471,4.7,Large
5,2025-04-16T18:34:18.170Z,33.467,-116.556167,11.44,1.01,ml,39.0,48.0,0.04736,0.19,...,0.256,29.0,automatic,ci,ci,6.01,2025-04-16 18:34:18.170,-1 days +20:28:32.711471,1.11,Small
6,2025-04-16T18:14:00.081Z,11.7508,141.6119,38.462,4.9,mb,78.0,68.0,3.666,0.7,...,0.055,103.0,reviewed,us,us,9.9,2025-04-16 18:14:00.081,-1 days +20:48:50.800471,5.0,Large
7,2025-04-16T18:06:36.870Z,19.337667,-155.179333,7.36,2.48,ml,49.0,81.0,0.04263,0.1,...,0.226572,37.0,reviewed,hv,hv,7.48,2025-04-16 18:06:36.870,-1 days +20:56:14.011471,2.58,Small
10,2025-04-16T17:44:40.248Z,-33.9361,-179.4185,26.558,5.4,mww,81.0,79.0,4.064,0.94,...,0.071,19.0,reviewed,us,us,10.4,2025-04-16 17:44:40.248,-1 days +21:18:10.633471,5.5,Large
11,2025-04-16T17:41:28.360Z,38.836834,-122.800163,2.04,0.23,md,12.0,87.0,0.007828,0.02,...,0.08,13.0,automatic,nc,nc,5.23,2025-04-16 17:41:28.360,-1 days +21:21:22.521471,0.33,Small
12,2025-04-16T17:36:07.080Z,33.03,-116.579667,13.85,0.55,ml,24.0,73.0,0.05301,0.27,...,0.152,9.0,automatic,ci,ci,5.55,2025-04-16 17:36:07.080,-1 days +21:26:43.801471,0.65,Small


We can manipulate by `null` values and columns using parameters subset and how to set analyzing columns and type of analysis respectively

> ### Exercise 1

> - Get type of `“latitude”` column in `eqPastMonth`. 

> - In `eqPastMonth` find all rows where `magType` corresponds to the value `"md"` and where `mag` is less `5` and `not-null` `magError`. Call the obtained DataFrmae as `eqPastMonth_md_large`. 

In [86]:
# type your code here
eqPastMonth_md_large = eqPastMonth