# Análisis exploratorio de data raw de iNaturalist

Análisis de los datos descargados de iNaturalist para evaluar pasos de limpieza.

## Librerías y ambiente

In [None]:
import sys
from pathlib import Path

ROOT = Path().resolve().parents[0]
sys.path.append(str(ROOT))

In [None]:
import pandas as pd
from src.data.data import clean_dataset

## Análisis

In [None]:
df_inat = pd.read_csv('../data/raw/df_inat_ps.csv')
df_inat.head()

Unnamed: 0,id,observed_on_string,observed_on,time_observed_at,time_zone,user_id,created_at,updated_at,quality_grade,url,...,positioning_device,place_town_name,place_county_name,place_state_name,place_country_name,species_guess,scientific_name,common_name,iconic_taxon_name,taxon_id
0,109096,2010-06-17,2010-06-17,,Eastern Time (US & Canada),6566,2012-08-04 05:09:30 UTC,2021-08-20 21:15:00 UTC,research,http://www.inaturalist.org/observations/109096,...,,,Sopetrán,Antioquia,Colombia,Poliporo naranja,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
1,134256,2007-01-07,2007-01-07,,Pacific Time (US & Canada),642,2012-10-12 00:53:24 UTC,2024-12-03 04:13:18 UTC,research,http://www.inaturalist.org/observations/134256,...,,,"Sarapiqui, Heredia Costa Rica",Heredia,Costa Rica,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
2,176110,2006-05-18,2006-05-18,,Pacific Time (US & Canada),11882,2013-01-09 00:34:23 UTC,2019-12-09 20:54:30 UTC,research,http://www.inaturalist.org/observations/176110,...,,,,San Salvador,Bahamas,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
3,206846,2013-01-15 12:31:20,2013-01-15,2013-01-15 22:31:20 UTC,Hawaii,3494,2013-02-28 07:08:38 UTC,2019-12-09 20:54:31 UTC,research,http://www.inaturalist.org/observations/206846,...,,,Pinellas,Florida,United States,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
4,240135,2010-09-18,2010-09-18,,Mexico City,9345,2013-04-18 22:30:54 UTC,2024-03-13 16:52:02 UTC,research,http://conabio.inaturalist.org/observations/24...,...,,,Malinalco,México,Mexico,Hongo de repisa naranja,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664


In [7]:
df_inat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11097 entries, 0 to 11096
Data columns (total 38 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                11097 non-null  int64  
 1   observed_on_string                11097 non-null  object 
 2   observed_on                       11097 non-null  object 
 3   time_observed_at                  10607 non-null  object 
 4   time_zone                         11097 non-null  object 
 5   user_id                           11097 non-null  int64  
 6   created_at                        11097 non-null  object 
 7   updated_at                        11097 non-null  object 
 8   quality_grade                     11097 non-null  object 
 9   url                               11097 non-null  object 
 10  image_url                         11097 non-null  object 
 11  tag_list                          259 non-null    object 
 12  desc

### Campos de fecha

Se conserva la columna observed_on que es la fecha normalizada de la observación. En este caso no interesa el horario.

In [8]:
date_cols = ['observed_on_string',
             'observed_on',
             'time_observed_at',
             'time_zone',
             'created_at',
             'updated_at']

df_inat[date_cols].describe()

Unnamed: 0,observed_on_string,observed_on,time_observed_at,time_zone,created_at,updated_at
count,11097,11097,10607,11097,11097,11097
unique,11009,3103,10548,175,11067,10636
top,2015-03-01,2025-04-12,2021-11-21 13:29:00 UTC,Brasilia,2025-10-20 19:47:03 UTC,2024-06-25 13:42:48 UTC
freq,4,36,3,1693,3,30


### Quality grade

El grado de calidad de todas las observaciones es de research porque así fue especificado en la búsqueda desde la página de iNaturalist.

In [10]:
df_inat.quality_grade.value_counts()

quality_grade
research    11097
Name: count, dtype: int64

### Agreements y disagreements

No hay observaciones con más disagreements que agreements. Todas tienen al menos 1 agreement al ser de grado research.

In [None]:
df_inat.num_identification_agreements.describe() # todas tienen al menos 1 agreement

count    11097.000000
mean         1.330540
std          0.589164
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          6.000000
Name: num_identification_agreements, dtype: float64

In [None]:
df_inat.num_identification_disagreements.describe() # algunas tienen disagreements

count    11097.000000
mean         0.003965
std          0.064264
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          2.000000
Name: num_identification_disagreements, dtype: float64

In [13]:
df_inat[df_inat.num_identification_disagreements > 0]

Unnamed: 0,id,observed_on_string,observed_on,time_observed_at,time_zone,user_id,created_at,updated_at,quality_grade,url,...,positioning_device,place_town_name,place_county_name,place_state_name,place_country_name,species_guess,scientific_name,common_name,iconic_taxon_name,taxon_id
47,2588640,2016-01-16,2016-01-16,,Buenos Aires,168406,2016-01-17 15:42:59 UTC,2023-05-09 02:02:07 UTC,research,http://www.inaturalist.org/observations/2588640,...,,,Distrito Federal,Ciudad de Buenos Aires,Argentina,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
75,4660350,2016/11/24 2:00 PM BRST,2016-11-24,2016-11-24 16:00:00 UTC,Brasilia,361088,2016-11-29 11:53:57 UTC,2024-02-01 16:13:10 UTC,research,http://www.inaturalist.org/observations/4660350,...,,,Tapiraí,São Paulo,Brazil,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
81,4747984,2014-10-25,2014-10-25,,Pacific Time (US & Canada),25945,2016-12-10 04:57:05 UTC,2025-11-10 15:50:47 UTC,research,http://www.inaturalist.org/observations/4747984,...,,Fagáceas del Noreste de México,Aquismón,,Mexico,Trametes sanguinea,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
158,7554395,Sat Aug 05 2017 10:23:09 GMT-0300 (GMT-3),2017-08-05,2017-08-05 13:23:09 UTC,Brasilia,555357,2017-08-19 13:36:45 UTC,2019-12-09 21:30:17 UTC,research,https://www.inaturalist.org/observations/7554395,...,,,Joinvile,Santa Catarina,Brazil,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
166,7812416,Wed Sep 06 2017 08:17:10 GMT-0500 (CDT),2017-09-06,2017-09-06 13:17:10 UTC,Central Time (US & Canada),302351,2017-09-06 22:14:39 UTC,2025-10-27 21:59:47 UTC,research,https://www.inaturalist.org/observations/7812416,...,,Fagáceas del Noreste de México,San Nicolás de los Garza,Nuevo León,Mexico,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
323,10976257,2014-05-10,2014-05-10,,Pretoria,389997,2014-05-10 15:44:04 UTC,2025-11-08 17:44:32 UTC,research,https://www.inaturalist.org/observations/10976257,...,,,George Greater Municipality and marine,Western Cape,South Africa,Laetiporus,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
345,11126142,2015-01-18,2015-01-18,,Pretoria,389997,2015-01-18 14:11:45 UTC,2025-11-08 18:04:12 UTC,research,https://www.inaturalist.org/observations/11126142,...,,,George Greater Municipality and marine,Western Cape,South Africa,Stereum,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
471,15864201,2018-08-13 8:26:00 AM CST,2018-08-13,2018-08-13 13:26:00 UTC,Central Time (US & Canada),397624,2018-08-25 01:54:03 UTC,2023-11-23 02:13:17 UTC,research,https://www.inaturalist.org/observations/15864201,...,,,Ubatuba,São Paulo,Brazil,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
527,17468951,2018-10-13 12:30:00 PM GMT-05:00,2018-10-13,2018-10-13 07:30:00 UTC,Ekaterinburg,725820,2018-10-13 17:31:12 UTC,2023-01-16 16:08:23 UTC,research,https://www.inaturalist.org/observations/17468951,...,gps,,Linhares,Espírito Santo,Brazil,Hongo de repisa naranja,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
712,21365123,2019/03/17 4:32 PM -03,2019-03-17,,Buenos Aires,1234701,2019-03-18 19:47:58 UTC,2024-01-21 15:44:27 UTC,research,https://www.inaturalist.org/observations/21365123,...,,,Capital,Córdoba,Argentina,Hongo de repisa naranja,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664


In [14]:
df_inat[df_inat.num_identification_disagreements >= df_inat.num_identification_agreements]

Unnamed: 0,id,observed_on_string,observed_on,time_observed_at,time_zone,user_id,created_at,updated_at,quality_grade,url,...,positioning_device,place_town_name,place_county_name,place_state_name,place_country_name,species_guess,scientific_name,common_name,iconic_taxon_name,taxon_id


### Captive cultivated

In [16]:
df_inat.captive_cultivated.value_counts()

captive_cultivated
False    11097
Name: count, dtype: int64

### Geografía

- Hay algunas observaciones con latitud longitud nulas (las que tienen geoprivacy 'private').


In [18]:
geo_cols = ['place_guess',
            'latitude',
            'longitude',
            'positional_accuracy',
            'private_place_guess',
            'private_latitude',
            'private_longitude',
            'public_positional_accuracy',
            'geoprivacy',
            'taxon_geoprivacy',
            'coordinates_obscured',
            'positioning_method',
            'positioning_device',
            'place_town_name',
            'place_county_name',
            'place_state_name',
            'place_country_name']

df_geo = df_inat[geo_cols]
df_geo.head()

Unnamed: 0,place_guess,latitude,longitude,positional_accuracy,private_place_guess,private_latitude,private_longitude,public_positional_accuracy,geoprivacy,taxon_geoprivacy,coordinates_obscured,positioning_method,positioning_device,place_town_name,place_county_name,place_state_name,place_country_name
0,Quebrada la sopetrana,6.506786,-75.758628,16.0,,,,16.0,,,False,,,,Sopetrán,Antioquia,Colombia
1,La Selva Biological Station,10.430811,-84.005815,,,,,,,,False,,,,"Sarapiqui, Heredia Costa Rica",Heredia,Costa Rica
2,"San Salvador, Bahamas",24.07123,-74.520607,3880.0,,,,3880.0,,,False,,,,,San Salvador,Bahamas
3,"Fort de Soto, Florida, United States",27.643108,-82.735084,,,,,,,,False,,,,Pinellas,Florida,United States
4,"Los Tepehuajes, sur de Malinalco, Estado de M...",18.858167,-99.445004,,,,,,,,False,,,,Malinalco,México,Mexico


In [19]:
df_geo.isna().sum()

place_guess                      15
latitude                         10
longitude                        10
positional_accuracy            2491
private_place_guess           11097
private_latitude              11097
private_longitude             11097
public_positional_accuracy     2455
geoprivacy                    10875
taxon_geoprivacy              11097
coordinates_obscured              0
positioning_method             7967
positioning_device             7951
place_town_name               10450
place_county_name               381
place_state_name                 15
place_country_name               10
dtype: int64

In [None]:
df_inat.positional_accuracy.describe() # se pueden filtrar las que tengan más uncertainty

count    8.606000e+03
mean     4.458198e+03
std      7.108133e+04
min      1.000000e+00
25%      7.000000e+00
50%      3.300000e+01
75%      2.440000e+02
max      2.801445e+06
Name: positional_accuracy, dtype: float64

In [37]:
df_inat[df_inat.positional_accuracy > 5000].geoprivacy.value_counts()

geoprivacy
obscured    22
Name: count, dtype: int64

In [22]:
df_inat.public_positional_accuracy.describe().T

count    8.642000e+03
mean     5.149278e+03
std      7.103933e+04
min      1.000000e+00
25%      8.000000e+00
50%      3.600000e+01
75%      2.680000e+02
max      2.801445e+06
Name: public_positional_accuracy, dtype: float64

In [42]:
df_inat[df_inat.geoprivacy == 'private']

Unnamed: 0,id,observed_on_string,observed_on,time_observed_at,time_zone,user_id,created_at,updated_at,quality_grade,url,...,positioning_device,place_town_name,place_county_name,place_state_name,place_country_name,species_guess,scientific_name,common_name,iconic_taxon_name,taxon_id
88,4956905,2016/12/31 1:16 PM CST,2016-12-31,2016-12-31 19:16:00 UTC,Central Time (US & Canada),129239,2017-01-16 21:21:14 UTC,2019-12-09 20:56:01 UTC,research,http://conabio.inaturalist.org/observations/49...,...,,,,,,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
265,10480376,Sun Oct 09 2016 10:11:23 GMT-0300 (GMT-3),2016-10-09,2016-10-09 13:11:23 UTC,Brasilia,815236,2018-03-28 23:29:58 UTC,2020-06-02 07:26:49 UTC,research,https://www.inaturalist.org/observations/10480376,...,,,,,,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
415,12566784,2018-05-18 11:16:47,2018-05-18,2018-05-18 01:16:47 UTC,Brisbane,787427,2018-05-18 03:43:02 UTC,2020-06-02 07:59:59 UTC,research,https://www.inaturalist.org/observations/12566784,...,,,,,,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
433,13068009,2018-06-03 14:05:35,2018-06-03,2018-06-03 04:05:35 UTC,Brisbane,787427,2018-06-03 05:39:38 UTC,2020-06-02 08:11:02 UTC,research,https://www.inaturalist.org/observations/13068009,...,,,,,,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
1879,64079029,2015/11/16 4:58 PM CST,2015-11-16,2015-11-16 22:58:00 UTC,Mexico City,3752984,2020-11-03 01:49:25 UTC,2020-11-03 01:49:50 UTC,research,https://www.inaturalist.org/observations/64079029,...,,,,,,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
2275,69794138,Fri Feb 19 2021 09:31:20 GMT+0800 (GMT+8),2021-02-19,2021-02-18 17:31:20 UTC,Perth,4030889,2021-02-19 02:01:18 UTC,2022-01-24 06:57:49 UTC,research,https://www.inaturalist.org/observations/69794138,...,,,,,,Trametes sanguinea,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
4313,137479675,2022-10-02,2022-10-02,,Brasilia,4023807,2022-10-04 01:10:20 UTC,2024-05-21 10:27:16 UTC,research,https://www.inaturalist.org/observations/13747...,...,gps,,,,,Orelha-de-pau,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
5664,166712649,2021-11-13 17:08:39,2021-11-13,2021-11-13 20:08:39 UTC,Brasilia,7017691,2023-06-11 01:26:48 UTC,2023-06-12 15:18:53 UTC,research,https://www.inaturalist.org/observations/16671...,...,gps,,,,,Orelha-de-pau,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
7500,206504103,2024-04-10 18:07:38,2024-04-10,2024-04-10 23:07:38 UTC,Bogota,7962141,2024-04-10 23:09:35 UTC,2024-09-01 17:59:29 UTC,research,https://www.inaturalist.org/observations/20650...,...,gps,,,,,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664
10116,288072761,2025-06-09 10:29:24+10:00,2025-06-09,2025-06-09 00:29:24 UTC,Sydney,7666231,2025-06-09 01:07:43 UTC,2025-06-12 22:26:54 UTC,research,https://www.inaturalist.org/observations/28807...,...,,,,,,Cinnabar Bracket,Trametes sanguinea,Hongo de repisa naranja,Fungi,974664


In [29]:
df_inat.geoprivacy.value_counts()

geoprivacy
obscured    212
private      10
Name: count, dtype: int64

In [38]:
df_geo[df_geo.geoprivacy == 'obscured'][['public_positional_accuracy',
                                         'positional_accuracy']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
public_positional_accuracy,212.0,44617.367925,97491.92774,27847.0,29476.75,30175.0,30666.0,1217796.0
positional_accuracy,179.0,20280.558659,110535.558224,1.0,8.0,90.0,1236.0,1217796.0


In [39]:
df_geo[df_geo.geoprivacy == 'private'][['public_positional_accuracy',
                                         'positional_accuracy']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
public_positional_accuracy,10.0,30552.9,649.550178,28930.0,30508.0,30632.0,30855.0,31362.0
positional_accuracy,7.0,197.714286,456.455808,5.0,16.5,20.0,47.0,1232.0


In [40]:
df_geo[df_geo.geoprivacy.isna()][['public_positional_accuracy',
                                         'positional_accuracy']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
public_positional_accuracy,8420.0,4125.37399,70002.921066,1.0,7.0,32.0,234.0,2801445.0
positional_accuracy,8420.0,4125.37399,70002.921066,1.0,7.0,32.0,234.0,2801445.0


In [50]:
df_geo.positioning_device.value_counts()

positioning_device
gps       3130
manual      13
google       3
Name: count, dtype: int64

In [51]:
df_geo.positioning_method.value_counts()

positioning_method
gps    3130
Name: count, dtype: int64

In [44]:
df_inat.taxon_geoprivacy.isna().sum()

np.int64(11097)

### Species guess and names

In [59]:
df_inat.species_guess.value_counts()

species_guess
Cinnabar Bracket            5369
Hongo de repisa naranja     3226
Orelha-de-pau               1413
Trametes sanguinea           530
Poliporo naranja             358
血紅密孔菌 (朱紅菌)                   86
血红栓孔菌                         26
Hongo de Repisa Naranja       21
Orelha-de-Pau                 17
เห็ดขอนแดงรูเล็ก              11
Poliporo Naranja               8
Trametes                       5
Pycnoporus sanguineus          4
outkovka krvavá                3
Hongos de repisa               3
Common Cinnabar Polypore       2
Fungos e Líquens               2
Stereum                        1
Laetiporus                     1
血紅密孔菌                          1
血红密孔菌                          1
Hongos                         1
Orelha de pau de sangue        1
Name: count, dtype: int64

In [60]:
df_inat.scientific_name.value_counts()

scientific_name
Trametes sanguinea    11097
Name: count, dtype: int64

In [61]:
df_inat.common_name.value_counts()

common_name
Hongo de repisa naranja    11097
Name: count, dtype: int64

In [62]:
df_inat.iconic_taxon_name.value_counts()

iconic_taxon_name
Fungi    11097
Name: count, dtype: int64

In [63]:
df_inat.taxon_id.value_counts()

taxon_id
974664    11097
Name: count, dtype: int64