The main objective of this notebook is to create a .csv file where there are the samples ('muestra' in spanish), number of observations on each sample, area, date, position and depth of each observation. The dataset is different for each filum and there are 3 filum, that are crustacea, cnidaria and mollusca

In [None]:
#Import the libraries

import pandas
import geopandas as gpd
import os 
import fiona

We start this mission by loading the different areas that separate the shelf zone of the world. This is an archive with 66 zones.

In [None]:

shapefile = gpd.read_file(r'C:\Users\glode\OneDrive\Desktop\doctorado\filtrado_obis\lme66.shp')#shapefile.Reader(r'C:\Users\glode\OneDrive\Desktop\doctorado\programa\lme66.shp')

# Ver los registros
shapefile = shapefile.set_crs(epsg=4326)

We charge the data of the different datasets, the data is downloaded in OBIS and it is filtered using the Filtrado_OBIS_F1.py. Now we have data with the parameters: latitude, longitude, depth, genusid (this parameters is a Worms index that indicates the genera of each observation), date

In [None]:
#We charge the data

data=pandas.read_csv('Occurrence_Filtrado_F1_mollusca_sinprofundidad.txt', sep='\t')
print(data.keys())

Index(['decimalLatitude', 'decimalLongitude', 'depth', 'genusid',
       'fecha_hora'],
      dtype='object')


In [None]:
#This is done to check how much data have genusid

data_genusid=data[data['genusid'].notna()]

print(len(data))    
print(len(data_genusid))

We create a new column to clean the date_hour column, it contains the hour but we don't need that. Furthermore, we convert the column to a dataset and we overwrite the date column with this information

In [None]:
#Clean the data
data['fecha_hora_limpia'] = data['fecha_hora'].str.extract(r'^(\d{2}-\d{2}-\d{4})')[0]

#Convert to datetime
data['fecha_hora_limpia'] = pandas.to_datetime(data['fecha_hora_limpia'], format='%d-%m-%Y')

#Overwrite the original column
data["fecha_hora"]=data["fecha_hora_limpia"]

In [None]:
#To check if we have done it correctly
data.head()

Unnamed: 0,decimalLatitude,decimalLongitude,depth,genusid,fecha_hora,fecha_hora_limpia
0,-39.478333,177.005,12.5,,1976-09-04,1976-09-04
1,-39.466667,177.136667,15.0,,1976-09-04,1976-09-04
2,-39.583333,177.071667,7.5,,1976-09-05,1976-09-05
3,-39.516667,177.416667,49.0,,1976-09-05,1976-09-05
4,-39.4,177.666667,33.0,,1976-09-05,1976-09-05


We just select the points that are not nan in the genusid column, i.e., the points that have information about the genera

In [None]:
data_genusid=data[data['genusid'].notna()]
print(data_genusid)

We create the 'muestra' column. There are the same 'muestra' the points that have the same latitude, longitude and date. We assign to each of this group a number

In [None]:
#Create the 'muestra' column
data_genusid['muestra'] = data_genusid.groupby(['decimalLatitude', 'decimalLongitude', 'fecha_hora']).ngroup()
#Check if the column was created correctly
print(data_genusid)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_genusid['muestra'] = data_genusid.groupby(['decimalLatitude', 'decimalLongitude', 'fecha_hora']).ngroup()


In [None]:
#We order the data by 'muestra' to facilitate the testing
df_ordenado = data_genusid.sort_values(by='muestra')
print(df_ordenado)

Once we have created the sample column, we have to create the column 'observaciones'. This column describes the number of observations, that here is considered as one row, that has each 'muestra'.

In [None]:
#Create the column 'observaciones'
data_genusid['observaciones'] = data_genusid.groupby('muestra')['muestra'].transform('count')
#We order the data by 'muestra' to facilitate the testing
df_ordenado = data_genusid.sort_values(by='muestra')
print(df_ordenado)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_genusid['observaciones'] = data_genusid.groupby('muestra')['muestra'].transform('count')


Now it's the time to identify the region where each point is located. For this purpose we create the points as an geodataframe and identify where each point is located using the function gpd.sjoin.  

Furthermore, in the dataset there are points all around the globe, in ocean and even in land, and we don't need all of them. So we have to select only the points that corresponds to this region

In [None]:
#Convert the data to a geodataframe
gdf_puntos = gpd.GeoDataFrame(
    data_genusid, 
    geometry=gpd.points_from_xy(data_genusid['decimalLongitude'], data_genusid['decimalLatitude']),
    crs='EPSG:4326'
)

#We check the geography of the shapefile
shapefile = shapefile.to_crs(epsg=4326)

#We identify each point with the region of the shapefile
gdf_resultado = gpd.sjoin(gdf_puntos, shapefile, how='left', predicate='within')

#We check if we have done it correctly
print(gdf_resultado)

Doing this procedure we have obtained so many columns, and we don't need all of them, so we delete it.  
 Furthermore, there are points all around the globe, even in land areas, but we only have to consider the points that are inside any of these 66 regions.

In [None]:
#We delete the columns that we do not need

df_resultado = gdf_resultado.drop(columns=['geometry', 'index_right', 'LME_NUMBER', 'LME_NAME', 'Shape_Area', 'Shape_Leng'])

#We check if we have done it correctly, ordering the data by 'muestra' to facilitate the testing
df_ordenado=df_resultado.sort_values(by='muestra')
print(df_ordenado)

#We select only the points that are inside a region
df_ordenado=df_resultado[df_resultado['OBJECTID'].notna()]




We save our progress in a .csv file.

In [194]:
df_ordenado.to_csv('resultado_final_mollusca_sinprofundidad.csv', sep=',', index=False)

We do this procedure for each filum, cnidaria, mollusca and crustacea