<a href="https://colab.research.google.com/github/nataliaboaventura/microplastic/blob/main/Marine_Microplastics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://raw.githubusercontent.com/nataliaboaventura/microplastic/refs/heads/main/sea-2470908.jpg)
*Source: [Pixabay](https://pixabay.com/photos/sea-ocean-underwater-turtle-2470908/), Public Domain.*
#Exploring Microplastics in the Atlantic Ocean: Data Analysis and Insights
by *Natalia Boaventura*


The objective of this project is to explore Kaggle's dataset ["Marine Microplastics"](https://www.kaggle.com/datasets/william2020/marine-microplastics?resource=download&SSORegistrationToken=CfDJ8EnTcgNki7pNlcBNxcx19bsgzTpDM_RqiDGZsGaOUlZdCTG7WJ1TMWnQltiT6VmHxokRFTJPvsjF-6LWqziEnauehrYSXLbnZN-7B0yKPpUYeFrzWxHd02RBJ4C_U8bH6NOqB0WhJ6SXn3tul5yzqNUNOv9lcbqxzeXDLT7zNidLdz1N6QegQGQQxIKutTtiOkrWD1ZFOcVft4wrBRy3ujo3lPbjLkk-f_d4-C--L9sX50heY1qpEYm1jlj1Olfd-2UMCC4J4WX6OEJfNhblWEfutjVCsXUgEimrLWI4c1ZUXpjpeuOwhYWheuA1gpUFsek432taaBs5l5Fa4rjXBGH7zNlVuMafG9bPHydctAuuy_69kCU&DisplayName=Natalia). The Marine Microplastics database contains data on microplastic concentrations, collection locations, and sampling methods, aiming to improve water quality and protect coastal ecosystems.
Microplastics are a growing environmental concern, as they pose significant threats to marine life and ecosystems. By analyzing this dataset, the project aims to understand the distribution, sources, and impact of microplastics in the ocean. The exploration will involve cleaning and visualizing the data, identifying patterns, and possibly correlating the presence of microplastics with environmental factors such as water temperature, location, and pollution levels. The insights gained from this analysis could help in developing strategies for mitigating plastic pollution in the ocean.


In [None]:
# Importing libraries
import pandas as pd
from tabulate import tabulate

# Downloading and extracting the dataset from Kaggle
def download_and_extract_data():

    # Downloading the dataset from Kaggle using the API
    !kaggle datasets download -d william2020/marine-microplastics
    !unzip marine-microplastics.zip


In [54]:
# Loading the dataset into a Pandas DataFrame
def load_data():
    df = pd.read_csv('Marine_Microplastics.csv')
    return df

# Displaying the initial records
df.head()

Unnamed: 0,OBJECTID,Oceans,Regions,SubRegions,Sampling Method,Measurement,Unit,Density Range,Density Class,Short Reference,...,Latitude,Longitude,Date,GlobalID,x,y,Lower_Density,Upper_Density,Operator,Replicated_Value
0,9676,Atlantic Ocean,,,Grab sample,0.018,pieces/m3,0.005-1,Medium,Barrows et al.2018,...,-31.696,-48.56,8/11/2015 12:00:00 AM,a77121b2-e113-444e-82d9-7af11d62fdd2,-48.56,-31.696,0.005,1.0,,
1,6427,Pacific Ocean,,,Neuston net,0.0,pieces/m3,0-0.0005,Very Low,Law et al.2014,...,6.35,-121.85,12/18/2002 12:00:00 AM,be27c450-02ca-4261-8d89-cae21108e6cc,-121.85,6.35,0.0,0.0005,,
2,10672,Pacific Ocean,,,Manta net,0.013,pieces/m3,0.005-1,Medium,Goldstein et al.2013,...,0.5,-95.35,10/17/2006 12:00:00 AM,23effcdd-35b7-4e1e-adb4-390693a287d3,-95.35,0.5,0.005,1.0,,
3,13921,Atlantic Ocean,,,Aluminum bucket,1368.0,pieces/m3,>=10,Very High,Queiroz et al.2022,...,0.631825,-45.398158,10/17/2018 12:00:00 AM,16d77822-0533-4116-97b9-0bdb592f3d6e,-45.398158,0.631825,,,lower or equal,10.0
4,9344,Pacific Ocean,,,Grab sample,0.001,pieces/m3,0.0005-0.005,Low,Barrows et al.2018,...,16.623,-99.6978,1/3/2015 12:00:00 AM,b9e435e3-9e86-4143-8b51-877e5dcdc7a6,-99.6978,16.623,0.0005,0.005,,


# Overview of the DataFrame's general information


In [None]:
# Summary data for the dataset analysis
summary_data = [
    ["Total Rows", f"{df.shape[0]:,}"],  # Total number of rows in the dataset
    ["Total Columns", f"{df.shape[1]:,}"],  # Total number of columns in the dataset
    ["Missing Values", f"{df.isna().sum().sum():,}"],  # Total count of missing values
    ["Duplicated Rows", f"{df.duplicated().sum():,}"],  # Number of duplicated rows
    ["Duplicated Columns", f"{df.columns.duplicated().sum():,}"],  # Number of duplicated columns
]

# Displaying an overview of the dataset in a table
print("Dataset Overview:")  # Header for dataset overview section
print(tabulate(summary_data, headers=["Metric", "Value"], tablefmt="pretty"))  # Display table with metrics

Dataset Overview:
+--------------------+--------+
|       Metric       | Value  |
+--------------------+--------+
|     Total Rows     | 20,425 |
|   Total Columns    |   22   |
|   Missing Values   | 36,759 |
|  Duplicated Rows   |   0    |
| Duplicated Columns |   0    |
+--------------------+--------+


In [None]:
# Displaying detailed information about the dataset
print("Dataset Information:")  # Header for dataset information section
df.info()  # Print the DataFrame's info including column types and memory usage

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20425 entries, 0 to 20424
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   OBJECTID          20425 non-null  int64  
 1   Oceans            20154 non-null  object 
 2   Regions           8885 non-null   object 
 3   SubRegions        1307 non-null   object 
 4   Sampling Method   20425 non-null  object 
 5   Measurement       14613 non-null  float64
 6   Unit              20425 non-null  object 
 7   Density Range     20425 non-null  object 
 8   Density Class     20425 non-null  object 
 9   Short Reference   20425 non-null  object 
 10  Long Reference    20425 non-null  object 
 11  DOI               20425 non-null  object 
 12  Organization      20425 non-null  object 
 13  Keywords          20407 non-null  object 
 14  Accession Number  20425 non-null  int64  
 15  Accession Link    20425 non-null  object 
 16  Latitude          2

In [None]:
# Displaying missing values for each column in a table
print("Missing Values per Column:")  # Header for missing values section
print(
    tabulate(
        df.isnull().sum().reset_index().values,
        headers=["Column", "Missing Values"],  # Table headers
        tablefmt="pretty",  # Format for a visually appealing table
    )
)

Missing Values per Column:
+------------------+----------------+
|      Column      | Missing Values |
+------------------+----------------+
|     OBJECTID     |       0        |
|      Oceans      |      271       |
|     Regions      |     11540      |
|    SubRegions    |     19118      |
| Sampling Method  |       0        |
|   Measurement    |      5812      |
|       Unit       |       0        |
|  Density Range   |       0        |
|  Density Class   |       0        |
| Short Reference  |       0        |
|  Long Reference  |       0        |
|       DOI        |       0        |
|   Organization   |       0        |
|     Keywords     |       18       |
| Accession Number |       0        |
|  Accession Link  |       0        |
|     Latitude     |       0        |
|    Longitude     |       0        |
|       Date       |       0        |
|     GlobalID     |       0        |
|        x         |       0        |
|        y         |       0        |
+------------------+---

#Exploring numeric data





In [50]:
# Understanding the "Measurement" column
print(df['Measurement'].describe())
measurement_counts = df['Measurement'].value_counts().reset_index()
measurement_counts.columns = ['Measurement', 'Count']
print(tabulate(measurement_counts.head(20), headers='keys', tablefmt='psql'))

count     14613.000000
mean        161.983976
std        2198.862484
min           0.000000
25%           0.000000
50%           0.008640
75%           0.128412
max      110480.000000
Name: Measurement, dtype: float64
+----+---------------+---------+
|    |   Measurement |   Count |
|----+---------------+---------|
|  0 |      0        |    4610 |
|  1 |      0.00216  |     536 |
|  2 |      0.00432  |     371 |
|  3 |      0.00648  |     232 |
|  4 |      0.0108   |     202 |
|  5 |      0.00864  |     199 |
|  6 |      0.001    |     113 |
|  7 |      0.0216   |     103 |
|  8 |      0.002    |     102 |
|  9 |      0.003    |      97 |
| 10 |      0.01728  |      93 |
| 11 |      0.01296  |      93 |
| 12 |   1410.44     |      82 |
| 13 |      0.01512  |      80 |
| 14 |      0.004    |      80 |
| 15 |      0.043196 |      75 |
| 16 |      0.0072   |      68 |
| 17 |      0.006    |      62 |
| 18 |      0.005    |      61 |
| 19 |      0.01944  |      61 |
+----+---------------+-

In [None]:
# Converting the 'Measurement' column to numeric and overwriting
df['Measurement'] = pd.to_numeric(df['Measurement'], errors='coerce')
print(tabulate(measurement_counts.head(20), headers='keys', tablefmt='psql'))
print(df['Measurement'].describe())

+----+---------------+---------+
|    |   Measurement |   Count |
|----+---------------+---------|
|  0 |      0        |    4610 |
|  1 |      0.00216  |     536 |
|  2 |      0.00432  |     371 |
|  3 |      0.00648  |     232 |
|  4 |      0.0108   |     202 |
|  5 |      0.00864  |     199 |
|  6 |      0.001    |     113 |
|  7 |      0.0216   |     103 |
|  8 |      0.002    |     102 |
|  9 |      0.003    |      97 |
| 10 |      0.01728  |      93 |
| 11 |      0.01296  |      93 |
| 12 |   1410.44     |      82 |
| 13 |      0.01512  |      80 |
| 14 |      0.004    |      80 |
| 15 |      0.043196 |      75 |
| 16 |      0.0072   |      68 |
| 17 |      0.006    |      62 |
| 18 |      0.005    |      61 |
| 19 |      0.01944  |      61 |
+----+---------------+---------+
count     14613.000000
mean        161.983976
std        2198.862484
min           0.000000
25%           0.000000
50%           0.008640
75%           0.128412
max      110480.000000
Name: Measurement, dtype

In [None]:
# Performing detailed analysis of the 'Density Range' column

print(f"{df['Density Range'].describe()}")
density_range_counts = df['Density Range'].value_counts().reset_index()
density_range_counts.columns = ['Density Range', 'Count']
print(tabulate(density_range_counts, headers='keys', tablefmt='psql'))

count       20425
unique         18
top       0.005-1
freq         6136
Name: Density Range, dtype: object
+----+-----------------+---------+
|    | Density Range   |   Count |
|----+-----------------+---------|
|  0 | 0.005-1         |    6136 |
|  1 | 0-0.0005        |    4485 |
|  2 | 2-40            |    2901 |
|  3 | 0.0005-0.005    |    1838 |
|  4 | 40-200          |    1346 |
|  5 | 1-10            |    1128 |
|  6 | 0               |    1017 |
|  7 | 1-2             |     403 |
|  8 | 500-30000       |     325 |
|  9 | >=10            |     313 |
| 10 | >200            |     233 |
| 11 | 20-150          |     102 |
| 12 | 0-100           |      97 |
| 13 | 0-2             |      41 |
| 14 | 2-20            |      36 |
| 15 | >40000          |      13 |
| 16 | 150-200         |       8 |
| 17 | 30000-40000     |       3 |
+----+-----------------+---------+


In [53]:
# The "Density" column has dtype object, resolving this issue

# Identifying and categorizing operators
def identify_operator(value):
    if pd.isna(value):
        return None  # Returning None for NaN values
    if '>=' in value:
        return 'lower or equal'
    elif '>' in value:
        return 'upper'
    elif '<' in value:
        return 'lower'
    else:
        return None

# Identifying and categorizing operators
def extract_numeric(value):
    if pd.isna(value):
        return None
    for op in ['>=', '>', '<']:
        if op in value:
            try:
                return float(value.replace(op, '').strip()) # Converting to numeric
            except ValueError:
                return None
    return None

# Function for cleaning the values by removing operators
def clean_value(value):
    if pd.isna(value):
        return None
    if any(op in value for op in ['>=', '>', '<']):
        return None
    try:
        return float(value.strip())
    except ValueError:
        return None

# Splitting the 'Density Range' column into Lower_Density and Upper_Density
df[['Lower_Density', 'Upper_Density']] = df['Density Range'].str.split('-', expand=True)

# Cleaning and converting Lower_Density and Upper_Density columns to numeric
df['Lower_Density'] = df['Lower_Density'].apply(clean_value)
df['Upper_Density'] = df['Upper_Density'].apply(clean_value)

# Identifying operators and replicating numeric values to Replicated_Value
df['Operator'] = df['Density Range'].apply(identify_operator)
df['Replicated_Value'] = df['Density Range'].apply(extract_numeric)

# Displaying the results
print(tabulate(df[['Density Range', 'Lower_Density', 'Upper_Density', 'Operator', 'Replicated_Value']].head(10), headers='keys', tablefmt='psql'))

+----+-----------------+-----------------+-----------------+----------------+--------------------+
|    | Density Range   |   Lower_Density |   Upper_Density | Operator       |   Replicated_Value |
|----+-----------------+-----------------+-----------------+----------------+--------------------|
|  0 | 0.005-1         |          0.005  |          1      |                |                nan |
|  1 | 0-0.0005        |          0      |          0.0005 |                |                nan |
|  2 | 0.005-1         |          0.005  |          1      |                |                nan |
|  3 | >=10            |        nan      |        nan      | lower or equal |                 10 |
|  4 | 0.0005-0.005    |          0.0005 |          0.005  |                |                nan |
|  5 | 0-0.0005        |          0      |          0.0005 |                |                nan |
|  6 | 0.005-1         |          0.005  |          1      |                |                nan |
|  7 | 0.0

In [52]:
# Converting Lower_Density, Upper_Density, and Replicated_Value to numeric, overwriting the existing columns
df['Lower_Density'] = pd.to_numeric(df['Lower_Density'], errors='coerce')
df['Upper_Density'] = pd.to_numeric(df['Upper_Density'], errors='coerce')
df['Replicated_Value'] = pd.to_numeric(df['Replicated_Value'], errors='coerce')

# Displaying the first few rows of the DataFrame with the converted columns
print(tabulate(df[['Density Range', 'Lower_Density', 'Upper_Density', 'Operator', 'Replicated_Value']].head(10), headers='keys', tablefmt='psql'))

+----+-----------------+-----------------+-----------------+----------------+--------------------+
|    | Density Range   |   Lower_Density |   Upper_Density | Operator       |   Replicated_Value |
|----+-----------------+-----------------+-----------------+----------------+--------------------|
|  0 | 0.005-1         |          0.005  |          1      |                |                nan |
|  1 | 0-0.0005        |          0      |          0.0005 |                |                nan |
|  2 | 0.005-1         |          0.005  |          1      |                |                nan |
|  3 | >=10            |        nan      |        nan      | lower or equal |                 10 |
|  4 | 0.0005-0.005    |          0.0005 |          0.005  |                |                nan |
|  5 | 0-0.0005        |          0      |          0.0005 |                |                nan |
|  6 | 0.005-1         |          0.005  |          1      |                |                nan |
|  7 | 0.0