# Cleaned List of All Scrapped Ships

This notebook contains the steps for cleaning and normalizing each annual scrapped ships CSV into the latest (2024) CSV format. This is necessary because almost all years store pretty much the same information, but order columns differently, use different column names that have the same meaning, or include information that is not present in other years.

## Manual Changes

TODO

## Glossary of Terms
Check out the [NGO Shipbreaking Platform's Glossary Page](https://shipbreakingplatform.org/our-work/glossary/) for header meanings. These acronyms & definitions are relevant to understanding the different data in each yearly CSV report.

## Comparison of Columns by Year

| Field | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |
| ----- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ARRIVAL |   |   |   | X | X | X | X | X | X | X | X | X | X |
| BEACHING DATE |   | X | X |   |   |   |   |   |   |   |   |   |   |
| BENEFICIAL OWNER |   |   |   | X | X | X | X | X | X | X | X | X | X |
| BENEFICIAL OWNER OF THE SHIP | X | X | X |   |   |   |   |   |   |   |   |   |   |
| BENEFICIAL OWNER'S COUNTRY | X |   |   |   |   |   |   |   |   |   |   |   |   |
| BO COUNTRY |   | X |   | X | X | X | X | X | X | X | X | X | X |
| BUILT |   |   |   | X | X | X | X | X | X | X | X | X | X |
| BUILT IN (Y) | X | X | X |   |   |   |   |   |   |   |   |   |   |
| CHANGE OF FLAG FOR BREAKING |   |   |   |   |   |   |   |   |   | X | X |   |   |
| COMMENT |   |   |   |   |   |   |   |   |   | X |   |   |   |
| COMMERCIAL OPERATOR | X | X |   | X | X | X | X | X | X | X | X | X | X |
| COMMERCIAL OPERATOR OF THE SHIP |   |   | X |   |   |   |   |   |   |   |   |   |   |
| COUNTRY |   |   |   | X | X | X | X | X | X | X | X | X | X |
| COUNTRY OF THE BENEFICIAL OWNER |   |   | X |   |   |   |   |   |   |   |   |   |   |
| DATE OF CHANGE |   | X | X |   |   |   |   |   |   |   |   |   |   |
| DATE SOLD FOR BREAKING |   |   |   |   | X |   |   |   |   |   |   |   |   |
| DESTINATION CITY |   |   | X |   |   |   |   |   |   |   |   |   |   |
| DESTINATION COUNTRY |   |   | X |   |   |   |   |   |   |   |   |   |   |
| DESTINATION YARD | X | X |   |   |   |   |   |   |   |   |   |   |   |
| FLAG |   |   |   | X |   | X |   |   |   | X | X |   | X |
| FLAG CHANGED FOR BREAKING |   |   | X |   |   |   |   |   |   |   |   |   |   |
| FLAG PRIOR LAST VOYAGE |   |   |   |   |   |   |   |   |   |   |   |   | X |
| FORMER FLAG (CHANGED FOR BREAKING) |   |   |   |   |   | X |   |   |   |   |   |   |   |
| FORMER NAME |   |   |   |   |   |   |   | X |   |   |   |   |   |
| FORMER NAME (CHANGED FOR BREAKING) |   |   |   |   |   | X |   |   |   |   |   |   |   |
| GROSS TONNAGE (GT) |   |   | X |   |   |   |   |   |   |   |   |   |   |
| GT |   |   |   | X | X | X | X | X | X | X | X | X | X |
| IMO  NUMBER | X | X |   |   |   |   |   |   |   |   |   |   |   |
| IMO NUMBER |   |   | X |   |   |   |   |   |   |   |   |   |   |
| IMO# |   |   |   | X | X | X | X | X | X | X | X | X | X |
| LAST FLAG | X | X | X |   | X |   | X | X | X |   |   |   |   |
| LAST FLAG (CHANGE FOR BREAKING) |   |   |   |   |   |   |   |   |   |   |   | X |   |
| LDT |   | X |   |   |   |   | X | X | X | X | X | X | X |
| LDT (LIGHT DISPLACEMENT TON) | X |   |   |   |   |   |   |   |   |   |   |   |   |
| NAME |   |   |   | X | X | X | X | X | X | X | X | X | X |
| NAME OF SHIP | X | X | X |   |   |   |   |   |   |   |   |   |   |
| NEXT TO LAST |   | X |   |   |   |   |   |   |   |   |   |   |   |
| PLACE |   |   |   | X | X | X | X | X | X | X | X | X | X |
| PREVIOUS FLAG |   |   |   |   |   |   | X | X | X |   |   | X |   |
| REGISTERED OWNER | X | X |   | X | X | X | X | X | X | X | X | X | X |
| REGISTERED OWNER OF THE SHIP |   |   | X |   |   |   |   |   |   |   |   |   |   |
| RO COUNTRY |   |   |   | X | X | X | X | X | X | X | X | X | X |
| SINGLE HULLED | X |   |   |   |   |   |   |   |   |   |   |   |   |
| SOLD FOR ($/LDT) | X |   |   |   |   |   |   |   |   |   |   |   |   |
| SUB-TYPE OF SHIP |   | X |   |   |   |   |   |   |   |   |   |   |   |
| TYPE |   |   |   | X | X | X | X | X | X | X | X | X | X |
| TYPE OF SHIP | X | X | X |   |   |   |   |   |   |   |   |   |   |
| USD/TON |   | X |   |   |   |   |   |   |   |   |   |   |   |

The above table displays which columns are present in each year. This table was generated with the code from the cell below.

In [10]:
import pandas as pd


def read_excel_file(year, nrows: int | None = None) -> pd.DataFrame:
    filename = f"annual_lists/original_xlsx/{year}-List-of-all-ships-dismantled-all-over-the-world.xlsx"
    usecols = "A:P" # Years use columns A:P unless specified otherwise
    header = 0

    # 2015-2024 has an extra "super" header in row 0
    if year >= 2015:
        header = 1
        
    if year == 2012:
       usecols="A:M"
    elif year in (2014, 2016):
       usecols="A:O"
    elif year == 2015:
        usecols="A:N"
    elif year in (2019, 2021):
        usecols="A:Q"
    df = pd.read_excel(filename, usecols=usecols, header=header, nrows=nrows)
    df.columns = df.columns.str.strip().str.upper()
    return df

print("Uncomment to produce the column/year comparison table")
unique = set()
for year in range(2012, 2025):
    df = read_excel_file(year)
    unique.update(df.columns)

headers = list(unique)
headers.sort()

print("| Field | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |")
print("| ----- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |")
for header in headers:
    print(f"| {header} |", end='')
    for year in range(2012, 2025):
        df = read_excel_file(year, nrows=2)
        present_headers = set(df.columns)
        if header in present_headers:
            print(" X |", end="")
        else:
            print("   |", end="")
    print()

Uncomment to produce the column/year comparison table
| Field | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |
| ----- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ARRIVAL |   |   |   | X | X | X | X | X | X | X | X | X | X |
| BEACHING DATE |   | X | X |   |   |   |   |   |   |   |   |   |   |
| BENEFICIAL OWNER |   |   |   | X | X | X | X | X | X | X | X | X | X |
| BENEFICIAL OWNER OF THE SHIP | X | X | X |   |   |   |   |   |   |   |   |   |   |
| BENEFICIAL OWNER'S COUNTRY | X |   |   |   |   |   |   |   |   |   |   |   |   |
| BO COUNTRY |   | X |   | X | X | X | X | X | X | X | X | X | X |
| BUILT |   |   |   | X | X | X | X | X | X | X | X | X | X |
| BUILT IN (Y) | X | X | X |   |   |   |   |   |   |   |   |   |   |
| CHANGE OF FLAG FOR BREAKING |   |   |   |   |   |   |   |   |   | X | X |   |   |
| COMMENT |   |   |   |   |   |   |   |   |   | X |   |   |   |
| COMMERCI

In [9]:
# Write the Excel (xlsx) files into an intermediary CSV format
# CSVs are useful on their own and more easily consumable by programs
# Update the headers to be UPPERCASE and w/ leading + trailing whitespace removed
for year in range(2012, 2025):
    df = read_excel_file(year)
    # Strip whitespace from both sides of column names
    df.columns = df.columns.str.strip().str.upper()
    df.to_csv(f"annual_lists/csvs/{year}-dismantled-ships.csv", index=False)


In [None]:
def read_csv(filename, parse_dates=None):
    """
    Reads the provided CSV file into a Pandas Dataframe.
    Assumes `filename` is a CSV file located within the ./data/ directory.

    Note: This function normalizes column names by mapping the following columns -
      "LAST FLAG"                          -> "FLAG"
      "LAST FLAG (CHANGE FOR BREAKING)"    -> "FLAG",
      "CHANGE OF FLAG FOR BREAKING"        -> "FLAG PRIOR LAST VOYAGE",
      "FORMER FLAG (CHANGED FOR BREAKING)" -> "FLAG PRIOR LAST VOYAGE",
      "PREVIOUS FLAG"                      -> "FLAG PRIOR LAST VOYAGE",
      "FORMER NAME (CHANGED FOR BREAKING)" -> "FORMER NAME"
    """
    dtype_spec = {
        "BENEFICIAL OWNER": str,
        "BO COUNTRY": str,
        "BUILT": "Int64",  # Nullable integer type bc built year is sometimes missing
        "CHANGE OF FLAG FOR BREAKING": str,
        "COMMENT": str,
        "COMMERCIAL OPERATOR": str,
        "COUNTRY": str,
        "FLAG PRIOR LAST VOYAGE": str,
        "FLAG": str,
        "FORMER FLAG (CHANGED FOR BREAKING)": str,
        "FORMER NAME (CHANGED FOR BREAKING)": str,
        "FORMER NAME": str,
        "GT": float,
        "IMO#": int,
        "LAST FLAG (CHANGE FOR BREAKING)": str,
        "LAST FLAG": str,
        "LDT": float,
        "NAME": str,
        "PLACE": str,
        "PREVIOUS FLAG": str,
        "REGISTERED OWNER": str,
        "RO COUNTRY": str,
        "TYPE": str,
    }
    na_values = ["Unknown owners", "Unknown", "unknown", "UNKNOWN?", "''"]
    if not parse_dates:
        parse_dates = ["ARRIVAL"]
    df = pd.read_csv(
        f"data/{filename}",
        dtype=dtype_spec,
        parse_dates=parse_dates,
        na_values=na_values,
    )
    df.rename(
        columns={
            "LAST FLAG": "FLAG",
            "LAST FLAG (CHANGE FOR BREAKING)": "FLAG",
            "CHANGE OF FLAG FOR BREAKING": "FLAG PRIOR LAST VOYAGE",
            "FORMER FLAG (CHANGED FOR BREAKING)": "FLAG PRIOR LAST VOYAGE",
            "PREVIOUS FLAG": "FLAG PRIOR LAST VOYAGE",
            "FORMER NAME (CHANGED FOR BREAKING)": "FORMER NAME",
        },
        inplace=True,
    )
    for col in parse_dates:
        df[col] = pd.to_datetime(df[col], errors="raise").dt.tz_localize("UTC")

    return df

