# Satellite Tracking Data
## Union of Concerned Scientists

Assembled by experts at the Union of Concerned Scientists (UCS), the [Satellite Database](https://www.ucs.org/resources/satellite-database) is a listing of the more than 7,560 operational satellites currently in orbit around Earth. It was first published on Dec 8, 2005 and most recently updated on May 1, 2023.

Much like orbital debris plummeting into the atmosphere, the dataset requires some cleanup. To begin, we'll import the necessary libraries, read the csv, and take a look at the columns:

In [1835]:
import pandas as pd
import numpy as np
import re

file_path = "data/UCS-Satellite-Database 5-1-2023.csv"
df = pd.read_csv(file_path)
df.columns

Index(['Name of Satellite, Alternate Names',
       'Current Official Name of Satellite', 'Country/Org of UN Registry',
       'Country of Operator/Owner', 'Operator/Owner', 'Users', 'Purpose',
       'Detailed Purpose', 'Class of Orbit', 'Type of Orbit',
       'Longitude of GEO (degrees)', 'Perigee (km)', 'Apogee (km)',
       'Eccentricity', 'Inclination (degrees)', 'Period (minutes)',
       'Launch Mass (kg.)', ' Dry Mass (kg.) ', 'Power (watts)',
       'Date of Launch', 'Expected Lifetime (yrs.)', 'Contractor',
       'Country of Contractor', 'Launch Site', 'Launch Vehicle',
       'COSPAR Number', 'NORAD Number', 'Comments', 'Unnamed: 28',
       'Source Used for Orbital Data', 'Source', 'Source.1', 'Source.2',
       'Source.3', 'Source.4', 'Source.5', 'Source.6', 'Unnamed: 37',
       'Unnamed: 38', 'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41',
       'Unnamed: 42', 'Unnamed: 43', 'Unnamed: 44', 'Unnamed: 45',
       'Unnamed: 46', 'Unnamed: 47', 'Unnamed: 48', 'Unnamed: 49',


### Basic Cleaning
#### Columns
There are many empty `Unnamed` columns that can be dropped. Likewise, the `Source` and `Comments` columns aren't relevant for our analysis, and the `Name of Satellite, Alternate Names` is redundant. We can drop those:

In [1836]:
original_cols = len(df.columns)
columns_to_drop = []

for col in df.columns:
    if col.startswith("Unnamed"):
        columns_to_drop.append(col)
    elif col.startswith("Source"):
        columns_to_drop.append(col)
    elif col == "Comments":
        columns_to_drop.append(col)
    elif col == "Name of Satellite, Alternate Names":
        columns_to_drop.append(col)

df = df.drop(columns=columns_to_drop)

dropped_cols = len(columns_to_drop)
remaining_cols = len(df.columns)

print(f"Dropped {dropped_cols} out of {original_cols} columns.")
print(f"Remaining columns: {remaining_cols}")


Dropped 42 out of 68 columns.
Remaining columns: 26


We're still focused on columns, but let's take a quick detour to verify how many rows are in the dataset. This figure will be useful as a point of reference.

In [1837]:
total_rows = len(df)
print(f"The dataset has {total_rows} total rows.")


The dataset has 7562 total rows.


We'll check which columns are missing the most values. This helps to decide which columns are useful and which may be too incomplete.

In [1838]:
missing_values = df.isna().sum()
missing_values.sort_values(ascending=False)

Power (watts)                         6983
 Dry Mass (kg.)                       6795
Detailed Purpose                      6308
Expected Lifetime (yrs.)              2112
Type of Orbit                          653
Launch Mass (kg.)                      247
Period (minutes)                        58
Eccentricity                            13
Apogee (km)                              9
Perigee (km)                             9
Inclination (degrees)                    6
Longitude of GEO (degrees)               5
Country/Org of UN Registry               3
Date of Launch                           3
Contractor                               2
Country of Contractor                    2
Launch Site                              2
Launch Vehicle                           2
COSPAR Number                            2
Current Official Name of Satellite       2
Class of Orbit                           2
Purpose                                  2
Users                                    2
Operator/Ow

Let's look at that as a percentage to make the comparison easier:

In [1839]:
missing_percent = (missing_values / total_rows) * 100
missing_percent_sorted = missing_percent.sort_values(ascending=False)
missing_percent_sorted.round(1)

Power (watts)                         92.3
 Dry Mass (kg.)                       89.9
Detailed Purpose                      83.4
Expected Lifetime (yrs.)              27.9
Type of Orbit                          8.6
Launch Mass (kg.)                      3.3
Period (minutes)                       0.8
Eccentricity                           0.2
Apogee (km)                            0.1
Perigee (km)                           0.1
Inclination (degrees)                  0.1
Longitude of GEO (degrees)             0.1
Country/Org of UN Registry             0.0
Date of Launch                         0.0
Contractor                             0.0
Country of Contractor                  0.0
Launch Site                            0.0
Launch Vehicle                         0.0
COSPAR Number                          0.0
Current Official Name of Satellite     0.0
Class of Orbit                         0.0
Purpose                                0.0
Users                                  0.0
Operator/Ow

If a column is missing data in 25% or more of the total rows, we will exclude it:

In [1840]:
missing_threshold = total_rows // 4

columns_to_exclude = missing_values[missing_values >= missing_threshold].index.tolist()
df = df.drop(columns=columns_to_exclude)
print(f"Dropped {len(columns_to_exclude)} columns: {(', '.join(columns_to_exclude))}")


Dropped 4 columns: Detailed Purpose,  Dry Mass (kg.) , Power (watts), Expected Lifetime (yrs.)


We're almost done with the column cleanup. As a final touch, let's rename some of the more verbose columns. We'll also drop any units of measure contained in the column names (such as km).

In [1841]:
column_renames = {
    "Current Official Name of Satellite": "Satellite",
    "Country/Org of UN Registry": "UN Registry",
    "Country of Operator/Owner": "Country of Operator",
    "Operator/Owner": "Operator"
}

df = df.rename(columns=column_renames)

df.columns = (
    df.columns
        .str.replace(r"\s*\(.*?\)", "", regex=True)
        .str.strip()
)

for col in df.columns:
    print(col)

Satellite
UN Registry
Country of Operator
Operator
Users
Purpose
Class of Orbit
Type of Orbit
Longitude of GEO
Perigee
Apogee
Eccentricity
Inclination
Period
Launch Mass
Date of Launch
Contractor
Country of Contractor
Launch Site
Launch Vehicle
COSPAR Number
NORAD Number


#### Rows
While we can tolerate missing data in certain fields, others are essential for the analysis. If any rows have data missing data in these fields, they will either need to be dropped entirely or fixed. Let's take a look.

In [1842]:
required_columns = [
    "Date of Launch",
    "Launch Site",
    "Satellite"
]

rows_missing_required = df[
    df[required_columns].isna().any(axis=1)
][required_columns]

rows_missing_required

Unnamed: 0,Date of Launch,Launch Site,Satellite
240,,Rocket Lab Launch Complex 1,BlackSky Global 5
7560,,,
7561,,,


The satellite **BlackSky Global 5** is missing a launch date. However, this [information](https://space.oscar.wmo.int/satellites/view/blacksky_5) isn't too hard to find so we can manually add it in: August 7, 2020.

The other two rows are just blanks, so let's delete them.

In [1843]:
rows_before = len(df)

df.loc[
    df["Satellite"] == "BlackSky Global 5",
    "Date of Launch"
] = "2020-08-07"

rows_missing_required = df[
    df[required_columns].isna().any(axis=1)
]

df = df.drop(index=rows_missing_required.index)

rows_after = len(df)
rows_removed = rows_before - rows_after

print(f"Dropped {rows_removed} out of {rows_before} rows.")
print(f"Remaining rows: {rows_after}")

Dropped 2 out of 7562 rows.
Remaining rows: 7560


Many of the text columns contain extra information in parenthesis. Let's look at the `Satellite` column as an example.

In [1844]:
mask_parens = (
    df["Satellite"]
        .astype(str)
        .str.contains(r"\(", na=False)
)

df.loc[mask_parens, ["Satellite"]].head(5)

Unnamed: 0,Satellite
1,AAC AIS-Sat1 (Kelpie 1)
224,Bispectral InfraRed Detector 2 (Bird 2)
319,Chandra X-Ray Observatory (CXO)
815,FUNCube-1 (AO-73)
913,Geomagnetic Tail Laboratory (Geotail)


We'll just clean house and remove parenthetical notes and trim whitespace in all text columns. While we're at it, let's remove trailing whitespace as well.

In [1845]:
text_cols = df.select_dtypes(include=["object"]).columns.tolist()

for col in text_cols:
    df[col] = (
        df[col]
            .str.replace(r"\s*\(.*?\)", "", regex=True)
            .str.strip()
            .replace({"": pd.NA})
    )

df.loc[mask_parens, ["Satellite"]].head(5)

Unnamed: 0,Satellite
1,AAC AIS-Sat1
224,Bispectral InfraRed Detector 2
319,Chandra X-Ray Observatory
815,FUNCube-1
913,Geomagnetic Tail Laboratory


### Making Booleans
#### `Users`

Now that we've got the basic dataset cleaned up, we'll take a look at some of the specific columns.

`Users` contains combinations of one to four possible values separated by a "/", representing who is using the Satellite: *Civil*, *Commercial*, *Government*, or *Military*. This format is easy enough for a human to read in a table, but a one-to-many relationship like this will be complicated to filter in a BI tool like QuickSight or Tableau. Instead, let's make life easier by creating some boolean columns. 

In [1846]:
user_categories = ["Civil", "Commercial", "Government", "Military"]

for category in user_categories:
    df[f"User: Is {category}"] = (
        df["Users"]
            .str.contains(category, case=False, na=False)
    )

user_flag_columns = ["User: Is Civil", "User: Is Commercial", "User: Is Government", "User: Is Military"]

df[user_flag_columns].sum()

example_columns = [
    "Satellite",
    "Users",
    "User: Is Civil",
    "User: Is Commercial",
    "User: Is Government",
    "User: Is Military"
]

# Show examples where User belongs to multiple categories
df[
    df[user_flag_columns].sum(axis=1) > 1
][example_columns].head(10)



Unnamed: 0,Satellite,Users,User: Is Civil,User: Is Commercial,User: Is Government,User: Is Military
70,Amos 17,Military/Commercial,False,True,False,True
71,Amos 3,Military/Commercial,False,True,False,True
72,Amos 4,Military/Commercial,False,True,False,True
147,Athena-Fidus,Government/Military,False,False,True,True
171,Beidou 2-12,Military/Government,False,False,True,True
172,Beidou 2-13,Military/Government,False,False,True,True
173,Beidou 2-15,Military/Government,False,False,True,True
174,Beidou 2-16,Military/Government,False,False,True,True
175,Beidou 2-17,Military/Government,False,False,True,True
176,Beidou 2-18,Military/Government,False,False,True,True


#### `Purpose`

We'll do the same thing for `Purpose`. However, there are a few more categories.

In [1847]:

purpose_categories = [
    "Communications",
    "Earth Observation",
    "Earth Science",
    "Educational",
    "Meteorological",
    "Mission Extension Technology",
    "Navigation",
    "Platform",
    "Satellite Positioning",
    "Space Observation",
    "Space Science",
    "Surveillance",
    "Technology Demonstration",
    "Technology Development",
    "Unknown",
    "Maritime Tracking"
]

for category in purpose_categories:
    df[f"Is Purpose: {category}"] = (
        df["Purpose"]
            .str.contains(category, case=False, na=False)
    )

The count by category varies significantly. Some have thousands while others only several.

In [1848]:
purpose_flag_cols = [f"Is Purpose: {c}" for c in purpose_categories]

df[purpose_flag_cols].sum().sort_values(ascending=False)


Is Purpose: Communications                  5527
Is Purpose: Earth Observation               1260
Is Purpose: Technology Development           386
Is Purpose: Navigation                       165
Is Purpose: Space Science                    103
Is Purpose: Technology Demonstration          65
Is Purpose: Earth Science                     30
Is Purpose: Surveillance                      20
Is Purpose: Space Observation                 13
Is Purpose: Unknown                           10
Is Purpose: Meteorological                     6
Is Purpose: Maritime Tracking                  5
Is Purpose: Educational                        3
Is Purpose: Mission Extension Technology       2
Is Purpose: Platform                           1
Is Purpose: Satellite Positioning              1
dtype: int64

In a lopsided situation like this, booleans become less effective. Let's consolidate some of these categories for better clarity.

In [1849]:
p_comm = "Is Purpose: Communications"
p_earth_obs = "Is Purpose: Earth Observation"
p_earth_sci = "Is Purpose: Earth Science"
p_edu = "Is Purpose: Educational"
p_met = "Is Purpose: Meteorological"
p_mext = "Is Purpose: Mission Extension Technology"
p_nav = "Is Purpose: Navigation"
p_platform = "Is Purpose: Platform"
p_satpos = "Is Purpose: Satellite Positioning"
p_space_obs = "Is Purpose: Space Observation"
p_space_sci = "Is Purpose: Space Science"
p_surv = "Is Purpose: Surveillance"
p_tech_demo = "Is Purpose: Technology Demonstration"
p_tech_dev = "Is Purpose: Technology Development"
p_unknown = "Is Purpose: Unknown"
p_mar = "Is Purpose: Maritime Tracking"


df["Purpose: Communications"] = df[p_comm]

df["Purpose: Earth Observation"] = (
    df[p_earth_obs]
    | df[p_met]
    | df[p_surv]
    | df[p_earth_sci]
    | df[p_mar]
)

df["Purpose: Navigation"] = (
    df[p_nav]
    | df[p_satpos]
)

df["Purpose: Space Science"] = (
    df[p_space_sci]
    | df[p_space_obs]
)

df["Purpose: Tech Dev"] = (
    df[p_tech_dev]
    | df[p_edu] 
    | df[p_platform] 
    | df[p_tech_demo]
    | df[p_mext]
)

df["Purpose: Unknown"] = df[p_unknown]


This gets us down to six `Purpose` boolean columns which will be much more manageable.

In [1850]:
consolidated_cols = [
    "Purpose: Communications",
    "Purpose: Earth Observation",
    "Purpose: Navigation",
    "Purpose: Space Science",
    "Purpose: Tech Dev",
    "Purpose: Unknown",
]

original_purpose_flag_cols = [c for c in df.columns if c.startswith("Is Purpose: ")]
df = df.drop(columns=original_purpose_flag_cols)

df[consolidated_cols].sum().sort_values(ascending=False)

Purpose: Communications       5527
Purpose: Earth Observation    1319
Purpose: Tech Dev              455
Purpose: Navigation            166
Purpose: Space Science         116
Purpose: Unknown                10
dtype: int64

### Aggregating
#### `Type of Orbit` and `Class of Orbit`

Next let's take a look the type of orbit. Unlike `Users` or `Purpose`, `Type of Orbit` is a 1:1 relationship. A satellite will only have a single type of orbit. However, this also looks like it could benefit from some consolidation.

In [1851]:
orbit_counts = (
    df["Type of Orbit"]
        .value_counts()
        .sort_values(ascending=False)
)

orbit_counts

Type of Orbit
Non-Polar Inclined            4042
Sun-Synchronous               1692
Polar                         1096
Equatorial                      38
Molniya                         23
Deep Highly Eccentric            9
Elliptical                       5
Sun-Synchronous near polar       2
Retrograde                       1
Cislunar                         1
Name: count, dtype: int64

Since some of these orbits could be considered [subtypes](https://en.wikipedia.org/wiki/Orbit), we'll aggregate them into six larger categories. The new column we will consider to be the broader `Orbit Category`.

In [1852]:
orbit_mapping = {
    "Non-Polar Inclined": "Inclined",
    "Sun-Synchronous": "Sun-Synchronous",
    "Sun-Synchronous near polar": "Sun-Synchronous",
    "Polar": "Polar",
    "Equatorial": "Equatorial",
    "Molniya": "Highly Elliptical",
    "Deep Highly Eccentric": "Highly Elliptical",
    "Elliptical": "Highly Elliptical",
    "Retrograde": "Other",
    "Cislunar": "Other",
}

df["Orbit Category"] = (
    df["Type of Orbit"].map(orbit_mapping)
)

df["Orbit Category"].value_counts()


Orbit Category
Inclined             4042
Sun-Synchronous      1694
Polar                1096
Equatorial             38
Highly Elliptical      37
Other                   2
Name: count, dtype: int64

The `Class of Orbit` refers to the altitude of the orbit. The original dataset is divided into two broad classes: 
* *nearly circular orbits*: LEO, MEO, and GEO
* *elliptical orbits*
Satellites in elliptical orbits have apogees and perigees that differ significantly from each other. They spend time at many different altitudes above the earth’s surface.

In [1853]:
df["Class of Orbit"].value_counts()

Class of Orbit
LEO           6767
GEO            590
MEO            143
Elliptical      59
LEo              1
Name: count, dtype: int64

However, for the purposes of our analysis, we'll consider elliptical orbits to be *High Earth Orbits (HEO)* based on the apogee. This will avoid confusion the category in `Type of Orbit`. We'll also clean up the capitalization type for Low Earth Orbit.

In [1854]:

df["Class of Orbit"] = (
    df["Class of Orbit"]
        .replace({
            "LEo": "LEO",
            "Elliptical": "HEO"
        })
)

df["Class of Orbit"].value_counts()


Class of Orbit
LEO    6768
GEO     590
MEO     143
HEO      59
Name: count, dtype: int64

Since the shape of the orbit is still relevant though, we'll derive this from `Class of Orbit` and capture this in a new field: `Shape of Orbit`.

In [1855]:
df["Shape of Orbit"] = (
    df["Class of Orbit"]
        .map({
            "LEO": "Circular",
            "MEO": "Circular",
            "GEO": "Circular",
            "HEO": "Elliptical",
        })
)

df["Shape of Orbit"].value_counts()

Shape of Orbit
Circular      7501
Elliptical      59
Name: count, dtype: int64

Now that we've got three fields of orbit data, let's do another rename so they are easier to find later in our BI tool. We'll make sure everything starts with *Orbit*. We'll also rename `Type of Orbit` to `Orbit Subcategory` to be better reflect the relationship.

In [1856]:
df = df.rename(columns={
    "Type of Orbit": "Orbit Subcategory",
    "Class of Orbit": "Orbit Class",
    "Shape of Orbit": "Orbit Shape",
})

[c for c in df.columns if "Orbit" in c]

['Orbit Class', 'Orbit Subcategory', 'Orbit Category', 'Orbit Shape']

### Mapping and Normalizing
#### `Operator` and `Contractor`

Here's where things get messy. The fields of `Operator` and `Contractor` were manually entered and cover a wide variety of organizations. As a result, several distinct issues appear:

* **Minor formatting and naming variations**: The same organization may appear multiple times due to differences in capitalization, punctuation, spacing, or legal suffixes

* ***Parent companies vs. subsidiaries or internal divisions**: Some organizations are listed as specific subsidiaries, regional branches, or internal divisions of a larger organization

* **Joint or multi-organization missions**: Certain satellites are operated jointly by multiple organizations and are recorded as a combined value

* **Mixed organization types**: This includes private companies, government agencies, military organizations, universities, and research institutions that do not easily fit into a single hierarchy

For our purposes, granular distinction is beyond the scope of this project. We will do what we can do simplify.

In [1857]:
op_sample = (
    pd.Series(df["Operator"].dropna().unique())
      .sample(5, random_state=42)
      .tolist()
)

total_op_begin = df["Operator"].nunique()
total_con_begin = df["Contractor"].nunique()

print(f"Operators: {total_op_begin}")
print(f"Contractors: {total_con_begin}")
print(f"Sample: {', '.join(op_sample)}")

Operators: 662
Contractors: 562
Sample: National University of Singapore, Telesat Canada Ltd./APT Satellite Holdings Ltd., SES S.A. -- total capacity leased to subsidiary of EchoStar Corp., Shanghai Academy of Space Technology, Hellas-Sat Consortium Ltd.


That's a fairly large amount to clean up. To begin, let's see what we can do with fixing the formatting. We'll remove suffices, trim whitespace, and apply other normalization standards.

In [1858]:
def normalize_org_column(df, col_name):
    s = (
        df[col_name]
            .astype(str)
            .str.strip()
            .str.replace(r"\s+", " ", regex=True)                 # collapse multiple spaces
            .str.replace(r"\s*/\s*", "/", regex=True)             # normalize slash spacing
            .str.replace(r",", "", regex=True)                    # remove commas
            .str.replace(r"\s+\.", ".", regex=True)               # remove space before period
            .str.replace(r"\.$", "", regex=True)                  # drop trailing periods
            .str.replace(
                r"\b(Ltd|Inc|LLC|PLC|Corp|Corporation|Co|S A|SA)\b",
                "",
                regex=True
            )                                                     # drop common suffixes
            .str.replace(r"\s+", " ", regex=True)                 # re-collapse spaces
            .str.strip()
            .replace({"nan": pd.NA})
    )
    
    s = s.str.split("/").str[0].str.strip() # Keep first entry

    df[col_name] = s
    return df[col_name]

In [1859]:
normalize_org_column(df, "Operator")
normalize_org_column(df, "Contractor")

op_sample_2 = (
    pd.Series(df["Operator"].dropna().unique())
      .sample(5, random_state=42)
      .tolist()
)

total_op_begin_2 = df["Operator"].nunique()
total_con_begin_2 = df["Contractor"].nunique()

print(f"Operators: {total_op_begin_2} (consolidated {total_op_begin - total_op_begin_2})")
print(f"Contractors: {total_con_begin_2} (consolidated {total_con_begin - total_con_begin_2})")
print(f"Sample: {', '.join(op_sample_2)}")

Operators: 579 (consolidated 83)
Contractors: 510 (consolidated 52)
Sample: Japan Meteorological Agency, Sun Yat-sen University, MinoSpace Technology, Skykraft, Tyvak Nanosatellite Systems


From trial and error, I found that universities were particularly challenging to cleanup. Here are some functions specific for academic institutions.

In [1860]:
def safe_university_cleanup(s: str):
    if s is None or pd.isna(s):
        return s

    s = str(s).strip()

    if "University" not in s:
        return s

    m = re.search(r"\bUniversity\b.*", s) # Grab from the first "University" onward
    if not m:
        return s

    uni_part = m.group(0)
    uni_part = re.split(r"[,/]", uni_part, maxsplit=1)[0].strip()

    if uni_part.lower() == "university": # Never collapse to just "University"
        return s

    return uni_part


def strip_trailing_space_and_punct(series: pd.Series) -> pd.Series:
    return (
        series
            .astype("string")
            .str.replace(r"[\s\.\,\;\:\-]+$", "", regex=True)
            .str.strip()
            .replace({"nan": pd.NA})
    )


def apply_academic_and_trailing_cleanup(df, col_name):
    df[col_name] = df[col_name].apply(safe_university_cleanup)
    df[col_name] = strip_trailing_space_and_punct(df[col_name])
    return df[col_name]

apply_academic_and_trailing_cleanup(df, "Operator")
apply_academic_and_trailing_cleanup(df, "Contractor")

total_op_begin_3 = df["Operator"].nunique()
total_con_begin_3 = df["Contractor"].nunique()

print(f"Operators: {total_op_begin_3} (consolidated {total_op_begin_2 - total_op_begin_3})")
print(f"Contractors: {total_con_begin_3} (consolidated {total_con_begin_2 - total_con_begin_3})")

Operators: 567 (consolidated 12)
Contractors: 499 (consolidated 11)


Now that the basic cleanup is done, on to the big step: mapping. Compiling the dictionary took a fair amount of manual effort to identify. It would be a good candidate for ML, but that's beyond the scope of this project.

In [1861]:
op_con_mapping = {
    # SpaceX variants
    "Spacex": "SpaceX",
    "SpaceX": "SpaceX",

    # Maxar / DigitalGlobe
    "DigitalGlobe": "Maxar",
    "DigitalGlobe Corporation": "Maxar",
    "Maxar Technologies Inc": "Maxar",
    "Maxar Technologies": "Maxar",
    "Maxar": "Maxar",

    # Planet / Planet Labs variants
    "Planet Labs Inc": "Planet Labs",
    "Planet Labs, Inc.": "Planet Labs",
    "Planet Labs": "Planet Labs",
    "Planet Labs, Inc": "Planet Labs",
    "Planet": "Planet Labs",

    # EUTELSAT family
    "EUTELSAT SA": "EUTELSAT",
    "EUTELSAT S A": "EUTELSAT",
    "EUTELSAT S.A.": "EUTELSAT",
    "EUTELSAT S.A": "EUTELSAT",
    "EUTELSAT Americas": "EUTELSAT",
    "Eutelsat SA": "EUTELSAT",
    "Eutelsat": "EUTELSAT",
    "EUTELSAT": "EUTELSAT",

    # GeoOptics
    "GeoOptics Inc": "GeoOptics",
    "GeoOptics, Inc.": "GeoOptics",
    "GeoOptics": "GeoOptics",

    # BlackSky
    "BlackSky Global": "BlackSky",
    "BlackSky Global, Inc": "BlackSky",
    "BlackSky": "BlackSky",

    # Spire
    "Spire Global Inc": "Spire Global",
    "Spire Global Inc.": "Spire Global",
    "Spire Global": "Spire Global",
    "Spire": "Spire Global",

    # Iridium
    "Iridium Communications Inc": "Iridium",
    "Iridium Communications, Inc.": "Iridium",
    "Iridium": "Iridium",

    # SES / EchoStar
    "Echostar Satellite Services LLC": "Echostar",
    "Echostar Satellite Services, LLC": "Echostar",
    "Echostar Corporation": "Echostar",
    "Echostar Satellite Services": "Echostar",
    "HughesNet leased from Echostar Technologies": "Echostar",
    "Echostar": "Echostar",
    "EchoStar": "Echostar",
    "SES S.A.": "SES",
    "SES S A": "SES",
    "SES S.A": "SES",
    "SES": "SES",
    "SES S.A. -- total capacity leased to subsidiary of EchoStar": "SES",

    # Intelsat / Inmarsat / Telesat / OneWeb
    "Intelsat S A": "Intelsat",
    "Intelsat S.A.": "Intelsat",
    "Intelsat S.A": "Intelsat",
    "Intelsat": "Intelsat",
    "INMARSAT, Ltd.": "Inmarsat",
    "INMARSAT": "Inmarsat",
    "INMARSAT .": "Inmarsat",
    "Inmarsat": "Inmarsat",
    "Telesat Canada Ltd": "Telesat",
    "Telesat Canada": "Telesat",
    "Telesat": "Telesat",
    "OneWeb": "OneWeb",
    "OneWeb Satellites": "OneWeb",

    # Airbus / OHB / Thales / Lockheed / Boeing
    "Airbus Defense and Space": "Airbus",
    "Airbus Defence and Space": "Airbus",
    "Airbus": "Airbus",
    "OHB SE": "OHB",
    "OHB Italia": "OHB",
    "OHB": "OHB",
    "Lockheed Martin": "Lockheed Martin",
    "Lockheed": "Lockheed Martin",
    "Thales Alenia Space": "Thales",
    "Thales": "Thales",
    "Boeing": "Boeing",
    "Boeing Co": "Boeing",

    # JSAT naming
    "Sky Perfect JSAT Corporation": "Sky Perfect JSAT",
    "Sky Perfect JSAT": "Sky Perfect JSAT",

    # Space agencies / acronyms
    "ESA": "European Space Agency",
    "European Space Agency": "European Space Agency",
    "CNES": "Centre National d'Etudes Spatiales",
    "Centre National d'Etudes Spatiales": "Centre National d'Etudes Spatiales",
    "DGA": "Directorate General of Armaments",
    "Directorate General of Armaments": "Directorate General of Armaments",

    # NASA / NOAA variants
    "National Aeronautics and Space Administration": "NASA",
    "National Aeronautics and Space Administration - Earth Science Enterprise": "NASA",
    "National Aeronautics and Space Administration Earth Science Office": "NASA",
    "NASA": "NASA",
    "NASA Goddard Space Flight Center": "NASA",
    "NASA Small Satellite Technology Program": "NASA",
    "National Aeronautics and Space Administration Goddard Space Flight Center": "NASA",
    "Goddard Space Flight Center": "NASA",
    "NASA Langley Research Center": "NASA",
    "National Aeronautics and Space Administration-Ames Research Center": "NASA",
    "NOAA": "NOAA",
    "National Oceanic and Atmospheric Administration": "NOAA",
    "National Oceanographic and Atmospheric Administration": "NOAA",

    # ISRO / JAXA standardization
    "ISRO": "Indian Space Research Organization",
    "Indian Space Research Organization": "Indian Space Research Organization",
    "JAXA": "Japan Aerospace Exploration Agency",
    "Japan Aerospace Exploration Agency": "Japan Aerospace Exploration Agency",

    # US military standardization
    "US Air Force": "US Air Force",
    "Air Force Research Laboratory": "US Air Force",
    "Air Force Satellite Control Network": "US Air Force",
    "US Air Force Academy": "US Air Force",
    "US Air Force Institute of Technology": "US Air Force",
    "Military Satellite Communications - US Air Force": "US Air Force",
    "U.S. Space Force": "US Space Force",
    "US Space Force": "US Space Force",
    "US Army": "US Army",
    "US Southern Command": "US Air Force",
    "US Naval Academy": "US Navy",
    "US Army Space and Missile Defense Command": "US Army",
    "U.S. Army’s Space and Missile Defense Command": "US Army",

    # DoD standardization (per your later rules)
    "DoD": "Department of Defense",
    "Department of Defense": "Department of Defense",
    "Atlas 5": "Department of Defense",

    # AMSAT rollup
    "AMSAT-UK": "AMSAT",
    "AMSAT-NA": "AMSAT",
    "AMSAT": "AMSAT",

    # Beijing ZeroG rollup
    "Beijing ZeroG Technology": "Beijing ZeroG",
    "Beijing ZeroG Space Technology": "Beijing ZeroG",
    "Beijing ZeroG Space Technology Co., Ltd.": "Beijing ZeroG",
    "Beijing ZeroG Space Technology Co. Ltd.": "Beijing ZeroG",
    "Beijing ZeroG Space Technology Co Ltd": "Beijing ZeroG",
    "Beijing ZeroG Space Technology Co Ltd.": "Beijing ZeroG",

    # China aerospace umbrella
    "China Aerospace Science and Technology": "China Aerospace",
    "China Aerospace Science and Industry": "China Aerospace",
    "China Academy of Space Technology": "China Aerospace",
    "China Aerospace Science and Technology Corporation": "China Aerospace",
    "China Aerospace Science and Industry Corporation": "China Aerospace",
    "Chinese Academy of Launch Vehicle Technology": "China Aerospace",
    "Chinese Academy of Space Techology": "China Aerospace",
    "DFH Satellite": "China Aerospace",
    "DFH Satellite Co. Ltd.": "China Aerospace",
    "DFH Satellite Co. Ltd": "China Aerospace",
    "China Satellite Communication": "China Aerospace",

    # Long ministry composite shortening
    "China’s Ministry of Land and Resources Ministry of Environmental Protection and Ministry of Agriculture":
        "China Ministries (Land/Environment/Agriculture)",
    "China's Ministry of Land and Resources Ministry of Environmental Protection and Ministry of Agriculture":
        "China Ministries (Land/Environment/Agriculture)",

    # Misc. brand shortenings you had
    "Harris": "Harris Corporation",
    "Hisdesat": "Hisdesat",
    "Capella": "Capella Space",
    "Capella Space": "Capella Space",
    "NanoAvionics": "NanoAvionics",
    "Spacety": "Spacety",
    "Broadcasting Satellite System": "Broadcasting Satellite System",
    "Hellas-Sat Consortium": "Hellas-Sat",
    "GalaxySpace": "GalaxySpace",
    "Galaxy Space": "GalaxySpace",
    "Globalstar": "Globalstar",
    "ORBCOMM": "ORBCOMM",
    "DirecTV": "DirecTV",
    "DirecTV Latin America": "DirecTV",
    "National Authority for Remote Sensing and Space Science": "National Authority for Remote Sensing and Space Sciences",
    "National Space Program Office": "National Space Program",
    "China Amateur Satellite - CAMSAT": "Chinese Amateur Satellite",
    "Zhuhai Orbita Control Engineering": "Zhuhai Orbita",
    "Zhuhai Orbita Aerospace Science and Technology": "Zhuhai Orbita",
    "Unknown US agency": "Department of Defense",
    "Aerospace": "Aerospace Corporation",
    "Deimos": "Deimos Imaging",
    "Defence Research and Development Organization": "Defence and Research Development",
    "Defence Research and Development Canada": "Defence Research and Development",
    "European Space Operations Centre": "European Space Agency",
    "UK": "UK Government",
    "Hispamar": "Hispasat",
    "Spacety": "Spacety Aerospace Company",
    "Shanghai Micro Satellite Engineering Center": "Shanghai Engineering Center for Microsatellites",
    "Shanghai Academy of Space Technology": "Shanghai Academy of Spaceflight Technology",
    "Minospace": "MinoSpace Technology",
    "Horizons 2 Satellite": "Horizons Satellite",
    "F Space Test Office": "F Space",
    "F": "F Space",
    "Institute of Space and Astronautical Science": "Institute of Space and Aeronautical Science",

    # Universities
    "University of the Philippines Diliman and Japan’s Hokkaido University and Tohoku University": "University of the Philippines",
    "University of Copenhagen University of Southern Denmark Aalborg University and Aarhus University": "University of Copenhagen",
    "University of Colorado’s Laboratory for Atmospheric and Space Physics": "University of Colorado",
    "University of Colorado Boulder": "University of Colorado",
    "University of Chile for Aerospace Investigation)": "University of Chile",
    "University of California-Berkeley": "University of California",
    "University of Texas - Austin": "University of Texas",
    "University in Tokyo": "University of Tokyo",
    "University of South Florida Institute of Applied Engineering": "University of South Florida",
    "University of Southern California Space Engineering Research Center": "University of Southern California",
    "University of Tokyo and NESTRA": "University of Tokyo",
    "University of Toronto Institute for Aerospace Studies": "University of Toronto",
    "University of Versailles Saint-Quentin-en-Yvelines": "University of Versailles",
    "University of North Carolina - Wilmington": "University of North Carolina",
    "University of Illinois Urbana-Champaign": "University of Illinois",
    "College of Engineering King Saud University": "King Saud University",
    "Department of Astrophysical and Planetary Science UC Boulder": "University of Colorado",
    "Department of Computer Science and the Faculty of Engineering Ariel University": "Ariel University",
    "Nanjing and Hong Kong Universities": "Nanjing University",
    "MIT": "Massachusetts Institute of Technology",
    "Max Valier school Bolzano Italy Oskar von Miller school Merano Italy": "Max Valier School",
    "Institute of Software Chinese Academy of Sciences": "Chinese Academy of Sciences",

    # Contractor Only
    "AAC Clyde Space": "AAC",
    "AAC Microtecs": "AAC",
    "All-Russian Scientific Research Institute Of Electromechanics": "All-Russia Research Institute of Electromechanics",
    "Amsat-NA": "AMSAT",
    "Applied Physics Laboratory Johns Hopkins": "Johns Hopkins University",
    "Asher Space Research Institute at Technion": "Asher Space Research Institut",
    "Astrodynamics and Control Laboratory of Yonsei University": "Yonsei University",
    "Beijing MinoSpace Technology": "MinoSpace Technology",
    "Blue Canyon": "Blue Canyon Technologies",
    "Boeing Defense and Space": "Boeing",
    "Boeing Integrated Defense Systems": "Boeing",
    "Boeing Satellite Development Center": "Boeing",
    "Boeing Satellite Systems": "Boeing",
    "Boeing Space & Intelligence Systems": "Boeing",
    "Boeing Space Systems": "Boeing",
    "Built by Vietnamese engineers studying in Japan": "Vietnam National Space Center",
    "California Polytechnic Institute": "California Polytechnic State University",
    "California Polytechnic University": "California Polytechnic State University",
    "Carlo Gavazzi Space working with network of universities": "Carlo Gavazzi Space",
    "Centre National D'Etudes Spatiales": "Centre National d'Etudes Spatiales",
    "Changguang Satellite": "Chang Guang Satellite Technology",
    "China Academy of Science": "Chinese Academy of Sciences",
    "China Academy of Sciences": "Chinese Academy of Sciences",
    "China Academy of Space Technology (CAST": "China Academy of Space Technology",
    "CAST": "China Academy of Space Technology",
    "China Academy of SpaceTechnology": "China Academy of Space Technology",
    "Chinese Academy of Space Technology": "China Academy of Space Technology",
    "consortium of European companies and institutes": "Copernicus Land and Marine Environment services",
    "COSMIAC and ASTRA": "COSMIAC",
    "Dornier Systems and 35 subcontractors": "Dornier Systems",
    "EADS Astrium": "EADS",
    "EADS B": "EADS",
    "EADS Space": "EADS",
    "Engineering Department Chosun University": "Chosun University",
    "Gom Space ApS": "Gom Space",
    "GomSpace ApS": "Gom Space",
    "Innovative Solutions in Space BV": "Innovative Solutions in Space",
    "Institute of Mechanics of the Chinese Academy of Science": "Chinese Academy of Sciences",
    "Instituto Nacional de Técnia Aeroespacial": "Instituto Nacional de Tecnica Aerospacial",
    "Israel Aircraft Industries Missiles and Space Group": "Israel Aircraft Industries",
    "Korea Aerospace Research Institute": "Korean Aerospace Research Institute",
    "Lockheed Commercial Space Systems": "Lockheed Martin",
    "Lockheed Martin Astronautics": "Lockheed Martin",
    "Lockheed Martin Commercial Space Systems": "Lockheed Martin",
    "Lockheed Martin Missiles & Space": "Lockheed Martin",
    "Lockheed Martin Space Systems": "Lockheed Martin",
    "Lockheed Martin Space Systems Advanced Technology Center": "Lockheed Martin",
    "MIT Lincoln Laboratory": "Massachusetts Institute of Technology",
    "Mitsubishi": "Mitsubishi Heavy Industries",
    "Mitsubishi Electric": "Mitsubishi Heavy Industries",
    "NASA Ames Research Center": "NASA",
    "NASA Goddard Space Flight Center collaborators": "NASA",
    "NASA Jet Propulsion Laboratory": "NASA",
    "National Institute of Technology at Kochi College": "Kochi College",
    "National Reconnaissance Office": "National Reconnaissance Laboratory",
    "Naval Postgraduate School": "US Navy",
    "Naval Research Laboratory": "US Navy",
    "Northrop Grumman Innovative Systems": "Northrop Grumman",
    "Northrup Grumman": "Northrop Grumman",
    "Northrup Grumman Information Systems": "Northrop Grumman",
    "Northrup Grumman Innovation Systems": "Northrop Grumman",
    "NPO Lavochkin": "NPO",
    "NPO VNIIEM": "NPO",
    "NPO": "NPO",
    "NPO-PM": "NPO",
    "OAO ISS": "OAO",
    "OAO Resetneva": "OAO",
    "OAO-ISS": "OAS",
    "OAS ISS": "OAS",
    "OHB Germany": "OHB",
    "OHB System-AG": "OHB",
    "OHB-System AG": "OHB",
    "OHB-System GmbH SSTL": "OHB",
    "Payloads from DARPA and Internet of Things": "DARPA",
    "Ratheon": "Raytheon",
    "Reaktor Space Labs": "Reaktor Space Lab",
    "Satrec": "Satrec",
    "SaTReC of KAIST": "Satrec",
    "Shanghai ASES Spaceflight Technology": "Shanghai Academy of Spaceflight Technology",
    "Shanghai ASES Spaceflight Technology . . NJU": "Shanghai Academy of Spaceflight Technology",
    "Shanghai Institute of Microsatellite Innovation Chinese Academy of Sciences": "Chinese Academy of Sciences",
    "Shanghai Institute of Satellite Engineering at the Shanghai Academy of Spaceflight Technology": "Shanghai Academy of Spaceflight Technology",
    "Shenzhen Aerospace Dongfanghong Development": "Shenzhen Aerospace",
    "Shenzhen Aerospace Dongfanghong HIT Satellite": "Shenzhen Aerospace",
    "Shenzhen Aerospace Dongfanghong Satellite": "Shenzhen Aerospace",
    "Shenzhen Aerospace Oriental Red Sea Satellite": "Shenzhen Aerospace",
    "Skycraft": "Skykraft",
    "Space Dynamics Laboratory Utah State University": "Utah State University",
    "Space Research Center Polish Academy of Sciences": "Polish Academy of Sciences",
    "Space Research Institute King Abdulaziz City for Science and Technology": "King Abdulaziz City for Science and Technology",
    "Space Technologies Research Institute": "Space Technology Research Institute",
    "Spacety Aerospace": "Spacety Aerospace Company",
    "Surrey Satellite Technology": "Surrey Satellite Technologies",
    "Thales Alenia Space Italia": "Thales",
    "TRW and Aerojet Electronics Systems": "TRW",
    "TRW Defense and Space Systems Group": "TRW",
    "TRW Space and Electronics": "TRW",
    "TsSKB-Progress Samara Space Center and KB Arsenal": "TsSKB",
    "Two high schools and OHB System AG": "OHB Systems",
    "U.S. Army Space and Missile Defense Command": "US Army",
    "U.S. Army's Space and Missile Defense Command": "US Army",
    "UNITAS Space Flight Laboratory": "UNITAS",
    "Unitas SFL": "UNITAS",
    "University Berlin": "Technical University Berlin",
    "University Dresden": "Technical University Dresden",
    "University and other Japanese universities": "Wakayama University",
    "University NASA Jet Propulsion Laboratory Blue Canyon Technologies": "Colorado State University",
    "University Institute of Applied Engineering": "University of South Florida",
    "University Bengaluru": "PES University",
    "University of Louisiana at Lafayette": "University of Louisiana",
    "University of Tokyo the Tokyo Institute of Technology Keio University Japan Space Systems": "University of Tokyo",
    "University of Toronto Institute for Aerospace Studies Space Flight Laboratory": "University of Toronto",
    "Various": "Unknown",
    "USA": "Unknown"
}

In [1862]:
def apply_canonical_mapping(df, col_name: str, mapping: dict):

    df[col_name] = df[col_name].map(mapping).fillna(df[col_name]) # Fallback to existing name
    df[col_name] = df[col_name].fillna("Unknown") # fill missing with "Unknown"

    return df[col_name]

apply_canonical_mapping(df, "Operator", op_con_mapping)
apply_canonical_mapping(df, "Contractor", op_con_mapping)

total_op_begin_4 = df["Operator"].nunique()
total_con_begin_4 = df["Contractor"].nunique()

print(f"Operators: {total_op_begin_4} (consolidated {total_op_begin_3 - total_op_begin_4})")
print(f"Contractors: {total_con_begin_4} (consolidated {total_con_begin_3 - total_con_begin_4})")

Operators: 493 (consolidated 74)
Contractors: 390 (consolidated 109)


That narrows it down a bit. Let's take a peak at the Top 10 most common Operators and Contractors.

In [1863]:
print("Top 10 Operators:")
display(df["Operator"].value_counts().head(10))

print("Top 10 Contractors:")
display(df["Contractor"].value_counts().head(10))


Top 10 Operators:


Operator
SpaceX                                  3996
OneWeb                                   589
Planet Labs                              220
Chinese Ministry of National Defense     149
Spire Global                             135
Ministry of Defense                      126
China Aerospace                           98
Swarm Technologies                        90
Iridium Communications                    75
National Reconnaissance Laboratory        75
Name: count, dtype: int64

Top 10 Contractors:


Contractor
SpaceX                3996
OneWeb                 589
China Aerospace        240
Planet Labs            199
Thales                 183
Spire Global           135
Space Systems          112
Lockheed Martin        108
Boeing                  97
Swarm Technologies      90
Name: count, dtype: int64

#### `Launch Vehicle`

There's a lot more crammed into this field than the name suggests. Some are actual launch vehicles while other include (or substitute) the launch method, the booster stage, or the spacecraft.

In [1864]:
lv_count = df["Launch Vehicle"].nunique()

lv_sample = (
    pd.Series(df["Launch Vehicle"].dropna().unique())
      .sample(5, random_state=42)
      .tolist()
)

print(f"Launch Vehicles: {lv_count}")
print(f"Sample: {', '.join(lv_sample)}")

Launch Vehicles: 158
Sample: Soyuz 2.1b, Long March 2B, Breeze KM, Long March 11H, Ariane-5


We'll begin by deriving the `Launch Vehicle Family`. This is the high-level rocket family for grouping and charts. We'll also make note of alternative methods for a later calculation.

Instead of using a dictionary, since the launch vehicles are more standardized, we'll use a substring search to do the mapping.

In [1865]:

def get_launch_vehicle_family(lv):
    if pd.isna(lv):
        return "Unknown"

    lv = lv.strip()

    if "Nanorack" in lv:
        return "Nanorack"
    if "Slingshot" in lv or "Dispenser" in lv:
        return "Deployer"

    if lv in {"L1011", "LauncherOne"}:
        return "Air Launch"

    if lv == "Space Shuttle":
        return "Space Shuttle"

    # Rocket families
    if lv.startswith("Falcon"):
        return "Falcon"
    if lv.startswith("Atlas"):
        return "Atlas"
    if lv.startswith("Soyuz"):
        return "Soyuz"
    if lv.startswith("Ariane"):
        return "Ariane"
    if lv.startswith("Long March"):
        return "Long March"
    if lv.startswith("PSLV"):
        return "PSLV"
    if lv.startswith("GSLV") or lv.startswith("LVM3"):
        return "GSLV"
    if lv.startswith("SSLV"):
        return "SSLV"
    if lv.startswith("Delta"):
        return "Delta"
    if lv.startswith("Titan"):
        return "Titan"
    if lv.startswith("Zenit"):
        return "Zenit"
    if lv.startswith("Minotaur"):
        return "Minotaur"
    if lv.startswith("Electron"):
        return "Electron"
    if lv.startswith("Antares"):
        return "Antares"
    if lv.startswith("Vega"):
        return "Vega"
    if lv.startswith("Pegasus"):
        return "Pegasus"
    if lv.startswith("Taurus"):
        return "Taurus"
    if lv.startswith("Proton"):
        return "Proton"
    if lv.startswith("Dnepr"):
        return "Dnepr"
    if lv.startswith("Nuri"):
        return "Nuri"
    if lv.startswith("Shavit"):
        return "Shavit"
    if lv.startswith("H2"):
        return "H-II"
    if lv.startswith("Kuaizhou"):
        return "Kuaizhou"
    if lv.startswith("Ceres"):
        return "Ceres"
    if lv.startswith("Rokot"):
        return "Rokot"
    if lv.startswith("Start"):
        return "Start"
    if lv.startswith("Qased"):
        return "Qased"
    if lv.startswith("Naro"):
        return "Naro"
    if lv.startswith("Jielong"):
        return "Jielong"
    if lv.startswith("Tsyklon"):
        return "Tsyklon"
    if lv.startswith("Rocket 3"):
        return "Rocket 3"
    if lv.startswith("Kosmos"):
        return "Kosmos"
    if lv.startswith("Epsilon"):
        return "Epsilon"
    if lv.startswith("JAXA M"):
        return "JAXA M-V"
    if lv.startswith("KT-"):
        return "KT-2"

    return "Other"

df["Launch Vehicle Family"] = df["Launch Vehicle"].apply(get_launch_vehicle_family)
df["Launch Vehicle Family"].value_counts()


Launch Vehicle Family
Falcon           4764
Soyuz             709
Long March        616
PSLV              249
Ariane            193
Atlas             142
Proton            123
Electron          103
Delta              89
Vega               65
Dnepr              61
H-II               48
Rokot              46
Pegasus            40
Zenit              35
GSLV               34
Kuaizhou           32
Nanorack           28
Minotaur           27
Rocket 3           21
Titan              21
Air Launch         20
Deployer           16
Kosmos             14
Other              12
Ceres              11
Epsilon             7
Shavit              7
Space Shuttle       6
Nuri                4
Antares             4
Jielong             3
Start               2
Tsyklon             2
Naro                1
Qased               1
Taurus              1
JAXA M-V            1
SSLV                1
KT-2                1
Name: count, dtype: int64

Next, we'll explore the `Launch Method` to look at how the satellite reached orbit.

In [1866]:
def get_launch_method(lv):
    if pd.isna(lv):
        return "Unknown"

    if "Deployer" in lv or "Dispenser" in lv or "Slingshot" in lv:
        return "Deployer / Hosted Payload"

    if lv in {"L1011", "LauncherOne"}:
        return "Air Launch"

    if lv == "Space Shuttle":
        return "Space Shuttle"

    return "Orbital Rocket"

df["Launch Method"] = df["Launch Vehicle"].apply(get_launch_method)
df["Launch Method"].value_counts()

Launch Method
Orbital Rocket               7490
Deployer / Hosted Payload      44
Air Launch                     20
Space Shuttle                   6
Name: count, dtype: int64

Since our mapping was just based on substrings, it's entirely possible that our initial pass missed some categorizations. Let's see if we can find any such anomalies.

In [1867]:
df[
    df["Launch Vehicle Family"].isin(["Other", "Deployer"])
][["Launch Vehicle", "Launch Vehicle Family", "Launch Method"]].head(20)


Unnamed: 0,Launch Vehicle,Launch Vehicle Family,Launch Method
110,Breeze M,Other,Orbital Rocket
353,SEOPS Slingshot Deployer,Deployer,Deployer / Hosted Payload
427,Cygnus,Other,Orbital Rocket
436,Breeze M,Other,Orbital Rocket
562,ION SCV Dispenser,Deployer,Deployer / Hosted Payload
563,ION SCV Dispenser,Deployer,Deployer / Hosted Payload
564,ION SCV Dispenser,Deployer,Deployer / Hosted Payload
565,ION SCV Dispenser,Deployer,Deployer / Hosted Payload
566,ION SCV Dispenser,Deployer,Deployer / Hosted Payload
567,ION SCV Dispenser,Deployer,Deployer / Hosted Payload


The *Other* category seems like it could benefit from further cleanup.

In [1868]:
other_lv = df[
    df["Launch Vehicle Family"] == "Other"
]["Launch Vehicle"]

print(f"Total records classified as 'Other': {len(other_lv)}")
print(f"\nLaunch Vehicle values classified as 'Other' (with counts):")

other_lv.value_counts()


Total records classified as 'Other': 12

Launch Vehicle values classified as 'Other' (with counts):


Launch Vehicle
Breeze M     7
Cygnus       2
Breeze KM    2
Fa           1
Name: count, dtype: int64

We'll do another cleanup pass to reclassify "Other" launch vehicle families. Some values classified as "Other" are not actually launch vehicles such as upper stages or spacecraft. Since the atual launch vehicle wasn't provided, we can more accurately classify these as "Unknown".


In [1869]:
invalid_launch_vehicle_values = {
    "Breeze M",
    "Breeze KM",
    "Cygnus",
    "Fa",
}

mask_invalid_other = (
    (df["Launch Vehicle Family"] == "Other") &
    (df["Launch Vehicle"].isin(invalid_launch_vehicle_values))
)

df.loc[mask_invalid_other, "Launch Vehicle Family"] = "Unknown"

df["Launch Vehicle Family"].value_counts()

Launch Vehicle Family
Falcon           4764
Soyuz             709
Long March        616
PSLV              249
Ariane            193
Atlas             142
Proton            123
Electron          103
Delta              89
Vega               65
Dnepr              61
H-II               48
Rokot              46
Pegasus            40
Zenit              35
GSLV               34
Kuaizhou           32
Nanorack           28
Minotaur           27
Rocket 3           21
Titan              21
Air Launch         20
Deployer           16
Kosmos             14
Unknown            12
Ceres              11
Epsilon             7
Shavit              7
Space Shuttle       6
Nuri                4
Antares             4
Jielong             3
Start               2
Tsyklon             2
Naro                1
Qased               1
Taurus              1
JAXA M-V            1
SSLV                1
KT-2                1
Name: count, dtype: int64

### Handling Dates
#### `Date of Launch`

To begin, let's see if any dates are incorrectly formatted.

In [1870]:
parsed_attempt = pd.to_datetime(
    df["Date of Launch"],
    errors="coerce",
    format="mixed"
)

bad_dates = df.loc[
    parsed_attempt.isna() & df["Date of Launch"].notna(),
    ["Satellite", "Date of Launch"]
].copy()

print("Rows with unparseable Date of Launch values:")
display(bad_dates)


Rows with unparseable Date of Launch values:


Unnamed: 0,Satellite,Date of Launch
349,Cicero-8,11/29/018
7186,Tianmu-1 01,1/9//2023


We've got two. These look like basic typos so we can manually clean them up. Afterward, it'll be safe to parse the column as a datetime object.

In [1871]:
manual_date_fixes = {
    "Cicero-8": "11/29/2018",
    "Tianmu-1 01": "1/9/2023",
}

for sat, corrected in manual_date_fixes.items():
    df.loc[df["Satellite"] == sat, "Date of Launch"] = corrected

df["Date of Launch"] = pd.to_datetime(
    df["Date of Launch"],
    errors="coerce",
    format="mixed"
)

remaining_missing = df["Date of Launch"].isna().sum()
print(f"Remaining missing/unparseable Date of Launch values after fixes: {remaining_missing}")

display(df[df["Satellite"].isin(manual_date_fixes.keys())][
    ["Satellite", "Date of Launch"]
])

Remaining missing/unparseable Date of Launch values after fixes: 0


Unnamed: 0,Satellite,Date of Launch
349,Cicero-8,2018-11-29
7186,Tianmu-1 01,2023-01-09


The dataset was last updated on May 1, 2023. Let's use this date to calculate the years in orbit.

In [1875]:
REFERENCE_DATE = pd.Timestamp("2023-05-01")

launch_date = pd.to_datetime(
    df["Date of Launch"],
    errors="coerce"
)

years_in_orbit = (
    (REFERENCE_DATE - launch_date).dt.days / 365.25
)

df["Years in Orbit"] = np.floor(years_in_orbit).astype("Int64")

df[["Satellite", "Date of Launch", "Years in Orbit"]].tail()


Unnamed: 0,Satellite,Date of Launch,Years in Orbit
7555,Ziyuan 1-02C,2011-12-22,11
7556,Ziyuan 1-2D,2019-09-14,3
7557,Ziyuan 3,2012-01-09,11
7558,Ziyuan 3-2,2016-05-29,6
7559,Ziyuan 3-3,2020-07-25,2


### Extrapolating Geographic Data
#### `Launch Site`

We've got launch sites, but we can use this to get a lot more geodata. To start though, the site names could use some basic cleanup.

In [None]:
site_count = df["Launch Site"].dropna().nunique()
launch_sites = sorted(df["Launch Site"].dropna().unique())

print(f"Launch Sites: {site_count}")
for i in range(0, len(launch_sites), 5):
    print(", ".join(launch_sites[i:i+5]))

There's some variations on spelling (Center vs Centre) as well as some sites with additional information like the specific launch pad. We'll normalize this then also assign categorize the launch site. 

Specifically, we want to flag sites that are traditional spaceports since they can be flagged with geodata, as opposed to sea, air, and space-based launch platforms.

In [None]:
def normalize_launch_site(site):
    if pd.isna(site):
        return ("Unknown", "Unknown")

    s = site.strip()

    if "International Space Station" in s: # Space-based
        return ("International Space Station", "Space-Based")

    if s in {"Orbital ATK L-1011", "Stargazer L-1011", "Virgin Orbit"}: # Air launch
        return ("Air Launch", "Air Launch")

    if s in {"Sea Launch Odyssey", "Yellow Sea Launch Platform"}: # Sea launch
        return ("Sea Launch", "Sea Launch")

    if s in {"Antares", "Cygnus", "FANTM-RAiL", "FANTM-RAiL [Xtenti]"}: # Clearly not a site (wrong field)
        return ("Unknown", "Unknown")

    # Geographic launch sites (pad-level detail intentionally collapsed)
    site_mapping = {
        "Cape Canaveral": "Cape Canaveral Space Force Station",
        "Vandeberg AFB": "Vandenberg Space Force Base",
        "Vandenberg AFB": "Vandenberg Space Force Base",
        "Baikonur Cosmodrome": "Baikonur Cosmodrome",
        "Guiana Space Center": "Guiana Space Center",
        "Jiuquan Satellite Launch Center": "Jiuquan Satellite Launch Center",
        "Xichang Satellite Launch Center": "Xichang Satellite Launch Center",
        "Taiyan Launch Center": "Taiyuan Launch Center",
        "Taiyuan Launch Center": "Taiyuan Launch Center",
        "Wenchang Satellite Launch Center": "Wenchang Space Center",
        "Wenchang Space Center": "Wenchang Space Center",
        "Plesetsk Cosmodrome": "Plesetsk Cosmodrome",
        "Vostochny Cosmodrome": "Vostochny Cosmodrome",
        "Svobodny Cosmodrome": "Svobodny Cosmodrome",
        "Satish Dhawan Space Centre": "Satish Dhawan Space Center",
        "Satish Dhawan Space Center": "Satish Dhawan Space Center",
        "Tanegashima Space Center": "Tanegashima Space Center",
        "Uchinoura Space Center": "Uchinoura Space Center",
        "Naro Space Center": "Naro Space Center",
        "Wallops Island Flight Facility": "Wallops Flight Facility",
        "Mid-Atlantic Regional Spaceport/Wallops Island": "Wallops Flight Facility",
        "Kodiak Island": "Pacific Spaceport Complex – Alaska",
        "Kodiak Launch Complex": "Pacific Spaceport Complex – Alaska",
        "Kwajalein Island": "Reagan Test Site (Kwajalein)",
        "Shahroud Missile Range": "Shahroud Missile Range",
    }

    normalized = site_mapping.get(s, s)
    return (normalized, "Geographic")

    df[["Launch Site", "Launch Site Type"]] = (
    df["Launch Site"]
        .apply(lambda x: pd.Series(normalize_launch_site(x)))
)

df["Launch Site Type"].value_counts()

In [None]:
df_cols_cleaned["Launch Site"].value_counts()

In [None]:
# Print unique Launch Site values (sorted)
for site in sorted(df_cols_cleaned["Launch Site"].dropna().unique()):
    print(repr(site))

In [None]:
# Add geographic metadata for Launch Sites
# ---------------------------------------

import pandas as pd

df_cols_cleaned["Launch Site"] = df_cols_cleaned["Launch Site"].replace({
    "Rocket Lab Launch Complex 1": "Rocket Lab Launch Complex",
    "Rocket Lab Launch Complex 1B": "Rocket Lab Launch Complex",
})

launch_site_geo = {
    "Baikonur Cosmodrome": ("Kazakhstan", "Baikonur Region", "Baikonur"),
    "Cape Canaveral Space Force Station": ("United States", "Florida", "Cape Canaveral"),
    "Dombarovsky Air Base": ("Russia", "Orenburg Oblast", "Yasny"),
    "Guiana Space Center": ("France", "French Guiana", "Kourou"),
    "Jiuquan Satellite Launch Center": ("China", "Gansu", "Jiuquan"),
    "Naro Space Center": ("South Korea", "South Jeolla", "Goheung"),
    "Pacific Spaceport Complex – Alaska": ("United States", "Alaska", "Kodiak"),
    "Palmachim Launch Complex": ("Israel", "Central District", "Palmachim"),
    "Plesetsk Cosmodrome": ("Russia", "Arkhangelsk Oblast", "Mirny"),
    "Reagan Test Site (Kwajalein)": ("Marshall Islands", "Kwajalein Atoll", "Kwajalein"),
    "Rocket Lab Launch Complex": ("New Zealand", "Hawke’s Bay", "Mahia"),
    "Satish Dhawan Space Center": ("India", "Andhra Pradesh", "Sriharikota"),
    "Shahroud Missile Range": ("Iran", "Semnan Province", "Shahroud"),
    "Svobodny Cosmodrome": ("Russia", "Amur Oblast", "Svobodny"),
    "Taiyuan Launch Center": ("China", "Shanxi", "Taiyuan"),
    "Tanegashima Space Center": ("Japan", "Kagoshima Prefecture", "Minamitane"),
    "Uchinoura Space Center": ("Japan", "Kagoshima Prefecture", "Kimotsuki"),
    "Vandenberg Space Force Base": ("United States", "California", "Lompoc"),
    "Vostochny Cosmodrome": ("Russia", "Amur Oblast", "Tsiolkovsky"),
    "Wallops Flight Facility": ("United States", "Virginia", "Wallops Island"),
    "Wenchang Space Center": ("China", "Hainan", "Wenchang"),
    "Xichang Satellite Launch Center": ("China", "Sichuan", "Xichang"),
}

df_cols_cleaned[["Launch Country", "Launch Region", "Launch City"]] = (
    df_cols_cleaned["Launch Site"]
        .map(lambda s: launch_site_geo.get(s, (pd.NA, pd.NA, pd.NA)))
        .apply(pd.Series)
)


In [None]:
# Check which sites do not have geo data
df_cols_cleaned[
    df_cols_cleaned["Launch Country"].isna()
]["Launch Site"].value_counts()


In [None]:
# Count records by full launch location combination
# ------------------------------------------------

launch_location_counts = (
    df_cols_cleaned
        .groupby(
            ["Launch Site", "Launch Country", "Launch Region", "Launch City"],
            dropna=False
        )
        .size()
        .reset_index(name="Count")
        .sort_values("Count", ascending=False)
)

launch_location_counts


In [None]:
# Launch Site -> (Latitude, Longitude)
# -----------------------------------
# Notes:
# - Fixed geographic sites get coordinates.
# - Mobile / non-geographic sites are set to NA (Air Launch, Sea Launch, ISS, Unknown).

launch_site_coords = {
    # Non-geographic / mobile
    "Air Launch": (pd.NA, pd.NA),
    "Sea Launch": (pd.NA, pd.NA),
    "International Space Station": (pd.NA, pd.NA),
    "Unknown": (pd.NA, pd.NA),

    # Fixed geographic sites (lat, lon)
    "Baikonur Cosmodrome": (45.9200, 63.3420),
    "Guiana Space Center": (5.2360, -52.7750),

    # US
    "Vandenberg Space Force Base": (34.73734, -120.58431),
    "Wallops Flight Facility": (37.9402, -75.4664),
    "Pacific Spaceport Complex – Alaska": (57.44283, -152.35811),
    "Cape Canaveral Space Force Station": (28.48731, -80.57429),

    # Marshall Islands
    "Reagan Test Site (Kwajalein)": (8.72512, 167.72818),

    # Russia / Kazakhstan
    "Vostochny Cosmodrome": (51.8167, 128.2500),
    "Plesetsk Cosmodrome": (62.92805, 40.57559),
    "Dombarovsky Air Base": (51.09393, 59.84266),
    "Svobodny Cosmodrome": (51.88888, 128.11187),

    # China
    "Jiuquan Satellite Launch Center": (40.9546639, 100.2883333),
    "Taiyuan Launch Center": (38.8427806, 111.6050972), 
    "Xichang Satellite Launch Center": (28.2409417, 102.0226000),  
    "Wenchang Space Center": (19.6144917, 110.9511333), 

    # Japan
    "Tanegashima Space Center": (30.40000, 130.97000), 
    "Uchinoura Space Center": (31.25151, 131.07549),

    # India
    "Satish Dhawan Space Center": (13.719939, 80.230425), 

    # South Korea
    "Naro Space Center": (34.4319, 127.5351),

    # Israel
    "Palmachim Launch Complex": (31.89778, 34.69056), 

    # New Zealand
    "Rocket Lab Launch Complex": (-39.26085, 177.86586),

    # Iran
    "Shahroud Missile Range": (36.20092, 55.33366),
}

# 1) Apply the mapping -> create new columns
df_cols_cleaned["Launch Latitude"] = df_cols_cleaned["Launch Site"].map(lambda s: launch_site_coords.get(s, (pd.NA, pd.NA))[0])
df_cols_cleaned["Launch Longitude"] = df_cols_cleaned["Launch Site"].map(lambda s: launch_site_coords.get(s, (pd.NA, pd.NA))[1])

# 2) Sanity checks / reporting
print("Rows:", len(df_cols_cleaned))
print("Rows missing Launch Latitude:", df_cols_cleaned["Launch Latitude"].isna().sum())

missing_sites = (
    df_cols_cleaned.loc[df_cols_cleaned["Launch Latitude"].isna(), "Launch Site"]
    .value_counts()
)

print("\nLaunch Sites missing coordinates (count):")
display(missing_sites)

In [None]:
launch_location_counts = (
    df_cols_cleaned
        .groupby(
            ["Launch Site", "Launch Country", "Launch Region", "Launch City"],
            dropna=False
        )
        .size()
        .reset_index(name="Count")
        .sort_values("Count", ascending=False)
)

launch_location_counts


In [None]:
# Distance from Equator & Hemisphere
# ---------------------------------
# Based on Launch Latitude (degrees)

KM_PER_DEGREE_LAT = 111.32  # average km per degree of latitude

lat_numeric = pd.to_numeric(df_cols_cleaned["Launch Latitude"], errors="coerce")

df_cols_cleaned["Distance from Equator (km)"] = (
    lat_numeric.abs() * KM_PER_DEGREE_LAT
)

# Hemisphere classification
def classify_hemisphere(lat):
    if pd.isna(lat):
        return "Unknown"
    if lat > 0:
        return "Northern"
    if lat < 0:
        return "Southern"
    return "Equator"

df_cols_cleaned["Hemisphere"] = df_cols_cleaned["Launch Latitude"].apply(classify_hemisphere)


In [None]:
df_cols_cleaned[
    df_cols_cleaned["Launch Latitude"].notna()
][["Launch Site", "Launch Latitude", "Distance from Equator (km)", "Hemisphere"]].head(10)


In [None]:
output_path = "data/ucs_satellite_cleaned_review.csv"

df_cols_cleaned.to_csv(output_path, index=False)

print(f"Review file written to: {output_path}")