# Web Scrape
Here I will web scrape data as part of the larger data set

In [1]:
# imports
from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np

# Step 1 - Import the Data

In [2]:
# get html file from website link
URL = 'https://www.satellite-calculations.com/Satellite/satellitelist.php'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html')

In [3]:
# find the sortable table - this contains our data
table = soup.find('table', {'class': 'sortable'})
len(table)

2145

In [4]:
# look at headers
headers = table.find_all('th')

In [5]:
# this will turn the header contents into a usable column name
def contents_to_title(contents):
    string = ''
    for i in contents:
        word = str(i)
        if word[0] != '<':
            string += word + ' '
        
    return string[:-1]

In [6]:
# create my list of column names
columns = []
for h in headers:
    contents = h.contents[0].contents
    c = contents_to_title(contents)
    columns.append(c)
    
len(columns)

24

In [7]:
# the secret sauce
# some rows have different structure, so this function has different options for how to parse them
def get_vals(td):
    
    values = []
    count = 0
    for t in td:
        if count == 12:
            values.append(t.contents[0].contents[1])
            count += 1
            continue
        count += 1
        try:
            values.append(t.contents[0].find('a').contents[0])
        except AttributeError:
            try:
                values.append(t.contents[1].find('a')['name'])
            except IndexError:
                try:
                    values.append(t.contents[0].contents[0])
                except IndexError:
                    continue
    
    return values


# get rows now
rows = table.find_all('tr')
rows = rows[1:]

row_list = []
for r in rows:
    td = r.find_all('td')
    vals = get_vals(td)
    row_list.append(vals)
    
len(row_list)

1071

In [8]:
# let's make a data frame
sat_df = pd.DataFrame(row_list, columns=columns )
sat_df.shape

(1071, 24)

In [9]:
sat_df.head()

Unnamed: 0,Current Sat Longitude,Sat Name,SatCat,Launch date,TLE Source,Site,Org,Op,Current Lat,Lat drift 10 minutes,...,E/W Lon Osc (Incl),E/W Lon Osc (Ecc),Apogee[km],Perigee[km],Altitude [km],Epoch,Time of calculation,Time since Epoch,ID,Satellite Period [hh:mm:ss]
0,179.9720°E,INTELSAT 18 (IS-18),37834,2011-10-05,CEL,TYMSC,ITSO,+,-0.0207°S,0.000°N,...,0.0000°,0.0198°,35796,35781,35786,2023-04-20 00:00:00 UTC,2023-04-21 17:20:55 UTC,T=01.72,2011-056A,23:56:08.47
1,179.5798°E,INMARSAT 5-F3,40882,2015-08-28,CEL,TYMSC,IM,+,0.0286°N,0.001°N,...,0.0000°,0.0030°,35790,35787,35788,2023-04-20 14:54:20 UTC,2023-04-21 17:20:55 UTC,T=01.10,2015-042A,23:56:09.07
2,179.4772°E,SL-12 R/B(2),26101,2000-03-12,CEL,TYMSC,CIS,,-13.8664°S,0.117°N,...,0.8638°,0.1292°,35701,35606,35688,2023-04-20 11:08:43 UTC,2023-04-21 17:20:55 UTC,T=01.26,2000-013D,23:49:15.57
3,178.5444°E,TJS-6,47613,2021-02-04,CEL,XICLF,PRC,+,-0.0897°S,0.003°S,...,0.0001°,0.0267°,35797,35777,35790,2023-04-17 14:55:20 UTC,2023-04-21 17:20:55 UTC,T=04.10,2021-010A,23:56:04.72
4,178.3432°E,ATS 1,2608,1966-12-07,CEL,AFETR,US,-,-1.0645°S,0.076°S,...,0.0190°,0.0338°,35790,35765,35771,2023-04-20 20:01:46 UTC,2023-04-21 17:20:55 UTC,T=00.89,1966-110A,23:55:33.84


To be honest, that part was the bulk of the work and the data looks pretty nice right now. Everything I do after this will likely be minor and not too consequential, or necessary.

# Step 2

I notice that the 'SatCat' column here is the same thing as 'NORAD Number' in my other data set. In the interest of easier merging, I'm going to rename the column.

In [10]:
sat_df.rename(columns={'SatCat':'NORAD Number'}, inplace=True)

# Step 3

We can sort by satllite longitude, to get an idea of where they are relative to each other

In [11]:
sat_df.sort_values('Current Sat Longitude', inplace=True)
sat_df.head()

Unnamed: 0,Current Sat Longitude,Sat Name,NORAD Number,Launch date,TLE Source,Site,Org,Op,Current Lat,Lat drift 10 minutes,...,E/W Lon Osc (Incl),E/W Lon Osc (Ecc),Apogee[km],Perigee[km],Altitude [km],Epoch,Time of calculation,Time since Epoch,ID,Satellite Period [hh:mm:ss]
646,0.3424°W,METEOSAT-10 (MSG3),38552,2012-07-05,CEL,FRGUI,EUME,+,1.6814°N,0.049°N,...,0.0184°,0.0038°,35789,35786,35786,2023-04-17 00:00:00 UTC,2023-04-21 17:20:55 UTC,T=04.72,2012-035B,23:56:06.88
645,0.4799°E,EUTELSAT HOTBIRD 13F,54048,2022-10-15,CEL,AFETR,EUTE,+,-0.0095°S,0.003°S,...,0.0000°,0.0182°,35795,35781,35790,2023-04-20 20:54:07 UTC,2023-04-21 17:20:55 UTC,T=00.85,2022-134A,23:56:06.91
647,0.6335°W,THOR 7,40613,2015-04-26,CEL,FRGUI,NOR,+,-0.0080°S,0.001°N,...,0.0000°,0.0298°,35799,35777,35785,2023-04-21 00:20:13 UTC,2023-04-21 17:20:55 UTC,T=00.71,2015-022A,23:56:06.52
648,0.7310°W,THOR 5,32487,2008-02-11,CEL,TYMSC,NOR,+,-0.0266°S,0.001°S,...,0.0000°,0.0282°,35798,35778,35784,2023-04-20 20:53:44 UTC,2023-04-21 17:20:55 UTC,T=00.85,2008-006A,23:56:07.16
649,0.8244°W,THOR 6,36033,2009-10-29,CEL,FRGUI,NOR,+,-0.0192°S,0.000°S,...,0.0000°,0.0264°,35797,35778,35779,2023-04-20 20:53:44 UTC,2023-04-21 17:20:55 UTC,T=00.85,2009-058B,23:56:05.60


# Step 4

I guess we can look at nulls again.

In [12]:
for column in sat_df:
    print(column, len(sat_df[column][sat_df[column].isnull()]))

Current Sat Longitude 0
Sat Name 0
NORAD Number 0
Launch date 0
TLE Source 0
Site 0
Org 0
Op 0
Current Lat 0
Lat drift 10 minutes 0
Lon drift 10 minutes 0
Longitude at Epoch 0
Lon Driftrate [deg pr.day] 0
Inclination 0
E/W Lon Osc (Incl) 0
E/W Lon Osc (Ecc) 0
Apogee[km] 0
Perigee[km] 0
Altitude [km] 0
Epoch 3
Time of calculation 3
Time since Epoch 3
ID 3
Satellite Period [hh:mm:ss] 3


Very good, not many nulls here. The only information contained in the first three of those columns is how updated the data is. The ID is just another identifier. The satellite period is actually contained in some of the other variables, and is a bit redundant.

For those reasons, I will remove the columns with nulls, except ID. ID can be used as a way to match across data sets, potentially.

In [13]:
removes = ['Epoch', 'Time of calculation', 'Time since Epoch', 'Satellite Period [hh:mm:ss]']
sat_df.drop(columns=removes, inplace=True)

# Step 5

Convert time to datetime

In [14]:
from datetime import datetime

In [15]:
# make a function to account for error
def to_datetime(time):
    try:
        return datetime.strptime(time, '%Y-%m-%d')
    
    except ValueError: # this only occurs when we get a value of 'CEL', so we can just return that
            return(time) # value back

In [16]:
sat_df['Launch date'] = sat_df['Launch date'].apply(to_datetime)

There we have it! Another clean data set.