# Transform Station List
The raw station list is in an inconvenient format. This notebook uses Pandas to transform it into a format that will allow for the structure to be inferred, reducing our workload.

Update the following parameters in the first cell to accomodate your installation:

- BRONZE_STATION_LIST_PATH - The location of the raw igra2-station-list.txt file
- SILVER_STATION_LIST_PATH - The location to save the CSV version of the file

In [1]:
import os
import pandas as pd

BRONZE_STATION_LIST_PATH = '/Users/olievortex/lakehouse/default/Files/bronze/igra2/doc/igra2-station-list.txt'
SILVER_STATION_LIST_PATH = '/Users/olievortex/lakehouse/default/Files/silver/igra2/doc/igra2-station-list.csv'

In [2]:
# Get the path without the filename
dest_path = '/'.join(SILVER_STATION_LIST_PATH.replace('\\', '/').split('/')[:-1]) 
print(f'Output folder: {dest_path}')

# Make sure the destination path exists
os.makedirs(dest_path, exist_ok=True)

# Variable not needed anymore
del dest_path

Output folder: /Users/olievortex/lakehouse/default/Files/silver/igra2/doc


In [3]:
# Define the fixed width intervals
colspecs = [
    (0, 11),        # id
    (12, 20),       # latitude
    (21, 30),       # longitude
    (31, 37),       # elevation
    (38, 40),       # state
    (41, 71),       # name
    (72, 76),       # fstyear
    (77, 81),       # lstyear
    (82, 88)        # nobs
]

# There's no header row so we must specify our own column names
names = ['id', 'latitude', 'longitude', 'elevation', 'state', 'name', 'fst_year', 'lst_year', 'nobs']

In [4]:
# read_fwf is the fixed width file reader for Pandas. The colspecs parameter specifies the fixed column
# ranges. The names parameter specifies the column names. Passing header=None tells Pandas the first
# row contains data, not column names.
df = pd.read_fwf(BRONZE_STATION_LIST_PATH, colspecs=colspecs, header=None, names=names, index_col=0)

In [5]:
# Confirm the data types are correct (they are)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2879 entries, ACM00078861 to ZZXUAICE031
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   latitude   2879 non-null   float64
 1   longitude  2879 non-null   float64
 2   elevation  2879 non-null   float64
 3   state      562 non-null    object 
 4   name       2879 non-null   object 
 5   fst_year   2879 non-null   int64  
 6   lst_year   2879 non-null   int64  
 7   nobs       2879 non-null   int64  
dtypes: float64(3), int64(3), object(2)
memory usage: 202.4+ KB


In [6]:
# View a sampling for sanity checks (it is sane). Many records have a null state field, so NaN is expected.
df.head()

Unnamed: 0_level_0,latitude,longitude,elevation,state,name,fst_year,lst_year,nobs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ACM00078861,17.117,-61.783,10.0,,COOLIDGE FIELD (UA),1947,1993,13896
AEM00041217,24.4333,54.65,16.0,,ABU DHABI INTERNATIONAL AIRPOR,1983,2024,39914
AEXUAE05467,25.25,55.37,4.0,,SHARJAH,1935,1942,2477
AFM00040911,36.7,67.2,378.0,,MAZAR-I-SHARIF,2010,2014,2179
AFM00040913,36.6667,68.9167,433.0,,KUNDUZ,2010,2013,4540


In [7]:
# Confirm that the state field is parsed correctly by viewing a row we know contains a value
df.loc['USW00094730']

latitude         42.0333
longitude       -70.0333
elevation           42.1
state                 MA
name         NORTH TRURO
fst_year            1944
lst_year            1946
nobs                1581
Name: USW00094730, dtype: object

In [8]:
# Save the dataframe as a CSV file
df.to_csv(SILVER_STATION_LIST_PATH)