# Introduction to Data Preprocessing
In this notebook, we will formally walk through the data preprocessing steps needed
for the US Airline Flight Routes dataset. We will:
1. Load the data and perform initial cleanup.
2. Handle missing or disparate coordinate entries.
3. Merge external data sources to augment our dataset.
4. Finally, output a cleaned CSV file for further analysis.

## Data Preprocessing

Import necessary libraries and functions

In [1]:
import numpy as np
import pandas as pd

pd.set_option("future.no_silent_downcasting", True) # Prevent silent data type changes during operations for future compatibility

In [2]:
import sys
sys.path.append('../src')

from utils import extract_coordinates

Load the <a href="https://www.kaggle.com/datasets/bhavikjikadara/us-airline-flight-routes-and-fares-1993-2024">US Airline Flight Routes and Fares dataset</a> into a dataframe

In [21]:
df = pd.read_csv("../data/US Airline Flight Routes and Fares 1993-2024.csv")
display(df.head())
print(df.shape)

  df = pd.read_csv("../data/US Airline Flight Routes and Fares 1993-2024.csv")


Unnamed: 0,tbl,Year,quarter,citymarketid_1,citymarketid_2,city1,city2,airportid_1,airportid_2,airport_1,...,fare,carrier_lg,large_ms,fare_lg,carrier_low,lf_ms,fare_low,Geocoded_City1,Geocoded_City2,tbl1apk
0,Table1a,2021,3,30135,33195,"Allentown/Bethlehem/Easton, PA","Tampa, FL (Metropolitan Area)",10135,14112,ABE,...,81.43,G4,1.0,81.43,G4,1.0,81.43,,,202131013514112ABEPIE
1,Table1a,2021,3,30135,33195,"Allentown/Bethlehem/Easton, PA","Tampa, FL (Metropolitan Area)",10135,15304,ABE,...,208.93,DL,0.4659,219.98,UA,0.1193,154.11,,,202131013515304ABETPA
2,Table1a,2021,3,30140,30194,"Albuquerque, NM","Dallas/Fort Worth, TX",10140,11259,ABQ,...,184.56,WN,0.9968,184.44,WN,0.9968,184.44,,,202131014011259ABQDAL
3,Table1a,2021,3,30140,30194,"Albuquerque, NM","Dallas/Fort Worth, TX",10140,11298,ABQ,...,182.64,AA,0.9774,183.09,AA,0.9774,183.09,,,202131014011298ABQDFW
4,Table1a,2021,3,30140,30466,"Albuquerque, NM","Phoenix, AZ",10140,14107,ABQ,...,177.11,WN,0.6061,184.49,AA,0.3939,165.77,,,202131014014107ABQPHX


(245955, 23)


Upon loading the dataframe, we immediately observe issues with it. There is a warning message above saying `DtypeWarning: Columns (20,21) have mixed types. Specify dtype option on import or set low_memory=False.` To address this, we identify what these columns are and address this problem if needed.

In [22]:
print("Columns with mixed types:")
for col in df.columns[20:22]:
  print(col)

Columns with mixed types:
Geocoded_City1
Geocoded_City2


In [23]:
print("Unique values for \"Geocoded_City1\":\n{}".format(df["Geocoded_City1"].unique()))
print("Unique values for \"Geocoded_City2\":\n{}".format(df["Geocoded_City2"].unique()))

Unique values for "Geocoded_City1":
[nan 'Salt Lake City, UT\n(40.758478, -111.888142)'
 'Colorado Springs, CO\n(38.835224, -104.819798)'
 'Pittsburgh, PA\n(40.442169, -79.994945)'
 'Las Vegas, NV\n(36.169202, -115.140597)'
 'Huntsville, AL\n(34.729538, -86.585283)'
 'Kansas City, MO\n(39.099792, -94.578559)'
 'Chicago, IL\n(41.775002, -87.696388)'
 'Albany, NY\n(42.651242, -73.755418)'
 'Boston, MA (Metropolitan Area)\n(42.358894, -71.056742)'
 'Miami, FL (Metropolitan Area)\n(44.977479, -93.264346)'
 'Nashville, TN\n(36.166687, -86.779932)'
 'Houston, TX\n(29.760803, -95.369506)'
 'Dallas/Fort Worth, TX\n(40.11086, -77.035636)'
 'El Paso, TX\n(31.76006, -106.492257)'
 'New York City, NY (Metropolitan Area)\n(40.123164, -75.333718)'
 'Charlotte, NC\n(35.222936, -80.840161)'
 'Charleston, SC\n(32.77647, -79.931027)'
 'Greensboro/High Point, NC\n(36.072701, -79.793899)'
 'Oklahoma City, OK\n(35.468494, -97.521264)'
 'Cleveland, OH (Metropolitan Area)\n(41.505546, -81.6915)'
 'Hartford, 

In [24]:
print("Percentage of missing values for 'Geocoded_City_1': {:.2f}%".format(100 * df["Geocoded_City1"].isnull().sum() / df.shape[0]))
print("Percentage of missing values for 'Geocoded_City_2': {:.2f}%".format(100 * df["Geocoded_City2"].isnull().sum() / df.shape[0]))

Percentage of missing values for 'Geocoded_City_1': 15.94%
Percentage of missing values for 'Geocoded_City_2': 15.94%


It appears that the values of the columns `Geocoded_City1` and `Geocoded_City2` are either missing, in the format of `"city name, state name\n(coordinates)"` or simply `"(coordinates)"`. We can format all values of this column to be in the form of the third option.

In [25]:
# Convert the values to be either NaN or only contain coordinates
df["Geocoded_City1"] = df["Geocoded_City1"].apply(lambda x: np.nan if pd.isnull(x) else extract_coordinates(x))
df["Geocoded_City2"] = df["Geocoded_City2"].apply(lambda x: np.nan if pd.isnull(x) else extract_coordinates(x))

In [26]:
print("Number of missing values for \"Geocoded_City1\" before imputation:", df["Geocoded_City1"].isnull().sum())
print("Number of missing values for \"Geocoded_City2\" before imputation:", df["Geocoded_City2"].isnull().sum())

# Impute the missing values in "Geocoded_City1" and "Geocoded_City2" based on matching values from the "airportid_1" and "airportid_2" columns
df["Geocoded_City1"] = df.groupby("airportid_1")["Geocoded_City1"].transform(lambda x: x.ffill().bfill())
df["Geocoded_City2"] = df.groupby("airportid_2")["Geocoded_City2"].transform(lambda x: x.ffill().bfill())

print("\nNumber of missing values for \"Geocoded_City1\" after imputation:", df["Geocoded_City1"].isnull().sum())
print("Number of missing values for \"Geocoded_City2\" after imputation:", df["Geocoded_City2"].isnull().sum())

Number of missing values for "Geocoded_City1" before imputation: 39206
Number of missing values for "Geocoded_City2" before imputation: 39206

Number of missing values for "Geocoded_City1" after imputation: 27
Number of missing values for "Geocoded_City2" after imputation: 39


In [27]:
print("Cities that have missing values for \"Geocoded_City1\":", df["city1"][df["Geocoded_City1"].isnull()].unique())
print("Cities that have missing values for \"Geocoded_City2\":", df["city2"][df["Geocoded_City2"].isnull()].unique())

Cities that have missing values for "Geocoded_City1": ['Charlottesville, VA' 'Sioux Falls, SD' 'Hilton Head, SC' 'Ashland, WV'
 'New Haven, CT']
Cities that have missing values for "Geocoded_City2": ['Bozeman, MT' 'Traverse City, MI' 'Kalispell, MT' 'Jackson, WY'
 'Montrose/Delta, CO' 'St. Cloud, MN' 'Vero Beach, FL']


There are still remaining null values in the `Geocoded_City1` and `Geocoded_City2` columns even after imputation. This is likely because there are `airportid` values that do not have any non-null entries for `Geocoded_City` to begin with.

In order to address this problem, we can use the <a href="https://simplemaps.com/data/us-cities">US Cities Database</a> to find matching coordinates for cities that yet do not have non-null entries in our original dataframe. City coordinates will be used for imputation at this stage under the assumption that cities with missing coordinate values do not have multiple airports.

In [28]:
# Load the dataset containing the coordinates of US cities
us_cities = pd.read_csv("../data/uscities.csv")
cols_used = ["city", "state_id", "state_name", "lat", "lng", "population", "density"]
us_cities = us_cities[cols_used] # Extract columns to be used
us_cities["coordinates"] = us_cities["lat"].astype(str) + ", " + us_cities["lng"].astype(str) # Create a new column with coordinates
us_cities.drop(["lat", "lng"], axis=1, inplace=True)
display(us_cities.head())

Unnamed: 0,city,state_id,state_name,population,density,coordinates
0,New York,NY,New York,18908608,11080.3,"40.6943, -73.9249"
1,Los Angeles,CA,California,11922389,3184.7,"34.1141, -118.4068"
2,Chicago,IL,Illinois,8497759,4614.5,"41.8375, -87.6866"
3,Miami,FL,Florida,6080145,4758.9,"25.784, -80.2101"
4,Houston,TX,Texas,5970127,1384.0,"29.786, -95.3885"


Modify the `city1` and `city2` columns of the original dataframe to make them match with the columns of the `us_cities` dataframe

In [29]:
# Divide the "city" column into "city" and "state" columns for both origin and destination
df["state_1"] = df["city1"].str.split(", ").str[-1].str[:2]
df["city_1"] = df["city1"].str.split(", ").str[0].str.split("/").str[0]
df["state_2"] = df["city2"].str.split(", ").str[-1].str[:2]
df["city_2"] = df["city2"].str.split(", ").str[0].str.split("/").str[0]
df.drop(["city1", "city2"], axis=1, inplace=True)
display(df.head())

Unnamed: 0,tbl,Year,quarter,citymarketid_1,citymarketid_2,airportid_1,airportid_2,airport_1,airport_2,nsmiles,...,carrier_low,lf_ms,fare_low,Geocoded_City1,Geocoded_City2,tbl1apk,state_1,city_1,state_2,city_2
0,Table1a,2021,3,30135,33195,10135,14112,ABE,PIE,970,...,G4,1.0,81.43,"40.602753, -75.469759","37.8606, -78.804199",202131013514112ABEPIE,PA,Allentown,FL,Tampa
1,Table1a,2021,3,30135,33195,10135,15304,ABE,TPA,970,...,UA,0.1193,154.11,"40.602753, -75.469759","37.8606, -78.804199",202131013515304ABETPA,PA,Allentown,FL,Tampa
2,Table1a,2021,3,30140,30194,10140,11259,ABQ,DAL,580,...,WN,0.9968,184.44,"35.084248, -106.649241","40.11086, -77.035636",202131014011259ABQDAL,NM,Albuquerque,TX,Dallas
3,Table1a,2021,3,30140,30194,10140,11298,ABQ,DFW,580,...,AA,0.9774,183.09,"35.084248, -106.649241","40.11086, -77.035636",202131014011298ABQDFW,NM,Albuquerque,TX,Dallas
4,Table1a,2021,3,30140,30466,10140,14107,ABQ,PHX,328,...,AA,0.3939,165.77,"35.084248, -106.649241","30.406931, -87.217578",202131014014107ABQPHX,NM,Albuquerque,AZ,Phoenix


## Clarifying the Data Merging Steps
During preprocessing, we cross-referenced city coordinates
with an external dataset to handle missing geographic details.
Extra features such as population and density were also joined
to enrich the dataset for further modeling.

Fill in null values for `Geocoded_City1` and `Geocoded_City2` by finding corresponding coordinates in `us_cities`

In [30]:
# Merge the us_cities dataframe with the original dataframe to fill in missing coordinates
df = df.merge(us_cities[['city', 'state_id', 'coordinates']], left_on=['city_1', 'state_1'], right_on=['city', 'state_id'], how='left')
df["Geocoded_City1"] = df['Geocoded_City1'].fillna(df['coordinates'])
df = df.drop(['city', 'state_id', 'coordinates'], axis=1)

df = df.merge(us_cities[['city', 'state_id', 'coordinates']], left_on=['city_2', 'state_2'], right_on=['city', 'state_id'], how='left')
df["Geocoded_City2"] = df['Geocoded_City2'].fillna(df['coordinates'])
df = df.drop(['city', 'state_id', 'coordinates'], axis=1)

# Check for null values again
geocity1_nullcount = df["Geocoded_City1"].isnull().sum()
print("Number of missing values for \"Geocoded_City1\" after imputation:", geocity1_nullcount)
if geocity1_nullcount > 0:
  print("Cities that having missing values for \"Geocoded_City1\":", df["city_1"][df["Geocoded_City1"].isnull()].unique())
geocity2_nullcount = df["Geocoded_City2"].isnull().sum()
print("Number of missing values for \"Geocoded_City2\" after imputation:", geocity2_nullcount)
if geocity2_nullcount > 0:
  print("Cities that having missing values for \"Geocoded_City2\":", df["city_2"][df["Geocoded_City2"].isnull()].unique())

Number of missing values for "Geocoded_City1" after imputation: 7
Cities that having missing values for "Geocoded_City1": ['Hilton Head' 'Ashland']
Number of missing values for "Geocoded_City2" after imputation: 0


Manually fill in information for values that are still missing due to the discrepancy in notation between the two datasets. As Ashland does not exist in `us_cities` and takes up a very small portion of the entire dataset, it will be dropped.

In [31]:
df = df[~((df["city_1"] == "Ashland") & (df["state_1"] == "WV"))]
df.loc[df["city_1"] == "Hilton Head", "Geocoded_City1"] = us_cities["coordinates"][us_cities["city"] == "Hilton Head Island"].values[0]

# Check for null values
print("Number of missing values for \"Geocoded_City1\" after imputation:", df["Geocoded_City1"].isnull().sum())
print("Number of missing values for \"Geocoded_City2\" after imputation:", df["Geocoded_City2"].isnull().sum())

Number of missing values for "Geocoded_City1" after imputation: 0
Number of missing values for "Geocoded_City2" after imputation: 0


We can add more features from the <a href="/content/drive/Othercomputers/My Mac/MIE368/MIE368_Group13/uscities.csv">US Cities Database</a> that might later help with making more accurate predictions of the number of passengers.

In [32]:
# Merge population and density for city_1
df = df.merge(us_cities[['city', 'state_id', 'population', 'density']], left_on=['city_1', 'state_1'], right_on=['city', 'state_id'], how='left')
df = df.rename(columns={'population': 'population_1', 'density': 'density_1'})
df = df.drop(['city', 'state_id'], axis=1)

# Merge population and density for city_2
df = df.merge(us_cities[['city', 'state_id', 'population', 'density']], left_on=['city_2', 'state_2'], right_on=['city', 'state_id'], how='left')
df = df.rename(columns={'population': 'population_2', 'density': 'density_2'})
df = df.drop(['city', 'state_id'], axis=1)

# Display the updated dataframe
display(df.head())
print(df[['population_1', 'population_2', 'density_1', 'density_2']].isnull().sum())

Unnamed: 0,tbl,Year,quarter,citymarketid_1,citymarketid_2,airportid_1,airportid_2,airport_1,airport_2,nsmiles,...,Geocoded_City2,tbl1apk,state_1,city_1,state_2,city_2,population_1,density_1,population_2,density_2
0,Table1a,2021,3,30135,33195,10135,14112,ABE,PIE,970,...,"37.8606, -78.804199",202131013514112ABEPIE,PA,Allentown,FL,Tampa,627863.0,2754.2,2861173.0,1320.9
1,Table1a,2021,3,30135,33195,10135,15304,ABE,TPA,970,...,"37.8606, -78.804199",202131013515304ABETPA,PA,Allentown,FL,Tampa,627863.0,2754.2,2861173.0,1320.9
2,Table1a,2021,3,30140,30194,10140,11259,ABQ,DAL,580,...,"40.11086, -77.035636",202131014011259ABQDAL,NM,Albuquerque,TX,Dallas,769986.0,1159.8,5830932.0,1478.7
3,Table1a,2021,3,30140,30194,10140,11298,ABQ,DFW,580,...,"40.11086, -77.035636",202131014011298ABQDFW,NM,Albuquerque,TX,Dallas,769986.0,1159.8,4064275.0,1198.9


population_1    23323
population_2    36913
density_1       23323
density_2       36913
dtype: int64


In [33]:
print("Cities in 'city1' with missing population values:", df["city_1"][df["population_1"].isnull()].unique())
print("Cities in 'city2' with missing density values:", df["city_2"][df["population_2"].isnull()].unique())

print("Cities in 'city1' with missing density values:", df["city_1"][df["density_1"].isnull()].unique())
print("Cities in 'city2' with missing density values:", df["city_2"][df["density_2"].isnull()].unique())

Cities in 'city1' with missing population values: ['Nantucket' 'New York City' "Martha's Vineyard" 'Hilton Head']
Cities in 'city2' with missing density values: ['New York City']
Cities in 'city1' with missing density values: ['Nantucket' 'New York City' "Martha's Vineyard" 'Hilton Head']
Cities in 'city2' with missing density values: ['New York City']


Again, we manually fill in information for values that are still missing due to the discrepancy in notation between the two datasets

In [34]:
missing_cities = df["city_1"][df["population_1"].isnull()].unique()
corresponding_cities = ["Siasconset", "New York", "Martha Lake", "Hilton Head Island"]

for i in range(len(missing_cities)):
  df.loc[df["city_1"] == missing_cities[i], "population_1"] = us_cities["population"][us_cities["city"] == corresponding_cities[i]].values[0]
  df.loc[df["city_1"] == missing_cities[i], "density_1"] = us_cities["density"][us_cities["city"] == corresponding_cities[i]].values[0]
  df.loc[df["city_2"] == missing_cities[i], "population_2"] = us_cities["population"][us_cities["city"] == corresponding_cities[i]].values[0]
  df.loc[df["city_2"] == missing_cities[i], "density_2"] = us_cities["density"][us_cities["city"] == corresponding_cities[i]].values[0]

print("Number of missing values after imputation:")
print(df[["population_1", "population_2", "density_1", "density_2"]].isnull().sum())

Number of missing values after imputation:
population_1    0
population_2    0
density_1       0
density_2       0
dtype: int64


We now drop irrelevant/redundant features from the original dataframe.

According to the <a href="https://www.kaggle.com/datasets/bhavikjikadara/us-airline-flight-routes-and-fares-1993-2024">source</a> of this dataset, the data features are as the following:
1. tbl: Table identifier
2. Year: Year of the data record
3. quarter: Quarter of the year (1-4)
4. citymarketid_1: Origin city market ID
5. citymarketid_2: Destination city market ID
6. city1: Origin city name
7. city2: Destination city name
8. airportid_1: Origin airport ID
9. airportid_2: Destination airport ID
10. airport_1: Origin airport code
11. airport_2: Destination airport code
12. nsmiles: Distance between airports in miles
13. passengers: Number of passengers
14. fare: Average fare
15. carrier_lg: Code for the largest carrier by passengers
16. large_ms: Market share of the largest carrier
17. fare_lg: Average fare of the largest carrier
18. carrier_low: Code for the lowest fare carrier
19. lf_ms: Market share of the lowest fare carrier
20. fare_low: Lowest fare
21. Geocoded_City1: Geocoded coordinates for the origin city
22. Geocoded_City2: Geocoded coordinates for the destination city
23. tbl1apk: Unique identifier for the route

In [35]:
cols_not_used = ["tbl", "citymarketid_1", "citymarketid_2", "airportid_1", "airportid_2", "carrier_lg", "large_ms", "fare_lg", "carrier_low", "lf_ms", "fare_low", "tbl1apk", "fare"]
df = df.drop(cols_not_used, axis=1)
display(df.head())

Unnamed: 0,Year,quarter,airport_1,airport_2,nsmiles,passengers,Geocoded_City1,Geocoded_City2,state_1,city_1,state_2,city_2,population_1,density_1,population_2,density_2
0,2021,3,ABE,PIE,970,180,"40.602753, -75.469759","37.8606, -78.804199",PA,Allentown,FL,Tampa,627863.0,2754.2,2861173.0,1320.9
1,2021,3,ABE,TPA,970,19,"40.602753, -75.469759","37.8606, -78.804199",PA,Allentown,FL,Tampa,627863.0,2754.2,2861173.0,1320.9
2,2021,3,ABQ,DAL,580,204,"35.084248, -106.649241","40.11086, -77.035636",NM,Albuquerque,TX,Dallas,769986.0,1159.8,5830932.0,1478.7
3,2021,3,ABQ,DFW,580,264,"35.084248, -106.649241","40.11086, -77.035636",NM,Albuquerque,TX,Dallas,769986.0,1159.8,5830932.0,1478.7
4,2021,3,ABQ,PHX,328,398,"35.084248, -106.649241","30.406931, -87.217578",NM,Albuquerque,AZ,Phoenix,769986.0,1159.8,4064275.0,1198.9


Divide the geocoordinate values of cities into two columns for latitude and longitude

In [36]:
df["lat_1"] = df["Geocoded_City1"].str.split(", ").str[0].astype(float)
df["lon_1"] = df["Geocoded_City1"].str.split(", ").str[1].astype(float)

df["lat_2"] = df["Geocoded_City2"].str.split(", ").str[0].astype(float)
df["lon_2"] = df["Geocoded_City2"].str.split(", ").str[1].astype(float)

df.drop(["Geocoded_City1", "Geocoded_City2"], axis=1, inplace=True)
display(df.head())

Unnamed: 0,Year,quarter,airport_1,airport_2,nsmiles,passengers,state_1,city_1,state_2,city_2,population_1,density_1,population_2,density_2,lat_1,lon_1,lat_2,lon_2
0,2021,3,ABE,PIE,970,180,PA,Allentown,FL,Tampa,627863.0,2754.2,2861173.0,1320.9,40.602753,-75.469759,37.8606,-78.804199
1,2021,3,ABE,TPA,970,19,PA,Allentown,FL,Tampa,627863.0,2754.2,2861173.0,1320.9,40.602753,-75.469759,37.8606,-78.804199
2,2021,3,ABQ,DAL,580,204,NM,Albuquerque,TX,Dallas,769986.0,1159.8,5830932.0,1478.7,35.084248,-106.649241,40.11086,-77.035636
3,2021,3,ABQ,DFW,580,264,NM,Albuquerque,TX,Dallas,769986.0,1159.8,5830932.0,1478.7,35.084248,-106.649241,40.11086,-77.035636
4,2021,3,ABQ,PHX,328,398,NM,Albuquerque,AZ,Phoenix,769986.0,1159.8,4064275.0,1198.9,35.084248,-106.649241,30.406931,-87.217578


Change column names where needed for easier interpretation

In [37]:
df = df.rename(columns={"Year": "year", "nsmiles": "distance", "passengers": "daily_passengers"})
display(df.head())

Unnamed: 0,year,quarter,airport_1,airport_2,distance,daily_passengers,state_1,city_1,state_2,city_2,population_1,density_1,population_2,density_2,lat_1,lon_1,lat_2,lon_2
0,2021,3,ABE,PIE,970,180,PA,Allentown,FL,Tampa,627863.0,2754.2,2861173.0,1320.9,40.602753,-75.469759,37.8606,-78.804199
1,2021,3,ABE,TPA,970,19,PA,Allentown,FL,Tampa,627863.0,2754.2,2861173.0,1320.9,40.602753,-75.469759,37.8606,-78.804199
2,2021,3,ABQ,DAL,580,204,NM,Albuquerque,TX,Dallas,769986.0,1159.8,5830932.0,1478.7,35.084248,-106.649241,40.11086,-77.035636
3,2021,3,ABQ,DFW,580,264,NM,Albuquerque,TX,Dallas,769986.0,1159.8,5830932.0,1478.7,35.084248,-106.649241,40.11086,-77.035636
4,2021,3,ABQ,PHX,328,398,NM,Albuquerque,AZ,Phoenix,769986.0,1159.8,4064275.0,1198.9,35.084248,-106.649241,30.406931,-87.217578


Check for missing values and the shape of the new dataframe

In [38]:
print("Shape of the new dataframe:", df.shape)
print("Number of missing values in the entire dataframe:\n{}".format(df.isnull().sum()))

Shape of the new dataframe: (245953, 18)
Number of missing values in the entire dataframe:
year                0
quarter             0
airport_1           0
airport_2           0
distance            0
daily_passengers    0
state_1             0
city_1              0
state_2             0
city_2              0
population_1        0
density_1           0
population_2        0
density_2           0
lat_1               0
lon_1               0
lat_2               0
lon_2               0
dtype: int64


Check if all numerical variables have the right data types

In [39]:
display(df.dtypes)

year                  int64
quarter               int64
airport_1            object
airport_2            object
distance              int64
daily_passengers      int64
state_1              object
city_1               object
state_2              object
city_2               object
population_1        float64
density_1           float64
population_2        float64
density_2           float64
lat_1               float64
lon_1               float64
lat_2               float64
lon_2               float64
dtype: object

Save the preprocessed data to a csv file to be later used

In [40]:
df.to_csv("../data/cleaned_data.csv", index=False)