Output of cleaned data file should look like this:

| YEAR | MONTH | DAY | DAY_OF_WEEK | ORG_AIRPORT | DEST_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | SCHEDULED_ARRIVAL | ARRIVAL_TIME | ARRIVAL_DELAY |
|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| integer | integer | integer | integer | string | string | integer | integer | integer | integer | integer | integer |

and should be named "cleaned_data" and saved as a .csv file

In [16]:
import pandas as pd

In [17]:
cleaned_data = pd.read_csv("C:/Users/cfman/OneDrive/Desktop/WGUClasses/D602 Deployment/Task 2/T_ONTIME_REPORTING.csv")

In [18]:
cleaned_data.head()

Unnamed: 0,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,CRS_ARR_TIME,ARR_TIME,ARR_DELAY
0,2024,6,1,6,ABQ,BUR,905,904.0,-1.0,1005,955.0,-10.0
1,2024,6,1,6,ABQ,LAX,601,556.0,-5.0,713,704.0,-9.0
2,2024,6,1,6,ABQ,LAX,700,656.0,-4.0,807,751.0,-16.0
3,2024,6,1,6,ABQ,LAX,1201,1159.0,-2.0,1317,1354.0,37.0
4,2024,6,1,6,ABQ,LAX,1247,1237.0,-10.0,1354,1335.0,-19.0


In [19]:
column_names = list(cleaned_data.columns)

In [20]:
for i in range(0, len(column_names)):
    print(column_names[i], "is type:", type(cleaned_data[column_names[i]][0]))

YEAR is type: <class 'numpy.int64'>
MONTH is type: <class 'numpy.int64'>
DAY_OF_MONTH is type: <class 'numpy.int64'>
DAY_OF_WEEK is type: <class 'numpy.int64'>
ORIGIN is type: <class 'str'>
DEST is type: <class 'str'>
CRS_DEP_TIME is type: <class 'numpy.int64'>
DEP_TIME is type: <class 'numpy.float64'>
DEP_DELAY is type: <class 'numpy.float64'>
CRS_ARR_TIME is type: <class 'numpy.int64'>
ARR_TIME is type: <class 'numpy.float64'>
ARR_DELAY is type: <class 'numpy.float64'>


From this output, we can see that we need to convert "DEP_TIME", "DEP_DELAY", "ARR_TIME", "ARR_DELAY" to the integer class type as specified in the intial breakdown of what the cleaned data should look like. We also need to rename a few of the columns as well.

## For ease of access to see what it should look like

| YEAR | MONTH | DAY | DAY_OF_WEEK | ORG_AIRPORT | DEST_AIRPORT | SCHEDULED_DEPARTURE | DEPARTURE_TIME | DEPARTURE_DELAY | SCHEDULED_ARRIVAL | ARRIVAL_TIME | ARRIVAL_DELAY |
|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
| integer | integer | integer | integer | string | string | integer | integer | integer | integer | integer | integer |

In [25]:
cleaned_data = cleaned_data.rename(columns = {"DAY_OF_MONTH": "DAY", "ORIGIN": "ORG_AIRPORT", "DEST": "DEST_AIRPORT",
                               "CRS_DEP_TIME": "SCHEDULED_DEPARTURE", "DEP_TIME": "DEPARTURE_TIME", "DEP_DELAY": "DEPARTURE_DELAY", 
                               "CRS_ARR_TIME": "SCHEDULED_ARRIVAL", "ARR_TIME": "ARRIVAL_TIME", "ARR_DELAY": "ARRIVAL_DELAY"})

In [26]:
cleaned_data.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,ORG_AIRPORT,DEST_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY
0,2024,6,1,6,ABQ,BUR,905,904.0,-1.0,1005,955.0,-10.0
1,2024,6,1,6,ABQ,LAX,601,556.0,-5.0,713,704.0,-9.0
2,2024,6,1,6,ABQ,LAX,700,656.0,-4.0,807,751.0,-16.0
3,2024,6,1,6,ABQ,LAX,1201,1159.0,-2.0,1317,1354.0,37.0
4,2024,6,1,6,ABQ,LAX,1247,1237.0,-10.0,1354,1335.0,-19.0


In [27]:
cleaned_data.isna().sum()

YEAR                     0
MONTH                    0
DAY                      0
DAY_OF_WEEK              0
ORG_AIRPORT              0
DEST_AIRPORT             0
SCHEDULED_DEPARTURE      0
DEPARTURE_TIME         650
DEPARTURE_DELAY        651
SCHEDULED_ARRIVAL        0
ARRIVAL_TIME           732
ARRIVAL_DELAY          987
dtype: int64

In order to get the float columns to integers, I must first deal with the missing values that are in those columns.

This is part of the data cleaning process in Part C but I need to do it here to get an acceptable file to upload.

I am going to keep it simple and just drop all the rows with missing values.

In [None]:
len(cleaned_data)

In [31]:
cleaned_data = cleaned_data.dropna()

In [32]:
len(cleaned_data)

107024

So we can see that we only lost about 1,000 rows out of the over 100,000 that we have.

In [35]:
## https://www.geeksforgeeks.org/change-data-type-for-one-or-more-columns-in-pandas-dataframe/

# Define the conversion dictionary
convert_dict = {'DEPARTURE_TIME': int, 'DEPARTURE_DELAY': int,'ARRIVAL_TIME': int,'ARRIVAL_DELAY': int}

# Convert columns using the dictionary
cleaned_data = cleaned_data.astype(convert_dict)

print(cleaned_data.dtypes)

YEAR                    int64
MONTH                   int64
DAY                     int64
DAY_OF_WEEK             int64
ORG_AIRPORT            object
DEST_AIRPORT           object
SCHEDULED_DEPARTURE     int64
DEPARTURE_TIME          int32
DEPARTURE_DELAY         int32
SCHEDULED_ARRIVAL       int64
ARRIVAL_TIME            int32
ARRIVAL_DELAY           int32
dtype: object


In [36]:
cleaned_data.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,ORG_AIRPORT,DEST_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY
0,2024,6,1,6,ABQ,BUR,905,904,-1,1005,955,-10
1,2024,6,1,6,ABQ,LAX,601,556,-5,713,704,-9
2,2024,6,1,6,ABQ,LAX,700,656,-4,807,751,-16
3,2024,6,1,6,ABQ,LAX,1201,1159,-2,1317,1354,37
4,2024,6,1,6,ABQ,LAX,1247,1237,-10,1354,1335,-19


Now we can see that everything is an integer other than the ORG_AIRPORT and the DEST_AIRPORT and we can see that the column names line up with what is needed for the other file. Now that that is cleaned up, we can move on to Part C.

In [37]:
cleaned_data.to_csv('C:/Users/cfman/OneDrive/Desktop/WGUClasses/D602 Deployment/Task 2/Repo Files/cleaned_data.csv')