In [29]:
# Imports
import os
import boto3
import pandas as pd
from io import StringIO

## Data Ingestion

---

In [30]:
# Load variables from .env file, ignoring lines without '='
def load_env_variables(env_file='../.env'):
    with open(env_file, 'r') as file:
        for line in file:
            # Skip lines without an equals sign or comments
            if '=' in line and not line.strip().startswith('#'):
                key, value = line.strip().split('=', 1)
                os.environ[key] = value

# Load environment variables
load_env_variables()

In [31]:
# Create an S3 client
s3 = boto3.client('s3')

# Specify path to your file within S3 bucket
bucket_name = os.getenv('BUCKET_NAME')
object_key = os.getenv('OBJECT_KEY')

# Get the object from S3
csv_obj = s3.get_object(Bucket=bucket_name, Key=object_key)

# Get the body of the object (the file content)
body = csv_obj['Body']

# Read the body into a string
csv_string = body.read().decode('utf-8')

# Use pandas to read the CSV string into a DataFrame
aig_df = pd.read_csv(StringIO(csv_string))

aig_df.head()  


Unnamed: 0,OpCo,OpCo Name,Subsidiary,Subsidiary Name,Departure Airport,Departure Airport Name,Departure Country,Departure Country Name,Departure Region,Arrival Airport,Arrival Airport Name,Arrival Country,Arrival Country Name,Arrival Region,Aircraft type,Date,Cabin,Service,# Passengers,# Flights
0,VY,Vueling,,,BIO,BILBAO,ES,Spain,Europe/Domestic,BCN,BARCELONA,ES,Spain,Europe/Domestic,A320,2019-07-02,Economy,Non-Premium,220,9
1,VY,Vueling,,,BCN,BARCELONA,ES,Spain,Europe/Domestic,BIO,BILBAO,ES,Spain,Europe/Domestic,A320,2019-07-02,Economy,Non-Premium,503,6
2,VY,Vueling,,,BIO,BILBAO,ES,Spain,Europe/Domestic,PMI,PALMA,ES,Spain,Europe/Domestic,A320,2019-07-02,Economy,Non-Premium,188,11
3,VY,Vueling,,,PMI,PALMA,ES,Spain,Europe/Domestic,BIO,BILBAO,ES,Spain,Europe/Domestic,A320,2019-07-02,Economy,Non-Premium,405,8
4,VY,Vueling,,,BIO,BILBAO,ES,Spain,Europe/Domestic,LIS,LISBON,PT,Portugal,Europe,A320,2019-07-02,Economy,Non-Premium,152,20


## Basic Data Exploration

---

In [32]:
# Dataframe info
aig_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3460 entries, 0 to 3459
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   OpCo                    3460 non-null   object
 1   OpCo Name               3460 non-null   object
 2   Subsidiary              1971 non-null   object
 3   Subsidiary Name         1971 non-null   object
 4   Departure Airport       3460 non-null   object
 5   Departure Airport Name  3460 non-null   object
 6   Departure Country       3460 non-null   object
 7   Departure Country Name  3460 non-null   object
 8   Departure Region        3460 non-null   object
 9   Arrival Airport         3460 non-null   object
 10  Arrival Airport Name    3460 non-null   object
 11  Arrival Country         3460 non-null   object
 12  Arrival Country Name    3460 non-null   object
 13  Arrival Region          3460 non-null   object
 14  Aircraft type           3238 non-null   object
 15  Date

### **Dataframe Descritpion**

This dataframe encapsulates flight data for an group of airlines (AIG), spanning 3460 records and 20 variables. It records details about operating entities, their subsidiaries, and associated names. Moreover, it contains geographical specifics, cataloging departure and arrival airports, countries, and regions. It also details the aircraft used, flight dates, cabin classes, service types, alongside the number of passengers and flights.

| **Variable**               | **Description**             | **DataType** | **Null Value Count** |
|------------------------|-------------------------|----------|------------------|
| `OpCo`                   | Operating Company       | object   | 0                |
| `OpCo Name`              | Operating Company Name  | object   | 0                |
| `Subsidiary`             | Subsidiary Company      | object   | 1489             |
| `Subsidiary Name`        | Subsidiary Company Name | object   | 1489             |
| `Departure Airport`      | Departure Airport Code  | object   | 0                |
| `Departure Airport Name` | Name of Departure Airport | object | 0              |
| `Departure Country`      | Departure Country Code  | object   | 0                |
| `Departure Country Name` | Name of Departure Country | object | 0             |
| `Departure Region`      | Departure Region Name   | object   | 0                |
| `Arrival Airport`        | Arrival Airport Code    | object   | 0                |
| `Arrival Airport Name`   | Name of Arrival Airport | object   | 0                |
| `Arrival Country`        | Arrival Country Code    | object   | 0                |
| `Arrival Country Name`   | Name of Arrival Country | object   | 0                |
| `Arrival Region`         | Arrival Region Name     | object   | 0                |
| `Aircraft type`          | Type of Aircraft        | object   | 222              |
| `Date`                   | Date of Flight          | datetime   | 0                |
| `Cabin`                  | Cabin Type              | object   | 0                |
| `Service`                | Type of Service         | object   | 0                |
| `# Passengers`           | Number of Passengers    | int64    | 0                |
| `# Flights`              | Number of Flights       | int64    | 0                |

In [33]:
# Check for unique values per column (only categoricals)
for column in aig_df.select_dtypes(include=['object']).columns:
    print(column)
    print(f"{aig_df[column].unique()}\n")

OpCo
['VY' 'BA' 'LV' 'IB' 'EI' 'I2']

OpCo Name
['Vueling' 'British Airways' 'Level' 'Iberia' 'Aer Lingus'
 'Iberia Express' 'Vueling+']

Subsidiary
[nan 'CJ' 'A0' 'BA' 'IBC']

Subsidiary Name
[nan 'Cityflyer' 'Euroflyer' 'British Airways' 'Iberia Airline']

Departure Airport
['BIO' 'BCN' 'PMI' 'LIS' 'SVQ' 'FCO' 'CDG' 'JMK' 'PRG' 'MRS' 'ALG' 'NCE'
 'TLS' 'IBZ' 'CAG' 'LGW' 'ORY' 'VGO' 'MXP' 'MUC' 'MAH' 'HER' 'SPU' 'MAD'
 'AGP' 'ATH' 'OPO' 'LPA' 'LCG' 'OTP' 'DUB' 'VLC' 'ZRH' 'DME' 'TLV' 'ARN'
 'STR' 'NTE' 'CTA' 'LYS' 'ACE' 'AMS' 'BLQ' 'VCE' 'LHR' 'NBO' 'OSL' 'CPH'
 'HEL' 'PSA' 'BUD' 'ABV' 'BOS' 'SEA' 'DXB' 'ATL' 'JFK' 'KWI' 'MCO' 'TPA'
 'ANU' 'TAB' 'BGI' 'UVF' 'POS' 'PUJ' 'BDA' 'YVR' 'SFO' 'SIN' 'HKG' 'ORD'
 'LAX' 'JNB' 'NAS' 'INV' 'NCL' 'EDI' 'MAN' 'ABZ' 'GOT' 'BRU' 'HAJ' 'BHD'
 'GRU' 'BLR' 'SAN' 'GVA' 'LCA' 'FRA' 'ZAG' 'TXL' 'BSL' 'LED' 'GLA' 'DUS'
 'GIB' 'LUX' 'VIE' 'LIN' 'BLL' 'PMO' 'BOD' 'ALC' 'SZG' 'FLR' 'GRX' 'RNS'
 'LTN' 'CWL' 'FUE' 'PUY' 'EFL' 'HAM' 'IST' 'KBP' 'FAO' 'CFU' 'KEF'

#### **EDA Insight #1:** Handling Missing Subsidiary and Aircraft Type Data

Subsidiaries are exclusively associated with `Iberia`, `Iberia Express`, and `British Airways`. This observation allows us to interpret null values in the subsidiaries column for other companies as 'No Subsidiary'. Additionally, `Aircraft type` has missing values, suggesting that the aircraft types were not specified on those specific records. A simple workaround for that column could be imputting those values as 'Not Specified'. **Validation approach:**

In [34]:
# Airlines spotted with subsidiaries
expected_airlines = {'Iberia', 'Iberia Express', 'British Airways'}

# Filter and check the hypothesis
is_subset = set(aig_df.loc[aig_df['Subsidiary'].notnull(), 'OpCo Name'].unique()).issubset(expected_airlines)

print(f"Only Iberia, Iberia Express, and British Airways have subsidiaries: {is_subset}")

Only Iberia, Iberia Express, and British Airways have subsidiaries: True


#### **EDA Insight #2:** Consolidating Vueling Entities

The review reveals no substantial evidence distinguishing `Vueling+` from `Vueling`, and with a mere 7 records attributed to `Vueling+`, I will streamline the analysis by consolidating these entries under `Vueling`. This approach ensures consistency and simplifies the dataset for more coherent insights. **Validation approach:**

In [35]:
# Get the number of Vueling+ records
print(f"Records attributed as 'Vueling+': {len(aig_df[aig_df['OpCo Name'] == 'Vueling+'])}")

Records attributed as 'Vueling+': 7


#### **EDA Insight #3:** Special Characters in Country Names

I've identified issues with country names, such as the presence of unwanted special characters (e.g., 'Bulgari@ ') and trailing whitespaces, which can introduce inconsistencies in our data analysis. To address this, I will implement a general regex pattern to cleanse the data, removing non-standard characters and excess spaces from both departure and arrival country names, ensuring a more error-free dataset for analysis. **Validation approach:**

In [36]:
# Regex expression
regex = r'[^a-zA-Z0-9\s\.,-]'

# Identify and filter invalid records in a single step
invalid_rows_df = aig_df[aig_df[['Departure Country Name', 'Arrival Country Name']]\
                         .apply(lambda x: x.str.contains(regex)).any(axis=1)]

invalid_rows_df[['Departure Country Name', 'Arrival Country Name']]

Unnamed: 0,Departure Country Name,Arrival Country Name
588,United Kingdom,Bulgari§
589,Bulgari§,United Kingdom
2239,Ireland,Bulgari§
2240,Bulgari§,Ireland
2845,United Kingdom,Bulgari§
2846,Bulgari§,United Kingdom


#### **EDA Insight #4:** Singular Date Observation

The dataset reveals the presence of a single unique date value. This insight has implications for how we might structure our data storage, particularly for raw data, which could be organized by date in directories like 'data/raw/day=20190702'. However, given this uniformity in date values, a straightforward conversion to a datetime format should suffice for this column. **Validation approach:**

In [37]:
print(f"Unique values for 'Date' column: {aig_df['Date'].unique()}\n")

Unique values for 'Date' column: ['2019-07-02']



#### **EDA Insight #5:** Distinction Between Iberia and Iberia Express

Our analysis identifies separate entries for `Iberia` and `Iberia Express`. Given domain knowledge and corroborated by online research, we recognize `Iberia Express` as the budget-friendly counterpart within the Iberia Group. This distinction justifies treating them as independent entities.

#### **EDA Insight #6:** Column Naming Convention

I've identified that the current column names, containing white spaces, may pose challenges, particularly when integrating with data warehousing solutions that may favor a different naming convention. Adopting a standardized approach to column naming not only enhances compatibility but also maintains consistency across our data models. 

**NOTE:** specific naming requirements may exist within broader workflows, for the purposes of this project, we will proceed with renaming the columns to adhere to best practices.

### **Extra Step:** Evaluating the Distribution of Numerical Attributes

This step is taken for ensuring the integrity and consistency of our numerical data. By checking the distribution of these attributes, I aim to identify potential anomalies, such as outliers or irregular values, which could impact the analysis. This examination helps in maintaining the quality of our data, ensuring it represents the underlying data we seek to understand.

In [38]:
# Check for distributions per numerical column
aig_df[['# Passengers', '# Flights']].describe()

Unnamed: 0,# Passengers,# Flights
count,3460.0,3460.0
mean,545.646243,10.994798
std,256.591472,5.373581
min,100.0,2.0
25%,327.75,6.0
50%,545.5,11.0
75%,757.0,16.0
max,1000.0,20.0


In [39]:
# Save the DataFrame to a CSV file
aig_df.to_csv('../data/raw/Sample_Data.csv', index=False)