# Data Exploration - Initial Summary

This notebook loads the CSV files from the data folder and provides an initial summary of the datasets before deeper exploration.

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Set up data directory path
data_dir = Path('../data')
print(f"Data directory: {data_dir.absolute()}")
print(f"\nCSV files found:")
csv_files = list(data_dir.glob('*.csv'))
for file in csv_files:
    print(f"  - {file.name}")

Data directory: c:\Users\easan.s.2024\Documents\GitHub\CS201-G4T6\TopK-Airlines\EDA\..\data

CSV files found:
  - airline.csv
  - airport.csv
  - lounge.csv
  - seat.csv


In [4]:
# Load all CSV files into a dictionary
dataframes = {}
for csv_file in csv_files:
    df_name = csv_file.stem  # Get filename without extension
    dataframes[df_name] = pd.read_csv(csv_file)
    print(f"Loaded: {df_name}")

Loaded: airline
Loaded: airport
Loaded: lounge
Loaded: seat


## Dataset Summaries

Let's examine each dataset's structure, size, and basic statistics.

In [5]:
# Display summary for each dataset
for name, df in dataframes.items():
    print("=" * 80)
    print(f"Dataset: {name}")
    print("=" * 80)
    print(f"\nShape: {df.shape[0]} rows × {df.shape[1]} columns")
    print(f"\nColumns and Data Types:")
    print(df.dtypes)
    print(f"\nMemory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print(f"\nMissing Values:")
    missing = df.isnull().sum()
    if missing.sum() > 0:
        print(missing[missing > 0])
    else:
        print("No missing values")
    print("\n")

Dataset: airline

Shape: 41396 rows × 20 columns

Columns and Data Types:
airline_name                      object
link                              object
title                             object
author                            object
author_country                    object
date                              object
content                           object
aircraft                          object
type_traveller                    object
cabin_flown                       object
route                             object
overall_rating                   float64
seat_comfort_rating              float64
cabin_staff_rating               float64
food_beverages_rating            float64
inflight_entertainment_rating    float64
ground_service_rating            float64
wifi_connectivity_rating         float64
value_money_rating               float64
recommended                        int64
dtype: object

Memory Usage: 52.89 MB

Missing Values:
author_country                    1591
aircraft    

In [6]:
# Display first few rows and descriptive statistics for each dataset
for name, df in dataframes.items():
    print("=" * 80)
    print(f"Dataset: {name} - First 5 Rows")
    print("=" * 80)
    display(df.head())
    
    print(f"\nDescriptive Statistics for {name}:")
    print("=" * 80)
    display(df.describe(include='all'))
    print("\n\n")

Dataset: airline - First 5 Rows


Unnamed: 0,airline_name,link,title,author,author_country,date,content,aircraft,type_traveller,cabin_flown,route,overall_rating,seat_comfort_rating,cabin_staff_rating,food_beverages_rating,inflight_entertainment_rating,ground_service_rating,wifi_connectivity_rating,value_money_rating,recommended
0,adria-airways,/airline-reviews/adria-airways,Adria Airways customer review,D Ito,Germany,2015-04-10,Outbound flight FRA/PRN A319. 2 hours 10 min f...,,,Economy,,7.0,4.0,4.0,4.0,0.0,,,4.0,1
1,adria-airways,/airline-reviews/adria-airways,Adria Airways customer review,Ron Kuhlmann,United States,2015-01-05,Two short hops ZRH-LJU and LJU-VIE. Very fast ...,,,Business Class,,10.0,4.0,5.0,4.0,1.0,,,5.0,1
2,adria-airways,/airline-reviews/adria-airways,Adria Airways customer review,E Albin,Switzerland,2014-09-14,Flew Zurich-Ljubljana on JP365 newish CRJ900. ...,,,Economy,,9.0,5.0,5.0,4.0,0.0,,,5.0,1
3,adria-airways,/airline-reviews/adria-airways,Adria Airways customer review,Tercon Bojan,Singapore,2014-09-06,Adria serves this 100 min flight from Ljubljan...,,,Business Class,,8.0,4.0,4.0,3.0,1.0,,,4.0,1
4,adria-airways,/airline-reviews/adria-airways,Adria Airways customer review,L James,Poland,2014-06-16,WAW-SKJ Economy. No free snacks or drinks on t...,,,Economy,,4.0,4.0,2.0,1.0,2.0,,,2.0,0



Descriptive Statistics for airline:


Unnamed: 0,airline_name,link,title,author,author_country,date,content,aircraft,type_traveller,cabin_flown,route,overall_rating,seat_comfort_rating,cabin_staff_rating,food_beverages_rating,inflight_entertainment_rating,ground_service_rating,wifi_connectivity_rating,value_money_rating,recommended
count,41396,41396,41396,41396,39805,41396,41396,1278,2378,38520,2341,36861.0,33706.0,33708.0,33264.0,31114.0,2203.0,565.0,39723.0,41396.0
unique,362,362,362,29652,158,2368,41362,363,4,4,2208,,,,,,,,,
top,spirit-airlines,/airline-reviews/spirit-airlines,Spirit Airlines customer review,Anders Pedersen,United Kingdom,2015-01-19,If you experience any problems submitting comm...,A320,Solo Leisure,Economy,Toronto to Las Vegas,,,,,,,,,
freq,990,990,990,58,9969,301,5,132,804,29784,4,,,,,,,,,
mean,,,,,,,,,,,,6.039527,3.094612,3.319212,2.805886,2.392364,2.736723,2.249558,3.164111,0.53382
std,,,,,,,,,,,,3.21468,1.405515,1.541307,1.580246,1.704753,1.569073,1.541283,1.523486,0.498861
min,,,,,,,,,,,,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,,,,,,,,,,,,3.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,0.0
50%,,,,,,,,,,,,7.0,3.0,4.0,3.0,2.0,3.0,1.0,4.0,1.0
75%,,,,,,,,,,,,9.0,4.0,5.0,4.0,4.0,4.0,4.0,4.0,1.0





Dataset: airport - First 5 Rows


Unnamed: 0,airport_name,link,title,author,author_country,date,content,experience_airport,date_visit,type_traveller,overall_rating,queuing_rating,terminal_cleanliness_rating,terminal_seating_rating,terminal_signs_rating,food_beverages_rating,airport_shopping_rating,wifi_connectivity_rating,airport_staff_rating,recommended
0,aalborg-airport,/airport-reviews/aalborg-airport,Aalborg Airport customer review,Klaus Malling,Denmark,2014-02-11,A small very effective airport with few flight...,,,,9.0,5.0,5.0,,,,4.0,,,1
1,aalborg-airport,/airport-reviews/aalborg-airport,Aalborg Airport customer review,S Kroes,Netherlands,2013-02-13,This is a nice and modern airport at the momen...,,,,9.0,5.0,4.0,,,,4.0,,,1
2,aalborg-airport,/airport-reviews/aalborg-airport,Aalborg Airport customer review,M Andersen,Denmark,2012-08-07,A very nice airy terminal - that seems modern ...,,,,9.0,5.0,5.0,,,,4.0,,,1
3,aalborg-airport,/airport-reviews/aalborg-airport,Aalborg Airport customer review,Paul Van Alsten,France,2011-05-22,AMS-AAL and quite satisfied with this regional...,,,,5.0,5.0,5.0,,,,3.0,,,0
4,aalborg-airport,/airport-reviews/aalborg-airport,Aalborg Airport customer review,K Fischer,,2010-08-04,Very quick check-inn and security screening. N...,,,,4.0,,,,,,,,,0



Descriptive Statistics for airport:


Unnamed: 0,airport_name,link,title,author,author_country,date,content,experience_airport,date_visit,type_traveller,overall_rating,queuing_rating,terminal_cleanliness_rating,terminal_seating_rating,terminal_signs_rating,food_beverages_rating,airport_shopping_rating,wifi_connectivity_rating,airport_staff_rating,recommended
count,17721,17721,17721,17721,12777,17721,17721,647,593,646,13796.0,12813.0,12815.0,587.0,27.0,630.0,12676.0,412.0,26.0,17721.0
unique,741,741,741,11834,116,2375,17697,4,116,4,,,,,,,,,,
top,london-heathrow-airport,/airport-reviews/london-heathrow-airport,London Heathrow Airport customer review,S Koenig,United Kingdom,2014-09-01,Switching from F to E concourse was a mess. Th...,Arrival and Departure,21-07-2015,Solo Leisure,,,,,,,,,,
freq,520,520,520,64,5049,68,2,326,17,236,,,,,,,,,,
mean,,,,,,,,,,,4.274355,2.747912,3.44245,2.58092,2.592593,2.169841,2.821631,2.40534,2.038462,0.221206
std,,,,,,,,,,,2.722765,1.57252,1.337508,1.403862,1.393923,1.534358,1.410575,1.579452,1.248384,0.415071
min,,,,,,,,,,,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
25%,,,,,,,,,,,2.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0
50%,,,,,,,,,,,4.0,3.0,3.0,2.0,3.0,2.0,3.0,2.0,1.5,0.0
75%,,,,,,,,,,,6.0,4.0,5.0,4.0,4.0,3.0,4.0,4.0,3.0,0.0





Dataset: lounge - First 5 Rows


Unnamed: 0,airline_name,link,title,author,author_country,date,content,lounge_name,airport,lounge_type,...,type_traveller,overall_rating,comfort_rating,cleanliness_rating,bar_beverages_rating,catering_rating,washrooms_rating,wifi_connectivity_rating,staff_service_rating,recommended
0,adria-airways,/lounge-reviews/adria-airways,Adria Airways Business Class Lounge,R Deu,Spain,2014-03-09,There are 2 separate areas with arm chairs and...,ADRIA AIRWAYS BUSINESS CLASS LOUNGE REVIEW,,Business Class,...,,3.0,3,4,3.0,3.0,3.0,2.0,2.0,1
1,aegean-airlines,/lounge-reviews/aegean-airlines,Business Class - Larnaca Airport,Andreas Kar,Germany,2015-06-13,"It is a small, dark, windowless lounge located...",Business Class,Larnaca Airport,Business Class,...,Solo Leisure,5.0,4,4,3.0,3.0,3.0,4.0,4.0,1
2,aegean-airlines,/lounge-reviews/aegean-airlines,Aegean Airlines Business Class Lounge - Athens...,A Diakomichalis,Greece,2014-10-05,Both of the lounges have the same food (sandwi...,AEGEAN AIRLINES BUSINESS CLASS LOUNGE REVIEW,Athens Airport,Business Class,...,,5.0,5,5,5.0,4.0,4.0,5.0,5.0,1
3,aegean-airlines,/lounge-reviews/aegean-airlines,Aegean Airlines Business Class Lounge - Athens...,R Deu,Spain,2014-03-11,The lounge was clean and the decor is up to da...,AEGEAN AIRLINES BUSINESS CLASS LOUNGE REVIEW,Athens Airport,Business Class,...,,2.0,3,3,2.0,1.0,2.0,3.0,2.0,0
4,aegean-airlines,/lounge-reviews/aegean-airlines,Aegean Airlines Business Class Lounge - Thessa...,Petros Papadopoulos,United Kingdom,2012-07-24,Big and spacious lounge. Comfy sofas and a nic...,AEGEAN AIRLINES BUSINESS CLASS LOUNGE REVIEW,Thessaloniki Airport,Business Class,...,,5.0,5,5,4.0,4.0,4.0,5.0,5.0,1



Descriptive Statistics for lounge:


Unnamed: 0,airline_name,link,title,author,author_country,date,content,lounge_name,airport,lounge_type,...,type_traveller,overall_rating,comfort_rating,cleanliness_rating,bar_beverages_rating,catering_rating,washrooms_rating,wifi_connectivity_rating,staff_service_rating,recommended
count,2264,2264,2264,2264,1783,2264,2264,2261,2170,1964,...,119,2259.0,2264.0,2264.0,2259.0,2261.0,2238.0,2253.0,2255.0,2264.0
unique,97,97,1351,1598,68,410,2261,793,201,4,...,4,,,,,,,,,
top,british-airways,/lounge-reviews/british-airways,Turkish Airlines Business Class Lounge - Istan...,C Wajsberg,United Kingdom,2013-07-07,We were booked into a United BusinessFirst cab...,EMIRATES BUSINESS CLASS LOUNGE REVIEW,London Heathrow Airport,Business Class,...,Business,,,,,,,,,
freq,242,242,34,17,471,49,3,63,203,1700,...,41,,,,,,,,,
mean,,,,,,,,,,,...,,3.365649,3.341873,3.658569,3.231076,2.781955,3.023235,3.285397,3.253215,0.360424
std,,,,,,,,,,,...,,1.485086,1.322064,1.252673,1.385324,1.476588,1.483449,1.515094,1.435761,0.48023
min,,,,,,,,,,,...,,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,,,,,,,,,,...,,2.0,2.0,3.0,2.0,1.0,2.0,2.0,2.0,0.0
50%,,,,,,,,,,,...,,3.0,4.0,4.0,3.0,3.0,3.0,4.0,3.0,0.0
75%,,,,,,,,,,,...,,4.0,4.0,5.0,4.0,4.0,4.0,4.0,5.0,1.0





Dataset: seat - First 5 Rows


Unnamed: 0,airline_name,link,title,author,author_country,date,content,aircraft,seat_layout,date_flown,...,type_traveller,overall_rating,seat_legroom_rating,seat_recline_rating,seat_width_rating,aisle_space_rating,viewing_tv_rating,power_supply_rating,seat_storage_rating,recommended
0,aegean-airlines,/seat-reviews/aegean-airlines,Aegean Airlines customer review,Jay Simpson,United Kingdom,2015-07-20,LHR to Larnaca return. Plane was clean and in ...,A320-200,3x3,19-07-2015,...,Solo Leisure,10.0,4,4,4,5,4.0,,4.0,1
1,aegean-airlines,/seat-reviews/aegean-airlines,Aegean Airlines customer review,Paul Staples,United Kingdom,2013-01-21,For a short haul airline the seats are very go...,AIRBUS A320,3x3,,...,,9.0,4,4,4,4,4.0,,,1
2,aer-lingus,/seat-reviews/aer-lingus,Aer Lingus customer review,L Pulliam,United States,2015-07-07,The seats are a bit tight but bearable. If you...,A330,2x4x2,06-07-2015,...,Couple Leisure,6.0,3,3,3,3,3.0,3.0,3.0,1
3,aer-lingus,/seat-reviews/aer-lingus,Aer Lingus customer review,D Brose,United States,2010-10-22,Appeared new. Good PTV entertainment. Seats ha...,Airbus A330,2x4x2,,...,,5.0,2,3,3,3,4.0,,,0
4,aeroflot-russian-airlines,/seat-reviews/aeroflot-russian-airlines,Aeroflot Russian Airlines customer review,Konstantinos Grimpilakos,Greece,2015-08-02,Boeing 737-800 seats from Athens to Moscow are...,Boeing 737-800,3x3,01-07-2015,...,Business,1.0,1,1,1,2,1.0,5.0,1.0,1



Descriptive Statistics for seat:


Unnamed: 0,airline_name,link,title,author,author_country,date,content,aircraft,seat_layout,date_flown,...,type_traveller,overall_rating,seat_legroom_rating,seat_recline_rating,seat_width_rating,aisle_space_rating,viewing_tv_rating,power_supply_rating,seat_storage_rating,recommended
count,1258,1258,1258,1258,1250,1258,1258,1258,1252,113,...,118,1257.0,1258.0,1258.0,1258.0,1258.0,1229.0,62.0,113.0,1258.0
unique,97,97,97,1147,59,282,1246,156,35,73,...,4,,,,,,,,,
top,british-airways,/seat-reviews/british-airways,British Airways customer review,J Wong,United Kingdom,2014-11-19,AF seats narrow and hard and it was very diffi...,BOEING 747-400,2x4x2,16-06-2015,...,Solo Leisure,,,,,,,,,
freq,86,86,86,6,348,74,2,105,391,5,...,47,,,,,,,,,
mean,,,,,,,,,,,...,,4.318218,2.753577,2.627186,2.717011,2.730525,2.872254,3.774194,3.070796,0.36407
std,,,,,,,,,,,...,,3.041998,1.446915,1.267095,1.271828,1.343137,1.454787,1.31098,1.334413,0.48136
min,,,,,,,,,,,...,,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0
25%,,,,,,,,,,,...,,1.0,1.0,1.0,2.0,1.0,2.0,3.0,2.0,0.0
50%,,,,,,,,,,,...,,4.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,0.0
75%,,,,,,,,,,,...,,7.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,1.0







## Deep Dive: Multi-Dataset Analysis

Let's conduct a thorough analysis of all datasets to understand their properties, distributions, and identify potential computational challenges.

In [7]:
# Overview of all datasets
print("=" * 80)
print("ALL DATASETS OVERVIEW")
print("=" * 80)
for name, df in dataframes.items():
    print(f"\n{name}:")
    print(f"  Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Memory: {df.memory_usage(deep=True).sum() / (1024**2):.2f} MB")

ALL DATASETS OVERVIEW

airline:
  Shape: 41,396 rows × 20 columns
  Columns: ['airline_name', 'link', 'title', 'author', 'author_country', 'date', 'content', 'aircraft', 'type_traveller', 'cabin_flown', 'route', 'overall_rating', 'seat_comfort_rating', 'cabin_staff_rating', 'food_beverages_rating', 'inflight_entertainment_rating', 'ground_service_rating', 'wifi_connectivity_rating', 'value_money_rating', 'recommended']
  Memory: 52.89 MB

airport:
  Shape: 17,721 rows × 20 columns
  Columns: ['airport_name', 'link', 'title', 'author', 'author_country', 'date', 'content', 'experience_airport', 'date_visit', 'type_traveller', 'overall_rating', 'queuing_rating', 'terminal_cleanliness_rating', 'terminal_seating_rating', 'terminal_signs_rating', 'food_beverages_rating', 'airport_shopping_rating', 'wifi_connectivity_rating', 'airport_staff_rating', 'recommended']
  Memory: 21.90 MB

lounge:
  Shape: 2,264 rows × 21 columns
  Columns: ['airline_name', 'link', 'title', 'author', 'author_countr

### 1. Basic Dataset Properties for All Datasets

In [8]:
# Detailed properties for each dataset
for name, df in dataframes.items():
    print("=" * 80)
    print(f"DATASET: {name}")
    print("=" * 80)
    print(f"\nTotal Records: {df.shape[0]:,}")
    print(f"Total Columns: {df.shape[1]}")
    print(f"\nColumn Names and Data Types:")
    print("-" * 80)
    for col, dtype in df.dtypes.items():
        null_count = df[col].isnull().sum()
        null_pct = (null_count / len(df)) * 100
        print(f"{col:30} | {str(dtype):15} | Nulls: {null_count:7,} ({null_pct:5.2f}%)")
    print("\n")

DATASET: airline

Total Records: 41,396
Total Columns: 20

Column Names and Data Types:
--------------------------------------------------------------------------------
airline_name                   | object          | Nulls:       0 ( 0.00%)
link                           | object          | Nulls:       0 ( 0.00%)
title                          | object          | Nulls:       0 ( 0.00%)
author                         | object          | Nulls:       0 ( 0.00%)
author_country                 | object          | Nulls:   1,591 ( 3.84%)
date                           | object          | Nulls:       0 ( 0.00%)
content                        | object          | Nulls:       0 ( 0.00%)
aircraft                       | object          | Nulls:  40,118 (96.91%)
type_traveller                 | object          | Nulls:  39,018 (94.26%)
cabin_flown                    | object          | Nulls:   2,876 ( 6.95%)
route                          | object          | Nulls:  39,055 (94.34%)
overal

### 2. Distribution Analysis Across All Datasets

In [9]:
# Distribution by airline (if applicable)
for name, df in dataframes.items():
    print("=" * 80)
    print(f"DATASET: {name} - AIRLINE DISTRIBUTION")
    print("=" * 80)
    
    airline_col = [col for col in df.columns if 'airline' in col.lower()]
    if airline_col:
        airline_counts = df[airline_col[0]].value_counts()
        print(f"\nTotal Airlines: {airline_counts.shape[0]:,}")
        print(f"\nTop 10 Airlines by Count:")
        print("-" * 80)
        display(airline_counts.head(10))
    else:
        print("No airline column found in this dataset")
    print("\n")

DATASET: airline - AIRLINE DISTRIBUTION

Total Airlines: 362

Top 10 Airlines by Count:
--------------------------------------------------------------------------------


airline_name
spirit-airlines      990
british-airways      901
united-airlines      840
jet-airways          727
air-canada-rouge     715
emirates             691
ryanair              658
american-airlines    612
lufthansa            600
qantas-airways       580
Name: count, dtype: int64



DATASET: airport - AIRLINE DISTRIBUTION
No airline column found in this dataset


DATASET: lounge - AIRLINE DISTRIBUTION

Total Airlines: 97

Top 10 Airlines by Count:
--------------------------------------------------------------------------------


airline_name
british-airways       242
united-airlines       135
emirates              131
qantas-airways        116
lufthansa             100
thai-airways           79
qatar-airways          79
american-airlines      73
malaysia-airlines      73
singapore-airlines     67
Name: count, dtype: int64



DATASET: seat - AIRLINE DISTRIBUTION

Total Airlines: 97

Top 10 Airlines by Count:
--------------------------------------------------------------------------------


airline_name
british-airways            86
emirates                   84
cathay-pacific-airways     77
virgin-atlantic-airways    77
lufthansa                  66
qantas-airways             64
air-france                 64
singapore-airlines         54
etihad-airways             46
american-airlines          45
Name: count, dtype: int64





In [10]:
# Distribution by ratings/scores
for name, df in dataframes.items():
    print("=" * 80)
    print(f"DATASET: {name} - RATING/SCORE DISTRIBUTION")
    print("=" * 80)
    
    rating_col = [col for col in df.columns if 'rating' in col.lower() or 'score' in col.lower() or 'stars' in col.lower()]
    if rating_col:
        for col in rating_col[:3]:  # Show first 3 rating columns
            print(f"\n{col}:")
            print("-" * 40)
            display(df[col].value_counts().sort_index())
    else:
        print("No rating/score columns found in this dataset")
    print("\n")

DATASET: airline - RATING/SCORE DISTRIBUTION

overall_rating:
----------------------------------------


overall_rating
1.0     5390
2.0     2996
3.0     2375
4.0     1810
5.0     2538
6.0     1814
7.0     3336
8.0     5329
9.0     5412
10.0    5861
Name: count, dtype: int64


seat_comfort_rating:
----------------------------------------


seat_comfort_rating
0.0     696
1.0    5970
2.0    4060
3.0    7405
4.0    9873
5.0    5702
Name: count, dtype: int64


cabin_staff_rating:
----------------------------------------


cabin_staff_rating
0.0      481
1.0     6442
2.0     3638
3.0     4947
4.0     7675
5.0    10525
Name: count, dtype: int64



DATASET: airport - RATING/SCORE DISTRIBUTION

overall_rating:
----------------------------------------


overall_rating
1.0     2315
2.0     2045
3.0     2041
4.0     2174
5.0     1628
6.0      403
7.0      697
8.0      916
9.0      790
10.0     787
Name: count, dtype: int64


queuing_rating:
----------------------------------------


queuing_rating
0.0     216
1.0    4252
2.0    1187
3.0    2673
4.0    1861
5.0    2624
Name: count, dtype: int64


terminal_cleanliness_rating:
----------------------------------------


terminal_cleanliness_rating
0.0     256
1.0    1217
2.0    1107
3.0    3843
4.0    2805
5.0    3587
Name: count, dtype: int64



DATASET: lounge - RATING/SCORE DISTRIBUTION

overall_rating:
----------------------------------------


overall_rating
1.0     172
2.0     519
3.0     537
4.0     595
5.0     369
6.0       8
7.0      11
8.0      14
9.0      20
10.0     14
Name: count, dtype: int64


comfort_rating:
----------------------------------------


comfort_rating
0      8
1    275
2    336
3    475
4    656
5    514
Name: count, dtype: int64


cleanliness_rating:
----------------------------------------


cleanliness_rating
0      8
1    170
2    239
3    477
4    646
5    724
Name: count, dtype: int64



DATASET: seat - RATING/SCORE DISTRIBUTION

overall_rating:
----------------------------------------


overall_rating
1.0     361
2.0     145
3.0     115
4.0      92
5.0      83
6.0      69
7.0      97
8.0     144
9.0     106
10.0     45
Name: count, dtype: int64


seat_legroom_rating:
----------------------------------------


seat_legroom_rating
1    369
2    211
3    230
4    257
5    191
Name: count, dtype: int64


seat_recline_rating:
----------------------------------------


seat_recline_rating
1    329
2    255
3    319
4    266
5     89
Name: count, dtype: int64





In [11]:
# Distribution by date/time
for name, df in dataframes.items():
    print("=" * 80)
    print(f"DATASET: {name} - DATE DISTRIBUTION")
    print("=" * 80)
    
    date_col = [col for col in df.columns if 'date' in col.lower() or 'time' in col.lower() or 'year' in col.lower()]
    if date_col:
        df_temp = df.copy()
        df_temp[date_col[0]] = pd.to_datetime(df_temp[date_col[0]], errors='coerce')
        
        valid_dates = df_temp[date_col[0]].dropna()
        if len(valid_dates) > 0:
            print(f"\nDate Range: {valid_dates.min()} to {valid_dates.max()}")
            print(f"\nRecords by Year:")
            print("-" * 40)
            yearly = valid_dates.dt.year.value_counts().sort_index()
            display(yearly)
        else:
            print("Could not parse dates")
    else:
        print("No date columns found in this dataset")
    print("\n")

DATASET: airline - DATE DISTRIBUTION

Date Range: 1970-01-01 00:00:00 to 2015-08-02 00:00:00

Records by Year:
----------------------------------------


date
1970        1
2002       13
2003       41
2004      120
2005      161
2006      187
2007      309
2008      431
2009      594
2010     1529
2011     3215
2012     4721
2013     8672
2014    13944
2015     7458
Name: count, dtype: int64



DATASET: airport - DATE DISTRIBUTION

Date Range: 2002-07-30 00:00:00 to 2015-08-01 00:00:00

Records by Year:
----------------------------------------


date
2002      24
2003     169
2004     352
2005     596
2006     684
2007     700
2008    1133
2009    1328
2010    1412
2011    1972
2012    2557
2013    2528
2014    2568
2015    1698
Name: count, dtype: int64



DATASET: lounge - DATE DISTRIBUTION

Date Range: 2006-05-09 00:00:00 to 2015-08-02 00:00:00

Records by Year:
----------------------------------------


date
2006      7
2007     40
2008    101
2009    173
2010    174
2011    249
2012    330
2013    417
2014    410
2015    363
Name: count, dtype: int64



DATASET: seat - DATE DISTRIBUTION

Date Range: 1970-01-01 00:00:00 to 2015-08-02 00:00:00

Records by Year:
----------------------------------------


date
1970     26
2007      4
2008     59
2009    109
2010    115
2011    165
2012     23
2013    151
2014    323
2015    283
Name: count, dtype: int64





In [12]:
# Distribution by routes/locations
for name, df in dataframes.items():
    print("=" * 80)
    print(f"DATASET: {name} - ROUTE/LOCATION DISTRIBUTION")
    print("=" * 80)
    
    route_col = [col for col in df.columns if any(keyword in col.lower() for keyword in ['route', 'origin', 'destination', 'country', 'location'])]
    if route_col:
        for col in route_col[:2]:  # Show first 2 route-related columns
            print(f"\n{col}:")
            print("-" * 40)
            print(f"Unique values: {df[col].nunique():,}")
            print(f"\nTop 10 most common:")
            display(df[col].value_counts().head(10))
    else:
        print("No route/location columns found in this dataset")
    print("\n")

DATASET: airline - ROUTE/LOCATION DISTRIBUTION

author_country:
----------------------------------------
Unique values: 158

Top 10 most common:


author_country
United Kingdom    9969
United States     8507
Australia         5062
Canada            3303
Germany           1117
Singapore          661
New Zealand        638
Netherlands        547
India              543
France             517
Name: count, dtype: int64


route:
----------------------------------------
Unique values: 2,208

Top 10 most common:


route
Toronto to Las Vegas          4
Sydney to Bangkok             4
Houston to Denver             4
London Heathrow to Bangkok    3
Brisbane to Honolulu          3
Singapore to Hong Kong        3
Melbourne to Bangkok          3
SIN to KUL                    3
Manila to Bangkok             3
Denver to Chicago             3
Name: count, dtype: int64



DATASET: airport - ROUTE/LOCATION DISTRIBUTION

author_country:
----------------------------------------
Unique values: 116

Top 10 most common:


author_country
United Kingdom    5049
United States     2069
Australia         1171
Canada             539
Germany            252
France             211
Netherlands        203
Ireland            189
Singapore          186
New Zealand        185
Name: count, dtype: int64



DATASET: lounge - ROUTE/LOCATION DISTRIBUTION

author_country:
----------------------------------------
Unique values: 68

Top 10 most common:


author_country
United Kingdom    471
Australia         316
United States     293
Canada             96
New Zealand        41
Germany            38
Brazil             35
Switzerland        32
Hong Kong          27
Thailand           26
Name: count, dtype: int64



DATASET: seat - ROUTE/LOCATION DISTRIBUTION

author_country:
----------------------------------------
Unique values: 59

Top 10 most common:


author_country
United Kingdom    348
United States     234
Australia         177
Canada             63
New Zealand        47
Germany            46
Singapore          34
Hong Kong          23
South Africa       18
Switzerland        18
Name: count, dtype: int64





### 3. Text Length Statistics Across Datasets

In [13]:
# Text length analysis for all datasets
for name, df in dataframes.items():
    print("=" * 80)
    print(f"DATASET: {name} - TEXT LENGTH STATISTICS")
    print("=" * 80)
    
    text_col = [col for col in df.columns if any(keyword in col.lower() for keyword in ['review', 'text', 'comment', 'content', 'description'])]
    
    if text_col:
        for col in text_col[:2]:  # Analyze first 2 text columns
            print(f"\n{col}:")
            print("-" * 80)
            text_lengths = df[col].dropna().astype(str).str.len()
            word_counts = df[col].dropna().astype(str).str.split().str.len()
            
            if len(text_lengths) > 0:
                print(f"Non-null entries: {len(text_lengths):,}")
                print(f"Minimum length: {text_lengths.min()} characters")
                print(f"Maximum length: {text_lengths.max()} characters")
                print(f"Average length: {text_lengths.mean():.2f} characters")
                print(f"Median length: {text_lengths.median():.2f} characters")
                print(f"\nMinimum words: {word_counts.min()}")
                print(f"Maximum words: {word_counts.max()}")
                print(f"Average words: {word_counts.mean():.2f}")
                print(f"Median words: {word_counts.median():.2f}")
                
                print(f"\nLength Distribution (percentiles):")
                display(text_lengths.describe())
    else:
        print("No text columns found in this dataset")
    print("\n")

DATASET: airline - TEXT LENGTH STATISTICS

content:
--------------------------------------------------------------------------------
Non-null entries: 41,396
Minimum length: 73 characters
Maximum length: 4901 characters
Average length: 647.43 characters
Median length: 551.00 characters

Minimum words: 12
Maximum words: 914
Average words: 116.68
Median words: 99.00

Length Distribution (percentiles):


count    41396.000000
mean       647.429462
std        401.995641
min         73.000000
25%        368.000000
50%        551.000000
75%        815.000000
max       4901.000000
Name: content, dtype: float64



DATASET: airport - TEXT LENGTH STATISTICS

content:
--------------------------------------------------------------------------------
Non-null entries: 17,721
Minimum length: 52 characters
Maximum length: 5122 characters
Average length: 645.69 characters
Median length: 550.00 characters

Minimum words: 9
Maximum words: 933
Average words: 115.57
Median words: 98.00

Length Distribution (percentiles):


count    17721.000000
mean       645.687320
std        413.573603
min         52.000000
25%        361.000000
50%        550.000000
75%        814.000000
max       5122.000000
Name: content, dtype: float64



DATASET: lounge - TEXT LENGTH STATISTICS

content:
--------------------------------------------------------------------------------
Non-null entries: 2,264
Minimum length: 33 characters
Maximum length: 3369 characters
Average length: 469.48 characters
Median length: 399.00 characters

Minimum words: 7
Maximum words: 618
Average words: 83.81
Median words: 71.00

Length Distribution (percentiles):


count    2264.000000
mean      469.484541
std       289.803457
min        33.000000
25%       269.000000
50%       399.000000
75%       598.000000
max      3369.000000
Name: content, dtype: float64



DATASET: seat - TEXT LENGTH STATISTICS

content:
--------------------------------------------------------------------------------
Non-null entries: 1,258
Minimum length: 32 characters
Maximum length: 3141 characters
Average length: 433.47 characters
Median length: 347.00 characters

Minimum words: 7
Maximum words: 594
Average words: 80.40
Median words: 64.00

Length Distribution (percentiles):


count    1258.000000
mean      433.466614
std       311.078166
min        32.000000
25%       224.250000
50%       347.000000
75%       539.000000
max      3141.000000
Name: content, dtype: float64





### 4. Missing Data Patterns Across All Datasets

In [14]:
# Missing data analysis for all datasets
for name, df in dataframes.items():
    print("=" * 80)
    print(f"DATASET: {name} - MISSING DATA PATTERNS")
    print("=" * 80)
    
    missing_data = df.isnull().sum()
    missing_pct = (missing_data / len(df)) * 100
    
    missing_df = pd.DataFrame({
        'Column': missing_data.index,
        'Missing_Count': missing_data.values,
        'Missing_Percentage': missing_pct.values
    }).sort_values('Missing_Count', ascending=False)
    
    print("\nColumns with Missing Data:")
    print("-" * 80)
    missing_cols = missing_df[missing_df['Missing_Count'] > 0]
    if len(missing_cols) > 0:
        display(missing_cols)
    else:
        print("No missing values in this dataset!")
    print("\n")

DATASET: airline - MISSING DATA PATTERNS

Columns with Missing Data:
--------------------------------------------------------------------------------


Unnamed: 0,Column,Missing_Count,Missing_Percentage
17,wifi_connectivity_rating,40831,98.635134
7,aircraft,40118,96.912745
16,ground_service_rating,39193,94.67823
10,route,39055,94.344864
8,type_traveller,39018,94.255484
15,inflight_entertainment_rating,10282,24.838149
14,food_beverages_rating,8132,19.64441
12,seat_comfort_rating,7690,18.576674
13,cabin_staff_rating,7688,18.571843
11,overall_rating,4535,10.955165




DATASET: airport - MISSING DATA PATTERNS

Columns with Missing Data:
--------------------------------------------------------------------------------


Unnamed: 0,Column,Missing_Count,Missing_Percentage
18,airport_staff_rating,17695,99.853281
14,terminal_signs_rating,17694,99.847638
17,wifi_connectivity_rating,17309,97.675075
13,terminal_seating_rating,17134,96.687546
8,date_visit,17128,96.653688
15,food_beverages_rating,17091,96.444896
9,type_traveller,17075,96.354608
7,experience_airport,17074,96.348965
16,airport_shopping_rating,5045,28.469048
4,author_country,4944,27.899103




DATASET: lounge - MISSING DATA PATTERNS

Columns with Missing Data:
--------------------------------------------------------------------------------


Unnamed: 0,Column,Missing_Count,Missing_Percentage
10,date_visit,2165,95.627208
11,type_traveller,2145,94.743816
4,author_country,481,21.245583
9,lounge_type,300,13.250883
8,airport,94,4.151943
17,washrooms_rating,26,1.14841
18,wifi_connectivity_rating,11,0.485866
19,staff_service_rating,9,0.397527
12,overall_rating,5,0.220848
15,bar_beverages_rating,5,0.220848




DATASET: seat - MISSING DATA PATTERNS

Columns with Missing Data:
--------------------------------------------------------------------------------


Unnamed: 0,Column,Missing_Count,Missing_Percentage
18,power_supply_rating,1196,95.071542
9,date_flown,1145,91.017488
19,seat_storage_rating,1145,91.017488
11,type_traveller,1140,90.620032
17,viewing_tv_rating,29,2.305246
4,author_country,8,0.63593
8,seat_layout,6,0.476948
10,cabin_flown,6,0.476948
12,overall_rating,1,0.079491






### 5. Potential Computational Problems

Based on all dataset properties, let's identify computational challenges for analysis and algorithms.

In [15]:
# Identify computational problems across all datasets
print("=" * 80)
print("POTENTIAL COMPUTATIONAL PROBLEMS - COMPREHENSIVE ANALYSIS")
print("=" * 80)

all_problems = {}

for name, df in dataframes.items():
    problems = []
    
    # Problem 1: Dataset size
    total_memory = df.memory_usage(deep=True).sum() / (1024**2)
    if df.shape[0] > 50000 or total_memory > 50:
        problems.append({
            'Problem': 'Large Dataset Size',
            'Details': f'{df.shape[0]:,} records, {total_memory:.2f} MB memory',
            'Impact': 'Memory-intensive operations, slow processing for complex algorithms',
            'Considerations': 'Chunking, sampling, or distributed processing may be needed'
        })
    
    # Problem 2: Imbalanced distribution
    airline_col = [col for col in df.columns if 'airline' in col.lower()]
    if airline_col:
        airline_counts = df[airline_col[0]].value_counts()
        if len(airline_counts) > 0:
            max_count = airline_counts.max()
            min_count = airline_counts.min()
            ratio = max_count / min_count if min_count > 0 else float('inf')
            if ratio > 50:
                problems.append({
                    'Problem': 'Imbalanced Distribution',
                    'Details': f'Max: {max_count:,}, Min: {min_count:,}, Ratio: {ratio:.1f}:1',
                    'Impact': 'Bias in rankings and analysis towards popular entities',
                    'Considerations': 'Normalization, weighted scoring, or minimum threshold filtering'
                })
    
    # Problem 3: Text processing complexity
    text_col = [col for col in df.columns if any(keyword in col.lower() for keyword in ['review', 'text', 'comment', 'content'])]
    if text_col:
        text_lengths = df[text_col[0]].dropna().astype(str).str.len()
        if len(text_lengths) > 0:
            total_chars = text_lengths.sum()
            if total_chars > 5_000_000:  # 5M characters
                problems.append({
                    'Problem': 'Large Text Corpus',
                    'Details': f'{total_chars:,} chars, avg {text_lengths.mean():.0f} chars/entry',
                    'Impact': 'NLP operations (sentiment, embeddings) will be time-consuming',
                    'Considerations': 'Pre-processing pipeline, caching, efficient libraries (spaCy, transformers)'
                })
    
    # Problem 4: High cardinality
    high_cardinality_cols = []
    for col in df.select_dtypes(include=['object']).columns:
        unique_count = df[col].nunique()
        if unique_count > 500:
            high_cardinality_cols.append((col, unique_count))
    
    if high_cardinality_cols:
        details = ', '.join([f'{col}: {count:,}' for col, count in high_cardinality_cols])
        problems.append({
            'Problem': 'High Cardinality Features',
            'Details': details,
            'Impact': 'Explosion of features in ML models, memory issues',
            'Considerations': 'Feature hashing, embeddings, dimensionality reduction, or grouping'
        })
    
    # Problem 5: Missing data
    missing_pct = (df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100
    if missing_pct > 10:
        problems.append({
            'Problem': 'Significant Missing Data',
            'Details': f'{missing_pct:.2f}% of all values are missing',
            'Impact': 'Reduced data quality, potential bias in analysis',
            'Considerations': 'Imputation strategies, or filtering rows/columns with too many nulls'
        })
    
    all_problems[name] = problems

# Display problems by dataset
for name, problems in all_problems.items():
    print(f"\n{'='*80}")
    print(f"DATASET: {name}")
    print('='*80)
    
    if problems:
        for i, problem in enumerate(problems, 1):
            print(f"\n{i}. {problem['Problem']}")
            print("-" * 80)
            print(f"Details: {problem['Details']}")
            print(f"Impact: {problem['Impact']}")
            print(f"Considerations: {problem['Considerations']}")
    else:
        print("\nNo major computational problems identified for this dataset.")
    print()

# Summary
total_problems = sum(len(p) for p in all_problems.values())
print(f"\n{'='*80}")
print(f"SUMMARY: {total_problems} total computational challenges identified across all datasets")
print('='*80)

POTENTIAL COMPUTATIONAL PROBLEMS - COMPREHENSIVE ANALYSIS

DATASET: airline

1. Large Dataset Size
--------------------------------------------------------------------------------
Details: 41,396 records, 52.89 MB memory
Impact: Memory-intensive operations, slow processing for complex algorithms
Considerations: Chunking, sampling, or distributed processing may be needed

2. Imbalanced Distribution
--------------------------------------------------------------------------------
Details: Max: 990, Min: 1, Ratio: 990.0:1
Impact: Bias in rankings and analysis towards popular entities
Considerations: Normalization, weighted scoring, or minimum threshold filtering

3. Large Text Corpus
--------------------------------------------------------------------------------
Details: 26,800,990 chars, avg 647 chars/entry
Impact: NLP operations (sentiment, embeddings) will be time-consuming
Considerations: Pre-processing pipeline, caching, efficient libraries (spaCy, transformers)

4. High Cardinality 