# Project: TravelTide </br>
---

### Data Preprocessing

1. Importing External Liabraries
2. Connecting to SQL Database
3. Aggregating to a cleaned, filtered table at the session level
4. Aggregating Session-based Data to User Level

  * Objective: Create a clean, user-level table
  * Each row represents a unique user_id
  * Columns contain aggregated features such as number of sessions, average session duration, booking behavior
  * This table serves as the foundation for downstream modeling and analysis

5. Exporting Aggregated User Table to Processed Data Layer

Tables:

* users
* sessions
* flights
* hotels

---
### First Step : Importing External Liabraries

In [None]:
import pandas as pd
import sqlalchemy as sa
import matplotlib.pyplot as plt

---
### Second Step : Connecting to SQL Database </br>

1. Create an engine </br>
2. Make a connection

In [None]:
traveltide_url = 'postgresql://Test:bQNxVzJL4g6u@ep-noisy-flower-846766.us-east-2.aws.neon.tech/TravelTide?sslmode=require'

In [None]:
engine = sa.create_engine(traveltide_url)
connection = engine.connect()

---
### Session Level Aggregation

1. Filter sessions after a specific date </br>
  * Limit the data to only include sessions starting on or after January 5, 2023.
2. Select active users only </br>
  * Include only users who have more than 7 sessions to focus on meaningful behavior patterns.
3. Clean hotel check-in and check-out data </br>
  * Fix swapped or invalid dates to ensure accurate calculation of hotel stays.
4. Calculate session-level features </br>
  * Compute metrics like session duration, discount usage, flight distance, and amount saved.
5. Enrich session data with user, flight, and hotel information </br>
  * Join the session table with users, hotels, and flights to create a comprehensive session-level dataset.

In [None]:
query = """
        WITH

        sessions_after_jan_5_2023 AS (
            SELECT *
            FROM sessions
            WHERE session_start >= '2023-01-05'
        ),

        users_with_more_than_7_sessions AS (
            SELECT user_id,
                  COUNT(*) AS session_count
            FROM sessions_after_jan_5_2023
            GROUP BY user_id
            HAVING COUNT(*) > 7
        ),

        ordered_check_in_out AS (
          SELECT
            trip_id,
            CASE
              WHEN check_out_time < check_in_time THEN check_out_time
              ELSE check_in_time
            END AS cleaned_check_in_time,
            CASE
              WHEN check_out_time < check_in_time THEN check_in_time
              ELSE check_out_time
            END AS cleaned_check_out_time
          FROM hotels
        ),

        calculated_stay AS (
          SELECT
            trip_id,
            (cleaned_check_out_time::date - cleaned_check_in_time::date) AS cleaned_nights,
            CASE
              WHEN cleaned_check_out_time - cleaned_check_in_time < INTERVAL '1 day'
              THEN EXTRACT(EPOCH FROM (cleaned_check_out_time - cleaned_check_in_time)) / 3600.0
              ELSE NULL
            END AS duration_hours
          FROM ordered_check_in_out
        ),

        enriched_sessions_with_user_trip_data AS (
            SELECT
                s.session_id,
                s.user_id,
                s.trip_id,
                s.session_start,
                s.session_end,
                s.page_clicks,
                s.flight_booked,
                      CASE WHEN flight_booked = 'true' THEN 1 ELSE 0 END AS binary_flight_booked,
                s.flight_discount,
                      CASE WHEN flight_discount = 'true' THEN 1 ELSE 0 END AS binary_flight_discount,
                s.flight_discount_amount,
                s.hotel_booked,
                      CASE WHEN hotel_booked = 'true' THEN 1 ELSE 0 END AS binary_hotel_booked,
                s.hotel_discount,
                      CASE WHEN hotel_discount = 'true' THEN 1 ELSE 0 END AS binary_hotel_discount,
                s.hotel_discount_amount,
                s.cancellation,
                      CASE WHEN cancellation = 'true' THEN 1 ELSE 0 END AS binary_cancellation,
                u.birthdate,
                      DATE_PART('year', AGE(CURRENT_DATE, u.birthdate)) AS customer_age,
                u.gender,
                u.married,
                      CASE WHEN u.married = 'true' THEN 1 ELSE 0 END AS binary_married,
                u.has_children,
                      CASE WHEN u.has_children = 'true' THEN 1 ELSE 0 END AS binary_has_children,
                u.home_country,
                u.home_city,
                u.home_airport,
                u.home_airport_lat,
                u.home_airport_lon,
                u.sign_up_date,
                f.origin_airport,
                f.destination,
                f.destination_airport,
                f.seats,
                f.return_flight_booked,
                      CASE WHEN return_flight_booked = 'true' THEN 1 ELSE 0 END AS binary_return_flight_booked,
                f.departure_time,
                f.return_time,
                f.checked_bags,
                f.trip_airline,
                f.destination_airport_lat,
                f.destination_airport_lon,
                f.base_fare_usd,
                      COALESCE(haversine_distance(home_airport_lat,home_airport_lon,
                               destination_airport_lat, destination_airport_lon),0) AS flown_flight_distance,
  	            h.hotel_name,
                      LEFT(h.hotel_name, LENGTH(h.hotel_name) - POSITION(' - ' IN REVERSE(h.hotel_name)) - 2) AS extract_hotel_name,
                      RIGHT(h.hotel_name, POSITION(' - ' IN REVERSE(h.hotel_name)) - 1) AS extract_hotel_location,
                h.rooms,
                cs.cleaned_nights,
                oco.cleaned_check_in_time,
                oco.cleaned_check_out_time,
                h.hotel_per_room_usd AS hotel_price_per_room_night_usd,
                      MAX(CASE WHEN cancellation = 'true' THEN 1 ELSE 0 END) OVER (PARTITION BY s.trip_id) AS trip_cancelled

            FROM sessions_after_jan_5_2023 s
            LEFT JOIN users u ON s.user_id = u.user_id
            LEFT JOIN flights f ON s.trip_id = f.trip_id
            LEFT JOIN hotels h ON s.trip_id = h.trip_id
            LEFT JOIN ordered_check_in_out oco ON s.trip_id = oco.trip_id
            LEFT JOIN calculated_stay cs ON s.trip_id = cs.trip_id
            WHERE s.user_id IN (SELECT user_id FROM users_with_more_than_7_sessions)
        )

        SELECT *
        FROM enriched_sessions_with_user_trip_data;
        """

df = pd.read_sql(query, engine)
with pd.option_context('display.max_columns', None, 'display.expand_frame_repr', False):
    display(df)

# output into variable for export down below
export = pd.read_sql(query, engine)

Unnamed: 0,session_id,user_id,trip_id,session_start,session_end,page_clicks,flight_booked,binary_flight_booked,flight_discount,binary_flight_discount,flight_discount_amount,hotel_booked,binary_hotel_booked,hotel_discount,binary_hotel_discount,hotel_discount_amount,cancellation,binary_cancellation,birthdate,customer_age,gender,married,binary_married,has_children,binary_has_children,home_country,home_city,home_airport,home_airport_lat,home_airport_lon,sign_up_date,origin_airport,destination,destination_airport,seats,return_flight_booked,binary_return_flight_booked,departure_time,return_time,checked_bags,trip_airline,destination_airport_lat,destination_airport_lon,base_fare_usd,flown_flight_distance,hotel_name,extract_hotel_name,extract_hotel_location,rooms,cleaned_nights,cleaned_check_in_time,cleaned_check_out_time,hotel_price_per_room_night_usd,trip_cancelled
0,101486-c431d39dbe884b6f9d6a267fe6655e94,101486,101486-1015905607d74b15954bfd4ac7029ef3,2023-06-01 09:00:00,2023-06-01 09:02:38,21,True,1,False,0,,True,1,False,0,,False,0,1972-12-07,52.0,F,True,1,True,1,usa,tacoma,TCM,47.138,-122.476,2022-02-17,TCM,edmonton,YED,1.0,True,1,2023-06-10 10:00:00,2023-06-14 10:00:00,0.0,United Airlines,53.667,-113.467,189.91,995.681600,Crowne Plaza - edmonton,Crowne Plaza,edmonton,1.0,4.0,2023-06-10 13:12:24.030,2023-06-14 11:00:00,253.0,0
1,101486-c668e4e44ffc4e5a93c46f661320aa23,101486,101486-6759c5dd49a1457d916bb2bbf48c3115,2023-06-17 19:42:00,2023-06-17 19:44:37,21,False,0,False,0,,True,1,False,0,,False,0,1972-12-07,52.0,F,True,1,True,1,usa,tacoma,TCM,47.138,-122.476,2022-02-17,,,,,,0,NaT,NaT,,,,,,0.000000,Banyan Tree - montreal,Banyan Tree,montreal,2.0,5.0,2023-06-24 11:00:00.000,2023-06-29 11:00:00,144.0,0
2,101961-c2ff557a18294bff9856f4d14ced1a47,101961,101961-19f641633ebc442799662326f0b4dfa0,2023-04-24 19:28:00,2023-04-24 19:30:19,19,True,1,True,1,0.1,True,1,False,0,,False,0,1980-09-14,44.0,F,True,1,False,0,usa,boston,BOS,42.364,-71.005,2022-02-17,BOS,new york,LGA,1.0,True,1,2023-05-05 11:00:00,2023-05-09 11:00:00,0.0,Allegiant Air,40.640,-73.779,49.67,297.807805,Conrad - new york,Conrad,new york,1.0,4.0,2023-05-05 13:22:30.720,2023-05-09 11:00:00,165.0,0
3,101961-83fd9718a030497ca70b55697ecba1b7,101961,101961-29a8ff7c9910469c959fffa60215cf78,2023-02-02 12:39:00,2023-02-02 12:41:23,19,True,1,False,0,,True,1,False,0,,False,0,1980-09-14,44.0,F,True,1,False,0,usa,boston,BOS,42.364,-71.005,2022-02-17,BOS,montreal,YHU,1.0,True,1,2023-02-08 07:00:00,2023-02-13 07:00:00,1.0,United Airlines,45.517,-73.417,77.02,402.714696,Rosewood - montreal,Rosewood,montreal,1.0,4.0,2023-02-08 09:30:00.990,2023-02-12 11:00:00,197.0,0
4,101961-2bc80c62f0284004bd93b2674c69748e,101961,101961-836fd88487d240baa4402c8e4c6f188c,2023-03-12 17:56:00,2023-03-12 17:59:06,25,True,1,False,0,,True,1,False,0,,False,0,1980-09-14,44.0,F,True,1,False,0,usa,boston,BOS,42.364,-71.005,2022-02-17,BOS,seattle,BFI,1.0,True,1,2023-03-16 07:00:00,2023-03-21 07:00:00,1.0,Kenmore Air,47.530,-122.302,769.50,4253.525872,Extended Stay - seattle,Extended Stay,seattle,1.0,5.0,2023-03-16 14:00:14.670,2023-03-21 11:00:00,132.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47431,641310-5c38bc611d9a447b8ce45663766131e4,641310,,2023-03-06 04:15:00,2023-03-06 04:16:09,9,False,0,False,0,,False,0,False,0,,False,0,1979-12-14,45.0,F,False,0,True,1,usa,los angeles,LAX,33.942,-118.408,2023-03-06,,,,,,0,NaT,NaT,,,,,,0.000000,,,,,,NaT,NaT,,0
47432,611852-f9378bb9ef5b4e12bcd232d4c60614dd,611852,,2023-05-15 03:19:00,2023-05-15 03:20:30,12,False,0,False,0,,False,0,False,0,,False,0,1983-09-08,41.0,M,False,0,False,0,usa,akron,AKR,41.038,-81.467,2023-02-22,,,,,,0,NaT,NaT,,,,,,0.000000,,,,,,NaT,NaT,,0
47433,612236-691728e1d9854ac5b2faa98a2ede862e,612236,,2023-05-15 21:35:00,2023-05-15 21:36:36,13,False,0,False,0,,False,0,False,0,,False,0,1974-04-23,51.0,F,True,1,True,1,usa,colorado springs,COS,38.806,-104.700,2023-02-22,,,,,,0,NaT,NaT,,,,,,0.000000,,,,,,NaT,NaT,,0
47434,614033-4110193633ec46aab363ae91930e49c4,614033,,2023-05-15 20:24:00,2023-05-15 20:25:37,13,False,0,False,0,,False,0,False,0,,False,0,1985-03-01,40.0,F,False,0,False,0,usa,honolulu,HNL,21.316,-157.927,2023-02-23,,,,,,0,NaT,NaT,,,,,,0.000000,,,,,,NaT,NaT,,0


##### Export the Output from above into .csv  

In [None]:
export.to_csv("02_session_level_aggregation.csv", index=False)

##### Columnsize of the new .csv

In [None]:
export.shape

(47436, 54)

---
### User Level Aggregation

1. **Aggregate engagement metrics from session data** </br>
   * Calculate user-level statistics including total sessions, active days, average session time, and click patterns to measure platform engagement intensity.

2. **Compute booking behavior and spending patterns** </br>
   * Derive completed orders, total transaction values, and revenue averages across flights and hotels to quantify customer financial value.

3. **Engineer trip dynamics and travel preferences** </br>
   * Create metrics for booking lead times, trip duration, seat/room quantities, and flight distances to understand travel behavior patterns.

4. **Calculate discount utilization and savings metrics** </br>
   * Measure promotional usage rates, dollar amounts saved, and price sensitivity indicators to assess deal responsiveness.

5. **Generate composite customer value features** </br>
   * Combine multiple dimensions into lifetime customer value, age group classifications, and normalized scoring metrics for segmentation readiness.

In [None]:
query = """
        WITH

        sessions_after_jan_5_2023 AS (
            SELECT *
            FROM sessions
            WHERE session_start >= '2023-01-05'
        ),

        users_with_more_than_7_sessions AS (
            SELECT user_id,
                  COUNT(*) AS session_count
            FROM sessions_after_jan_5_2023
            GROUP BY user_id
            HAVING COUNT(*) > 7
        ),

        ordered_check_in_out AS (
          SELECT
            trip_id,
            CASE
              WHEN check_out_time < check_in_time THEN check_out_time
              ELSE check_in_time
            END AS cleaned_check_in_time,
            CASE
              WHEN check_out_time < check_in_time THEN check_in_time
              ELSE check_out_time
            END AS cleaned_check_out_time
          FROM hotels
        ),

        calculated_stay AS (
          SELECT
            trip_id,
            (cleaned_check_out_time::date - cleaned_check_in_time::date) AS cleaned_nights,
            CASE
              WHEN cleaned_check_out_time - cleaned_check_in_time < INTERVAL '1 day'
              THEN EXTRACT(EPOCH FROM (cleaned_check_out_time - cleaned_check_in_time)) / 3600.0
              ELSE NULL
            END AS duration_hours
          FROM ordered_check_in_out
        ),

        enriched_sessions_with_user_trip_data AS (
            SELECT
                s.session_id,
                s.user_id,
                s.trip_id,
                s.session_start,
                s.session_end,
                s.page_clicks,
                s.flight_booked,
                      CASE WHEN flight_booked = 'true' THEN 1 ELSE 0 END AS binary_flight_booked,
                s.flight_discount,
                      CASE WHEN flight_discount = 'true' THEN 1 ELSE 0 END AS binary_flight_discount,
                s.flight_discount_amount,
                s.hotel_booked,
                      CASE WHEN hotel_booked = 'true' THEN 1 ELSE 0 END AS binary_hotel_booked,
                s.hotel_discount,
                      CASE WHEN hotel_discount = 'true' THEN 1 ELSE 0 END AS binary_hotel_discount,
                s.hotel_discount_amount,
                s.cancellation,
                      CASE WHEN cancellation = 'true' THEN 1 ELSE 0 END AS binary_cancellation,
                u.birthdate,
                      DATE_PART('year', AGE(CURRENT_DATE, u.birthdate)) AS customer_age,
                u.gender,
                u.married,
                      CASE WHEN u.married = 'true' THEN 1 ELSE 0 END AS binary_married,
                u.has_children,
                      CASE WHEN u.has_children = 'true' THEN 1 ELSE 0 END AS binary_has_children,
                u.home_country,
                u.home_city,
                u.home_airport,
                u.home_airport_lat,
                u.home_airport_lon,
                u.sign_up_date,
                f.origin_airport,
                f.destination,
                f.destination_airport,
                f.seats,
                f.return_flight_booked,
                      CASE WHEN return_flight_booked = 'true' THEN 1 ELSE 0 END AS binary_return_flight_booked,
                f.departure_time,
                f.return_time,
                f.checked_bags,
                f.trip_airline,
                f.destination_airport_lat,
                f.destination_airport_lon,
                f.base_fare_usd,
                      COALESCE(haversine_distance(home_airport_lat,home_airport_lon,
                               destination_airport_lat, destination_airport_lon),0) AS flown_flight_distance,
  	            h.hotel_name,
                      LEFT(h.hotel_name, LENGTH(h.hotel_name) - POSITION(' - ' IN REVERSE(h.hotel_name)) - 2) AS extract_hotel_name,
                      RIGHT(h.hotel_name, POSITION(' - ' IN REVERSE(h.hotel_name)) - 1) AS extract_hotel_location,
                h.rooms,
                cs.cleaned_nights,
                oco.cleaned_check_in_time,
                oco.cleaned_check_out_time,
                h.hotel_per_room_usd AS hotel_price_per_room_night_usd,
                      MAX(CASE WHEN cancellation = 'true' THEN 1 ELSE 0 END) OVER (PARTITION BY s.trip_id) AS trip_cancelled

            FROM sessions_after_jan_5_2023 s
            LEFT JOIN users u ON s.user_id = u.user_id
            LEFT JOIN flights f ON s.trip_id = f.trip_id
            LEFT JOIN hotels h ON s.trip_id = h.trip_id
            LEFT JOIN ordered_check_in_out oco ON s.trip_id = oco.trip_id
            LEFT JOIN calculated_stay cs ON s.trip_id = cs.trip_id
            WHERE s.user_id IN (SELECT user_id FROM users_with_more_than_7_sessions)
        ),

        session_combined_table AS (
          SELECT *
          FROM enriched_sessions_with_user_trip_data
        ),

        user_agg_metric AS (
          SELECT
            user_id,
            customer_age,
            gender,
            binary_married AS married,
            binary_has_children AS has_children,
            home_city,
            home_country,
            home_airport,

            COUNT(DISTINCT trip_id) AS num_trips,
            COUNT(session_id) AS num_sessions,

            MIN(session_start::DATE) AS user_start_date,
            MAX(session_end::DATE) AS user_end_date,

            COUNT(DISTINCT DATE(session_start)) AS active_presence_days,

            SUM(page_clicks) AS total_page_clicks,
            AVG(page_clicks) AS mean_clicks_per_session,

            AVG(EXTRACT(EPOCH FROM (session_end - session_start))) AS average_session_time,

            SUM(binary_cancellation) AS cancelled_session_count,

            SUM(CASE WHEN binary_flight_booked = 1 AND trip_cancelled = 0 THEN 1 ELSE 0 END) AS completed_flight_orders,
            SUM(CASE WHEN binary_hotel_booked = 1 AND trip_cancelled = 0 THEN 1 ELSE 0 END) AS completed_hotel_orders,

            SUM(COALESCE(base_fare_usd, 0)) AS aggregate_flight_spend,
            SUM(COALESCE(hotel_price_per_room_night_usd, 0) * COALESCE(rooms, 0) * COALESCE(cleaned_nights, 0)) AS aggregate_hotel_spend,
            SUM(COALESCE(base_fare_usd, 0)) + SUM(COALESCE(hotel_price_per_room_night_usd, 0) * COALESCE(rooms, 0) * COALESCE(cleaned_nights, 0)) AS overall_transaction_value,

            AVG(COALESCE(base_fare_usd, 0)) AS avg_flight_price,
            AVG(COALESCE(hotel_price_per_room_night_usd, 0) * COALESCE(rooms, 0) * COALESCE(cleaned_nights, 0)) AS avg_hotel_price,

            SUM(COALESCE(base_fare_usd, 0)) / NULLIF(COUNT(DISTINCT trip_id), 0) AS trip_revenue_avg,

            SUM(CASE WHEN binary_flight_discount = 1 OR binary_hotel_discount = 1 THEN 1 ELSE 0 END)::NUMERIC / NULLIF(COUNT(session_id), 0) AS promo_booking_ratio,

            (SUM(CASE WHEN binary_flight_discount = 1 THEN base_fare_usd * flight_discount_amount ELSE 0 END) +
            SUM(CASE WHEN binary_hotel_discount = 1 THEN hotel_price_per_room_night_usd * hotel_discount_amount * rooms * cleaned_nights ELSE 0 END))::NUMERIC AS mean_savings_value,

            SUM(CASE WHEN binary_flight_discount = 1 THEN base_fare_usd * flight_discount_amount ELSE 0 END)::NUMERIC / NULLIF(SUM(flown_flight_distance), 0) AS savings_per_kilometer,

            SUM(seats)::NUMERIC / NULLIF(COUNT(DISTINCT trip_id), 0) AS mean_seat_count,
            SUM(rooms)::NUMERIC / NULLIF(COUNT(DISTINCT trip_id), 0) AS mean_room_quantity,
            SUM(cleaned_nights)::NUMERIC / NULLIF(COUNT(DISTINCT trip_id), 0) AS avg_nights_booked,

            SUM(flown_flight_distance)::NUMERIC / NULLIF(COUNT(DISTINCT trip_id), 0) AS avg_flight_distance,
            AVG(return_time::DATE - departure_time::DATE) AS avg_trip_timespan,
            AVG(departure_time::DATE - session_end::DATE) AS booking_lead_interval,

            COUNT(*) AS session_frequency,
            (MAX(session_end::DATE) - MIN(session_start::DATE)) AS user_activity_lifespan,
            MAX(session_end::DATE) AS last_seen_date,

            SUM(CASE WHEN binary_flight_booked = 1 OR binary_hotel_booked = 1 THEN 1 ELSE 0 END)::NUMERIC / NULLIF(COUNT(session_id), 0) AS conversion_ratio,

            COALESCE(SUM(flown_flight_distance), 0) AS total_flight_distance,
            COALESCE(SUM(checked_bags), 0) AS total_checked_bags,
            SUM(checked_bags)::NUMERIC / NULLIF(SUM(CASE WHEN binary_flight_booked = 1 THEN 1 ELSE 0 END), 0) AS checked_bag_usage_ratio,

            SUM(CASE WHEN binary_flight_booked = 1 THEN 1 ELSE 0 END)::NUMERIC / NULLIF(SUM(CASE WHEN binary_hotel_booked = 1 THEN 1 ELSE 0 END), 0) AS flight_hotel_mix_ratio

          FROM session_combined_table
          GROUP BY
            user_id, customer_age, gender, binary_married, binary_has_children,
            home_city, home_country, home_airport
        ),

        customer_value AS (
          SELECT
            user_id,

            CASE
              WHEN customer_age < 18 THEN '< 18'
              WHEN customer_age BETWEEN 18 AND 26 THEN 'Student'
              WHEN customer_age BETWEEN 27 AND 34 THEN 'Young Aged'
              WHEN customer_age BETWEEN 35 AND 60 THEN 'Middle Aged'
              WHEN customer_age > 60 THEN 'Senior'
              ELSE 'Unknown'
            END AS age_group,

            AVG(overall_transaction_value)::NUMERIC AS avg_total_sales,

            SUM(overall_transaction_value)::NUMERIC / NULLIF(SUM(num_trips), 0) AS customer_value_per_trip,

            SUM(overall_transaction_value)::NUMERIC / NULLIF(SUM(num_sessions), 0) AS customer_value_per_session,

            AVG(user_activity_lifespan)::NUMERIC / 180 AS avg_cust_lifespan

          FROM user_agg_metric
          GROUP BY user_id, customer_age
        ),

          discount_propn AS (
            SELECT
              user_id,

              SUM(CASE WHEN hotel_discount = 'true' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS hotel_discount_proportion,

              SUM(CASE WHEN flight_discount = 'true' THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS flight_discount_proportion,

              AVG(flight_discount_amount)::NUMERIC AS avg_flight_discount_charges,

              AVG(hotel_discount_amount)::NUMERIC AS avg_hotel_discount_charges,

              (SUM(CASE WHEN hotel_discount = 'true' THEN 1 ELSE 0 END)
                + SUM(CASE WHEN flight_discount = 'true' THEN 1 ELSE 0 END))::FLOAT / COUNT(*) AS total_discount_proportion,

              SUM(CASE
                  WHEN flight_discount = 'true' THEN base_fare_usd * flight_discount_amount
                  ELSE 0
                END)::NUMERIC / NULLIF(COUNT(*), 0) AS avg_dollar_saved,

              SUM(CASE
                  WHEN flight_discount = 'true' THEN base_fare_usd * flight_discount_amount
                  ELSE 0
                END)::NUMERIC / NULLIF(SUM(flown_flight_distance), 0) AS avg_dollar_saved_per_km

            FROM session_combined_table
            GROUP BY user_id
          ),

          session_level_final_table AS (
            SELECT
              user_id,

              COUNT(DISTINCT trip_id) AS booking_count,

              COUNT(DISTINCT trip_id) * 1.0 / NULLIF(COUNT(session_id), 0) AS booking_rate,

              EXTRACT(DAY FROM (MAX(session_end) - MIN(session_start))) AS active_days,

              AVG(page_clicks) AS avg_page_clicks,

              COALESCE(
                SUM(checked_bags) * 1.0 / NULLIF(COUNT(DISTINCT trip_id), 0),
                0
              ) AS avg_bags,

              CASE
                WHEN COUNT(DISTINCT trip_id) > 0 THEN
                  SUM(binary_cancellation) * 1.0 / COUNT(DISTINCT trip_id)
                ELSE 0
              END AS cancellation_rate,

              SUM(binary_flight_booked) - SUM(binary_cancellation) AS num_flight_booked,
              SUM(binary_hotel_booked) - SUM(binary_cancellation) AS num_hotel_booked,

              COALESCE(
                SUM(seats) * 1.0 / NULLIF(COUNT(DISTINCT trip_id), 0),
                0
              ) AS avg_num_seats,

              COALESCE(SUM(checked_bags), 0) AS total_checked_bags,

              AVG(departure_time::DATE - session_end::DATE) AS travel_lead_time,

              AVG(CASE
                WHEN cancellation = 'true' THEN (departure_time::DATE - session_end::DATE)
                ELSE NULL
              END) AS avg_cancel_lead_time,

              COALESCE(SUM(flown_flight_distance), 0)::NUMERIC AS total_dist_flown,
              COALESCE(AVG(flown_flight_distance), 0)::NUMERIC AS avg_dist_flown_incl,

              SUM(cleaned_nights) * 1.0 / NULLIF(COUNT(DISTINCT trip_id), 0) AS avg_hotel_stay,
              SUM(rooms) * 1.0 / NULLIF(COUNT(DISTINCT trip_id), 0) AS avg_hotel_rooms,

              COALESCE(
                EXTRACT(DAY FROM AVG(return_time - departure_time)),
                0
              ) AS avg_trip_duration,

              AVG(session_end - session_start) AS avg_session_duration,

              SUM(CASE WHEN flight_discount = 'true' AND trip_id IS NOT NULL THEN 1 ELSE 0 END) AS num_flights_discount_applied,
              SUM(CASE WHEN hotel_discount = 'true' AND trip_id IS NOT NULL THEN 1 ELSE 0 END) AS num_hotel_discount_applied,

              SUM(CASE WHEN flight_discount = 'true' AND trip_id IS NULL THEN 1 ELSE 0 END) AS num_flights_discount_offered,
              SUM(CASE WHEN hotel_discount = 'true' AND trip_id IS NULL THEN 1 ELSE 0 END) AS num_hotel_discount_offered,

              SUM(COALESCE(hotel_price_per_room_night_usd,0)
                  * COALESCE(hotel_discount_amount,0)
                  * COALESCE(rooms,0)
                  * COALESCE(cleaned_nights,0)) AS total_hotel_discount_charges,

              SUM(COALESCE(base_fare_usd,0)
                  * COALESCE(flight_discount_amount,0)
                  * COALESCE(seats,0)) AS total_flight_discount_charges

            FROM session_combined_table
            GROUP BY user_id
          ),

          trip_ratios AS (
            SELECT
              user_id,

              COALESCE((
                num_flights_discount_applied + num_hotel_discount_applied
              )::FLOAT / NULLIF(booking_count, 0), 0) AS discounted_booking_rate,

              COALESCE(total_dist_flown / NULLIF(num_flight_booked, 0), 0) AS avg_dist_flown,

              COALESCE(
                NULLIF(num_flight_booked, 0)::FLOAT / NULLIF(num_hotel_booked, 0),
                0
              ) AS flight_to_hotel_booking_ratio,

              COALESCE(
                total_checked_bags / NULLIF(num_flight_booked, 0),
                0
              ) AS checked_bags_ratio
            FROM session_level_final_table
          ),

          final_single_user_table AS (
            SELECT
              uam.*,
              cv.age_group,
              cv.avg_total_sales,
              cv.customer_value_per_trip,
              cv.customer_value_per_session,
              cv.avg_cust_lifespan,
              dp.hotel_discount_proportion,
              dp.flight_discount_proportion,
              dp.avg_flight_discount_charges,
              dp.avg_hotel_discount_charges,
              dp.total_discount_proportion,
              dp.avg_dollar_saved,
              dp.avg_dollar_saved_per_km,
              slf.booking_count,
              slf.booking_rate,
              slf.active_days,
              slf.avg_page_clicks,
              slf.avg_bags,
              slf.cancellation_rate,
              slf.num_flight_booked,
              slf.num_hotel_booked,
              slf.avg_num_seats,
              slf.total_checked_bags,
              slf.travel_lead_time,
              slf.avg_cancel_lead_time,
              slf.total_dist_flown,
              slf.avg_dist_flown_incl,
              slf.avg_hotel_stay,
              slf.avg_hotel_rooms,
              slf.avg_trip_duration,
              slf.avg_session_duration,
              slf.num_flights_discount_applied,
              slf.num_hotel_discount_applied,
              slf.num_flights_discount_offered,
              slf.num_hotel_discount_offered,
              slf.total_hotel_discount_charges,
              slf.total_flight_discount_charges,
              tr.discounted_booking_rate,
              tr.avg_dist_flown,
              tr.flight_to_hotel_booking_ratio,
              tr.checked_bags_ratio,

              cv.avg_cust_lifespan * cv.customer_value_per_trip AS lifetime_customer_value,

              (
                dp.avg_dollar_saved_per_km - MIN(dp.avg_dollar_saved_per_km) OVER ()
              ) / NULLIF(
                MAX(dp.avg_dollar_saved_per_km) OVER () - MIN(dp.avg_dollar_saved_per_km) OVER (), 0
              ) AS scaled_ads
            FROM user_agg_metric uam
            LEFT JOIN customer_value cv USING (user_id)
            LEFT JOIN discount_propn dp USING (user_id)
            LEFT JOIN session_level_final_table slf USING (user_id)
            LEFT JOIN trip_ratios tr USING (user_id)
          )

              SELECT *
              FROM final_single_user_table;
            """
df = pd.read_sql(query, engine)
with pd.option_context('display.max_columns', None, 'display.expand_frame_repr', False):
    display(df)
export = pd.read_sql(query, engine)

Unnamed: 0,user_id,customer_age,gender,married,has_children,home_city,home_country,home_airport,num_trips,num_sessions,user_start_date,user_end_date,active_presence_days,total_page_clicks,mean_clicks_per_session,average_session_time,cancelled_session_count,completed_flight_orders,completed_hotel_orders,aggregate_flight_spend,aggregate_hotel_spend,overall_transaction_value,avg_flight_price,avg_hotel_price,trip_revenue_avg,promo_booking_ratio,mean_savings_value,savings_per_kilometer,mean_seat_count,mean_room_quantity,avg_nights_booked,avg_flight_distance,avg_trip_timespan,booking_lead_interval,session_frequency,user_activity_lifespan,last_seen_date,conversion_ratio,total_flight_distance,total_checked_bags,checked_bag_usage_ratio,flight_hotel_mix_ratio,age_group,avg_total_sales,customer_value_per_trip,customer_value_per_session,avg_cust_lifespan,hotel_discount_proportion,flight_discount_proportion,avg_flight_discount_charges,avg_hotel_discount_charges,total_discount_proportion,avg_dollar_saved,avg_dollar_saved_per_km,booking_count,booking_rate,active_days,avg_page_clicks,avg_bags,cancellation_rate,num_flight_booked,num_hotel_booked,avg_num_seats,total_checked_bags.1,travel_lead_time,avg_cancel_lead_time,total_dist_flown,avg_dist_flown_incl,avg_hotel_stay,avg_hotel_rooms,avg_trip_duration,avg_session_duration,num_flights_discount_applied,num_hotel_discount_applied,num_flights_discount_offered,num_hotel_discount_offered,total_hotel_discount_charges,total_flight_discount_charges,discounted_booking_rate,avg_dist_flown,flight_to_hotel_booking_ratio,checked_bags_ratio,lifetime_customer_value,scaled_ads
0,94883,53.0,F,1,0,kansas city,usa,MCI,2,8,2023-01-10,2023-05-28,8,73,9.125,67.750000,0,2,2,864.09,230.0,1094.09,108.01125,28.750,432.045000,0.125,0.0000,0.000000,1.5,1.500000,1.000000,1451.335404,1.500000,7.500000,8,138,2023-05-28,0.250,2902.670807,1,0.500000,1.000000,Middle Aged,1094.09,547.045000,136.76125,0.766667,0.125,0.000,,0.100000,0.125,0.000000,0.000000,2,0.250,138.0,9.125,0.500000,0.000000,2,2,1.5,1,7.500000,,2902.670807,362.833851,1.000000,1.500000,1.0,0 days 00:01:07.750000,0,0,0,1,0.00,0.0000,0.000000,1451.335404,1.00,0,419.401167,0.000000
1,101486,52.0,F,1,1,tacoma,usa,TCM,2,8,2023-01-21,2023-07-18,8,131,16.375,122.250000,0,1,2,189.91,2452.0,2641.91,23.73875,306.500,94.955000,0.250,0.0000,0.000000,0.5,1.500000,4.500000,497.840800,4.000000,9.000000,8,178,2023-07-18,0.250,995.681600,0,0.000000,0.500000,Middle Aged,2641.91,1320.955000,330.23875,0.988889,0.000,0.250,0.075000,,0.250,0.000000,0.000000,2,0.250,178.0,16.375,0.000000,0.000000,1,2,0.5,0,9.000000,,995.681600,124.460200,4.500000,1.500000,4.0,0 days 00:02:02.250000,0,0,2,0,0.00,0.0000,0.000000,995.681600,0.50,0,1306.277722,0.000000
2,101961,44.0,F,1,0,boston,usa,BOS,5,8,2023-01-19,2023-06-22,8,126,15.750,117.750000,0,5,5,1242.66,2798.0,4040.66,155.33250,349.750,248.532000,0.375,4.9670,0.000727,1.0,1.000000,4.400000,1366.569097,4.800000,6.600000,8,154,2023-06-22,0.625,6832.845483,2,0.400000,1.000000,Middle Aged,4040.66,808.132000,505.08250,0.855556,0.125,0.250,0.150000,0.100000,0.375,0.620875,0.000727,5,0.625,154.0,15.750,0.400000,0.000000,5,5,1.0,2,6.600000,,6832.845483,854.105685,4.400000,1.000000,4.0,0 days 00:01:57.750000,1,0,1,1,0.00,4.9670,0.200000,1366.569097,1.00,0,691.401822,0.005662
3,106907,46.0,F,1,1,miami,usa,TNT,1,8,2023-01-10,2023-07-27,8,240,30.000,758.915066,1,0,0,27804.12,8514.0,36318.12,3475.51500,1064.250,27804.120000,0.125,0.0000,0.000000,12.0,6.000000,22.000000,25594.961081,13.000000,198.500000,8,198,2023-07-27,0.250,25594.961081,10,5.000000,1.000000,Middle Aged,36318.12,36318.120000,4539.76500,1.100000,0.125,0.125,,,0.250,0.000000,0.000000,1,0.125,197.0,30.000,10.000000,1.000000,1,1,12.0,10,198.500000,173.0,25594.961081,3199.370135,22.000000,6.000000,13.0,0 days 00:12:38.915066,1,1,0,0,0.00,0.0000,2.000000,25594.961081,1.00,10,39949.932000,0.000000
4,118043,53.0,F,0,1,los angeles,usa,LAX,5,8,2023-02-05,2023-07-15,8,164,20.500,153.125000,0,3,4,2339.29,6638.0,8977.29,292.41125,829.750,467.858000,0.500,194.5500,0.000000,1.2,1.000000,4.800000,1503.099100,4.666667,9.666667,8,160,2023-07-15,0.625,7515.495499,3,1.000000,0.750000,Middle Aged,8977.29,1795.458000,1122.16125,0.888889,0.375,0.250,0.200000,0.116667,0.625,0.000000,0.000000,5,0.625,159.0,20.500,0.600000,0.000000,3,4,1.2,3,9.666667,,7515.495499,939.436937,4.800000,1.000000,4.0,0 days 00:02:33.125000,1,3,1,0,194.55,0.0000,0.800000,2505.165166,0.75,1,1595.962667,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5777,792549,47.0,F,0,0,kansas city,usa,MCI,4,8,2023-04-30,2023-07-20,8,114,14.250,106.875000,0,4,1,1039.17,180.0,1219.17,129.89625,22.500,259.792500,0.125,0.0000,0.000000,1.0,0.250000,1.250000,1459.165621,3.000000,5.500000,8,81,2023-07-20,0.500,5836.662486,2,0.500000,4.000000,Middle Aged,1219.17,304.792500,152.39625,0.450000,0.000,0.125,0.150000,,0.125,0.000000,0.000000,4,0.500,81.0,14.250,0.500000,0.000000,4,1,1.0,2,5.500000,,5836.662486,729.582811,1.250000,0.250000,3.0,0 days 00:01:46.875000,0,0,1,0,0.00,0.0000,0.000000,1459.165621,4.00,0,137.156625,0.000000
5778,796032,52.0,F,1,0,winnipeg,canada,YAV,3,8,2023-05-01,2023-06-29,8,148,18.500,545.319542,1,2,2,5221.64,1655.0,6876.64,652.70500,206.875,1740.546667,0.250,225.0060,0.011809,2.0,0.666667,3.000000,6351.421470,7.000000,84.000000,8,59,2023-06-29,0.500,19054.264409,5,1.250000,1.333333,Middle Aged,6876.64,2292.213333,859.58000,0.327778,0.125,0.250,0.100000,,0.375,28.125750,0.011809,3,0.375,59.0,18.500,1.666667,0.333333,3,2,2.0,5,84.000000,152.0,19054.264409,2381.783051,3.000000,0.666667,7.0,0 days 00:09:05.319542,2,1,0,0,0.00,450.0120,1.000000,6351.421470,1.50,1,751.336593,0.091980
5779,801660,55.0,F,1,1,toronto,canada,YKZ,3,8,2023-05-03,2023-07-19,8,115,14.375,106.000000,0,3,3,409.96,1081.0,1490.96,51.24500,135.125,136.653333,0.375,21.9195,0.010283,1.0,1.000000,2.333333,710.553530,3.000000,6.666667,8,77,2023-07-19,0.375,2131.660590,1,0.333333,1.000000,Middle Aged,1490.96,496.986667,186.37000,0.427778,0.000,0.375,0.166667,,0.375,2.739937,0.010283,3,0.375,77.0,14.375,0.333333,0.000000,3,3,1.0,1,6.666667,,2131.660590,266.457574,2.333333,1.000000,3.0,0 days 00:01:46,1,0,2,0,0.00,21.9195,0.333333,710.553530,1.00,0,212.599852,0.080094
5780,811077,46.0,F,1,1,knoxville,usa,TYS,1,8,2023-05-06,2023-07-09,8,105,13.125,99.125000,0,1,1,579.79,994.0,1573.79,72.47375,124.250,579.790000,0.375,0.0000,0.000000,1.0,1.000000,7.000000,3223.161635,8.000000,11.000000,8,64,2023-07-09,0.125,3223.161635,0,0.000000,1.000000,Middle Aged,1573.79,1573.790000,196.72375,0.355556,0.125,0.250,0.075000,0.200000,0.375,0.000000,0.000000,1,0.125,64.0,13.125,0.000000,0.000000,1,1,1.0,0,11.000000,,3223.161635,402.895204,7.000000,1.000000,8.0,0 days 00:01:39.125000,0,0,2,1,0.00,0.0000,0.000000,3223.161635,1.00,0,559.569778,0.000000


##### Export the Output from above into .csv  

In [None]:
export.to_csv("02_user_level_aggregation.csv", index=False)

##### Columnsize of the new .csv

In [None]:
export.shape

(5782, 84)