# Project: TravelTide </br>
---

### Data Cleansing

1. Importing External Liabraries
2. Connecting to SQL Database
4. Data Cleansing

Tables:

* users - No data cleansing required
* sessions - No data cleansing required
* flights - No data cleansing required
* hotels - data cleansing required


---
### First Step : Importing External Liabraries

In [None]:
import pandas as pd
import sqlalchemy as sa
import matplotlib.pyplot as plt

---
### Second Step : Connecting to SQL Database </br>

1. Create an engine </br>
2. Make a connection

In [None]:
traveltide_url = 'postgresql://Test:bQNxVzJL4g6u@ep-noisy-flower-846766.us-east-2.aws.neon.tech/TravelTide?sslmode=require'

In [None]:
engine = sa.create_engine(traveltide_url)
connection = engine.connect()

---
## SQL
---

### Data Cleansing

#### Table: hotels

##### Negative hotel nights in hotels table

* This query corrects potentially swapped check-in/out times and calculates stay duration. </br>
* It returns all hotel stays, highlighting those under 24 hours for identifying short or irregular bookings.


In [None]:
query = """
        -- 1. CTE: time cleaning
        WITH ordered_times AS (
          SELECT
            trip_id,
            CASE
              WHEN check_out_time < check_in_time THEN check_out_time
              ELSE check_in_time
            END AS cleaned_check_in_time,

            CASE
              WHEN check_out_time < check_in_time THEN check_in_time
              ELSE check_out_time
            END AS cleaned_check_out_time
          FROM hotels
        ),

        -- 2. CTE: nights und stay duration calculation
        calculated_stay AS (
          SELECT
            *,
            -- differences in nights (whole Calenderdays)
            (cleaned_check_out_time::date - cleaned_check_in_time::date) AS cleaned_nights,

            -- stay duration in hours (only, when under 24h)
            CASE
              WHEN cleaned_check_out_time - cleaned_check_in_time < INTERVAL '1 day'
              THEN EXTRACT(EPOCH FROM (cleaned_check_out_time - cleaned_check_in_time)) / 3600.0
              ELSE NULL
            END AS duration_hours
          FROM ordered_times
        )

        -- Final result
        SELECT *
        FROM calculated_stay
        WHERE duration_hours >= 2 -- filter out short stays (under 2 hours)

        """
pd.read_sql(query, engine)

Unnamed: 0,trip_id,cleaned_check_in_time,cleaned_check_out_time,cleaned_nights,duration_hours
0,629809-2f3c89dc72014fb8a0fa4146a49af106,2023-03-11 22:11:22.965,2023-03-12 11:00:00.000,1,12.810287
1,1178-1679eaf1aa1541bca9df58d1aee11e4d,2023-03-10 18:58:03.945,2023-03-11 11:00:00.000,1,16.032238
2,11242-5e5a1f9869d34d189286db98790b7e76,2023-03-08 18:10:12.720,2023-03-09 11:00:00.000,1,16.829800
3,13006-a24c8619a1e44bafa87de64547ead304,2023-03-07 11:11:31.200,2023-03-08 11:00:00.000,1,23.808000
4,13487-e349d708bc1d4768ad9270cce1fa23f5,2023-03-09 11:15:00.810,2023-03-10 11:00:00.000,1,23.749775
...,...,...,...,...,...
151890,629607-74e60e14f69b4af3be0e0f5abe2622bf,2023-03-06 20:58:03.945,2023-03-07 11:00:00.000,1,14.032238
151891,629652-bf4beedf9028424b8ddc3b8a06285dea,2023-03-09 11:47:46.410,2023-03-10 11:00:00.000,1,23.203775
151892,629679-a06185e46ebb41bba0e200185c4cb146,2023-03-09 17:10:34.590,2023-03-10 11:00:00.000,1,17.823725
151893,629698-e41a8444d35b49c3a88880b4f5ea3a6d,2023-03-14 14:32:49.245,2023-03-15 11:00:00.000,1,20.452987
