# Exploration of Database flow of data

Here we wish to explore the way our data flows through the database. We are interested in figuring out where all the journeys from the *Journeys* table are lost when trying to connect to the table *SJWaypoints*. The SQL query we wish to analysise is the following: 

```SQL
SELECT J.Id as JId, J.CreatedOn, J.SearchStart, J.SearchEnd, J.StartStop, J.EndStop,
    J.StartZone, J.Endzone, J.internalStartZones, J.internalValidZones, 
    SJWaypoints._id as SJId, SJWaypoints.Id,  SJWaypoints.Name, SJWaypoints.Latitude, 
    SJWaypoints.Longitude, SJWaypoints.[Type], SJWaypoints.SJSearchJourney_Id
FROM Journeys J
    JOIN Tickets ON J.Id = Tickets.Journey_Id
    JOIN Orders ON Orders.Id = Tickets.OrderId
    JOIN SJSearchJourneys SJ ON SJ.Id = Orders.JourneyClasses_Id
    JOIN SJWaypoints ON SJWaypoints.SJSearchJourney_Id = SJ.Id
WHERE J.CreatedOn BETWEEN '2022-12-01 00:00:00' and '2023-01-01 00:00:00'
```

We will be splitting up the query into parts exploring the amount of data and the relationship between the two joined tables.

From our starting points, we run the query:
```SQL
SELECT COUNT(*)
FROM Journeys
WHERE Journeys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```

and see that the month has 941,093 registered journeys. How does that number change throughout our joinings?


## Journeys and Tickets

*Journeys* is our starting point for our data since this table consist of actually traveled (bought) journeys. From our ER-Diagram, we notice that the *Journeys* and *Tickets* tables have a 1 - 1 relationship through *Journeys*'s primary key **Id** and *Tickets*'s foreign key **Journey_Id**.

What we expect to see from this join, is the amount of results to be the same.  

The SQL Query in use is the following:
``` SQL
SELECT COUNT(*)
FROM Journeys
    JOIN Tickets ON Journeys.Id = Tickets.Journey_Id
WHERE Journeys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```


The *Ticket* table does not necessarily contain explict relevant information for our purpose. If so, the column **Type** might be an indicator for whether a journey is from a standard ticket or a 'pendler-kort', which we are not interested in keeping in our training data. 


## Tickets and Orders

Here we wish to look at the relationship between *Tickets* and *Orders*. We expect each ticket to match with an entry in orders. We expect this relationship to be 1 to many, such that the expected count(*) is **LESS** than the 'original' 941,093 results. 

```SQL
SELECT COUNT(*)
FROM Tickets
    JOIN Orders ON Tickets.OrderId = Orders.Id
WHERE Tickets.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```

We also run the following:
```SQL
SELECT COUNT(*)
FROM Tickets
WHERE Tickets.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```


## Orders and SJSearchJourneys

We expect to see a large drop in number in this join, since an entry in *Orders* can either have an associate value in **JourneyClasses_Id**, **RejseplanenProduct_Id** or None of them. To analyse *Orders* further, a few additional queries were performed. But the 'join' were done using the following query:
```SQL
SELECT COUNT(*)
FROM Orders
    JOIN SJSearchJourneys ON Orders.JourneyClasses_Id = SJSearchJourney.Id
WHERE Orders.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```

We also make use of the Query:
```SQL
SELECT COUNT(*)
FROM Orders
WHERE Orders.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```


## SJSearchJourneys and SJWaypoints

A *SearchJourney* have a one-to-many relationship with *Waypoints* 

```SQL
SELECT COUNT(*)
FROM SJSearchJourneys
    JOIN SJWaypoints ON SJSearchJourneys.Id = SJWaypoints.SJSearchJourney_Id
WHERE SJSearchJourneys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```

```SQL
SELECT COUNT(*)
FROM SJSearchJourneys
WHERE SJSearchJourneys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```

# Deep dive into a Extract from Journeys x Tickets where the journeys are from tickets baught a joint travel

The SQL command run:

```SQL
SELECT * 
FROM Tickets 
JOIN Journeys ON Tickets.Journey_Id = Journeys.Id
WHERE OrderId = 'ed70bcb8-bc89-4983-a068-0049dd5c249d'
```

The OrderId were found by running 

```SQL
SELECT OrderId, COUNT(*) AS OrderCount
FROM Tickets
GROUP BY OrderId
HAVING COUNT(*) > 2;
```

The OrderId was then chosen by selecting a relatively high number of OrderCount (36). Thus the first SQL query returns a table with 36 distinct journeys (based on journey id). 

Please take a look at Appendix_1.md for the full csv file. 

The data is highlighted in order to showcase the limits of our data. 
For the extracted data we can see that (of the relevant columns):
- The **Type** is set to *Zone*. This would be nice and awesome if the SJSearchJourneys actually contained detailed information on Journeys of type *Zone*, but this is not the case for all of these entries. 
- The **StartStop** and **EndStop** are for all journeys are left empty, not giving any information of the actual journey
- The **SearchStart** contain a value of 'Min lokation (01)' and thus does not give any indication of a location. 
- The **SearchEnd** contain a value of '2 zoner' gives just as much information.
- The **InternalValidZones** contain the value '1001,1002,1003' which tells us that the Ticket for the Journey is valid in these three zones. 
- The **InternalStartZone** likewise tells us that the journey possibly started in zone 1001. 'Possibly' since the column named **StartZone** is empty. 

Of course this is a, be it random, selected extracted table and thus further research have to be made in order to say whether the above observation is true for a magnitude of other 'joint' travels. But from the second query above, a (still counting) ~500.000 rows showcases the amount of joint travels, with most of them being around ~3-5 tickets but many with > 25 as well - probably institutes buying tickets for an entire classroom or similar. 

* Do note, that we also attempted to find these journeys in SJSearchJourneys but the order with the OrderId for the Tickets does not contain a value in JourneyClasses_Id thus indicating that there is no association between this journey and SJSearchJourneys. 

## Orders with no entry in SJ or RP

Looking into how many journeys in one month that do not, when joined with tickets and orders, have a JourneyClasses_Id as well as no RejseplanenProductId. These are found with the following query: 

```SQL
SELECT COUNT(*)
FROM Orders
WHERE Orders.JourneyClasses_Id is NULL and Orders.RejseplanenProduct_Id is NULL
and Orders.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```

The result is ***653373*** in one month. So we loose "only" 653373 journeys when joining on from Orders, so the majority of the lost journeys are not from here

# The JOINS
## HOW MANY JOURNEYS CAN WE ACTUALLY WORK WITH?


If we are to only work with data from the Journeys table, then we need to have either both SearchStart and SearchEnd or both StartStop and EndStop where these "pairs" do not have the same value. The number of journeys we can work with is found with the following query: 
```SQL
SELECT COUNT(*)
FROM Journeys
WHERE 
    (Journeys.SearchStart IS NOT NULL AND Journeys.SearchEnd IS NOT NULL AND Journeys.SearchStart <> Journeys.SearchEnd)
    OR
    (Journeys.StartStop IS NOT NULL AND Journeys.EndStop IS NOT NULL AND Journeys.StartStop <> Journeys.EndStop);
```

The result is that from the entire Journeys table which contains around 43 million journeys, we have ***23449428*** we can "work" with. 

``` SQL
SELECT COUNT(*)
FROM Journeys
JOIN Tickets ON Journeys.Id = Tickets.Journey_Id
WHERE Journeys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```

***870115***

``` SQL
SELECT COUNT(*)
FROM Journeys
JOIN Tickets ON Journeys.Id = Tickets.Journey_Id
JOIN Orders ON Orders.Id = Tickets.OrderId
WHERE Journeys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```
***870115***

##### SJ

``` SQL
SELECT COUNT(*)
FROM Journeys
JOIN Tickets ON Journeys.Id = Tickets.Journey_Id
JOIN Orders ON Orders.Id = Tickets.OrderId
JOIN SJSearchJourneys ON Orders.JourneyClasses_Id = SJSearchJourneys.Id
WHERE Journeys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```
***35948***

``` SQL
SELECT COUNT(*)
FROM Journeys
JOIN Tickets ON Journeys.Id = Tickets.Journey_Id
JOIN Orders ON Orders.Id = Tickets.OrderId
JOIN SJSearchJourneys ON Orders.JourneyClasses_Id = SJSearchJourneys.Id
JOIN SJWaypoints ON SJSearchJourneys.Id =  SJWaypoints.SJSearchJourney_Id
WHERE Journeys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```
***571087***

##### RP

``` SQL
SELECT COUNT(*)
FROM Journeys
JOIN Tickets ON Journeys.Id = Tickets.Journey_Id
JOIN Orders ON Orders.Id = Tickets.OrderId
JOIN RejseplanenProducts ON Orders.RejseplanenProduct_Id = RejseplanenProducts.Id
WHERE Journeys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```
***49350***

``` SQL
SELECT COUNT(*)
FROM Journeys
JOIN Tickets ON Journeys.Id = Tickets.Journey_Id
JOIN Orders ON Orders.Id = Tickets.OrderId
JOIN RejseplanenProducts ON Orders.RejseplanenProduct_Id = RejseplanenProducts.Id
JOIN RPWaypoints ON RejseplanenProducts.Id = RPWaypoints.RejseplanenProduct_Id
WHERE Journeys.CreatedOn BETWEEN '2022/12/01 00:00:00' and '2023/01/01 00:00:00'
```
***394597***