# 2024: Week 5 - Getting the right data
### January 31, 2024
### Challenge by: Jenny Martin

It's the final week of beginner month and we're going to spend a little more time diving deeper into joins, calculations and outputs. 

Prep Air are interested in creating a workflow that has multiple outputs depending on user requirements. They want users to be able to answer the following questions:

 - What are the details of the customers who have booked flights and which routes are they travelling on?
 - Which customers are yet to book a flight in 2024?
 - Which flights are yet to be booked by customers in 2024?

The datasets you'll be working with are fairly large so you'll need to decide which tables to join (and when) to be as efficient as possible. You may wish to use this as an opportunity to explore the sampling options in Tableau Prep too!

### Inputs
There are 3 tables to connect to for this challenge:
 - Prep Air Ticket Sales
 - Prep Air 2024 Flights
 - Prep Air Customers

### Requirements
 - Input the data
 - For the first output:
    - Create a dataset that gives all the customer details for booked flights in 2024. Make sure the output also includes details on the flights origin and destination
    - When outputting the data, create an excel file with a new sheet for each output (so 1 file for all outputs this week!)
 - For the second output:
    - Create a dataset that allows Prep Air to identify which flights have not yet been booked in 2024
    - Add a datestamp field to this dataset for today's date (31/01/2024) so that Prep Air know the unbooked flights as of the day the workflow is run
    - When outputting the table to a new sheet in the Excel Workbook, choose the option "Append to Table" under Write Options. This means that if the workflow is run on a different day, the results will add additional rows to the dataset, rather than overwriting the previous run's data
 - For the third output:
    - Create a dataset that shows which customers have yet to book a flight with Prep Air in 2024
    - Create a field which will allow Prep Air to see how many days it has been since the customer last flew (compared to 31/01/2024)
    - Categorise customers into the following groups:
        - Recent fliers - flown within the last 3 months
        - Taking a break - 3-6 months since last flight
        - Been away a while - 6-9 months since last flight
        - Lapsed Customers - over 9 months since last flight
    - Output the data to a new sheet in the Excel Workbook
 
 
### Outputs
#### 2024 Booked Flights Output:

 - 11 fields
 - Date
 - From
 - To
 - Flight Number
 - Customer ID
 - Last Date Flown
 - first_name
 - last_name
 - email
 - gender
 - Ticket Price
 - 44,768 rows (44,769 including headers)
 
#### Unbooked Flights Output:

 - 5 fields
 - Flight unbooked as of
 - Date
 - Flight Number
 - From
 - To
 - 296 rows (297 including headers)

#### Customers Yet to Book in 2024 Output:

 - 8 fields
 - Customer ID
 - Customer Category
 - Days Since Last Flown
 - Last Date Flown
 - first_name
 - last_name
 - email
 - gender
 - 1,260 rows (1,261 including headers)


You can download the outputs from here. If you want to check your results. 

After you finish the challenge make sure to fill in the participation tracker, then share your solution on Twitter using #PreppinData and tagging @Datajedininja, @JennyMartinDS14 & @TomProwse1
You can also post your solution on the Tableau Forum where we have a Preppin' Data community page. Post your solutions and ask questions if you need any help! 


In [3]:
import zipfile
import os
import pandas as pd

In [4]:
def unzip_file(zip_file, extract_to):
    """
    Unzips a zip file to a specified directory.

    Args:
        zip_file (str): Path to the zip file to be extracted.
        extract_to (str): Directory where the contents of the zip file will be extracted.
    """
    try:
        with zipfile.ZipFile(zip_file, 'r') as zip_ref:
            zip_ref.extractall(extract_to)
        print("Extraction complete.")
    except zipfile.BadZipFile:
        print("Error: Not a valid zip file.")
    except Exception as e:
        print(f"An error occurred: {e}")


unzip_file('Inputs-20240319T163303Z-001.zip', os.getcwd())


Extraction complete.


In [5]:
# ### Inputs
# There are 3 tables to connect to for this challenge:
#  - Prep Air Ticket Sales
#  - Prep Air 2024 Flights
#  - Prep Air Customers

# ### Requirements
#  - Input the data
#  - For the first output:
#     - Create a dataset that gives all the customer details for booked flights in 2024. Make sure the output also includes details on the flights origin and destination
#     - When outputting the data, create an excel file with a new sheet for each output (so 1 file for all outputs this week!)
#  - For the second output:
#     - Create a dataset that allows Prep Air to identify which flights have not yet been booked in 2024
#     - Add a datestamp field to this dataset for today's date (31/01/2024) so that Prep Air know the unbooked flights as of the day the workflow is run
#     - When outputting the table to a new sheet in the Excel Workbook, choose the option "Append to Table" under Write Options. This means that if the workflow is run on a different day, the results will add additional rows to the dataset, rather than overwriting the previous run's data
#  - For the third output:
#     - Create a dataset that shows which customers have yet to book a flight with Prep Air in 2024
#     - Create a field which will allow Prep Air to see how many days it has been since the customer last flew (compared to 31/01/2024)
#     - Categorise customers into the following groups:
#         - Recent fliers - flown within the last 3 months
#         - Taking a break - 3-6 months since last flight
#         - Been away a while - 6-9 months since last flight
#         - Lapsed Customers - over 9 months since last flight
    # - Output the data to a new sheet in the Excel Workbook

In [6]:
def cln_clm(df):
  df.columns = [i.lower().strip().replace('_','') for i in df.columns]

  return df

In [15]:
# Input 1: Customers
customers_2024_df=cln_clm(pd.read_csv("Prep Air Customers.csv"))
customers_2024_df.head()

Unnamed: 0,customer id,last date flown,firstname,lastname,email,gender
0,1,2023-01-05,Denyse,Gebuhr,dgebuhr0@vinaora.com,Female
1,2,2023-10-05,Keene,Devennie,kdevennie1@plala.or.jp,Male
2,3,2023-11-09,Tyler,McGrail,tmcgrail2@nyu.edu,Male
3,4,2023-11-22,Drusi,Ibeson,dibeson3@hostgator.com,Female
4,5,2023-12-23,Stanwood,Seacroft,sseacroft4@wikispaces.com,Male


In [11]:
# input 2 Flights
flights_2024_df=cln_clm(pd.read_csv("Prep Air 2024 Flights.csv"))
flights_2024_df.head()

Unnamed: 0,date,flight number,from,to
0,2024-11-22,PA001,London,New York
1,2024-11-23,PA001,London,New York
2,2024-11-23,PA002,New York,London
3,2024-11-24,PA001,London,New York
4,2024-11-27,PA001,London,New York


In [14]:
# input 3 Tickets
tickets_df=cln_clm(pd.read_csv('Prep Air Ticket Sales.csv'))
tickets_df.head()


Unnamed: 0,date,flight number,customer id,ticket price
0,2024-01-03,PA001,232,818.99
1,2024-01-03,PA001,293,1947.99
2,2024-01-03,PA001,472,1350.99
3,2024-01-03,PA001,572,905.99
4,2024-01-03,PA001,1191,567.99


In [19]:
# Merge dfs
merged_df=customers_2024_df.merge(tickets_df,on='customer id',how='left')
merged_df=merged_df.merge(flights_2024_df,on=['flight number','date'])

# update metadata
merged_df['date']=pd.to_datetime(merged_df['date'])

In [31]:
#  - For the first output:
#     - Create a dataset that gives all the customer details for booked flights in 2024. Make sure the output also includes details on the flights origin and destination
#     - When outputting the data, create an excel file with a new sheet for each output (so 1 file for all outputs this week!)

output_1_df=merged_df.copy()

In [32]:
#  - For the second output:
#     - Create a dataset that allows Prep Air to identify which flights have not yet been booked in 2024
#     - Add a datestamp field to this dataset for today's date (31/01/2024) so that Prep Air know the unbooked flights as of the day the workflow is run
#     - When outputting the table to a new sheet in the Excel Workbook, choose the option "Append to Table" under Write Options. This means that if the workflow is run on a different day, the results will add additional rows to the dataset, rather than overwriting the previous run's data

In [68]:
from datetime import date as dt

today=dt.today()

not_booked_df=flights_2024_df.merge(tickets_df,how='left',on=['date','flight number'])

not_booked_df=not_booked_df.loc[not_booked_df['customer id'].isna()].groupby(['date','flight number','from','to']).agg({'date':'count'})

not_booked_df=not_booked_df.rename(columns={'date':'count'}).reset_index()
not_booked_df

Unnamed: 0,date,flight number,from,to,count
0,2024-11-15,PA005,London,Tokyo,1
1,2024-11-16,PA011,Perth,Tokyo,1
2,2024-11-18,PA003,London,Perth,1
3,2024-11-18,PA004,Perth,London,1
4,2024-11-18,PA007,New York,Perth,1
...,...,...,...,...,...
291,2024-12-31,PA004,Perth,London,1
292,2024-12-31,PA005,London,Tokyo,1
293,2024-12-31,PA006,Tokyo,London,1
294,2024-12-31,PA010,Tokyo,New York,1
