<a href="https://colab.research.google.com/github/olga-terekhova/colabs/blob/main/Trip_Timeline_Creator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trip Timeline Creator

Create timeline and merged PDFs for a collection of files with trip bookings (transport, accommodation). Works well for several groups of people going on the same trip: each group gets their own swimlane on the timeline and their own PDF file with all relevant documents.

## User guide

This notebook takes a list of files on a mounted Google Drive. 

The filenames are expected to be in this format:
YYYYMMDD HHMM To YYYYMMDD HHMM - Hotel XXXX - Group.pdf

Possible activity types: 
'Airplane', 'Taxi', 'Hotel', 'Airbnb', 'Visa'.

Possible group labels are listed in the group_list (See Parameters below). Use unique abbreviation as group labels. 

The datetimes are used to create a timeline of activities.

PDFs referring to the same group of people are merged into one single PDF. 

The results are saved in the Google Colab instance root folder.

## Installation prerequisites

Run only this cell first:

In [1]:
# Install Kaleido to save charts as pictures

!pip install -U kaleido

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kaleido
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kaleido
Successfully installed kaleido-0.2.1


Restart the runtime.

Run after beginning from the next cell.

## Parameters

In [1]:
# List of abbreviation referring to groups of people on the trip. Should be part of file names
group_list = ["OMML", "OV", "EV"]

In [2]:
# Define parent folder for Google Drive
parent_folder_name = "/content/drive/My Drive/"

# Define location to a config json file which stores a path to the booking files
# The config json file should store this content: 
#    {"trip_timeline": "/content/drive/My Drive/... your folder name in Google Drive where source documents are stored/"}
config_file_name = "Colab Notebooks/config_trip_timeline.json"


## Main code

In [3]:
# Install PyPDF2
!pip install PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:
# Load the Drive helper and mount
from google.colab import drive


In [5]:
# Read project parameters
# Set up before hand: add a file Colab Notebooks/config_trip_timeline.json with the content:
# {"trip_timeline": "/content/drive/My Drive/... your folder name where source documents are stored/"} 

import json

# Read the JSON file
with open(parent_folder_name + config_file_name, "r") as file:
    config_data = json.load(file)

# Get the value of trip_timeline
folder_name = config_data["trip_timeline"]

In [6]:
# Read the list of files in the folder
filelist = !ls "$folder_name"

In [7]:
# Print the list if desired
# for file in filelist:
#  print(file)

In [8]:
# Parsing the names of files
import re
import pandas as pd

date_regex = r'(\d{8})'
time_regex = r'(\d{4})'
print_guide = ["NOPRINT"] # if added to the name of the file, it won't be included in the final PDF

# Empty lists to store the extracted information
data = []

for file in filelist:
  file_string = file.strip("\'")
  
  # Splitting the input string using ' - ' as the separator
  parts = file_string.split(' - ', 1)  

  # Extracting PeriodString and ActionString
  PeriodString = parts[0]
  ActionString = parts[1]

  # Extracting StartDate, StartTime, EndDate, EndTime, and ActionType from PeriodString
  StartDate = re.search(date_regex, PeriodString).group(1)
  StartTime = re.search(time_regex, PeriodString[9:]).group(1)
  EndDate = re.search(date_regex, PeriodString[PeriodString.index('To') + 3:]).group(1)
  EndTime = re.search(time_regex, PeriodString[PeriodString.index('To') + 12:]).group(1)

  # Extracting ActionType from ActionString
  ActionType = ActionString.split(' ')[0]

  # Searching for substrings in the group_list
  Group = None
  for group in group_list:
    if group in ActionString:
        Group = group
        break

  # Searching for print instructions
  Print_Ins = None
  for print_ins in print_guide:
    if print_ins in ActionString:
      Print_Ins = print_ins
      break

  file_dict = {
        'FileName': file_string,
        'StartDate': StartDate,
        'StartTime': StartTime,
        'EndDate': EndDate,
        'EndTime': EndTime,
        'ActionType': ActionType,
        'Group': Group,
        'Print_Ins': Print_Ins
  
  }

  # Appending the dictionary to the data list
  data.append(file_dict)


# Creating a pandas DataFrame from the list of dictionaries
df = pd.DataFrame(data)

# Printing the DataFrame if desired
# print(df)


In [9]:
%%capture  

# (Remove %%capture to see the output figure)

# Creating a chart with timeline
import plotly.express as px

# Convert StartDate and StartTime to necessary format
df['Start'] = pd.to_datetime(df['StartDate'] + ' ' + df['StartTime'], format='%Y%m%d %H%M')

# Convert EndDate and EndTime to necessary format
df['Finish'] = pd.to_datetime(df['EndDate'] + ' ' + df['EndTime'], format='%Y%m%d %H%M')

# Set the width based on ActionType
df['Width'] = df['ActionType'].apply(lambda x: 0.6 if x == 'Airplane' else
                                           0.5 if x == 'Taxi' else
                                           0.4 if x in ['Hotel', 'Airbnb'] else
                                           0.4)

# Set the color map for every activity
cm = {'Airplane':'#89c0d6', 'Taxi':'#d87cab', 'Hotel': '#d4d351', 'Airbnb':'#93b264', 'Visa': '#e06c6c'}

# Create the timeline chart
fig = px.timeline(df, x_start='Start', x_end='Finish', y='Group', color='ActionType', color_discrete_map=cm,
                  text='ActionType', 
                  width=4000)

# Change width for different action types
for i, d in enumerate(fig.data):
    d.width = df[df['ActionType']==d.name]['Width']  

# Customize the layout
fig.update_layout(
    title='Action Timeline',
    plot_bgcolor='#fcfcfc',
    uniformtext_minsize=7, uniformtext_mode='show',
    xaxis=dict(title='Time', 
               tickformat='%Y-%m-%d',
               dtick='D1',        
               showgrid=True,  # Show grid lines
               gridcolor='black',  # Set grid lines color to black
               ticklabelmode="period"
 ),
    yaxis=dict(title='Action Type'),
    showlegend=True
)

# Set the layer property of text elements
fig.update_traces(textposition='inside',textfont=dict(color='black'),insidetextanchor="start")

# Show the chart
fig.show()


In [12]:
import kaleido

# Save the chart as a PNG file
fig.write_image('timeline_chart.png', engine='kaleido')
print("Timeline saved to timeline_chart.png")


Timeline saved to timeline_chart.png


In [11]:
# Printing a PDF for every group
from PyPDF2 import PdfMerger

for group in group_list:
  file_names = df[(df['Group'] == group) & (df['Print_Ins'] != 'NOPRINT') ]['FileName'].tolist()

  full_file_names = []
  for file_name in file_names:
    full_file_name = folder_name + file_name
    full_file_names.append(full_file_name)

  # Create a PdfFileMerger object
  merger = PdfMerger(strict=False)

  # Append each PDF file to the merger
  for pdf_file in full_file_names:
      merger.append(pdf_file)

  # Output file name for the merged PDF
  output_file = '2023_Timeline_All_Docs_'+group+'.pdf'

  # Write the merged PDF to the output file
  merger.write(output_file)

  # Close the merger
  merger.close()

  # Signal ending
  print("Finished for "+group + ". See 2023_Timeline_All_Docs_"+group+".pdf.")




Finished for OMML. See 2023_Timeline_All_Docs_OMML.pdf.
Finished for OV. See 2023_Timeline_All_Docs_OV.pdf.
Finished for EV. See 2023_Timeline_All_Docs_EV.pdf.
