## Delay Report
### Overview
The delay report script aims to find the updated_eta and updated_etd of certain vessels provided within "Vessel Delay Tracking.XLSX". This is done by querying an underlying BigSchedules API, MSC Web API and from a static G2 Schedules Excel document. None of the API interactions use the Rio Tinto credentials to ensure that traceback cannot occur.

The script is written in a modular approach to increase ease of maintenance and improve code quality. Configurations are stored in a `data` subdirectory. The script expects a `Vessel Delay Tracking.XLSX` file and `g2_filename` (G2 Schedule Excel file) in the same directory.

### Features
1. Avoids detection
    - Uses API calls instead of Selenium which is easily detectable
    - Uses randomised timing for API requests
2. Modular
    - If one component breaks, you can always disable it without affecting the other modules

In [1]:
# Imports
import pandas as pd
import numpy as np
import random
import os
import json
import requests
import time

from tqdm.auto import tqdm
from pathlib import Path
from datetime import datetime

In [2]:
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)

In [3]:
# Utility functions
def write_json(response: dict, output_file: str):
    with open(output_file, 'w') as w:
        json.dump(response, w, indent=2)

def read_config(instance: object, attr_name: str, path_to_config: str):
    with open(path_to_config, "r") as f:
        setattr(instance, attr_name, json.load(f))

In [8]:
# Read configuration file
with open("data/config.json", "r") as f:
    config = json.load(f)
    
# Used to map carrier names to the ones BigSchedule uses and supports
with open("data/carrier_mapping.json", "r") as f:
    carrier_mapping = json.load(f)

# Bigschedule login
with open("data/bigschedules_login.json", "r") as f:
    bs_login = json.load(f)
    
# Prepare base information
# UNLOCODE to port name mapping
port_mapping = (
    pd.concat([pd.read_csv(p, usecols=[1, 2, 4, 5], engine='python', names=[
              'country', 'port', 'name', 'subdiv']) for p in Path('data').glob("*UNLOCODE CodeListPart*")])
    .query('port == port')
    .assign(
        uncode=lambda x: x.country.str.cat(x.port),
        full_name=lambda x: np.where(
            x.subdiv.notnull(), x.name.str.cat(x.subdiv, sep=", "), x.name)
    )
    .drop_duplicates('uncode')
    .set_index('uncode')
    .to_dict('index')
)

# Read the vessel delay tracking file
xl = pd.ExcelFile('Vessel Delay Tracking.xlsx')

### BSExtractor
To use the BigSchedules Web API, we first need to query sub-APIs to get all the information we need to query the main API.
These parameters and their corresponding sub-APIs are:

1. `_` : YYYYMMDDHH (24h)
2. `carrierId`: GET from https://www.bigschedules.com/api/carrier/fuzzyQuery
3. `scac`: GET from https://www.bigschedules.com/api/carrier/fuzzyQuery
4. `vesselGid`: GET from https://www.bigschedules.com/api/vessel/list?_=2020081916&vesselName=maersk+danube; this query also requires timestamp

Only after we query these sub-APIs, do we have the parameter values to query the web API.

In [14]:
bigschedules_sheet = (
    xl.parse(pd.to_datetime(xl.sheet_names,
                            errors='coerce',
                            format='%d.%m.%Y').max().date().strftime('%d.%m.%Y'),
                            parse_dates=True)
                            .query(f"`Fwd Agent` in {[k for k,v in carrier_mapping.items()]}")
                            .replace({'Fwd Agent': carrier_mapping})
)

In [11]:
# Get port name
bigschedules_sheet = bigschedules_sheet.assign(pol_name=lambda x: x['Port of Loading'].apply(lambda y: port_mapping.get(y)['name']),
                                               pod_name=lambda x: x['Port of discharge'].apply(lambda y: port_mapping.get(y)['name']))

In [20]:
bigschedules_sheet

Unnamed: 0,Plnt,Req. Delivery Date,Shipment,Term,Sold-to-Party Name,Ship-to-Pty,Sales Ord.,Delivery,Description,Product Type,Vessel,Voyage,ETD Date,Disport ETA,Gross Weight,Port of Loading,Port of discharge,Incoterms Part2,No. of Containers,Container Type,MetPro Status,Fwd Agent,Booking Ref.,Reason for rejection description,No. of bundles,Item Status Information,Incoterms Part1,Shipping Cond,updated_etd,updated_eta,No. of days delayed ETD,No. of days delayed ETA,Reason of Delay
0,2502,2020-08-10,30012791.0,4050,SHENG YU STEEL CO. LTD.,SHENGYU,15018188,802094756,TA 480 X 840 AA941.1 1182 ANY N,TA,COSCO INDONESIA,095N,2020-08-20,2020-09-02,158.426,AUMEL,TWKHH,"KAOHSIUNG, TAIWAN",7,TEU,SHIPPED,ANL,AEL0985436,,140,,CIF,31,,,,,
1,2501,2020-07-15,30012670.0,6140,"HIHO METAL CO., LTD.",HIHO,15017843,802093896,IS 22KG AA170.9 CNTR 44 N,IS,CHRISTA SCHULTE,004N,2020-08-08,2020-08-23,509.705,AUBNE,KRBNP,"BUSAN NEW PORT, KOREA",21,CNO,SHIPPED,HAPAG,43888731,,504,,CIF,31,,,,,
2,2501,2020-07-15,30012688.0,6140,"HIHO METAL CO., LTD.",HIHO,15017848,802093901,IS 22KG AA170.9 CNTR 44 N,IS,CHRISTA SCHULTE,004N,2020-08-08,2020-08-17,484.451,AUBNE,KRBNP,"BUSAN NEW PORT, KOREA",20,CNO,SHIPPED,EVERGREEN,600000015375,,480,,CIF,31,,,,,
3,2522,2020-07-15,30012647.0,400A,OHGITANI CORPORATION,NISC,15017964,802093833,TA 480 X 840 XA941.1 1000 US ANY N,TA,OOCL SHANGHAI,053N,2020-08-01,2020-08-20,262.523,AUMEL,JPOSA,"OSAKA, JAPAN",11,TEU,SHIPPED,ANL,AEL0974951,,275,,DDP,31,,,,,
4,2522,2020-07-15,30012647.0,400A,OHGITANI CORPORATION,NISC,15017965,802093834,TA 480 X 840 XA941.1 1000 US ANY N,TA,OOCL SHANGHAI,053N,2020-08-01,2020-08-18,214.673,AUMEL,JPYOK,"Yokohama, Japan",9,TEU,SHIPPED,ANL,AEL0974918,,225,,DDP,31,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,2524,2020-08-20,30012866.0,400A,FUJIDEN INTERNATIONAL CORP.,YODOGAWA,15018354,802094914,BK 300 X 865 410370 1750 US,BK,RDO CONCERT,035N,2020-08-30,2020-09-23,147.240,AUSYD,JPOSA,KURE PLANT,6,CNO,ORDERED,HAMBURG,0BNE007342,,120,O,DAP,31,,,,,
299,2504,2020-08-31,30012838.0,4200,"HONDA METAL INDUSTRIES VIETNAM, LTD.",HONDAVN,15018259,802094842,BT 203 6063962 5800 4/4 H3,BT,ANL GIPPSLAND,046N,2020-08-31,2020-09-30,145.296,AUSYD,VNCLI,"CAT LAI, VIETNAM",6,CNO,ORDERED,OOCL,4050311650,,72,O/ETA 15-25/09/2020,CIF,31,,,,,
300,2501,2020-08-30,30012857.0,4020,TRAFIGURA PTE LTD,ACCESSWKR,15018498,802095184,IS 22KG AA170.9 CNTR 44 N,IS,COSCO HONG KONG,152N,2020-08-31,2020-09-16,501.904,AUBNE,KRINC,"INCHEON, KOREA",26,C20,ORDERED,OOCL,4050357730,,494,,CIF,31,,,,,
301,2501,2020-08-30,30012857.0,4020,TRAFIGURA PTE LTD,ACCESSWKR,15018499,802095185,IS 22KG AA170.9 CNTR 44 N,IS,COSCO HONG KONG,152N,2020-08-31,2020-09-16,501.904,AUBNE,KRINC,"INCHEON, KOREA",26,C20,ORDERED,OOCL,4050357780,,494,,CIF,31,,,,,


In [None]:
class BSExtractor:
    """
    Extracts information from the BigSchedules Portal.
    
    Methods
    -------
    prepare:
        A single query to the BigSchedules Web API can provide information to multiple lines on the delay_sheet.
        Further filters self.delay_sheet to a smaller list of searches needed to fulfill all the lines on the
            delay_sheet. This reduces the total number of calls made to the BigSchedules Web API and prevents
            duplication of API calls.
        
    call_api:
        Makes calls to the BigSchedules Web API, using information from the prepare method as parameters in the
        API request. Also saves the API responses into a subdirectory "responses/<today_date>".
    
    extract:
        Extracts information from the JSON responses from the call_api method and assembles the final dataframe.
    """
    def __init__(self, main_delay_sheet: pd.DataFrame, interval: tuple):
        # Get the BigSchedules delay sheet
        self.delay_sheet = (main_delay_sheet.query(f"`Fwd Agent` not in {['MSC', 'G2OCEAN']}")
                            .drop(['updated_etd', 'updated_eta', 'No. of days delayed ETD',
                                   'No. of days delayed ETA', 'Reason of Delay'], axis=1)
                            .copy())

        # Get the BigSchedules-specific port names from the UNLOCODEs
        self.port_mapping = (pd.concat([pd.read_csv(p, usecols=[1, 2, 4, 5], engine='python',
                                               names=['country', 'port', 'name', 'subdiv']) for p in Path('data').glob("*UNLOCODE CodeListPart*")])
            .query('port == port')
            .assign(
                uncode=lambda x: x.country.str.cat(x.port),
                full_name=lambda x: np.where(
                    x.subdiv.notnull(), x.name.str.cat(x.subdiv, sep=", "), x.name)
            )
            .drop_duplicates('uncode')
            .set_index('uncode')
            .to_dict('index')
        )
        
        # Get port name
        self.delay_sheet = self.delay_sheet.assign(pol_name=lambda x: x['Port of Loading'].apply(lambda y: self.port_mapping.get(y)),
                                                   pod_name=lambda x: x['Port of discharge'].apply(lambda y: self.port_mapping.get(y))).copy()

        self.interval = interval
        self.session = requests.Session()
        
    def prepare(self):
        """
        Further filters self.delay_sheet to a smaller list of searches needed to fulfill all the lines on the
            delay_sheet.
        """
#         # Further filter by POL-Vessel-Voyage to get ETD, POD-Vessel-Voyage to get ETA
#         key = ['pol_name', 'pod_name']
#         self.reduced_df = self.delay_sheet.drop_duplicates(key)[key].sort_values(key)

#         self.reduced_df['pol_code'] = self.reduced_df.pol_name.map(self.msc_port_id)
#         self.reduced_df['pod_code'] = self.reduced_df.pod_name.map(self.msc_port_id)

#         # Unable to handle those with no pod_id in BigSchedules Web; dropping these lines
#         self.reduced_df.dropna(inplace=True)
        
    def call_api(self):
        """
        Makes calls to the BigSchedules Web API, using information from the prepare method as parameters in the
        API request. Also saves the API responses into a subdirectory "responses/<today_date>".
        """
#         def get_schedules(etd: str, pol: str, pod: str):
#             url = f"https://www.bigschedules.com//api/vesselSchedule/list?DISABLE_ART=true&_=2020081917&carrierId=18&language=en-US&scac=HLCU&vesselGid=V000005557&vesselName=CHRISTA+SCHULTE"
#             headers = {
#                 'Accept': 'application/json',
#                 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
#                 'Content-Type': 'application/json',
#                 'Sec-Fetch-Site': 'same-origin',
#                 'Sec-Fetch-Mode': 'cors',
#                 'Sec-Fetch-Dest': 'empty',
#                 'Referer': 'https://www.msc.com/search-schedules',
#                 'Accept-Language': 'en-GB,en;q=0.9',
#                 'Cookie': 'CMSPreferredCulture=en-GB; ASP.NET_SessionId=tht5lkut0asln2goiskoagfe; UrlReferrer=https://www.google.com/; CurrentContact=8b0b2fea-705b-4a4f-b8bf-bb1cd6c982bc; MSCAgencyId=115867; BIGipServerkentico.app~kentico_pool=439883018.20480.0000; _ga=GA1.2.1736073830.1597290148; _gid=GA1.2.1289141279.1597290148; _gcl_au=1.1.345060449.1597290148; __hstc=100935006.13bb76c8a78a8d0a203a993ffef3a3f6.1597290148282.1597290148282.1597290148282.1; hubspotutk=13bb76c8a78a8d0a203a993ffef3a3f6; __hssrc=1; _ym_uid=15972901491036911544; _ym_d=1597290149; _ym_isad=1; newsletter-signup-cookie=temp-hidden; _hjid=3e183004-f562-4048-8b60-daccdf9c187c; _hjUserAttributesHash=2c3b62a0e1cd48bdfd4d01b922060e19; _hjCachedUserAttributes={"attributes":{"mscAgencyId":"115867"},"userId":null}; OptanonAlertBoxClosed=2020-08-13T03:42:45.080Z; CMSCookieLevel=200; VisitorStatus=11062214903; TS0142aef9=0192b4b6225179b1baa3b4d270b71a4eee782a0192338173beabaa471f306c2a13fe854bf6a7ac08ac21924991864aa7728c54559023beabd273d82285d5f943202adb58da417d61813232e89b240828c090f890c6a74dc4adfec38513d13447be4b5b4404d69f964987b7917f731b858f0c9880a139994b98397c4aeb5bd60b0d0e38ec9e5f3c97b13fb184b4e068506e6086954f8a515f2b7239d2e5c1b9c70f61ca74f736355c58648a6036e9b5d06412389ac41221c5cb740df99c84dc2bfef4a530dbc5e2577c189212eebac723d9ee9f98030f4bc6ca7d824ab313ae5fdd1eaa9886; OptanonConsent=isIABGlobal=false&datestamp=Thu+Aug+13+2020+11%3A43%3A36+GMT%2B0800+(Singapore+Standard+Time)&version=5.9.0&landingPath=NotLandingPage&groups=1%3A1%2C2%3A1%2C3%3A1%2C4%3A1%2C0_53017%3A1%2C0_53020%3A1%2C0_53018%3A1%2C0_53019%3A1%2C101%3A1&AwaitingReconsent=false'
#             }
#             response = self.session.get(url, headers=headers)
#             return response
        
#         self.response_jsons = []
#         first_day = datetime.today().replace(day=1).strftime('%Y-%m-%d')
        
#         for row in tqdm(self.reduced_df.itertuples(), total=len(self.reduced_df)):
#             response_filename = f'MSC {int(row.pol_code)}-{int(row.pod_code)}.json'
#             if response_filename not in os.listdir():
#                 response = get_schedules(first_day, int(row.pol_code), int(row.pod_code))
#                 self.response_jsons.append(response.json())
#                 write_json(response.json(), response_filename)
#                 time.sleep(random.randint(*self.interval))
#             else:
#                 with open(response_filename, 'r') as f:
#                     self.response_jsons.append(json.load(f))
        
    def extract(self):
        """
        Extracts information from the JSON responses from the call_api method and assembles the final dataframe.
        """
#         def get_relevant_fields(response, i):
#             return {
#                 'pol_code': response[0]['Sailings'][i]['PortOfLoadId'],
#                 'pod_code': response[0]['Sailings'][i]['PortOfDischargeId'],
#                 'Voyage': response[0]['Sailings'][i]['VoyageNum'],
#                 'Vessel': response[0]['Sailings'][i]['VesselName'],
#                 'updated_etd': response[0]['Sailings'][i]['NextETD'],
#                 'updated_eta': response[0]['Sailings'][i]['ArrivalDate']
#             }

#         self.response_df = pd.DataFrame(([get_relevant_fields(response, i)
#                                      for response in self.response_jsons
#                                      for i in range(len(response[0]['Sailings']))
#                                      if len(response)
#                                     ]))
        
#         # Create reverse mapping from port_code to name
#         msc_port_id_reversed = {v:k for k,v in self.msc_port_id.items()}

#         # Add additional columns to response_df
#         self.response_df['pol_name'] = self.response_df.pol_code.map(msc_port_id_reversed)
#         self.response_df['pod_name'] = self.response_df.pod_code.map(msc_port_id_reversed)

#         # Merge results back to original dataframe
#         merge_key = ['pol_name', 'pod_name', 'Vessel', 'Voyage']
#         self.delay_sheet = (self.delay_sheet.reset_index().
#                             merge(self.response_df[merge_key + ['updated_eta', 'updated_etd']],
#                                   on=merge_key, how='left')
#                             .set_index('index')
#                             .copy())
#         self.delay_sheet.updated_eta = pd.to_datetime(self.delay_sheet.updated_eta.str[:10])
#         self.delay_sheet.updated_etd = pd.to_datetime(self.delay_sheet.updated_etd.str[:10])


### MSCExtractor
Currently, there are 2 main Web APIs which we can use for data extraction. These are:
1. MSC Search Schedules API
    - countryID API
2. MSC Arrival-Departure API

The Search Schedules API is forward looking and is the preferred mode of extraction. We will extract information for all lines for this first. If the information extraction fails, we will than treat the line as having a vessel that has already left port. We then use the Arrival-Departure API to extract the information for these.

The countryID API is a sub-API used jointly with the Search Schedules API, which we will need to query before using the Search Schedules API because the parameters for the Search Schedules API uses numbers instead of the UNLOCODEs.

### To-do
1. Now I just need to figure out how to get the first cookie and use it in subsequent headers.

### MSC Arrival-Departure API
Not currently in use.

In [None]:
# def json_payload(port_of_loading):
#     return {
#         'mscCode': port_of_loading,
#         'isCountry': False
#     }

In [None]:
# url = "https://www.msc.com/Site/WebServices/RouteFinder.svc/PortActivity"
# payload = {
#     'mscCode': 'NZBLU',
#     'isCountry': False
# }
# headers = {
#     'Accept': '*/*',
#     'X-Requested-With': 'XMLHttpRequest',
#     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36',
#     'Content-Type': 'application/json; charset=UTF-8',
#     'Origin': 'https://www.msc.com',
#     'Sec-Fetch-Site': 'same-origin',
#     'Sec-Fetch-Mode': 'cors',
#     'Sec-Fetch-Dest': 'empty',
#     'Referer': 'https://www.msc.com/arrivals-departures',
#     'Accept-Encoding': 'gzip, deflate, br',
#     'Accept-Language': 'en-GB,en;q=0.9',
#     'Cookie': 'CMSPreferredCulture=en-GB; ASP.NET_SessionId=tht5lkut0asln2goiskoagfe; UrlReferrer=https://www.google.com/; CurrentContact=8b0b2fea-705b-4a4f-b8bf-bb1cd6c982bc; CMSLandingPageLoaded=true; MSCAgencyId=115867; BIGipServerkentico.app~kentico_pool=439883018.20480.0000; _ga=GA1.2.1736073830.1597290148; _gid=GA1.2.1289141279.1597290148; _gcl_au=1.1.345060449.1597290148; __hstc=100935006.13bb76c8a78a8d0a203a993ffef3a3f6.1597290148282.1597290148282.1597290148282.1; __hssrc=1; hubspotutk=13bb76c8a78a8d0a203a993ffef3a3f6; _ym_uid=15972901491036911544; _ym_d=1597290149; _ym_isad=1; _ym_visorc_65601397=w; newsletter-signup-cookie=temp-hidden; _hjid=3e183004-f562-4048-8b60-daccdf9c187c; _hjUserAttributesHash=2c3b62a0e1cd48bdfd4d01b922060e19; _hjCachedUserAttributes={`"attributes`":{`"mscAgencyId`":`"115867`"},`"userId`":null}; OptanonAlertBoxClosed=2020-08-13T03:42:45.080Z; CMSCookieLevel=200; VisitorStatus=11062214903; CMSUserPage={`"TimeStamp`":`"2020-08-13T05:43:28.7403678+02:00`",`"LastPageDocumentID`":2969,`"LastPageNodeID`":6383,`"Identifier`":`"93e5bd8e-a77a-49db-b00e-1e3712608ca2`"}; TS0142aef9=0192b4b6225179b1baa3b4d270b71a4eee782a0192338173beabaa471f306c2a13fe854bf6a7ac08ac21924991864aa7728c54559023beabd273d82285d5f943202adb58da417d61813232e89b240828c090f890c6a74dc4adfec38513d13447be4b5b4404d69f964987b7917f731b858f0c9880a139994b98397c4aeb5bd60b0d0e38ec9e5f3c97b13fb184b4e068506e6086954f8a515f2b7239d2e5c1b9c70f61ca74f736355c58648a6036e9b5d06412389ac41221c5cb740df99c84dc2bfef4a530dbc5e2577c189212eebac723d9ee9f98030f4bc6ca7d824ab313ae5fdd1eaa9886; OptanonConsent=isIABGlobal=false&datestamp=Thu+Aug+13+2020+11%3A43%3A36+GMT%2B0800+(Singapore+Standard+Time)&version=5.9.0&landingPath=NotLandingPage&groups=1%3A1%2C2%3A1%2C3%3A1%2C4%3A1%2C0_53017%3A1%2C0_53020%3A1%2C0_53018%3A1%2C0_53019%3A1%2C101%3A1&AwaitingReconsent=false; _gat=1; _gat_local=1; __hssc=100935006.3.1597290148283; _gali=ui-id-2'
# }

In [None]:
# response = session.request("POST", url, headers=headers, json=payload)
# response.json()

### MSCExtractor

In [None]:
os.getcwd()

In [None]:
os.chdir('../..')
# Delay report skeleton
delay_report = DelayReport()
delay_report.run_bs()
delay_report.run_msc()
delay_report.run_g2()
delay_report.calculate_deltas()
delay_report.output()

In [None]:
class MSCExtractor:
    """
    Extracts information from the MSC Portal.
    
    Methods
    -------
    get_countryID:
        Gets the countryID mappings via the CountryID API in order to use the Search Schedules API.
        
    prepare:
        A single query to the Search Schedules API can provide information to multiple lines on the delay_sheet.
        Further filters self.delay_sheet to a smaller list of searches needed to fulfill all the lines on the
            delay_sheet. This reduces the total number of calls made to the Search Schedules API and prevents
            duplication of API calls.
        
    call_api:
        Makes calls to the Search Schedules API, using information from the prepare method as parameters in the
        API request. Also saves the API responses into a subdirectory "responses/<today_date>".
    
    extract:
        Extracts information from the JSON responses from the call_api method and assembles the final dataframe.
    """
    def __init__(self, main_delay_sheet: pd.DataFrame, interval: tuple):
        # Get the MSC delay sheet
        self.delay_sheet = (main_delay_sheet.loc[main_delay_sheet['Fwd Agent'] == 'MSC']
                            .drop(['updated_etd', 'updated_eta', 'No. of days delayed ETD',
                                   'No. of days delayed ETA', 'Reason of Delay'], axis=1)
                            .copy())

        # Get the MSC-specific port names from the UNLOCODEs
        self.port_mapping = {v['Port Code']: v['MSC Port Name'] for k,v in (pd.read_excel('../../data/MSC Port Code Mapping.xlsx')
                                                                   .to_dict('index').items())}
        
        # Get port name
        self.delay_sheet = self.delay_sheet.assign(pol_name=lambda x: x['Port of Loading'].apply(lambda y: self.port_mapping.get(y)),
                                                   pod_name=lambda x: x['Port of discharge'].apply(lambda y: self.port_mapping.get(y))).copy()

        self.interval = interval
        self.session = requests.Session()
        
    def get_countryID(self):
        """
        Checks if the query for countryID has been done today.
        If it has been done, skips it and uses the existing countryID JSON file.
        Otherwise, queries the countryID API.
        
        This API call does not require a cookie.
        """
        if 'countryID.json' not in os.listdir():
            def query_id(port: str):
                url = f"https://www.msc.com/api/schedules/autocomplete?q={port}"
                return self.session.get(url)

            def get_id(response):
                if len(response.json()):
                    return response.json()[0].get('id')

            msc_locations = list(self.delay_sheet.pol_name.unique()) + list(self.delay_sheet.pod_name.unique())
            location_code_responses = {location: query_id(location) for location in tqdm(msc_locations)}
            self.msc_port_id = {k:get_id(v) for k,v in location_code_responses.items()}
            write_json(self.msc_port_id, 'countryID.json')

            # PODs with no pod_id
            exception_cases = [k for k,v in self.msc_port_id.items() if v is None]
            write_json(exception_cases, 'msc_exceptions.txt')
        else:
            read_config(self, 'msc_port_id', 'countryID.json')

    def prepare(self):
        """
        Further filters self.delay_sheet to a smaller list of searches needed to fulfill all the lines on the
            delay_sheet. Also maps the UNLOCODE to their respective countryID.
        """
        # Further filter by POL-Vessel-Voyage to get ETD, POD-Vessel-Voyage to get ETA
        key = ['pol_name', 'pod_name']
        self.reduced_df = self.delay_sheet.drop_duplicates(key)[key].sort_values(key)

        self.reduced_df['pol_code'] = self.reduced_df.pol_name.map(self.msc_port_id)
        self.reduced_df['pod_code'] = self.reduced_df.pod_name.map(self.msc_port_id)

        # Unable to handle those with no pod_id in MSC; dropping these lines
        self.reduced_df.dropna(inplace=True)
        
    def call_api(self):
        """
        Makes calls to the Search Schedules API, using information from the prepare method as parameters in the
        API request. Also saves the API responses into a subdirectory "responses/<today_date>".
        """
        def get_schedules(etd: str, pol: str, pod: str):
            url = f"https://www.msc.com/api/schedules/search?WeeksOut=8&DirectRoutes=false&Date={etd}&From={pol}&To={pod}"
            headers = {
                'Accept': 'application/json',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
                'Content-Type': 'application/json',
                'Sec-Fetch-Site': 'same-origin',
                'Sec-Fetch-Mode': 'cors',
                'Sec-Fetch-Dest': 'empty',
                'Referer': 'https://www.msc.com/search-schedules',
                'Accept-Language': 'en-GB,en;q=0.9',
                'Cookie': 'CMSPreferredCulture=en-GB; ASP.NET_SessionId=tht5lkut0asln2goiskoagfe; UrlReferrer=https://www.google.com/; CurrentContact=8b0b2fea-705b-4a4f-b8bf-bb1cd6c982bc; MSCAgencyId=115867; BIGipServerkentico.app~kentico_pool=439883018.20480.0000; _ga=GA1.2.1736073830.1597290148; _gid=GA1.2.1289141279.1597290148; _gcl_au=1.1.345060449.1597290148; __hstc=100935006.13bb76c8a78a8d0a203a993ffef3a3f6.1597290148282.1597290148282.1597290148282.1; hubspotutk=13bb76c8a78a8d0a203a993ffef3a3f6; __hssrc=1; _ym_uid=15972901491036911544; _ym_d=1597290149; _ym_isad=1; newsletter-signup-cookie=temp-hidden; _hjid=3e183004-f562-4048-8b60-daccdf9c187c; _hjUserAttributesHash=2c3b62a0e1cd48bdfd4d01b922060e19; _hjCachedUserAttributes={"attributes":{"mscAgencyId":"115867"},"userId":null}; OptanonAlertBoxClosed=2020-08-13T03:42:45.080Z; CMSCookieLevel=200; VisitorStatus=11062214903; TS0142aef9=0192b4b6225179b1baa3b4d270b71a4eee782a0192338173beabaa471f306c2a13fe854bf6a7ac08ac21924991864aa7728c54559023beabd273d82285d5f943202adb58da417d61813232e89b240828c090f890c6a74dc4adfec38513d13447be4b5b4404d69f964987b7917f731b858f0c9880a139994b98397c4aeb5bd60b0d0e38ec9e5f3c97b13fb184b4e068506e6086954f8a515f2b7239d2e5c1b9c70f61ca74f736355c58648a6036e9b5d06412389ac41221c5cb740df99c84dc2bfef4a530dbc5e2577c189212eebac723d9ee9f98030f4bc6ca7d824ab313ae5fdd1eaa9886; OptanonConsent=isIABGlobal=false&datestamp=Thu+Aug+13+2020+11%3A43%3A36+GMT%2B0800+(Singapore+Standard+Time)&version=5.9.0&landingPath=NotLandingPage&groups=1%3A1%2C2%3A1%2C3%3A1%2C4%3A1%2C0_53017%3A1%2C0_53020%3A1%2C0_53018%3A1%2C0_53019%3A1%2C101%3A1&AwaitingReconsent=false'
            }
            response = self.session.get(url, headers=headers)
            return response
        
        self.response_jsons = []
        first_day = datetime.today().replace(day=1).strftime('%Y-%m-%d')
        
        for row in tqdm(self.reduced_df.itertuples(), total=len(self.reduced_df)):
            response_filename = f'MSC {int(row.pol_code)}-{int(row.pod_code)}.json'
            if response_filename not in os.listdir():
                response = get_schedules(first_day, int(row.pol_code), int(row.pod_code))
                self.response_jsons.append(response.json())
                write_json(response.json(), response_filename)
                time.sleep(random.randint(*self.interval))
            else:
                with open(response_filename, 'r') as f:
                    self.response_jsons.append(json.load(f))
        
        
    def extract(self):
        """
        Extracts information from the JSON responses from the call_api method and assembles the final dataframe.
        """
        def get_relevant_fields(response, i):
            return {
                'pol_code': response[0]['Sailings'][i]['PortOfLoadId'],
                'pod_code': response[0]['Sailings'][i]['PortOfDischargeId'],
                'Voyage': response[0]['Sailings'][i]['VoyageNum'],
                'Vessel': response[0]['Sailings'][i]['VesselName'],
                'updated_etd': response[0]['Sailings'][i]['NextETD'],
                'updated_eta': response[0]['Sailings'][i]['ArrivalDate']
            }

        self.response_df = pd.DataFrame(([get_relevant_fields(response, i)
                                     for response in self.response_jsons
                                     for i in range(len(response[0]['Sailings']))
                                     if len(response)
                                    ]))
        
        # Create reverse mapping from port_code to name
        msc_port_id_reversed = {v:k for k,v in self.msc_port_id.items()}

        # Add additional columns to response_df
        self.response_df['pol_name'] = self.response_df.pol_code.map(msc_port_id_reversed)
        self.response_df['pod_name'] = self.response_df.pod_code.map(msc_port_id_reversed)

        # Merge results back to original dataframe
        merge_key = ['pol_name', 'pod_name', 'Vessel', 'Voyage']
        self.delay_sheet = (self.delay_sheet.reset_index().
                            merge(self.response_df[merge_key + ['updated_eta', 'updated_etd']],
                                  on=merge_key, how='left')
                            .set_index('index')
                            .copy())
        self.delay_sheet.updated_eta = pd.to_datetime(self.delay_sheet.updated_eta.str[:10])
        self.delay_sheet.updated_etd = pd.to_datetime(self.delay_sheet.updated_etd.str[:10])


In [None]:
class G2Extractor:
    """
    Extracts information from the G2 Schedule Excel file by using pd.apply to the delay_sheet.
    
    Methods
    -------
    extract:
        Extracts data from the Excel dataframe, using two helper methods get_updated_eta and get_updated_etd.
    
    get_updated_eta, get_updated_etd:
        Helper methods to extract the updated_eta and updated_etd given a row in the delay_sheet.
    """
    def __init__(self, g2_file: str, main_delay_sheet: pd.DataFrame):
        self.schedule = pd.read_excel(
            Path('../../' + g2_file), skiprows=9, index_col='Unnamed: 0')
        self.delay_sheet = main_delay_sheet.query(f"`Fwd Agent` in {['G2OCEAN']}").copy()
        self.port_mapping = {v['Port Code']: v['G2 Port Name'] for k,v in (pd.read_excel('../../data/G2 Port Code Mapping.xlsx')
                                                           .to_dict('index').items())}

    def get_updated_etd(self, row):
        try:
            # column_index_etd is the column number that points to the ETD
            column_index_etd = np.argwhere(
                self.schedule.columns.str.contains(row['Vessel']))[0][0] + 1
        except IndexError:
            return np.nan
        return self.schedule.loc[self.schedule.index == self.port_mapping.get(row['Port of Loading'])].iloc[:, column_index_etd][0]

    def get_updated_eta(self, row):
        try:
            # column_index_eta is the column number that points to the ETA
            column_index_eta = np.argwhere(
                self.schedule.columns.str.contains(row['Vessel']))[0][0]
        except IndexError:
            return np.nan
        return self.schedule.loc[self.schedule.index == self.port_mapping.get(row['Port of discharge'])].iloc[:, column_index_eta][0]

    def extract(self):
        """
        Extracts data from the Excel dataframe, using two helper methods get_updated_eta and get_updated_etd.
        """
        self.delay_sheet['updated_etd'] = self.delay_sheet.apply(self.get_updated_etd, axis=1)
        self.delay_sheet['updated_eta'] = self.delay_sheet.apply(self.get_updated_eta, axis=1)

In [None]:
class DelayReport:
    """
    Main delay report class that loads configurations that are shared across Extractors and runs the Extractors.
    
    Methods
    -------
    run_bs, run_msc, run_g2:
        Runs the corresponding extraction by instantiating a relevant Extractor class.
        
    calculate_deltas:
        Calculates the deltas from the updated delay_sheet.
        
    output:
        Write the final delay report Excel file to disk.
    """
    def __init__(self):
        # Read configurations
        read_config(self, 'config', 'data/config.json')
        
        # Used to map carrier names to the ones BigSchedules uses and supports
        read_config(self, 'carrier_mapping', 'data/carrier_mapping.json')

        # Random interval in seconds
        self.interval = (self.config.get('randomiser_lower_interval'), self.config.get('randomiser_upper_interval'))

        # Prepare UNLOCODE to port name mapping
        self.port_mapping = (
            pd.concat([pd.read_csv(p, usecols=[1, 2, 4, 5], engine='python', names=[
                      'country', 'port', 'name', 'subdiv']) for p in Path('data').glob("*UNLOCODE CodeListPart*")])
            .query('port == port')
            .assign(uncode=lambda x: x.country.str.cat(x.port),
                    full_name=lambda x: np.where(x.subdiv.notnull(), x.name.str.cat(x.subdiv, sep=", "), x.name))
            .drop_duplicates('uncode')
            .set_index('uncode')
            .to_dict('index'))
        
        # Read the vessel delay tracking file
        self.xl = pd.ExcelFile(self.config['delay_filename'])
        # today_date = datetime.now().strftime('%d.%m.%Y')
        # if today_date not in self.xl.sheet_names:
        #     raise Exception(
        #         f"The script cannot find today's date ({today_date}) in the Vessel Delay Tracking.xlsx file provided. Please check that the sheets are correctly named - the script will only operate on a sheet with today's date.")

        # Assemble the final dataframe to update
        self.main_delay_sheet = self.xl.parse(pd.to_datetime(self.xl.sheet_names,
                                                             errors='coerce',
                                                             format='%d.%m.%Y').max().date().strftime('%d.%m.%Y'),
                                              parse_dates=True).copy()
        
        # If our current Excel file already has an updated_eta or updated_etd columns, we drop them
        new_columns = ['updated_etd', 'updated_eta', 'No. of days delayed ETD', 'No. of days delayed ETA', 'Reason of Delay']
        for updated_column in new_columns:
            if updated_column in self.main_delay_sheet.columns:
                self.main_delay_sheet.drop(updated_column, axis=1, inplace=True)
        
        # Add new columns to the right side of the dataframe
        self.main_delay_sheet[new_columns] = pd.DataFrame([[pd.NaT for i in range(4)] + [np.nan]])
        
        # Today's directory
        today_path = Path('responses/' + datetime.now().strftime('%Y-%m-%d'))
        try:
            os.makedirs(today_path)
        except FileExistsError:
            pass
        os.chdir(today_path)
        
    def run_bs(self):
        if self.config.get('run_bs'):
            bs_extractor = BSExtractor(self.main_delay_sheet, self.interval)
            bs_extractor.extract()
        
    def run_msc(self):
        if self.config.get('run_msc'):
            self.msc_extractor = MSCExtractor(self.main_delay_sheet, self.interval)
            self.msc_extractor.get_countryID()
            self.msc_extractor.prepare()
            self.msc_extractor.call_api()
            self.msc_extractor.extract()
    
    def run_g2(self):
        if self.config.get('run_g2'):
            self.g2_extractor = G2Extractor(self.config.get('g2_filename'), self.main_delay_sheet)
            self.g2_extractor.extract()

    def calculate_deltas(self):
        # Update the dataframe
        if self.config.get('run_bs'):
            self.main_delay_sheet.update(self.bs_extractor.delay_sheet)

        if self.config.get('run_msc'):
            self.main_delay_sheet.update(self.msc_extractor.delay_sheet)
            
        if self.config.get('run_g2'):
            self.main_delay_sheet.update(self.g2_extractor.delay_sheet)

        # Calculate the deltas
        self.main_delay_sheet['No. of days delayed ETD'] = (self.main_delay_sheet.updated_etd
                                                            - pd.to_datetime(self.main_delay_sheet['ETD Date'])).dt.days
        self.main_delay_sheet['No. of days delayed ETA'] = (self.main_delay_sheet.updated_eta
                                                            - pd.to_datetime(self.main_delay_sheet['Disport ETA'])).dt.days

        # Format the dates correctly via strftime
        date_columns = ['ETD Date', 'Disport ETA', 'updated_etd', 'updated_eta']
        for column in date_columns:
            self.main_delay_sheet[column] = self.main_delay_sheet[column].dt.strftime('%d/%m/%Y')
    
    def output(self):
        # Output the excel file
        saved_file = f"Vessel Delay Tracking - {datetime.today().strftime('%d.%m.%Y')}.xlsx"
        self.main_delay_sheet.to_excel(Path('../../' + saved_file), index=False)
        os.startfile(Path('../../' + saved_file))