# **Web Scarping**

## **Ship Arrival Forecast**

**Web Scraping:** Ship arrival forecast's Website (Portuguese).

**The objective:** The goal is to parse the HTLM of the website collecting real-time information on the ships' arrival forecast to the Douro Port (Oporto, Portugal).

**Company:** APDL – Administração dos Portos do Douro, Leixões e Viana do Castelo, S.A. - *(APDL - Port Administration of Douro, Leixões and Viana do Castelo, SA).*

**Object:** The mission of APDL, S.A. is to manage the Douro, Leixões and Viana do Castelo Ports, undertaking their economic exploitation, conservation and development, including the powers assigned to the port authority. 

**Data source:** JUPII - Single Port Window.
http://siga.apdl.pt/site-apdl/planeamento/naviosprevchegada.jsp?lang=pt 

In [1]:
# import the libraries
import pandas as pd
import datetime
import requests
import sqlite3
from bs4 import BeautifulSoup

In [2]:
# Setting-up the environment
def get_data(url, parse):
    raw_html = requests.get(url, parse).content
    return BeautifulSoup(raw_html)

# Identifying the path
url = 'http://siga.apdl.pt/'
raw_html = requests.get(url).content
html = BeautifulSoup(raw_html)

In [3]:
# Extraction of the information
html = get_data('http://siga.apdl.pt/site-apdl/planeamento/naviosprevchegada.jsp?lang=pt', 'html.parse')
ship_arriv = html.find_all("tbody")

In [4]:
# Creating dictionary
all_ships = {}
for index in range(len(ship_arriv)):
    ship_n = "ship_{}".format(index + 1)
    
    # Pré-process and data cleaning
    estimated_arrival_time = ship_arriv[index].find_all("td")[0].get_text()#,
    process = ship_arriv[index].find_all("td")[1].get_text()#,
    customs_process_id = ship_arriv[index].find_all("td")[2].get_text()#,
    imo_nr = ship_arriv[index].find_all("td")[3].get_text()#,
    vessel = ship_arriv[index].find_all("td")[4].get_text()#,
    call_reference = ship_arriv[index].find_all("td")[5].get_text()#,
    flag = ship_arriv[index].find_all("td")[6].get_text()#,
    comp = ship_arriv[index].find_all("td")[7].get_text()#,
    gt = ship_arriv[index].find_all("td")[8].get_text()#,
    berthing_place = ship_arriv[index].find_all("td")[9].get_text()#,
    draught_entrance = ship_arriv[index].find_all("td")[10].get_text()#,
    draught_departure = ship_arriv[index].find_all("td")[11].get_text()#,
    port_ori = ship_arriv[index].find_all("td")[12].get_text()#,
    port_dest = ship_arriv[index].find_all("td")[13].get_text()#,
    shipping_agent = ship_arriv[index].find_all("td")[14].get_text()#,
    vessel_type = ship_arriv[index].find_all("td")[15].get_text()#,
    all_ships[ship_n] = {
    "Estimated Arrival Time" : estimated_arrival_time,
    "Process" : process,
    "Customs process ID" : customs_process_id,
    "Imo nr" : imo_nr,
    "Vessel" : vessel,
    "Call Reference" : call_reference,
    "Flag" : flag,
    "Comp." : comp,
    "GT" : gt,
    "Berthing Place" : berthing_place,
    "Draught Entrance" : draught_entrance,
    "Draught Departure" : draught_departure,
    "Port Origin" : port_ori,
    "port Destination" : port_dest,
    "Shipping Agent." : shipping_agent,
    "Vessel Type" : vessel_type
    }

In [5]:
a_s = pd.DataFrame(all_ships).T # Displaying results in a dataframe
a_s.head(20) # Calling dataframe

Unnamed: 0,Estimated Arrival Time,Process,Customs process ID,Imo nr,Vessel,Call Reference,Flag,Comp.,GT,Berthing Place,Draught Entrance,Draught Departure,Port Origin,port Destination,Shipping Agent.,Vessel Type
ship_1,2020/08/28 15:00,ESC202001824,1827,9483695,NORDICA,PCSK,PAISES BAIXOS,151.72,10318.0,TERMINAL DE CONTENTORES SUL,,,LISBOA,TILBURY,GARLAND NAVEGACAO,PORTA-CONTENTORES
ship_2,2020/08/28 16:00,ESC202001888,1892,9246554,ALLEGRO,V2CQ5,ANTIGUA E BARBUDA,134.44,9962.0,TERMINAL DE CONTENTORES SUL,8.0,8.0,VIGO,SETÚBAL,WEC LINES - IBERO PORTUGAL,PORTA-CONTENTORES
ship_3,2020/08/28 18:00,ESC202001948,1951,9106924,WILSON BORG,9HNO4,MALTA,87.9,2446.0,DOCA 2 NORTE,3.2,5.0,LISBOA,SETÚBAL,NAVEX - EMP. PORT. DE NAVEGACAO,CARGA GERAL N.D.
ship_4,2020/08/28 22:00,ESC202001905,1909,9584487,HELENA SCHEPERS,5BVP3,CHIPRE,151.72,10318.0,TERMINAL DE CONTENTORES NORTE,,,ROTTERDAM,LISBOA,"One Shipping, Unipessoal Lda",PORTA-CONTENTORES
ship_5,2020/08/28 22:00,ESC202001910,1914,9516246,MICHELLE 1,9HA2395,MALTA,116.0,5576.0,DOCA 4 NORTE,,,CAEN,UNKOWN LOCATION,GARLAND NAVEGACAO,CARGA GERAL N.D.
ship_6,2020/08/29 06:00,ESC202001573,1580,9354428,SARA BORCHARD,CQEJ,PORTUGAL,134.44,9962.0,TERMINAL DE CONTENTORES NORTE,7.8,8.0,LIVERPOOL,CASTELLÓN DE LA PLANA,"MARMEDSA - AGENCIA MARITIMA, LDA",PORTA-CONTENTORES
ship_7,2020/08/29 06:00,ESC202001880,1884,9351098,MAX STABILITY,9HA4086,MALTA,126.87,7532.0,TERMINAL DE CONTENTORES NORTE,5.5,5.8,LISBOA,CANIÇAL,SOFRENA - SOC. AFRET. NAVEGACAO,PORTA-CONTENTORES
ship_8,2020/08/29 06:00,ESC202001890,1894,9143972,WEC BRUEGHEL,MCXD5,REINO UNIDO,121.35,6362.0,TERMINAL DE CONTENTORES SUL,6.5,6.5,FIGUEIRA DA FOZ,SETÚBAL,WEC LINES - IBERO PORTUGAL,PORTA-CONTENTORES
ship_9,2020/08/29 06:00,ESC202001571,1578,9242560,AMELIE BORCHARD,V2GC4,ANTIGUA E BARBUDA,134.44,9981.0,TERMINAL DE CONTENTORES NORTE,8.7,8.7,CASTELLÓN DE LA PLANA,DUBLIN,"MARMEDSA - AGENCIA MARITIMA, LDA",PORTA-CONTENTORES
ship_10,2020/08/29 06:30,ESC202001881,1885,9823352,LAURELINE,9HA4791,MALTA,216.47,50443.0,Terminal Multiusos de Leixões (Molhe Sul),,,ROTTERDAM,ZEEBRUGGE,"DELPHIS PORTUGAL, LDA","RO-RO, N.D."


---

**Author**: Gonçalo Guimarães Gomes. Portuguese Data Analyst and Digital Brand Marketeer. Postgraduate in Data Science and in Digital Marketing. Degree and Executive Master in Marketing Management.

### **Contacts**

- [Linkedin](https://www.linkedin.com/in/goncaloggomes/)
- [Twitter](https://twitter.com/goncaloggomes)
- [Medium Profile](https://medium.com/@goncaloggomes)
- [GitHub](https://github.com/goncaloggomes)
- [Email](mailto:goncaloggomes@gmail.com)