# Proposal

## Use Case

#### Big Question
- Which shippers are most/least reliable (arrival time delta between estimated and actual)?

#### Sub Questions
- Which are the most reliable shippers per country/region/subregion
- Which carrier companies are the most reliable?
- What, if any, were the reliability changes over the years?
    - How did covid affect reliability metrics of shipment times?
- Which consignees chose their shippers wisest?

## Raw Denormalized Tables

##### Header
| Column | Data Type | Explanation | % non-null in 1st file |
| ------ | --------- | ----------- | ---------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| carrier_code | String | Standard Carrier Alpha Code (SCAC) to identify Vessel Operating Common Carriers (VOCC) | 100% |
| vessel_country_code | String | Carrier vessel country of origin | 99.99% |
| vessel_name | String | Ship's name | 100% |
| estimated_arrival_date | String | Time given as estimated shipment arrival | 100% |
| actual_arrival_date | String | Real date of arrival | 100% |

##### Consignee
| Column | Data Type | Explanation | % non-null 1st file |
| ------ | --------- | ----------- | ------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| consignee_name | String | Name of company receiving manifest items | 99.99% |
| consignee_address_1 | String | Top level address | 99.99% |
| consignee_address_2 | String | 2nd level address | 87.80% |
| consignee_address_3 | String | 3rd level address | 55.05% |
| consignee_address_4  | String | 4th level address | 11.52% |
| country_code | String | 2-digit country code | 20.36% |

##### shipper
| Column | Data Type | Explanation | % non-null 1st file |
| ------ | --------- | ----------- | ------------------- |
| identifier | int64 | id of manifest shipment | 100% |
| shipper_party_name | String | Name of company shipping manifest items | 99.99% |
| shipper_party_address_1 | String | Top level address | 99.99% |
| shipper_party_address_2 | String | 2nd level address | 91.55% |
| shipper_party_address_3 | String | 3rd level address | 62.89% |
| shipper_party_address_4 | String | 4th level address | 14.36% |
| country_code | String | 2-digit country code | 21.21% |

## Proposed Normalized Schema

![ERD](/Users/jesseputnam/cs-learning/skillstorm/project01/erd.png)

# ETL Pipeline

In [1]:
import pandas as pd

from lib.utils import get_id_nums, clean_row, remove_incorrect_codes

## Set up countries table

- Get table of countries with alpha-2 code that includes region from repository
    - https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes

In [2]:
countries_cols = ['name', 'alpha-2', 'region', 'sub-region']
countries = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/all.csv', usecols=countries_cols, keep_default_na=False)

extra_codes = pd.DataFrame(
    {'name': ['Czechia', 'Netherland Antilles', 'Germany', 'European Union'], 
    'alpha-2': ['XC', 'AN', 'DD', 'EU'], 
    'region': ['Europe', 'Americas', 'Europe', 'Europe'], 
    'sub-region': ['Eastern Europe', 'Latin America and the Caribbean', 'Western Europe', 'Western Europe']}
    )

extra_codes
countries = pd.concat([countries, extra_codes], ignore_index=True, keys=['alpha-2', 'name'])

# Change countries index column to be alpha-2 values and rename to id
countries.set_index('alpha-2', inplace=True)
countries.index.name = 'id'
countries.sort_index(inplace=True)

# Create country code set with O(1) lookup for table cleaning
alpha_2_set = set(countries.index)

# Create country name dictionary with O(1) lookup for table cleaning
country_dict = {x[1].upper(): x[0] for x in countries.itertuples()}

# Add some statistically siginificant outliers, including common 2 common 'typos'
country_dict['TAIWAN'] = 'TW'
country_dict['SOUTH KOREA'] = 'KR'
country_dict['SHANGHAI CN'] = 'CN'
country_dict['SHANGHAI'] = 'CN'
country_dict['SHANGHAI .'] = 'CN'
country_dict['HONG KONG .'] = 'CN'
country_dict['TAIPEI .'] = 'TW'
country_dict['USA'] = 'US'
country_dict['U.S.A.'] = 'US'

- Convert altered countries dataframe to csv

In [20]:
countries_sql_path = '/Users/jesseputnam/cs-learning/skillstorm/project01/data/sql/countries.csv'
countries.to_csv(countries_sql_path, mode='w')


## Shipper/Consignee

- Identify the necessary columns

In [6]:
# Choose columns to keep
shipper_cols = ['identifier', 'shipper_party_name', 'shipper_party_address_1', 'shipper_party_address_2', 'shipper_party_address_3', 'shipper_party_address_4', 'country_code']

shipper_final_path = '/Users/jesseputnam/cs-learning/skillstorm/project01/data/final/shippers_clean.csv'

- Read CSVs to DataFrames with only necessary columns

In [4]:
shipper_0 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2018/shipper_2018_part_0.csv', usecols=shipper_cols)
shipper_1 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2018/shipper_2018_part_1.csv', usecols=shipper_cols)
shipper_2 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2019/shipper_2019_part_0.csv', usecols=shipper_cols)
shipper_3 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2019/shipper_2019_part_1.csv', usecols=shipper_cols)
shipper_4 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2020/shipper_2020_part_0.csv', usecols=shipper_cols)
shipper_5 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2020/shipper_2020_part_1.csv', usecols=shipper_cols)

- Concatenate shippers DataFrames to single DataFrame

In [6]:
shippers = pd.concat([shipper_0, shipper_1, shipper_2, shipper_3, shipper_4, shipper_5], ignore_index=True)

In [8]:
# Replace NaN in name with Unknown
shippers['shipper_party_name'].fillna('N/A', inplace=True)

- Clean shipper rows and remove remaining unnecessary columns - (see utils.py for **CLEAN_ROW** function)

In [9]:
shippers_clean = shippers.apply(lambda row: clean_row(row, 'shipper_party', alpha_2_set, country_dict), axis=1)
shippers_clean = shippers_clean[['identifier', 'shipper_party_name', 'country_code']]

- **Result:** out of 40,240,366 values


|  | Before Cleaning | After Cleaning |
| - | - | - |
| country_codes #| 9,911,774 | 13,100,153| 
| country_codes %| 24.6% | 32.55% |

- Create shipper id column and map IDs by name - (see utils.py for **GET_ID_NUMS** function)

In [13]:
shipper_id_dict = get_id_nums(shippers_clean['shipper_party_name'])
shippers_clean['shipper_id'] = shippers_clean['shipper_party_name'].map(shipper_id_dict)

- Write cleaned data to CSV (time consuming process -- no mistakes)

In [16]:
shippers_clean.to_csv(shipper_final_path, mode='w')

## Header

- Identify necessary columns

In [44]:
header_cols = ['identifier', 'carrier_code', 'vessel_country_code', 'vessel_name', 'estimated_arrival_date','actual_arrival_date']
header_final_path = '/Users/jesseputnam/cs-learning/skillstorm/project01/data/final/header_clean.csv'

- Read header CSV files

In [45]:
header_0 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2018/header_2018_part_0.csv', usecols=header_cols)
header_1 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2018/header_2018_part_1.csv', usecols=header_cols)
header_2 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2018/header_2018_part_2.csv', usecols=header_cols)
header_3 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2018/header_2018_part_3.csv', usecols=header_cols)
header_4 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2019/header_2019_part_0.csv', usecols=header_cols)
header_5 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2019/header_2019_part_1.csv', usecols=header_cols)
header_6 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2019/header_2019_part_2.csv', usecols=header_cols)
header_7 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2019/header_2019_part_3.csv', usecols=header_cols)
header_8 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2020/header_2020_part_0.csv', usecols=header_cols)
header_9 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2020/header_2020_part_1.csv', usecols=header_cols)
header_10 = pd.read_csv('/Users/jesseputnam/cs-learning/skillstorm/project01/data/2020/header_2020_part_2.csv', usecols=header_cols)


- Concat to a single DataFrame

In [46]:
headers = pd.concat([header_0, header_1, header_2, header_3, header_4, header_5, header_6, header_7, header_8, header_9, header_10], ignore_index=True)

- Remove erroneous or unknown country codes

In [51]:
headers['vessel_country_code'] = headers['vessel_country_code'].apply(lambda row: remove_incorrect_codes(row, alpha_2_set))

- Merge header with consignee and shipper on IDs only

In [97]:
## Merge on identifier, adding just the IDs for shipper and consignee
headers_full = headers.merge(shippers_clean[['identifier', 'shipper_id']], on='identifier', how='inner')
headers_full = headers_full.merge(consignees_clean[['identifier', 'consignee_id']], on='identifier', how='inner')

In [100]:
headers_full

Unnamed: 0,identifier,carrier_code,vessel_country_code,vessel_name,estimated_arrival_date,actual_arrival_date,shipper_id,consignee_id
0,201801010,DFDS,GB,EVER SIGMA,2017-02-14,2017-02-15,0,0
1,201801011,DFDS,GB,EVER SIGMA,2017-02-14,2017-02-15,1,1
2,201801012,DFDS,GB,EVER SIGMA,2017-02-14,2017-02-15,2,0
3,201801013,DFDS,GB,EVER SIGMA,2017-02-14,2017-02-15,3,2
4,201801014,DFDS,GB,EVER SIGMA,2017-02-14,2017-02-15,4,3
...,...,...,...,...,...,...,...,...
40972605,2020092382683,OERT,LR,DS KOI,2020-09-06,2020-09-08,1714562,55123
40972606,2020092382684,FEVM,US,SEASPAN GANGES,2020-09-08,2020-09-11,1743310,2260669
40972607,2020092382685,MOSJ,PA,COSCO AMERICA,2020-07-19,2020-07-20,269749,229332
40972608,2020092382686,MOSJ,PA,COSCO AMERICA,2020-07-19,2020-07-20,269749,229332


In [99]:
headers_full[headers_full['identifier'] == 2018010327496]

Unnamed: 0,identifier,carrier_code,vessel_country_code,vessel_name,estimated_arrival_date,actual_arrival_date,shipper_id,consignee_id
65839,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32931,29064
65840,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32931,29064
65841,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32931,29064
65842,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32932,29064
65843,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32932,29064
65844,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32932,29064
65845,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32932,29064
65846,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32932,29064
65847,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30,32932,29064


In [92]:
headers[headers['identifier'] == 2018010327496]

Unnamed: 0,identifier,carrier_code,vessel_country_code,vessel_name,estimated_arrival_date,actual_arrival_date
89642,2018010327496,SEAG,HK,COSCO DEVELOPMENT,2017-12-30,2017-12-30


In [93]:
headers.size

325292088

In [98]:
headers_full.size

327780880

In [96]:
shippers_clean[shippers_clean['identifier'] == 2018010327496]

Unnamed: 0,identifier,shipper_party_name,country_code,shipper_id
64830,2018010327496,NINGBO HEMING IMPORT AND EXPORT,,32931
64831,2018010327496,SHAOXING SHANGYU TIANYA PHOTOGRAPHI,,32932
64832,2018010327496,SHAOXING SHANGYU TIANYA PHOTOGRAPHI,,32932


In [104]:
consignees_clean['identifier'].value_counts().head(50)

identifier
2019071522115    3
2018010327496    3
2019021832036    3
2018091226571    2
2020032728109    2
2019032810956    2
2020082070417    2
2020082070405    2
2020032728185    2
2020032728227    2
2020082070357    2
2020082070292    2
2018102712160    2
2020032728327    2
2020032728336    2
2019070419873    2
2018102712218    2
2019070419621    2
2020082070195    2
2019120631445    2
2018080136090    2
2018080136089    2
2019070419988    2
2019070419995    2
2018102712319    2
2019032810630    2
2019032810628    2
2018080136061    2
2019120631417    2
2018080136037    2
2019070419623    2
2019070419619    2
2019070419620    2
2018080136606    2
2018080136902    2
2018080136886    2
2018080136880    2
2018080136879    2
2020032727691    2
2018080136825    2
2019120632371    2
2020032727746    2
2018080136786    2
2020032727895    2
2020082070670    2
2018080136613    2
2020082070632    2
2018080135985    2
2018102711873    2
2020032727956    2
Name: count, dtype: int64

# SQL File Creation

- Remove identifier from consignee and shipper, change index to IDs

- Change index on header to identifier?

- Upload shipper, header, consignee as as csv for SQL batch loading

- Write SQL table creation file