In [1]:
import pandas as pd
import datetime as dt
import pyarrow as pa
import pyarrow.parquet as pq

### Making Phase One Product List

The first step reads in the excel file. What I did was take the offical document and then with Adobe, extracted Annex 6-1 which has the product lists from which purchases should be made. Then with the extracted .pdf file, I converted it to an excel file. Now some simple cleaning needs to be done. The end product is something where we have an (i) and hs code, (ii) description and (iii) a low catagory and (iv) a high catagory. The high catagory is the catagory for which purchase commitments are made. 

In [2]:
file_path = "./data/annex-6-1.xlsx"

df = pd.read_excel(file_path, skiprows = 1, header = None, dtype = {1: str})

df.columns = ["catagory", "hs4_o", "description"]

#### Some basic cleaning.

The first step steps are to create the low and high product catagories and seperate them into their own column. Then the remaing steps is to drop headings etc. The only issue is that some HS codes are messed up (i.e. ones are viewed as Ls) and two other issues. I manually replace these. Then we are done.

In [3]:
df.catagory.ffill(inplace = True)
# This will take the catagory and push down, giving rise to the numbers, etc.

df["low_catagory"] = df[df.description.isnull()].hs4_o

df["high_catagory"] = df[df.description.isnull() & df.hs4_o.isnull()].catagory

df["low_catagory"].ffill(inplace = True)

df["high_catagory"].ffill(inplace = True)

In [5]:
df.dropna(axis = 0, subset = ["hs4_o"], inplace = True)

df.dropna(axis = 0, subset = ["description"], inplace = True)

replace_list = ['85ll', '72ll', '220 1', '290543/\n290544', "290151"]

df.replace(to_replace = replace_list, value = ["8511", "7211", "2201", "2905", "290511"], inplace = True)

df["foo"] = df.hs4_o.apply(pd.to_numeric, args=('coerce',))

In [6]:
df.dropna(axis = 0, subset = ["foo"], inplace = True)

df.drop(axis = 1, labels = ["foo", "catagory"], inplace = True)

The final issue is somehow the "0" infront of the hs4 codes are not being read in. In excel they display, but not read in. This will fix these entries.

In [7]:
df[df.high_catagory == "3. Energy"]

Unnamed: 0,hs4_o,description,low_catagory,high_catagory
560,271111,Liquefied natural gas,Liquefied natural gas,3. Energy
563,2709,Petroleum oils and oilsobtained from bituminou...,Crude oil,3. Energy
566,271112,Liquefied propane,Refined products,3. Energy
567,271113,Liquefied butane,Refined products,3. Energy
568,27111990,Other unlisted liquefied petroleum gases and g...,Refined products,3. Energy
569,271311,U n ca lc ined petroleum coke,Refined products,3. Energy
570,271312,Calcined petroleum coke,Refined products,3. Energy
571,271012250,"Naphtha (Excluding Motor Fue,l) blend Stock n...",Refined products,3. Energy
572,290511,Me thanol,Refined products,3. Energy
575,2701,"Coal; briquett,es ovoids and similar solid fu...",Coal,3. Energy


In [8]:
def add_zero(x):
    
    if len(x) < 4:
        x = "0" + x
        
    if len(x) > 4:
        # Some codes are above hs (almost all are energy),
        # now I have a fix for energy
        x = x[0:4]
        
    return x

In [15]:
is_energy = df.high_catagory == "3. Energy"

df.loc[~is_energy, "hs4_o"] = df.loc[~is_energy, "hs4_o"].apply(add_zero)
# This fixes the zero problem on all the codes but energy which does not
# have a zero problem

df["hs4"] = df.hs4_o.apply(add_zero)
# then this creates the hs4 codes that mimic the original setup

In [17]:
df.loc[is_energy, "hs4"]

560    2711
563    2709
566    2711
567    2711
568    2711
569    2713
570    2713
571    2710
572    2905
575    2701
Name: hs4, dtype: object

#### And we are done

Check out the end product. **To Do** make a .csv file for posting.

In [18]:
grp = df.groupby("hs4_o")

outdf = grp.agg({"description": "first", "low_catagory": "first", "high_catagory": "first", "hs4": "first"})
# Given the truncation, this will collapse things so that there is a unique hs4 code
# the descrition will be off, but low and high catagory will be right.

outdf.reset_index(inplace = True)

In [19]:
outdf[outdf.high_catagory == "3. Energy"]

Unnamed: 0,hs4_o,description,low_catagory,high_catagory,hs4
194,2701,"Coal; briquett,es ovoids and similar solid fu...",Coal,3. Energy,2701
195,2709,Petroleum oils and oilsobtained from bituminou...,Crude oil,3. Energy,2709
196,271012250,"Naphtha (Excluding Motor Fue,l) blend Stock n...",Refined products,3. Energy,2710
197,271111,Liquefied natural gas,Liquefied natural gas,3. Energy,2711
198,271112,Liquefied propane,Refined products,3. Energy,2711
199,271113,Liquefied butane,Refined products,3. Energy,2711
200,27111990,Other unlisted liquefied petroleum gases and g...,Refined products,3. Energy,2711
201,271311,U n ca lc ined petroleum coke,Refined products,3. Energy,2713
202,271312,Calcined petroleum coke,Refined products,3. Energy,2713
259,290511,Me thanol,Refined products,3. Energy,2905


In [20]:
out_file = ".\\data"+ "\\annex-6-1.parquet"

pq.write_table(pa.Table.from_pandas(outdf), out_file)

outdf.to_csv(".\\data"+ "\\annex-6-1.csv",index = False)

  'start': level._start,
  'stop': level._stop,
  'step': level._step


In [21]:
foo.tail(20)

NameError: name 'foo' is not defined