# Process NACE

Processing of NACE data obtained from Eurostat

https://ec.europa.eu/eurostat/ramon/nomenclatures/index.cfm?TargetUrl=LST_CLS_DLD&StrNom=NACE_REV2&StrLanguageCode=EN&StrLayoutCode=HIERARCHIC#

jab 27.04.2020

# Packages and options

In [1]:
import pandas as pd

# Read data

In [2]:
lst_df = pd.read_html("./NACE_REV2_20200427_154248.htm")

In [3]:
len(lst_df)

1

In [4]:
df_in = lst_df[0]
df_in.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 996 entries, 0 to 995
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Order                     996 non-null    int64 
 1   Level                     996 non-null    int64 
 2   Code                      996 non-null    object
 3   Parent                    975 non-null    object
 4   Description               996 non-null    object
 5   This item includes        778 non-null    object
 6   This item also includes   202 non-null    object
 7   Rulings                   134 non-null    object
 8   This item excludes        507 non-null    object
 9   Reference to ISIC Rev. 4  996 non-null    object
dtypes: int64(2), object(8)
memory usage: 77.9+ KB


In [5]:
df_in.head(1)

Unnamed: 0,Order,Level,Code,Parent,Description,This item includes,This item also includes,Rulings,This item excludes,Reference to ISIC Rev. 4
0,398481,1,A,,"AGRICULTURE, FORESTRY AND FISHING",This section includes the exploitation of vege...,,,,A


# All levels in one frame

In [6]:
col_rename = {'Code': 'id',
            'Level': 'level',
            'Parent': 'parent_id',
            'Description': 'description',
            'This item includes': 'includes',
            'This item also includes': 'includesAlso',
            'Rulings': 'ruling',
            'This item excludes': 'excludes',
            'Reference to ISIC Rev. 4': 'isic4_id'}
df_all = df_in.rename(columns=col_rename)[col_rename.values()].copy()

Need to add the level 3 codes ending with .0 as rthey are sometime used as well:

In [7]:
new_rows = []
for i, row in df_all[df_all.level == 2].iterrows():
    if row["id"] + ".0" not in df_all.id.values:
        r = {k: v for k,v in row.items()}
        r["id"] = r["id"] + ".0"
        r["level"] = 3
        r["parent_id"] = row["id"]
        r['isic4_id'] = r['isic4_id'] + "0"
        new_rows.append(r)
df_ = pd.DataFrame(new_rows)

In [8]:
df_out = df_all.append(df_)
df_out.to_csv("../nace_all.csv", index=False)
df_out.id.is_unique

True