In [1]:
import pathlib
import numpy as np
import pandas as pd

## Trial Data
First we load some trial data (eg. car fleet count from Brussels Region in 2015).

In [2]:
path = pathlib.Path('data/regexp.csv')
path

PosixPath('data/regexp.csv')

In [3]:
df = pd.read_csv(str(path), sep=';', thousands=' ').rename(columns={'2015': 'Count'}).dropna(axis=0, how='all')
df.dtypes

Sector         object
Subsector      object
Fuel           object
Technology     object
Count         float64
dtype: object

Dataframe exhibits some useful structured text colums

In [4]:
df.iloc[:10,:]

Unnamed: 0,Sector,Subsector,Fuel,Technology,Count
0,Passenger Cars,"Gasoline <0,8 l",Gasoline,PC Euro 4 - 98/69/EC Stage2005,945.0
1,Passenger Cars,"Gasoline <0,8 l",Gasoline,PC Euro 5 - EC 715/2007,33.0
2,Passenger Cars,"Gasoline <0,8 l",Gasoline,PC Euro 6 - EC 715/2007,1.0
3,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,PRE ECE,6774.0
4,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,ECE 15/00-01,1020.0
5,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,ECE 15/02,2698.0
6,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,ECE 15/03,1768.0
7,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,ECE 15/04,1756.0
8,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,PC Euro 1 - 91/441/EEC,2694.0
9,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,PC Euro 2 - 94/12/EEC,5458.0


## Regular Expressions
We will extract form those columns useful information using [regular expression][1] (`re` [module][2] in python) or its binding in Pandas (`str.extract` [method][3]).
[1]: https://en.wikipedia.org/wiki/Regular_expression
[2]: https://docs.python.org/3.5/library/re.html
[3]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html

First lets see what are modalities within columns:

In [5]:
df.Technology.unique()

array(['PC Euro 4 - 98/69/EC Stage2005', 'PC Euro 5 - EC 715/2007',
       'PC Euro 6 - EC 715/2007', 'PRE ECE', 'ECE 15/00-01', 'ECE 15/02',
       'ECE 15/03', 'ECE 15/04', 'PC Euro 1 - 91/441/EEC',
       'PC Euro 2 - 94/12/EEC', 'PC Euro 3 - 98/69/EC Stage2000',
       'Conventional', 'LD Euro 1 - 93/59/EEC', 'LD Euro 2 - 96/69/EEC',
       'LD Euro 3 - 98/69/EC Stage2000', 'LD Euro 4 - 98/69/EC Stage2005',
       'LD Euro 5 - 2008 Standards', 'LD Euro 6',
       'HD Euro I - 91/542/EEC Stage I',
       'HD Euro II - 91/542/EEC Stage II', 'HD Euro III - 2000 Standards',
       'HD Euro IV - 2005 Standards', 'HD Euro V - 2008 Standards',
       'HD Euro VI', 'EEV', 'Mop - Euro I', 'Mop - Euro II',
       'Mop - Euro III', 'Mot - Euro I', 'Mot - Euro II', 'Mot - Euro III'], dtype=object)

We would like to extract from here Euro Norm Id and its serial number:

In [6]:
x1 = df.Technology.str.extract('Euro (?P<EuroStrId>[\d]|[IV]{1,2})', expand=True)
x2 = df.Technology.str.extract('(?P<NormId>[ EC]{0,4}(?:[\d]{2,4}[/]{0,1}){1,3}[ EC]{0,3})', expand=True)
df2 = pd.concat([df, x1, x2], axis=1)
df2.iloc[:10,:]

Unnamed: 0,Sector,Subsector,Fuel,Technology,Count,EuroStrId,NormId
0,Passenger Cars,"Gasoline <0,8 l",Gasoline,PC Euro 4 - 98/69/EC Stage2005,945.0,4.0,98/69/EC
1,Passenger Cars,"Gasoline <0,8 l",Gasoline,PC Euro 5 - EC 715/2007,33.0,5.0,EC 715/2007
2,Passenger Cars,"Gasoline <0,8 l",Gasoline,PC Euro 6 - EC 715/2007,1.0,6.0,EC 715/2007
3,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,PRE ECE,6774.0,,
4,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,ECE 15/00-01,1020.0,,ECE 15/00
5,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,ECE 15/02,2698.0,,ECE 15/02
6,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,ECE 15/03,1768.0,,ECE 15/03
7,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,ECE 15/04,1756.0,,ECE 15/04
8,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,PC Euro 1 - 91/441/EEC,2694.0,1.0,91/441/EEC
9,Passenger Cars,"Gasoline 0,8 - 1,4 l",Gasoline,PC Euro 2 - 94/12/EEC,5458.0,2.0,94/12/EEC
