I read in the tidy dataframe that I produced in the previous project and stored as a pickled file from disk, using the "read_pickle()" method:

In [2]:
import pandas as pd
df = pd.read_pickle('EU_industry_production_dataframe.pkl')

As a reminder, this is how the dataframe looks like:

In [3]:
df.info()
print(df.head())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK)
Data columns (total 1 columns):
production_index    27936 non-null object
dtypes: object(1)
memory usage: 306.5+ KB
                        production_index
time       country_code                 
1953-01-01 AT                          :
           BA                          :
           BE                          :
           BG                          :
           CY                          :


The "production_index" column thus consists of string ("non-null object") values.

To see what these values are, let's have a look at the most recent "production_index" values for Austria (country code 'AT') - say the last year. Because I want to specify a range in the time component, I have to apply "slice()" with the beginning and the end (inclusive) of the period inside the "loc()" method:

In [4]:
print(df.loc[(slice('2016-09','2017-08'),'AT'),:])

                        production_index
time       country_code                 
2016-09-01 AT                     113.7 
2016-10-01 AT                     114.7 
2016-11-01 AT                     116.4 
2016-12-01 AT                     115.6 
2017-01-01 AT                     113.2 
2017-02-01 AT                     116.5 
2017-03-01 AT                     117.5 
2017-04-01 AT                     117.6 
2017-05-01 AT                     117.3 
2017-06-01 AT                     118.2 
2017-07-01 AT                    119.9 p
2017-08-01 AT                         : 


In this selection, most values are floating-point numbers, as one would expect of an index. The penultimate value, however, contains the letter "p". What does that mean?

A quick look at the interactive online table version (http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=sts_inpr_m&lang=en) of the dataset (notably, this info is not in the metadata...) reveals the nature of these letters:

"Available flags:
b 	break in time series 	c 	confidential 	d 	definition differs, see metadata
e 	estimated 	f 	forecast 	i 	see metadata (phased out)
n 	not significant 	p 	provisional 	r 	revised
s 	Eurostat estimate (phased out) 	u 	low reliability 	z 	not applicable"

Thus the "p" marks a provisional value, in contrast to the confirmed values without this flag.

The last value in the selection, a colon, stands for a missing value:

"Special value:
: not available"

To clean up the production index values, the flags should be identified and moved to their own column (so that I don't lose the information), the remaining numbers converted into floats, and the colons replaced by "NaN" values.

I start with extracting the flags from the production index values. I do this using regular expressions (http://www.regular-expressions.info/reference.html) and the Python package "re". The string values with flags, e.g., '119.9 p', follow a particular pattern that can be translated into a regular expression: one or more digits ('\d+'), maybe followed by a decimal point ('.?') and another digit ('\d?'), zero or more spaces ('\s*')* and a group of flags as lowercase letters ('([a-z]+)'). Concatenating all these strings yields the pattern string, which has to be passed to the "re.compile()" method to define a pattern that can be understood by the "re" package:

In [5]:
import re  # Package for regular expressions
flag_pattern = re.compile('\d+.?\d?\s*([a-z]+)')  # Define pattern to match

Now I define a function that checks if a string matches "flag_pattern" and returns the matching flags:

In [6]:
def get_flags(string_value):
    '''Returns lower-case letter flags (as string) from a production index string value'''
    match = flag_pattern.match(string_value)
    if bool(match):
        return match.group(1)  # Returns first matching group
    else:
        return ''              # Else return empty string

I need to pass this function to the "apply()" method of the dataframe in order to apply it to all rows and store the results in a new "flags" column:

In [7]:
df['flags'] = df['production_index'].apply(get_flags)
print(df.loc[(slice('2016-09','2017-08'),'AT'),:])

                        production_index flags
time       country_code                       
2016-09-01 AT                     113.7       
2016-10-01 AT                     114.7       
2016-11-01 AT                     116.4       
2016-12-01 AT                     115.6       
2017-01-01 AT                     113.2       
2017-02-01 AT                     116.5       
2017-03-01 AT                     117.5       
2017-04-01 AT                     117.6       
2017-05-01 AT                     117.3       
2017-06-01 AT                     118.2       
2017-07-01 AT                    119.9 p     p
2017-08-01 AT                         :       


How many different flag combinations are there? To find out, let's use the "unique()" method:

In [8]:
print(df['flags'].unique())

['' 's' 'p']


So there are three different flag combinations:
'': no flags
'p': provisional
's': Eurostat estimate (phased out)

These strings mark distinct categories.

It is thus natural to convert the flag strings into categorical values. I can then look at the "categories" attribute to confirm the number of categories:

In [9]:
df['flags'] = df['flags'].astype('category')
print(df['flags'].cat.categories)

Index(['', 'p', 's'], dtype='object')


As a next step, I extract the actual numbers from the value strings and replace the colon with a NaN value. I can do this again using regular expressions and a extraction/conversion function:

In [10]:
import numpy as np  # Needed for NaN values

number_pattern = re.compile('(\d+.?\d?)\s*[a-z]*')  # Number group followed by zero or more flags
nan_pattern = re.compile(':')                       # Missing value

def get_number(string_value):
    '''Returns production index value (as float) from a production index string value'''
    if bool(nan_pattern.match(string_value)):
        return np.nan
    else:
        match = number_pattern.match(string_value)
        if bool(match):
            return float(match.group(1))  # Returns first (and only) matching group
        else:
            print(string_value)           # Or raises error if there is no match
            raise ValueError('Production index value string does not match number pattern')

I apply the "get_number()" function to the "industry_production" column:

In [11]:
df['production_index_float'] = df['production_index'].apply(get_number)
print(df.loc[(slice('2016-09','2017-08'),'AT'),:])

                        production_index flags  production_index_float
time       country_code                                               
2016-09-01 AT                     113.7                          113.7
2016-10-01 AT                     114.7                          114.7
2016-11-01 AT                     116.4                          116.4
2016-12-01 AT                     115.6                          115.6
2017-01-01 AT                     113.2                          113.2
2017-02-01 AT                     116.5                          116.5
2017-03-01 AT                     117.5                          117.5
2017-04-01 AT                     117.6                          117.6
2017-05-01 AT                     117.3                          117.3
2017-06-01 AT                     118.2                          118.2
2017-07-01 AT                    119.9 p     p                   119.9
2017-08-01 AT                         :                            NaN


Last but not least, I can replace the old "production_index" string column with the new float column and then get rid of the duplicate column:

In [12]:
df['production_index'] = df['production_index_float']
df.drop('production_index_float', axis=1, inplace=True)
print(df.loc[(slice('2016-09','2017-08'),'AT'),:])
print(df.info())

                         production_index flags
time       country_code                        
2016-09-01 AT                       113.7      
2016-10-01 AT                       114.7      
2016-11-01 AT                       116.4      
2016-12-01 AT                       115.6      
2017-01-01 AT                       113.2      
2017-02-01 AT                       116.5      
2017-03-01 AT                       117.5      
2017-04-01 AT                       117.6      
2017-05-01 AT                       117.3      
2017-06-01 AT                       118.2      
2017-07-01 AT                       119.9     p
2017-08-01 AT                         NaN      
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK)
Data columns (total 2 columns):
production_index    8744 non-null float64
flags               27936 non-null category
dtypes: category(1), float64(1)
memory usage: 355.2+ KB
None


This is the clean dataframe with the fixed variables "time" (datetimes) and "country_code" (strings) as the MultiIndex, and the measured variables "production_index" (floats) and "flags" (categorical) as data columns.

I store the clean dataframe in a pickled file:

In [13]:
df.to_pickle('EU_industry_production_dataframe_clean.pkl')