## Feature Engineering

### Merging Datasets


In [1]:
import pandas as pd

In [2]:
file_url = 'https://github.com/PacktWorkshops/'\
           'The-Data-Science-Workshop/blob/'\
           'master/Chapter12/Dataset/'\
           'Online%20Retail.xlsx?raw=true'

In [3]:
df = pd.read_excel(file_url)
print(df.shape)
df.head()

(541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


Next, we are going to load all the public holidays in the UK into another pandas DataFrame. From Chapter 10, Analyzing a Dataset we know the records of this dataset are only for the years 2010 and 2011. So we are going to extract public holidays for those two years, but we need to do so in two different steps as the API provided by date.nager is split into single years only

In [4]:
uk_holidays_2010 = pd.read_csv('https://date.nager.at/PublicHoliday/'\
                    'Country/GB/2010/CSV')
uk_holidays_2010.shape

(13, 9)

In [5]:
uk_holidays_2010.head()

Unnamed: 0,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,2010-01-01,New Year's Day,New Year's Day,GB,False,True,,Public,
1,2010-01-04,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
2,2010-03-17,Saint Patrick's Day,Saint Patrick's Day,GB,True,False,,Public,GB-NIR
3,2010-04-02,Good Friday,Good Friday,GB,False,True,,Public,
4,2010-04-05,Easter Monday,Easter Monday,GB,False,False,,Public,"GB-ENG,GB-WLS,GB-NIR"


In [6]:
# uk holidays 2011
uk_holidays_2011 = pd.read_csv('https://date.nager.at/PublicHoliday/'                  
                    'Country/GB/2011/CSV')

In [7]:
print(uk_holidays_2011.shape)
uk_holidays_2011.head()

(15, 9)


Unnamed: 0,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,2011-01-01,New Year's Day,New Year's Day,GB,False,False,,Public,GB-NIR
1,2011-01-03,New Year's Day,New Year's Day,GB,False,False,,Public,"GB-ENG,GB-WLS"
2,2011-01-03,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
3,2011-01-04,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
4,2011-03-17,Saint Patrick's Day,Saint Patrick's Day,GB,True,False,,Public,GB-NIR


There were 15 public holidays in 2011. Now we need to combine the records of these two DataFrames. We will use the .append() method from pandas and assign the results into a new DataFrame:

In [8]:
uk_holidays = uk_holidays_2010.append(uk_holidays_2011)

In [9]:
print(uk_holidays.shape)
uk_holidays.head()

(28, 9)


Unnamed: 0,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,2010-01-01,New Year's Day,New Year's Day,GB,False,True,,Public,
1,2010-01-04,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
2,2010-03-17,Saint Patrick's Day,Saint Patrick's Day,GB,True,False,,Public,GB-NIR
3,2010-04-02,Good Friday,Good Friday,GB,False,True,,Public,
4,2010-04-05,Easter Monday,Easter Monday,GB,False,False,,Public,"GB-ENG,GB-WLS,GB-NIR"


In order to merge two DataFrames together, we need to have at least one common column between them, meaning the two DataFrames should have at least one column that contains the same type of information. In our example, we are going to merge this DataFrame using the **Date** column with the Online Retail DataFrame on the InvoiceDate column.We can see that the data format of these two columns is different: one is a date **(yyyy-mm-dd)** and the other is a datetime **(yyyy-mm-dd hh:mm:ss)**

So, we need to transform the **InvoiceDate** column into date format **(yyyy-mm-dd)**. One way to do it (we will see another one later in this chapter) is to transform this column into text and then extract the first 10 characters for each cell using the **.str.slice()** method.

For example, the date 2010-12-01 08:26:00 will first be converted into a string and then we will keep only the first 10 characters, which will be 2010-12-01. We are going to save these results into a new column called **InvoiceDay**:

In [10]:
df['InvoiceDay'] = df['InvoiceDate'].astype(str).str.slice(stop=10)
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01


Now InvoiceDay from the online retail DataFrame and Date from the UK public holidays DataFrame have similar information, so we can merge these two DataFrames together using .merge() from pandas.

There are multiple ways to join two tables together:

    The left join
    The right join
    The inner join
    The outer join
The Left Join

The left join will keep all the rows from the first DataFrame, which is the Online Retail dataset (the left-hand side) and join it to the matching rows from the second DataFrame, which is the UK Public Holidays dataset (the right-hand side)

To perform a left join, we need to specify to the .merge() method the following parameters:

    how = 'left' for a left join
    left_on = InvoiceDay to specify the column used for merging from the left-hand side (here, the Invoiceday column from the Online Retail DataFrame)
    right_on = Date to specify the column used for merging from the right-hand side (here, the Date column from the UK Public Holidays DataFrame)


In [11]:
df_left = pd.merge(df, uk_holidays, left_on='InvoiceDay',
                  right_on='Date', how='left')
df_left.shape

(541909, 18)

In [12]:
df_left.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01,,,,,,,,,
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,,,,,,,,,
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01,,,,,,,,,
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,,,,,,,,,
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,,,,,,,,,


The Right Join

The right join is similar to the left join except it will keep all the rows from the second DataFrame (the right-hand side) and tries to match it with the first one (the left-hand side)

In [13]:
df_right = df.merge(uk_holidays, left_on='InvoiceDay',
                   right_on='Date', how='right')
print(df_right.shape)
df_right.head()

(9602, 18)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,,,,,NaT,,,,,2010-01-01,New Year's Day,New Year's Day,GB,False,True,,Public,
1,,,,,NaT,,,,,2010-01-04,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
2,,,,,NaT,,,,,2010-03-17,Saint Patrick's Day,Saint Patrick's Day,GB,True,False,,Public,GB-NIR
3,,,,,NaT,,,,,2010-04-02,Good Friday,Good Friday,GB,False,True,,Public,
4,,,,,NaT,,,,,2010-04-05,Easter Monday,Easter Monday,GB,False,False,,Public,"GB-ENG,GB-WLS,GB-NIR"


We can see there are fewer rows as a result of the right join, but it doesn't get the same number as for the Public Holidays DataFrame. This is because there are multiple rows from the Online Retail DataFrame that match one single date in the public holidays one. 

An inner join will only keep the rows that match between the two tables:

In [15]:
df_inner = df.merge(uk_holidays, left_on='InvoiceDay', 
                   right_on='Date', how='inner')
df_inner.shape

(9579, 18)

The outer join will keep all rows from both tables (matched and unmatched)

In [16]:
df_outer = df.merge(uk_holidays, left_on='InvoiceDay', 
                   right_on='Date', how='outer')
df_outer.shape

(541932, 18)

Before merging two tables, it is extremely important for you to know what your focus is. If your objective is to expand the number of features from an original dataset by adding the columns from another one, then you will probably use a left or right join. But be aware you may end up with more observations due to potentially multiple matches between the two tables. On the other hand, if you are interested in knowing which observations matched or didn't match between the two tables, you will either use an inner or outer join.

### Binning Variables


In [17]:
# unique values for 'Country'
df['Country'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

We are going to group some of the countries together into regions such as Asia, the Middle East, and America. We will leave the European countries as is.

In [18]:
df['Country_bin'] = df['Country']

In [20]:
# asian countries list
asian_countries = ['Japan', 'Hong Kong', 'Singapore']

And finally, using the .loc() and .isin() methods from pandas, we are going to change the value of Country_bin to Asia for all of the countries that are present in the asian_countries list:

In [23]:
df.loc[df['Country'].isin(asian_countries), 
      'Country_bin'] = 'Asia'

In [25]:
df['Country_bin'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Asia', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Lebanon',
       'United Arab Emirates', 'Saudi Arabia', 'Czech Republic', 'Canada',
       'Unspecified', 'Brazil', 'USA', 'European Community', 'Malta',
       'RSA'], dtype=object)

In [26]:
# binning Middle Eastern countries
m_east_countries = ['Israel', 'Bahrain', 'Lebanon', 
                   'United Arab Emirates', 'Saudi Arabia']
df.loc[df['Country'].isin(m_east_countries), 
      'Country_bin'] = 'Middle East'

In [27]:
df['Country_bin'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Asia', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Middle East', 'Finland', 'Greece', 'Czech Republic', 'Canada',
       'Unspecified', 'Brazil', 'USA', 'European Community', 'Malta',
       'RSA'], dtype=object)

In [28]:
# binning for countries in North & South America
american_countries = ['Canada', 'Brazil', 'USA']
df.loc[df['Country'].isin(american_countries), 
      'Country_bin'] = 'America'

In [29]:
df['Country_bin'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Asia', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Middle East', 'Finland', 'Greece', 'Czech Republic', 'America',
       'Unspecified', 'European Community', 'Malta', 'RSA'], dtype=object)

In [30]:
df['Country_bin'].nunique()

30