## Feature Engineering

### Merging Datasets


In [1]:
import pandas as pd

In [2]:
file_url = 'https://github.com/PacktWorkshops/'\
           'The-Data-Science-Workshop/blob/'\
           'master/Chapter12/Dataset/'\
           'Online%20Retail.xlsx?raw=true'

In [3]:
df = pd.read_excel(file_url)
print(df.shape)
df.head()

(541909, 8)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


Next, we are going to load all the public holidays in the UK into another pandas DataFrame. From Chapter 10, Analyzing a Dataset we know the records of this dataset are only for the years 2010 and 2011. So we are going to extract public holidays for those two years, but we need to do so in two different steps as the API provided by date.nager is split into single years only

In [4]:
uk_holidays_2010 = pd.read_csv('https://date.nager.at/PublicHoliday/'\
                    'Country/GB/2010/CSV')
uk_holidays_2010.shape

(13, 9)

In [5]:
uk_holidays_2010.head()

Unnamed: 0,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,2010-01-01,New Year's Day,New Year's Day,GB,False,True,,Public,
1,2010-01-04,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
2,2010-03-17,Saint Patrick's Day,Saint Patrick's Day,GB,True,False,,Public,GB-NIR
3,2010-04-02,Good Friday,Good Friday,GB,False,True,,Public,
4,2010-04-05,Easter Monday,Easter Monday,GB,False,False,,Public,"GB-ENG,GB-WLS,GB-NIR"


In [6]:
# uk holidays 2011
uk_holidays_2011 = pd.read_csv('https://date.nager.at/PublicHoliday/'                  
                    'Country/GB/2011/CSV')

In [7]:
print(uk_holidays_2011.shape)
uk_holidays_2011.head()

(15, 9)


Unnamed: 0,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,2011-01-01,New Year's Day,New Year's Day,GB,False,False,,Public,GB-NIR
1,2011-01-03,New Year's Day,New Year's Day,GB,False,False,,Public,"GB-ENG,GB-WLS"
2,2011-01-03,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
3,2011-01-04,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
4,2011-03-17,Saint Patrick's Day,Saint Patrick's Day,GB,True,False,,Public,GB-NIR


There were 15 public holidays in 2011. Now we need to combine the records of these two DataFrames. We will use the .append() method from pandas and assign the results into a new DataFrame:

In [8]:
uk_holidays = uk_holidays_2010.append(uk_holidays_2011)

In [9]:
print(uk_holidays.shape)
uk_holidays.head()

(28, 9)


Unnamed: 0,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,2010-01-01,New Year's Day,New Year's Day,GB,False,True,,Public,
1,2010-01-04,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
2,2010-03-17,Saint Patrick's Day,Saint Patrick's Day,GB,True,False,,Public,GB-NIR
3,2010-04-02,Good Friday,Good Friday,GB,False,True,,Public,
4,2010-04-05,Easter Monday,Easter Monday,GB,False,False,,Public,"GB-ENG,GB-WLS,GB-NIR"


In order to merge two DataFrames together, we need to have at least one common column between them, meaning the two DataFrames should have at least one column that contains the same type of information. In our example, we are going to merge this DataFrame using the **Date** column with the Online Retail DataFrame on the InvoiceDate column.We can see that the data format of these two columns is different: one is a date **(yyyy-mm-dd)** and the other is a datetime **(yyyy-mm-dd hh:mm:ss)**

So, we need to transform the **InvoiceDate** column into date format **(yyyy-mm-dd)**. One way to do it (we will see another one later in this chapter) is to transform this column into text and then extract the first 10 characters for each cell using the **.str.slice()** method.

For example, the date 2010-12-01 08:26:00 will first be converted into a string and then we will keep only the first 10 characters, which will be 2010-12-01. We are going to save these results into a new column called **InvoiceDay**:

In [10]:
df['InvoiceDay'] = df['InvoiceDate'].astype(str).str.slice(stop=10)
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01


Now InvoiceDay from the online retail DataFrame and Date from the UK public holidays DataFrame have similar information, so we can merge these two DataFrames together using .merge() from pandas.

There are multiple ways to join two tables together:

    The left join
    The right join
    The inner join
    The outer join
The Left Join

The left join will keep all the rows from the first DataFrame, which is the Online Retail dataset (the left-hand side) and join it to the matching rows from the second DataFrame, which is the UK Public Holidays dataset (the right-hand side)

To perform a left join, we need to specify to the .merge() method the following parameters:

    how = 'left' for a left join
    left_on = InvoiceDay to specify the column used for merging from the left-hand side (here, the Invoiceday column from the Online Retail DataFrame)
    right_on = Date to specify the column used for merging from the right-hand side (here, the Date column from the UK Public Holidays DataFrame)


In [11]:
df_left = pd.merge(df, uk_holidays, left_on='InvoiceDay',
                  right_on='Date', how='left')
df_left.shape

(541909, 18)

In [12]:
df_left.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01,,,,,,,,,
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,,,,,,,,,
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01,,,,,,,,,
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,,,,,,,,,
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,,,,,,,,,


The Right Join

The right join is similar to the left join except it will keep all the rows from the second DataFrame (the right-hand side) and tries to match it with the first one (the left-hand side)

In [13]:
df_right = df.merge(uk_holidays, left_on='InvoiceDay',
                   right_on='Date', how='right')
print(df_right.shape)
df_right.head()

(9602, 18)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay,Date,LocalName,Name,CountryCode,Fixed,Global,LaunchYear,Type,Counties
0,,,,,NaT,,,,,2010-01-01,New Year's Day,New Year's Day,GB,False,True,,Public,
1,,,,,NaT,,,,,2010-01-04,New Year's Day,New Year's Day,GB,False,False,,Public,GB-SCT
2,,,,,NaT,,,,,2010-03-17,Saint Patrick's Day,Saint Patrick's Day,GB,True,False,,Public,GB-NIR
3,,,,,NaT,,,,,2010-04-02,Good Friday,Good Friday,GB,False,True,,Public,
4,,,,,NaT,,,,,2010-04-05,Easter Monday,Easter Monday,GB,False,False,,Public,"GB-ENG,GB-WLS,GB-NIR"


We can see there are fewer rows as a result of the right join, but it doesn't get the same number as for the Public Holidays DataFrame. This is because there are multiple rows from the Online Retail DataFrame that match one single date in the public holidays one. 

An inner join will only keep the rows that match between the two tables:

In [14]:
df_inner = df.merge(uk_holidays, left_on='InvoiceDay', 
                   right_on='Date', how='inner')
df_inner.shape

(9579, 18)

The outer join will keep all rows from both tables (matched and unmatched)

In [15]:
df_outer = df.merge(uk_holidays, left_on='InvoiceDay', 
                   right_on='Date', how='outer')
df_outer.shape

(541932, 18)

Before merging two tables, it is extremely important for you to know what your focus is. If your objective is to expand the number of features from an original dataset by adding the columns from another one, then you will probably use a left or right join. But be aware you may end up with more observations due to potentially multiple matches between the two tables. On the other hand, if you are interested in knowing which observations matched or didn't match between the two tables, you will either use an inner or outer join.

### Binning Variables


In [16]:
# unique values for 'Country'
df['Country'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

We are going to group some of the countries together into regions such as Asia, the Middle East, and America. We will leave the European countries as is.

In [17]:
df['Country_bin'] = df['Country']

In [18]:
# asian countries list
asian_countries = ['Japan', 'Hong Kong', 'Singapore']

And finally, using the .loc() and .isin() methods from pandas, we are going to change the value of Country_bin to Asia for all of the countries that are present in the asian_countries list:

In [19]:
df.loc[df['Country'].isin(asian_countries), 
      'Country_bin'] = 'Asia'

In [20]:
df['Country_bin'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Asia', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Lebanon',
       'United Arab Emirates', 'Saudi Arabia', 'Czech Republic', 'Canada',
       'Unspecified', 'Brazil', 'USA', 'European Community', 'Malta',
       'RSA'], dtype=object)

In [21]:
# binning Middle Eastern countries
m_east_countries = ['Israel', 'Bahrain', 'Lebanon', 
                   'United Arab Emirates', 'Saudi Arabia']
df.loc[df['Country'].isin(m_east_countries), 
      'Country_bin'] = 'Middle East'

In [22]:
df['Country_bin'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Asia', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Middle East', 'Finland', 'Greece', 'Czech Republic', 'Canada',
       'Unspecified', 'Brazil', 'USA', 'European Community', 'Malta',
       'RSA'], dtype=object)

In [23]:
# binning for countries in North & South America
american_countries = ['Canada', 'Brazil', 'USA']
df.loc[df['Country'].isin(american_countries), 
      'Country_bin'] = 'America'

In [24]:
df['Country_bin'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Asia', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Middle East', 'Finland', 'Greece', 'Czech Republic', 'America',
       'Unspecified', 'European Community', 'Malta', 'RSA'], dtype=object)

In [25]:
df['Country_bin'].nunique()

30

## Manipulating Dates

In most datasets you will be working on, there will be one or more columns containing date information. Usually, you will not feed that type of information directly as input to a machine learning algorithm. The reason is you don't want it to learn extremely specific patterns, such as customer A bought product X on August 3, 2012, at 08:11 a.m. The model would be overfitting in that case and wouldn't be able to generalize to future data. 

In [26]:
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
InvoiceDay             object
Country_bin            object
dtype: object

In [27]:
# extract year of a date
df['InvoiceDate'].dt.year

0         2010
1         2010
2         2010
3         2010
4         2010
          ... 
541904    2011
541905    2011
541906    2011
541907    2011
541908    2011
Name: InvoiceDate, Length: 541909, dtype: int64

In [28]:
# extract day of the week for each row for InvoiceDate column
df['InvoiceDate'].dt.dayofweek

0         2
1         2
2         2
3         2
4         2
         ..
541904    4
541905    4
541906    4
541907    4
541908    4
Name: InvoiceDate, Length: 541909, dtype: int64

In [29]:
# add 3 days to each date using pandas time-series offset object
df['InvoiceDate'] + pd.tseries.offsets.Day(3)

0        2010-12-04 08:26:00
1        2010-12-04 08:26:00
2        2010-12-04 08:26:00
3        2010-12-04 08:26:00
4        2010-12-04 08:26:00
                 ...        
541904   2011-12-12 12:50:00
541905   2011-12-12 12:50:00
541906   2011-12-12 12:50:00
541907   2011-12-12 12:50:00
541908   2011-12-12 12:50:00
Name: InvoiceDate, Length: 541909, dtype: datetime64[ns]

In [30]:
df['InvoiceDate']

0        2010-12-01 08:26:00
1        2010-12-01 08:26:00
2        2010-12-01 08:26:00
3        2010-12-01 08:26:00
4        2010-12-01 08:26:00
                 ...        
541904   2011-12-09 12:50:00
541905   2011-12-09 12:50:00
541906   2011-12-09 12:50:00
541907   2011-12-09 12:50:00
541908   2011-12-09 12:50:00
Name: InvoiceDate, Length: 541909, dtype: datetime64[ns]

In [31]:
# offset by business days
df['InvoiceDate'] + pd.tseries.offsets.BusinessDay(-1) # return previous business day of each value in column

0        2010-11-30 08:26:00
1        2010-11-30 08:26:00
2        2010-11-30 08:26:00
3        2010-11-30 08:26:00
4        2010-11-30 08:26:00
                 ...        
541904   2011-12-08 12:50:00
541905   2011-12-08 12:50:00
541906   2011-12-08 12:50:00
541907   2011-12-08 12:50:00
541908   2011-12-08 12:50:00
Name: InvoiceDate, Length: 541909, dtype: datetime64[ns]

In [32]:
# apply specific time-frequency to get first day of the month from a date
df['InvoiceDate'] + pd.Timedelta(1, unit='MS')

0        2010-12-01 08:26:00.001
1        2010-12-01 08:26:00.001
2        2010-12-01 08:26:00.001
3        2010-12-01 08:26:00.001
4        2010-12-01 08:26:00.001
                   ...          
541904   2011-12-09 12:50:00.001
541905   2011-12-09 12:50:00.001
541906   2011-12-09 12:50:00.001
541907   2011-12-09 12:50:00.001
541908   2011-12-09 12:50:00.001
Name: InvoiceDate, Length: 541909, dtype: datetime64[ns]

## Performing Data Aggregation
The idea behind it is to summarize a numerical column for specific groups from another column.

One potential reason might be that you want to normalize another numerical column using this aggregation. For instance, if you are working on a dataset for a retailer that contains all the sales for each store around the world, the volume of sales may differ drastically for a country compared to another one as they don't have the same population. In this case, rather than using the raw sales figures for each store, you would calculate a ratio (or a percentage) of the sales of a store divided by the total volume of sales in its country. With this new ratio feature, some of the stores that looked as though they were underperforming because their raw volume of sales was not as high as for other countries may actually be performing much better than the average in its country.

In pandas, it is quite easy to perform data aggregation. We just need to combine the following methods successively: .groupby() and .agg().
The .agg() method expects a dictionary with the name of a column as a key and the aggregation function as a value such as {'column_name': 'aggregation_function'}.

Let's calculate the total quantity of items sold for each country. We will specify the Country column as the grouping column:

In [33]:
df.groupby('Country').agg({'Quantity': 'sum'})

Unnamed: 0_level_0,Quantity
Country,Unnamed: 1_level_1
Australia,83653
Austria,4827
Bahrain,260
Belgium,23152
Brazil,356
Canada,2763
Channel Islands,9479
Cyprus,6317
Czech Republic,592
Denmark,8188


This result gives the total volume of items sold for each country. We can see that Australia has almost sold four times more items than Belgium. This level of information may be too high-level and we may want a bit more granular detail. Let's perform the same aggregation but this time we will group on two columns: Country and StockCode. We just need to provide the names of these columns as a list to the .groupby() method:

In [34]:
df.groupby(['Country', 'StockCode']).agg({'Quantity': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
Country,StockCode,Unnamed: 2_level_1
Australia,15036,600
Australia,20665,6
Australia,20675,216
Australia,20676,216
Australia,20677,216
...,...,...
Unspecified,85049A,1
Unspecified,85179A,1
Unspecified,85179C,1
Unspecified,85180A,2


We can see how many items have been sold for each country. We can note that Australia has sold the same quantity of products 20675, 20676, and 20677 (216 each). This may indicate that these products are always sold together.

In [36]:
df['Invoice_Date'] = df['InvoiceDate'].dt.date

In [37]:
df.groupby(['Country', 'StockCode', 'Invoice_Date']).agg({'Quantity': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Quantity
Country,StockCode,Invoice_Date,Unnamed: 3_level_1
Australia,15036,2011-05-17,600
Australia,20665,2011-03-24,6
Australia,20675,2011-01-06,72
Australia,20675,2011-03-03,144
Australia,20676,2011-01-06,72
...,...,...,...
Unspecified,85049A,2011-07-28,1
Unspecified,85179A,2011-07-28,1
Unspecified,85179C,2011-07-28,1
Unspecified,85180A,2011-07-28,2


We can now merge this additional information back into the original DataFrame. But before that, there is an additional data transformation step required: reset the column index. The pandas package creates a multi-level index after data aggregation by default. You can think of it as though the column names were stored in multiple rows instead of one only. To change it back to a single level, you need to call the .reset_index() method:

In [38]:
df_agg = df.groupby(['Country', 'StockCode', 'Invoice_Date']).agg({'Quantity': 'sum'}).reset_index()
df_agg.head()

Unnamed: 0,Country,StockCode,Invoice_Date,Quantity
0,Australia,15036,2011-05-17,600
1,Australia,20665,2011-03-24,6
2,Australia,20675,2011-01-06,72
3,Australia,20675,2011-03-03,144
4,Australia,20676,2011-01-06,72


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   InvoiceNo     541909 non-null  object        
 1   StockCode     541909 non-null  object        
 2   Description   540455 non-null  object        
 3   Quantity      541909 non-null  int64         
 4   InvoiceDate   541909 non-null  datetime64[ns]
 5   UnitPrice     541909 non-null  float64       
 6   CustomerID    406829 non-null  float64       
 7   Country       541909 non-null  object        
 8   InvoiceDay    541909 non-null  object        
 9   Country_bin   541909 non-null  object        
 10  Invoice_Date  541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 45.5+ MB


Now we can merge this new DataFrame into the original one using the .merge() method we saw earlier in this chapter:

In [40]:
df_merged = pd.merge(df, df_agg, how='left', on=['Country', 'StockCode', 'Invoice_Date'])
df_merged

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity_x,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay,Country_bin,Invoice_Date,Quantity_y
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,454
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,33
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,40
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,59
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,551
...,...,...,...,...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France,2011-12-09,France,2011-12-09,12
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France,2011-12-09,France,2011-12-09,6
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France,2011-12-09,France,2011-12-09,4
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France,2011-12-09,France,2011-12-09,4


In [41]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 541909 entries, 0 to 541908
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   InvoiceNo     541909 non-null  object        
 1   StockCode     541909 non-null  object        
 2   Description   540455 non-null  object        
 3   Quantity_x    541909 non-null  int64         
 4   InvoiceDate   541909 non-null  datetime64[ns]
 5   UnitPrice     541909 non-null  float64       
 6   CustomerID    406829 non-null  float64       
 7   Country       541909 non-null  object        
 8   InvoiceDay    541909 non-null  object        
 9   Country_bin   541909 non-null  object        
 10  Invoice_Date  541909 non-null  object        
 11  Quantity_y    541909 non-null  int64         
dtypes: datetime64[ns](1), float64(2), int64(2), object(7)
memory usage: 53.7+ MB


We can see there are two columns called Quantity_x and Quantity_y instead of Quantity.

The reason is that, after merging, there were two different columns with the exact same name (Quantity), so by default, pandas added a suffix to differentiate them. 

We can fix this situation either by replacing the name of one of those two columns before merging or we can replace both of them after merging. To replace column names, we can use the .rename() method from pandas by providing a dictionary with the old name as the key and the new name as the value, such as {'old_name': 'new_name'}.

Let's replace the column names after merging with Quantity and DailyQuantity:

In [42]:
df_merged.rename(columns={'Quantity_x': 'Quantity', 'Quantity_y': 'DailyQuantity'} ,inplace=True)
df_merged

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay,Country_bin,Invoice_Date,DailyQuantity
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,454
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,33
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,40
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,59
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,551
...,...,...,...,...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France,2011-12-09,France,2011-12-09,12
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France,2011-12-09,France,2011-12-09,6
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France,2011-12-09,France,2011-12-09,4
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France,2011-12-09,France,2011-12-09,4


Now we can create a new feature that will calculate the ratio between the items sold with the daily total quantity of sold items in the corresponding country:

In [43]:
df_merged['QuantityRatio'] = df_merged['Quantity'] / df_merged['DailyQuantity']
df_merged

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,InvoiceDay,Country_bin,Invoice_Date,DailyQuantity,QuantityRatio
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,454,0.013216
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,33,0.181818
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,40,0.200000
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,59,0.101695
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01,United Kingdom,2010-12-01,551,0.010889
...,...,...,...,...,...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France,2011-12-09,France,2011-12-09,12,1.000000
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France,2011-12-09,France,2011-12-09,6,1.000000
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France,2011-12-09,France,2011-12-09,4,1.000000
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France,2011-12-09,France,2011-12-09,4,1.000000
