## Handling Row Duplication

In [1]:
import pandas as pd

In [2]:
file_url = 'https://github.com/PacktWorkshops/The-Data-Science-Workshop/blob/'\
           'master/Chapter10/dataset/Online%20Retail.xlsx?raw=true'

In [3]:
df=pd.read_excel(file_url)

In [4]:
# duplicated rows
df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
541904    False
541905    False
541906    False
541907    False
541908    False
Length: 541909, dtype: bool

In Python, the True and False binary values correspond to the numerical values 1 and 0, respectively. To find out how many rows have been identified as duplicates, you can use the sum() method on the output of duplicated(). This will add all the 1s (that is, True values) and gives us the count of duplicates

In [5]:
df.duplicated().sum()

5268

In [6]:
df[['InvoiceNo', 'StockCode', 'InvoiceDate', 'CustomerID']]

Unnamed: 0,InvoiceNo,StockCode,InvoiceDate,CustomerID
0,536365,85123A,2010-12-01 08:26:00,17850.0
1,536365,71053,2010-12-01 08:26:00,17850.0
2,536365,84406B,2010-12-01 08:26:00,17850.0
3,536365,84029G,2010-12-01 08:26:00,17850.0
4,536365,84029E,2010-12-01 08:26:00,17850.0
...,...,...,...,...
541904,581587,22613,2011-12-09 12:50:00,12680.0
541905,581587,22899,2011-12-09 12:50:00,12680.0
541906,581587,23254,2011-12-09 12:50:00,12680.0
541907,581587,23255,2011-12-09 12:50:00,12680.0


If you only want to filter the rows that are considered duplicates, you can use the same API call with the output of the duplicated() method. It will only keep the rows with True as a value

In [7]:
df[df.duplicated()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
517,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,2010-12-01 11:45:00,1.25,17908.0,United Kingdom
527,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,2010-12-01 11:45:00,2.10,17908.0,United Kingdom
537,536409,22900,SET 2 TEA TOWELS I LOVE LONDON,1,2010-12-01 11:45:00,2.95,17908.0,United Kingdom
539,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,2010-12-01 11:45:00,4.95,17908.0,United Kingdom
555,536412,22327,ROUND SNACK BOXES SET OF 4 SKULLS,1,2010-12-01 11:49:00,2.95,17920.0,United Kingdom
...,...,...,...,...,...,...,...,...
541675,581538,22068,BLACK PIRATE TREASURE CHEST,1,2011-12-09 11:34:00,0.39,14446.0,United Kingdom
541689,581538,23318,BOX OF 6 MINI VINTAGE CRACKERS,1,2011-12-09 11:34:00,2.49,14446.0,United Kingdom
541692,581538,22992,REVOLVER WOODEN RULER,1,2011-12-09 11:34:00,1.95,14446.0,United Kingdom
541699,581538,22694,WICKER STAR,1,2011-12-09 11:34:00,2.10,14446.0,United Kingdom


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


If you want to subset the rows and columns at the same time, you must use one of the other two available APIs: .loc or .iloc. These APIs do the exact same thing but .loc uses labels or names while .iloc only takes indices as input.

In [9]:
df.loc[df.duplicated(), ['InvoiceNo', 'StockCode', 'InvoiceDate', 'CustomerID']]

Unnamed: 0,InvoiceNo,StockCode,InvoiceDate,CustomerID
517,536409,21866,2010-12-01 11:45:00,17908.0
527,536409,22866,2010-12-01 11:45:00,17908.0
537,536409,22900,2010-12-01 11:45:00,17908.0
539,536409,22111,2010-12-01 11:45:00,17908.0
555,536412,22327,2010-12-01 11:49:00,17920.0
...,...,...,...,...
541675,581538,22068,2011-12-09 11:34:00,14446.0
541689,581538,23318,2011-12-09 11:34:00,14446.0
541692,581538,22992,2011-12-09 11:34:00,14446.0
541699,581538,22694,2011-12-09 11:34:00,14446.0


This preceding output shows that the first few duplicates are row numbers 517, 527, 537, and so on. By default, pandas doesn't mark the first occurrence of duplicates as duplicates: all the same, duplicates will have a value of True except for the first occurrence. You can change this behavior by specifying the keep parameter. If you want to keep the last duplicate, you need to specify keep='last'

In [10]:
df.loc[df.duplicated(keep='last'), ['InvoiceNo', 'StockCode', 'InvoiceDate', 'CustomerID']]

Unnamed: 0,InvoiceNo,StockCode,InvoiceDate,CustomerID
485,536409,22111,2010-12-01 11:45:00,17908.0
489,536409,22866,2010-12-01 11:45:00,17908.0
494,536409,21866,2010-12-01 11:45:00,17908.0
521,536409,22900,2010-12-01 11:45:00,17908.0
548,536412,22327,2010-12-01 11:49:00,17920.0
...,...,...,...,...
541640,581538,22992,2011-12-09 11:34:00,14446.0
541644,581538,22694,2011-12-09 11:34:00,14446.0
541646,581538,23275,2011-12-09 11:34:00,14446.0
541656,581538,23318,2011-12-09 11:34:00,14446.0


To do so, you can use the drop_duplicates() method from pandas. It has the same keep parameter as duplicated(), which specifies which duplicated record you want to keep or if you want to remove all of them. In this case, we want to keep at least one duplicate row. Here, we want to keep the first occurrence

In [11]:
df.drop_duplicates(keep='first')

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France


The drop_duplicates() and duplicated() methods also have another very useful parameter: subset. This parameter allows you to specify the list of columns to consider while looking for duplicates. By default, all the columns of a DataFrame are used to find duplicate rows.

In [12]:
df.duplicated(subset=['InvoiceNo', 'StockCode', 'InvoiceDate', 'CustomerID'], keep='first').sum()

10677

## Converting Data Types

In [13]:
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [14]:
# change CustomerId column data type to object
df = pd.read_excel(file_url, dtype={'CustomerID': 'category'})
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID           category
Country                object
dtype: object

Now, let's look at the second way of converting a single column into a different type. In pandas, you can use the astype() method and specify the new data type that it will be converted into as its parameter. It will return a new column (a new pandas series, to be more precise), so you need to reassign it to the same column of the DataFrame. For instance, if you want to change the InvoiceNo column into a categorical variable, you would do the following:

In [15]:
df['InvoiceNo'] = df['InvoiceNo'].astype('category')
df.dtypes

InvoiceNo            category
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID           category
Country                object
dtype: object

As you can see, the data type for InvoiceNo has changed to a categorical variable. The difference between object and category is that the latter has a finite number of possible values (also called discrete variables). Once these have been changed into categorical variables, pandas will automatically list all the values. They can be accessed using the .cat.categories attribute

In [16]:
df['InvoiceNo'].cat.categories

Index([   536365,    536366,    536367,    536368,    536369,    536370,
          536371,    536372,    536373,    536374,
       ...
       'C581464', 'C581465', 'C581466', 'C581468', 'C581470', 'C581484',
       'C581490', 'C581499', 'C581568', 'C581569'],
      dtype='object', length=25900)

### Handling Incorrect Values

In [17]:
df.loc[df['StockCode'] == 23131, 'Description'].unique()

array(['MISTLETOE HEART WREATH CREAM', 'MISELTOE HEART WREATH WHITE',
       'MISELTOE HEART WREATH CREAM', '?', 'had been put aside', nan],
      dtype=object)

Let's focus on the misspelling issue. What we need to do here is modify the incorrect spelling and replace it with the correct value. First, let's create a new column called StockCodeDescription, which is an exact copy of the Description column:

In [18]:
df['StockCodeDescription'] = df['Description']
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype         
---  ------                --------------   -----         
 0   InvoiceNo             541909 non-null  category      
 1   StockCode             541909 non-null  object        
 2   Description           540455 non-null  object        
 3   Quantity              541909 non-null  int64         
 4   InvoiceDate           541909 non-null  datetime64[ns]
 5   UnitPrice             541909 non-null  float64       
 6   CustomerID            406829 non-null  category      
 7   Country               541909 non-null  object        
 8   StockCodeDescription  540455 non-null  object        
dtypes: category(2), datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 32.6+ MB


You will use this new column to fix the misspelling issue. To do this, use the subsetting technique you learned about earlier in this chapter. You need to use .loc and filter the rows and columns you want, that is, all rows with **StockCode == 21131** and **Description == MISELTOE HEART WREATH CREAM** and the **Description** column

In [19]:
df.loc[(df['StockCode'] == 23131) &
      (df['StockCodeDescription'] == 'MISELTOE HEART WREATH CREAM'),
      'StockCodeDescription'] = 'MISTLETOE HEART WREATH CREAM'

In [20]:
df.loc[df['StockCode'] == 23131, 
      'StockCodeDescription'].unique()

array(['MISTLETOE HEART WREATH CREAM', 'MISELTOE HEART WREATH WHITE', '?',
       'had been put aside', nan], dtype=object)

Let's use this to see if we have the same misspelling issue (MISEL) in the entire dataset. You will need to add one additional parameter since this method doesn't handle missing values. You will also have to subset the rows that don't have missing values for the Description column. This can be done by providing the na=False parameter to the .str.contains() method

In [21]:
df.loc[df['StockCodeDescription']
      .str.contains('MISEL', na=False)]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,StockCodeDescription
186760,552882,23131,MISELTOE HEART WREATH WHITE,48,2011-05-12 10:10:00,3.75,14646.0,Netherlands,MISELTOE HEART WREATH WHITE
186761,552882,23130,MISELTOE HEART WREATH,48,2011-05-12 10:10:00,3.75,14646.0,Netherlands,MISELTOE HEART WREATH
195286,553711,23130,MISELTOE HEART WREATH,12,2011-05-18 15:39:00,4.15,13552.0,United Kingdom,MISELTOE HEART WREATH
195288,553711,23131,MISELTOE HEART WREATH WHITE,12,2011-05-18 15:39:00,4.15,13552.0,United Kingdom,MISELTOE HEART WREATH WHITE
372887,569252,23130,MISELTOE HEART WREATH,4,2011-10-03 10:38:00,4.15,14333.0,United Kingdom,MISELTOE HEART WREATH
373325,569324,23131,MISELTOE HEART WREATH WHITE,6,2011-10-03 12:32:00,4.15,16912.0,United Kingdom,MISELTOE HEART WREATH WHITE
373327,569324,23130,MISELTOE HEART WREATH,8,2011-10-03 12:32:00,4.15,16912.0,United Kingdom,MISELTOE HEART WREATH
377632,569558,23131,MISELTOE HEART WREATH WHITE,12,2011-10-05 08:53:00,4.15,14936.0,Channel Islands,MISELTOE HEART WREATH WHITE
377635,569558,23130,MISELTOE HEART WREATH,8,2011-10-05 08:53:00,4.15,14936.0,Channel Islands,MISELTOE HEART WREATH


This misspelling issue (MISELTOE) is not only related to StockCode 23131, but also to other items. You will need to fix all of these using the str.replace() method. This method takes the string of characters to be replaced and the replacement string as parameters

In [22]:
df['StockCodeDescription'] = df['StockCodeDescription'].str.replace('MISELTOE', 'MISTLETOE')

In [23]:
# print all rows containing misspelling MISEL; there should be none.
df.loc[df['StockCodeDescription']
      .str.contains('MISEL', na=False)]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,StockCodeDescription


## Handling Missing Values

In [25]:
df.isna().sum()

InvoiceNo                    0
StockCode                    0
Description               1454
Quantity                     0
InvoiceDate                  0
UnitPrice                    0
CustomerID              135080
Country                      0
StockCodeDescription      1455
dtype: int64

In [26]:
df[df['Description'].isna()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,StockCodeDescription
622,536414,22139,,56,2010-12-01 11:52:00,0.0,,United Kingdom,
1970,536545,21134,,1,2010-12-01 14:32:00,0.0,,United Kingdom,
1971,536546,22145,,1,2010-12-01 14:33:00,0.0,,United Kingdom,
1972,536547,37509,,1,2010-12-01 14:33:00,0.0,,United Kingdom,
1987,536549,85226A,,1,2010-12-01 14:34:00,0.0,,United Kingdom,
...,...,...,...,...,...,...,...,...,...
535322,581199,84581,,-2,2011-12-07 18:26:00,0.0,,United Kingdom,
535326,581203,23406,,15,2011-12-07 18:31:00,0.0,,United Kingdom,
535332,581209,21620,,6,2011-12-07 18:35:00,0.0,,United Kingdom,
536981,581234,72817,,27,2011-12-08 10:33:00,0.0,,United Kingdom,


The pandas package provides a method that we can use to easily remove missing values: .dropna(). This method returns a new DataFrame without all the rows that have missing values. By default, it will look at all the columns. You can specify a list of columns for it to look for with the subset parameter:

In [30]:
df.dropna(subset=['Description'], inplace=True) # inplace=True replaces orginal dataset directly

In [31]:
df.isna().sum()

InvoiceNo                    0
StockCode                    0
Description                  0
Quantity                     0
InvoiceDate                  0
UnitPrice                    0
CustomerID              133626
Country                      0
StockCodeDescription         1
dtype: int64

In [32]:
# customerID missing values
df[df['CustomerID'].isna()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,StockCodeDescription
1443,536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,2010-12-01 14:32:00,2.51,,United Kingdom,DECORATIVE ROSE BATHROOM BOTTLE
1444,536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,2010-12-01 14:32:00,2.51,,United Kingdom,DECORATIVE CATS BATHROOM BOTTLE
1445,536544,21786,POLKADOT RAIN HAT,4,2010-12-01 14:32:00,0.85,,United Kingdom,POLKADOT RAIN HAT
1446,536544,21787,RAIN PONCHO RETROSPOT,2,2010-12-01 14:32:00,1.66,,United Kingdom,RAIN PONCHO RETROSPOT
1447,536544,21790,VINTAGE SNAP CARDS,9,2010-12-01 14:32:00,1.66,,United Kingdom,VINTAGE SNAP CARDS
...,...,...,...,...,...,...,...,...,...
541536,581498,85099B,JUMBO BAG RED RETROSPOT,5,2011-12-09 10:26:00,4.13,,United Kingdom,JUMBO BAG RED RETROSPOT
541537,581498,85099C,JUMBO BAG BAROQUE BLACK WHITE,4,2011-12-09 10:26:00,4.13,,United Kingdom,JUMBO BAG BAROQUE BLACK WHITE
541538,581498,85150,LADIES & GENTLEMEN METAL SIGN,1,2011-12-09 10:26:00,4.96,,United Kingdom,LADIES & GENTLEMEN METAL SIGN
541539,581498,85174,S/4 CACTI CANDLES,1,2011-12-09 10:26:00,10.79,,United Kingdom,S/4 CACTI CANDLES


This time, all the transactions look normal, except they are missing values for the CustomerID column; all the other variables have been filled in with values that seem genuine. There is no other way to infer the missing values for the CustomerID column. These rows represent almost 25% of the dataset, so we can't remove them.

However, most algorithms require a value for each observation, so you need to provide one for these cases. We will use the .fillna() method from pandas to do this. Provide the value to be imputed as Missing and use inplace=True as a parameter:

In [36]:
df['CustomerID'].unique()

[17850.0, 13047.0, 12583.0, 13748.0, 15100.0, ..., 13436.0, 15520.0, 13298.0, 14569.0, 12713.0]
Length: 4373
Categories (4372, float64): [17850.0, 13047.0, 12583.0, 13748.0, ..., 15520.0, 13298.0, 14569.0, 12713.0]