### <span style="color:black"><b>Pandas Tutorial 7</b></span>

---

<u>Duplicated Rows</u>

* We now move on to handling duplicated records in pandas
* We will almost never need exact copies of a particular row so it is best to identify and remove these rows so that we can get on with analysing our data
* The two main objectives are to [identify duplicated rows](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) using `duplicated()` and [drop duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html#pandas.DataFrame.drop_duplicates) using `drop_duplicates()`

Useful dataframe methods:
<pre>
df.duplicated() ~ Place inside a .loc[]
df.drop_duplicates(inplace=True)
</pre>


In [1]:
import pandas as pd

In [2]:
# Read in the csv file 
df = pd.read_csv('../00_ExtraData/sales.csv')
df.head()

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Europe,Lithuania,Vegetables,Online,C,2010-05-27,440731847,2010-06-02,1202,154.06,90.93,185180.12,109297.86,75882.26
1,Middle East and North Africa,Iran,Office Supplies,Offline,M,2012-06-23,373544442,2012-06-28,8956,651.21,524.96,5832236.76,4701541.76,1130695.0
2,Middle East and North Africa,Yemen,Baby Food,Online,C,2017-02-06,488047730,2017-03-21,3024,255.28,159.42,771966.72,482086.08,289880.64
3,Sub-Saharan Africa,Guinea-Bissau,Vegetables,Online,M,2010-12-16,980862224,2011-01-03,7784,154.06,90.93,1199203.04,707799.12,491403.92
4,Asia,Myanmar,Cereal,Offline,H,2013-03-08,365934693,2013-03-24,7062,205.7,117.11,1452653.4,827030.82,625622.58


**Exercise: Identify all rows that have been duplicated. Introduce certain arguments**

In [3]:
df.loc[df.duplicated(keep='last'), :]

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
8784,Sub-Saharan Africa,Namibia,Household,Offline,M,2015-08-31,897751939,2015-10-12,3604,668.27,502.54,2408445.08,1811154.16,597290.92
12660,Europe,Moldova,Meat,Online,L,2012-02-28,459845054,2012-03-20,7225,421.89,364.69,3048155.25,2634885.25,413270.0
21769,Europe,Iceland,Baby Food,Online,H,2010-11-20,599480426,2011-01-09,8435,255.28,159.42,2153286.8,1344707.7,808579.1
31911,Asia,Indonesia,Meat,Online,H,2010-08-20,472974574,2010-08-27,2542,421.89,364.69,1072444.38,927041.98,145402.4
32912,Europe,Malta,Cereal,Online,M,2010-08-12,626391351,2010-09-13,1975,205.7,117.11,406257.5,231292.25,174965.25
37557,Europe,Russia,Meat,Online,L,2017-06-22,538911855,2017-06-25,4848,421.89,364.69,2045322.72,1768017.12,277305.6


**Exercise: Identify all rows that have been duplicated across a few of the columns of your choice**

In [4]:
df.loc[df.duplicated(keep=False, subset=['Order Date', 'Country', 'Sales Channel']), :]

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
4,Asia,Myanmar,Cereal,Offline,H,2013-03-08,365934693,2013-03-24,7062,205.70,117.11,1452653.40,827030.82,625622.58
15,North America,Canada,Baby Food,Offline,L,2014-11-10,553182494,2014-12-09,7848,255.28,159.42,2003437.44,1251128.16,752309.28
34,Europe,Montenegro,Cosmetics,Online,H,2010-01-07,492513072,2010-02-13,4418,437.20,263.33,1931549.60,1163391.94,768157.66
73,Sub-Saharan Africa,Kenya,Baby Food,Online,L,2015-11-28,260121333,2015-12-12,4538,255.28,159.42,1158460.64,723447.96,435012.68
102,Europe,Serbia,Beverages,Offline,H,2017-02-21,156536567,2017-03-19,1504,47.45,31.79,71364.80,47812.16,23552.64
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49807,Middle East and North Africa,Somalia,Cereal,Online,M,2012-06-24,776492059,2012-07-15,110,205.70,117.11,22627.00,12882.10,9744.90
49828,Middle East and North Africa,Somalia,Baby Food,Online,H,2016-12-18,409011828,2016-12-20,4544,255.28,159.42,1159992.32,724404.48,435587.84
49836,Central America and the Caribbean,Costa Rica,Cosmetics,Offline,C,2013-12-17,471324479,2014-01-21,4242,437.20,263.33,1854602.40,1117045.86,737556.54
49844,Middle East and North Africa,Kuwait,Household,Online,M,2015-10-27,578327882,2015-12-05,8927,668.27,502.54,5965646.29,4486174.58,1479471.71


**Exercise: Drop duplicate rows**

In [5]:
df.shape

(50006, 14)

In [6]:
df.drop_duplicates(inplace=True)

In [7]:
# Duplicated rows been droppped
df.shape

(50000, 14)