<a href="https://colab.research.google.com/github/gopal2812/mlblr/blob/master/pandas3duplicate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 - Data Preparation Basics
## Segment 3 - Removing duplicates

 [Instructor] It's really important to remove duplicates from your dataset in order to preserve the dataset's accuracy and to avoid producing incorrect and misleading statistics. For example, imagine you're analyzing retail sales data and shopaholic Sally came in three times and used three different credit cards to make purchases. Now imagine that she provided the cashier the same zip code. 32803 for each sale. Just based on the card number, Sally looks like three different customers all from the same zip code. If you fail to examine other attributes of the customer so that you could identify and remove duplicates, shopaholic Sally's transactions would skew the results of a customer demographic analysis because Sally would be counted as three people rather than one. To market to 32803 customers effectively, you need to understand their characteristics. Don't let duplicate records skew your analysis. I've already imported numpy and pandas, so let's start off by just creating a data frame object. We're going to call it DF_obj, as usual, and then we're going to call the data frame constructor and we're going to create three columns here, so I'm going to create a dictionary, and for the first column, I'm going to name it column one. So column one is going to contain the numbers one, one, two, two, three, three, and three, okay? And then we're going to create a second column here, and we're going to call that column two. We want column two to contain values a, a, b, b, and then the last three will be c, c, and c, okay? And we'll have one more column. That's going to be column three, of course, and then in column three, we're going to have the same exact letters as column two, but they're going to be capitalized, so we'll have an A, an A, B, B, and then three Cs, okay? Now we need to close the dictionary and then close the function and print this out. Okay, so one thing before printing this, I see we have a stray parentheses here, so just clean that up and then run it. Okay, cool. So here, we have a data frame object and we're going to use this just to drop duplicates. To do that, we're going to use the .duplicated method. What this method does is it searches each row in the data frame and returns a true or false value to indicate whether it's a duplicate of another row found earlier in the data frame. So let's just test this out real quick. We'll say DF_obj, and then we'll say .duplicated, and we'll run this and take a look. We can see that we have a false value that was returned in index position zero, and that makes sense 'cause there were no rows that came before it, but let's look at a row that returned a value of true. For example, row six. If we look at row four, we can see that row six is a duplicate of it. Here's row four and here's row six. Row four returned a value of false. In other words, it's not a duplicate. That's because row four was the first row to contain that exact combination of values. Any subsequent rows that have the same combination of values will be counted as duplicates and return a true value. Now that we found the duplicate records, let's look at how we can drop them. To do that, we're going to use the drop duplicates method, so we'll just say DF_obj.drop_duplicates. Then run this. So by looking at this output here, you can see that the row at index position one has been dropped, and that makes sense because it's a duplicate of the row at index position zero. Also, row three was dropped and that also makes sense because it's a duplicate of the row at index position two. All of our duplicate rows have been dropped from our data frame. I also want to show you how to drop records based on column values. In order to do that, I want to make a small change to our data frame. So what I'm going to do is I'm going to go back up and copy the code we used to create the data frame. Now I'm just going to change this letter here from a C to a D. This is just for the purpose of our demonstration. Let me go ahead and print this out. Now let's drop the rows that have duplicates in only one column series. To do that, we'll call the drop duplicates method off of the data frame and pass in the label index of the column that we want to de-duplify based on. So in this case, let's drop duplicates from column three. To do that, we're going to say DF_obj and then say drop duplicates and we are going to pass in the label index named column three and run this. And just as we predicted, it dropped the rows that had the series index values one, three, and six. Now we have no duplicates in column three. Now that I have shown you how to drop the duplicates from your data, I just want to highlight the point that it's really important to check your data for duplicates and remove them if you find them. Now it's time to move on to data concatenation and transformation.

In [0]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

### Removing duplicates

In [0]:
DF_obj= DataFrame({'column 1':[1,1,2,2,3,3,3],
                   'column 2':['a', 'a','b', 'b', 'c', 'c', 'c'],
                   'column 3':['A', 'A', 'B', 'B', 'C', 'C', 'C']})
DF_obj

Unnamed: 0,column 1,column 2,column 3
0,1,a,A
1,1,a,A
2,2,b,B
3,2,b,B
4,3,c,C
5,3,c,C
6,3,c,C


In [0]:
DF_obj.duplicated()

0    False
1     True
2    False
3     True
4    False
5     True
6     True
dtype: bool

In [0]:
DF_obj.drop_duplicates()

Unnamed: 0,column 1,column 2,column 3
0,1,a,A
2,2,b,B
4,3,c,C


In [0]:
DF_obj= DataFrame({'column 1':[1,1,2,2,3,3,3],
                   'column 2':['a', 'a','b', 'b', 'c', 'c', 'c'],
                   'column 3':['A', 'A', 'B', 'B', 'C', 'D', 'C']})
DF_obj

Unnamed: 0,column 1,column 2,column 3
0,1,a,A
1,1,a,A
2,2,b,B
3,2,b,B
4,3,c,C
5,3,c,D
6,3,c,C


In [0]:
DF_obj.drop_duplicates(['column 3'])

Unnamed: 0,column 1,column 2,column 3
0,1,a,A
2,2,b,B
4,3,c,C
5,3,c,D
