## Дедупликиране на данни с Pandas


In [1]:
import pandas as pd
import numpy as np

За да внесем текстови файлове с данни в pandas, използваме метода `pandas.read_csv()`. [Документация](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [2]:
df = pd.read_csv("over_a_hundred_thousand.csv", delimiter="|")
df.shape

(100000, 7)

In [13]:
df.head(1)

Unnamed: 0,name,sex,birthdate,username,address,SSN,mail
0,Omar Caldwell,M,16-04-2001,cowanpatrick,Unit 8870 Box 8137 DPO AE 69568,803-02-2702,wjohnson@hotmail.com


Ще подберем на случаен принцип 5% от записите, като ги копираме в отделен dataframe. 

In [4]:
random_sample = df.sample(int(len(df)*0.05))  # 5 percent of the original dataset
len(random_sample)

5000

В следващата стъпка ще конкатенираме създадената току що извадка заедно с основния масив данни. 

In [5]:
df_with_duplicates = pd.concat([df, random_sample], ignore_index=True)
len(df_with_duplicates)
df_with_duplicates.head(5)

Unnamed: 0,name,sex,birthdate,username,address,SSN,mail
0,Omar Caldwell,M,16-04-2001,cowanpatrick,Unit 8870 Box 8137 DPO AE 69568,803-02-2702,wjohnson@hotmail.com
1,Taylor Green,F,29-05-1994,dhoward,"2141 Christensen Turnpike Gabrielaport, MI 45332",274-99-3139,kim81@wallace-thomas.com
2,Nicholas Smith,F,18-07-1982,bonnie86,"37930 Sanchez Fort Apt. 872 West Nancy, MI 32549",429-31-5875,erin80@hotmail.com
3,Sherry Wood,M,06-09-1980,david66,"362 Walters Brooks South Jenna, IL 53923",165-49-9744,jasonrivera@yahoo.com
4,Ashley Mckenzie,M,10-05-1994,ucastro,"43489 White Bridge South Alyssaport, NY 49020",878-34-1547,stonegabrielle@yahoo.com


Методът `DataFrame.sample()`, който използвахме, за да направим случайна извадка може да се използва и за произволно разбъркване на данните в даден фрейм. За целта използваме параметър `frac`:

In [6]:
df_with_duplicates = df_with_duplicates.sample(frac=1)

In [7]:
df_with_duplicates.head(5)

Unnamed: 0,name,sex,birthdate,username,address,SSN,mail
93224,Tristan Wells,M,04-08-1985,gcarr,"2433 Laura Ford Suite 833 Evanton, OK 30407",822-92-3179,vwalker@gmail.com
61548,Eric Berger,M,23-05-1991,sfields,"PSC 0526, Box 8570 APO AP 69375",179-13-6873,mfrench@gmail.com
50462,William Ballard,M,06-05-1973,cole15,"41617 Cooper Flats Jacksonville, WY 21859",694-56-7550,bryan55@morales.com
9514,Lauren Chang,M,28-11-1994,jonathan87,"88552 Lewis Summit Brownland, ME 71574",077-57-0085,sdickerson@haas.org
59999,Hayden Bowen,M,26-09-1981,harrisontoni,"145 Griffith Keys Port Javierbury, DC 12777",029-28-5149,jonathanrobinson@yahoo.com


In [8]:
df_with_duplicates.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105000 entries, 93224 to 52087
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   name       105000 non-null  object
 1   sex        105000 non-null  object
 2   birthdate  105000 non-null  object
 3   username   105000 non-null  object
 4   address    105000 non-null  object
 5   SSN        105000 non-null  object
 6   mail       105000 non-null  object
dtypes: object(7)
memory usage: 51.8 MB


In [9]:
df_with_duplicates.duplicated().any()

True

In [10]:
df_with_duplicates.duplicated(subset=["name", "birthdate"]).any()

True

In [11]:
duplicate_records = df_with_duplicates[df_with_duplicates.duplicated(keep=False)]  # can keep 'first', 'last', or all.
len(duplicate_records)

10000

In [12]:
duplicate_records.sort_values(by=["name", "birthdate"]).head(6)

Unnamed: 0,name,sex,birthdate,username,address,SSN,mail
11533,Aaron Coleman,F,12-02-1993,cday,"3852 Brooks Shoal Tanyamouth, TX 04442",691-35-1258,rodriguezkatherine@peters.com
102223,Aaron Coleman,F,12-02-1993,cday,"3852 Brooks Shoal Tanyamouth, TX 04442",691-35-1258,rodriguezkatherine@peters.com
102711,Aaron Diaz,M,21-01-1983,victoriastewart,"39777 Fry Mountain Lake Sarahland, MO 12901",233-13-1425,xpayne@gmail.com
43446,Aaron Diaz,M,21-01-1983,victoriastewart,"39777 Fry Mountain Lake Sarahland, MO 12901",233-13-1425,xpayne@gmail.com
26230,Aaron Edwards,M,02-07-1994,jacqueline11,"228 Hobbs Via Michaelshire, MS 34304",387-99-1485,evansdaniel@simmons.com
103589,Aaron Edwards,M,02-07-1994,jacqueline11,"228 Hobbs Via Michaelshire, MS 34304",387-99-1485,evansdaniel@simmons.com
