# Mini projet Qualité de Données : Détections des doublons
## ***Christophe COMPAIN / Sander COHEN***

### Objectif et Données Disponibles
L'objectif du projet est d'identifier les logiciels vendus sur les deux plateformes.

Pour ce faire, nous disposons des données pour chacune des plateformes isolément, respectivement dans les fichiers ***Company1.csv*** et ***Company2.csv***. 

### Import packages, Variables Globales et import csv

In [1]:
import pandas as pd
import nltk
import time
import numpy as np
import math
import re
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\scohe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\scohe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
path = "D:\\OneDrive - Université Paris-Dauphine\\Bureau\\Cours Master\\12-Qualité de Données\\\Projet\\mini-projet\\"
file1= "Data\\Company1.csv" #"SampleData\\Sample_Company1.csv"
file2= "Data\\Company2.csv" #"SampleData\\Sample_Company2.csv"
real= "Data\\Ground_truth_mappings.csv" #"SampleData\\Sample_Groud_truth_mappings.csv"

In [3]:
company1 = pd.read_csv(path+file1, encoding = "ISO-8859-1")
company2 = pd.read_csv(path+file2, encoding = "ISO-8859-1")
ground_truth_matches = pd.read_csv(path+real, encoding = "ISO-8859-1")

### Exploration des données

In [4]:
company1.head(5)

Unnamed: 0,id,title,description,manufacturer,price
0,b000jz4hqo,clickart 950 000 - premier image pack (dvd-rom),,broderbund,0.0
1,b0006zf55o,ca international - arcserve lap/desktop oem 30pk,oem arcserve backup v11.1 win 30u for laptops ...,computer associates,0.0
2,b00004tkvy,noah's ark activity center (jewel case ages 3-8),,victory multimedia,0.0
3,b000g80lqo,peachtree by sage premium accounting for nonpr...,peachtree premium accounting for nonprofits 20...,sage software,599.99
4,b0006se5bq,singing coach unlimited,singing coach unlimited - electronic learning ...,carry-a-tune technologies,99.99


In [5]:
company2.head(5)

Unnamed: 0,id,name,description,manufacturer,price
0,11125907881740407428,learning quickbooks 2007,learning quickbooks 2007,intuit,38.99
1,11538923464407758599,superstart! fun with reading & writing!,fun with reading & writing! is designed to hel...,,8.49
2,11343515411965421256,qb pos 6.0 basic software,qb pos 6.0 basic retail mngmt software. for re...,intuit,637.99
3,12049235575237146821,math missions: the amazing arcade adventure (g...,save spectacle city by disrupting randall unde...,,12.95
4,12244614697089679523,production prem cs3 mac upgrad,adobe cs3 production premium mac upgrade from ...,adobe software,805.99


In [6]:
ground_truth_matches.head(5)

Unnamed: 0,idCompany1,idCompany2
0,b000jz4hqo,18441480711193821750
1,b00004tkvy,18441110047404795849
2,b000g80lqo,18441188461196475272
3,b0006se5bq,18428750969726461849
4,b00021xhzw,18430621475529168165


#### Observation d'un premier duplicat

In [7]:
company1[company1.id == ground_truth_matches.idCompany1[1]]

Unnamed: 0,id,title,description,manufacturer,price
2,b00004tkvy,noah's ark activity center (jewel case ages 3-8),,victory multimedia,0.0


In [8]:
company2[company2.id == ground_truth_matches.idCompany2[1]]

Unnamed: 0,id,name,description,manufacturer,price
1881,18441110047404795849,the beginners bible: noah's ark activity cente...,,,9.95


In [9]:
stop_words = set(nltk.corpus.stopwords.words('english'))  
stop_words.update("r")

def prep(texte):
    #suppression des caracteres non alphanumériques + tout en minuscule
    texte = re.sub("[^a-zA-Z0-9_]", " ",str(texte)).lower()
    #tokenization par mot
    tokens = nltk.word_tokenize(texte)
    #supreesion des stopwords
    filtered_tokens = [w for w in tokens if not w in stop_words]
#    # Stemming
#    texte = [nltk.stem.SnowballStemmer('english').stem(w) for w in filtered_tokens]
    # Lemmatization
    texte = [nltk.stem.WordNetLemmatizer().lemmatize(w) for w in filtered_tokens]
    #remise sous forme d'une string
    return " ".join(texte)
        

In [10]:
company1['Company']="company1"
company1=company1.rename(columns={"title": "name"})
company2['Company']="company2"
corpus = pd.concat([company1, company2],sort=False,ignore_index=True)
corpus['name'] = corpus['name'].fillna(' ')
corpus['manufacturer'] = corpus['manufacturer'].fillna(' ')
corpus['description'] = corpus['description'].fillna(' ')
corpus['full data']=corpus['manufacturer'].apply(prep) + ' ' + corpus['name'].apply(prep) # + ' ' + corpus['description'].apply(prep)#corpus['manufacturer'] + ' ' + corpus['name'] + ' ' + corpus['description']

#corpus.reset_index(drop=True)
len(corpus)
corpus.tail()

Unnamed: 0,id,name,description,manufacturer,price,Company,full data
4584,14872602878188858026,jumpstart(r) advanced 1st grade,prepare your child for the 1st grade and beyon...,,19.99,company2,jumpstart advanced 1st grade
4585,14916162814320983138,ibm(r) viavoice(r) advanced edition 10,ibm viavoice advanced edition release 10 is a ...,,78.95,company2,ibm viavoice advanced edition 10
4586,14974113209571399013,xbox 360: gears of war,as marcus fenix you fight a war against the im...,,59.99,company2,xbox 360 gear war
4587,14986935400648190776,documents to go premium 7.0,this pda software enables you to use your docu...,,49.99,company2,document go premium 7 0
4588,14996991014087320062,microsoft(r) picture it! digital image pro 9.0,picture it! digital image pro puts you in cont...,,99.87,company2,microsoft picture digital image pro 9 0


In [11]:
corpus['full data']

0       broderbund clickart 950 000 premier image pack...
1       computer associate ca international arcserve l...
2       victory multimedia noah ark activity center je...
3       sage software peachtree sage premium accountin...
4           carry tune technology singing coach unlimited
                              ...                        
4584                         jumpstart advanced 1st grade
4585                     ibm viavoice advanced edition 10
4586                                    xbox 360 gear war
4587                              document go premium 7 0
4588              microsoft picture digital image pro 9 0
Name: full data, Length: 4589, dtype: object

In [12]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.01,sublinear_tf=True) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
#denselist = dense.tolist()
#df = pd.DataFrame(denselist, columns=feature_names)


In [13]:
#feature_names
dense.shape #.head() 4589 rows × 12391 columns

(4589, 18590)

In [14]:
##tfidf
#number_of_matches = 0
#matches=[]
#start = time.process_time()
#for i in range(len(company1)):
#    try :  
#        price1 = float(company1.iloc[i,4]) 
#    except : 
#        price1 = 0
#    for j in range(len(company2)):
#        try :  
#            price2 = float(company2.iloc[j,4]) 
#        except : 
#            price2 = 0
#        if price1* price2 == 0:
#            price_ratio=1
#        else:
#            price_ratio =max(price1, price2)/min(price1, price2)
#        try :
#            similarity = np.dot(dense[i],np.transpose(dense[len(company1)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1)+j],np.transpose(dense[len(company1)+j])).item(0))
#        except : 
#            similarity = 0
#        if ((similarity > 0.5) and (price_ratio<2)):# or name_score<=1) :
#            number_of_matches = number_of_matches +1
#            matches.append((company1.iloc[i,0],company2.iloc[j,0]))
#print("Number of matches: {}".format(number_of_matches))
#matches_df = pd.DataFrame(matches)
#matches_df.columns= ['idCompany1','idCompany2']
#diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
#true_positives = diff_df[diff_df.Exist=='both']
#false_positives = diff_df[diff_df.Exist=='right_only']
#false_negatives = diff_df[diff_df.Exist=='left_only']
#end = time.process_time()
#print("Processing time: {}".format(end - start))
#print("Number of true positives: {}".format(len(true_positives)))
#print("Number of false positives: {}".format(len(false_positives)))
#print("Number of false negatives: {}".format(len(false_negatives)))
#precision = len(true_positives)/(len(true_positives)+ len(false_positives))
#print("Precision: {}".format(precision))
#recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
#print("Recall: {}".format(recall))
#f_measure = 2*(precision*recall)/(precision+recall)
#print("F measure: {}".format(f_measure))



In [None]:
##tfidf
number_of_matches = 0
matches=[]
start = time.process_time()
for i in range(len(company1)):
    try :  
        price1 = float(company1.iloc[i,4]) 
    except : 
        price1 = 0
    for j in range(len(company2)):
        try :  
            price2 = float(company2.iloc[j,4]) 
        except : 
            price2 = 0
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1)+j],np.transpose(dense[len(company1)+j])).item(0))
            except : 
                similarity = 0
            if ((similarity > 0.5)) :# or name_score<=1) :
                number_of_matches = number_of_matches +1
                matches.append((company1.iloc[i,0],company2.iloc[j,0]))
print("Number of matches: {}".format(number_of_matches))
matches_df = pd.DataFrame(matches)
matches_df.columns= ['idCompany1','idCompany2']
diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

In [17]:
base_false_negatives =false_negatives.merge(corpus.loc[corpus['Company'] == 'company1']
                                            .drop(['Company','name','manufacturer'], inplace=False, axis=1)
                                            .rename(columns = {'id': 'idCompany1','description': 'descr1',
                                                               'price': 'price1','full data': 'full data1'}
                                                    , inplace = False)
                                            , how='inner', on='idCompany1').merge(corpus.loc[corpus['Company'] == 'company2']
                                                                                  .drop(['Company','name','manufacturer'], inplace=False, axis=1)
                                                                                  .rename(columns = {'id': 'idCompany2', 
                                                                                                     'description': 'descr2', 
                                                                                                     'price': 'price2', 
                                                                                                     'full data': 'full data2'}, inplace = False)
                                                                                  , how='inner', on='idCompany2')

In [18]:
base_false_positives =false_positives.merge(corpus.loc[corpus['Company'] == 'company1']
                                            .drop(['Company','name','manufacturer'], inplace=False, axis=1)
                                            .rename(columns = {'id': 'idCompany1','description': 'descr1',
                                                               'price': 'price1','full data': 'full data1'}
                                                    , inplace = False)
                                            , how='inner', on='idCompany1').merge(corpus.loc[corpus['Company'] == 'company2']
                                                                                  .drop(['Company','name','manufacturer'], inplace=False, axis=1)
                                                                                  .rename(columns = {'id': 'idCompany2', 
                                                                                                     'description': 'descr2', 
                                                                                                     'price': 'price2', 
                                                                                                     'full data': 'full data2'}, inplace = False)
                                                                                  , how='inner', on='idCompany2')

In [19]:
base_false_positives


Unnamed: 0,idCompany1,idCompany2,Exist,descr1,price1,full data1,descr2,price2,full data2
0,b000ehpzv8,18440602911777368492,right_only,emc retrospect 7.5 disk to diskcromwindows,0,dantz emc retrospect 7 5 disk disk window,emc ju10a0075 emc retrospect 7.5 disk to disk ...,266.92,emc ju10a0075 emc retrospect 7 5 disk disk wi...
1,b000i82j80,7702263644150091221,right_only,,29.95,webroot software spy sweeper spanish,spy sweeper 4.5 with the most advanced blockin...,31.99,webroot software webroot software 65210 spy sw...
2,b000in34bq,7702263644150091221,right_only,format: win 2000 xp,49.95,webroot software webroot spy sweeper antivirus...,spy sweeper 4.5 with the most advanced blockin...,31.99,webroot software webroot software 65210 spy sw...
3,b000b6n2o4,7702263644150091221,right_only,spy sweeper (small box) (win 98 me 2000 xp),29.95,webroot software webroot spy sweeper antispywa...,spy sweeper 4.5 with the most advanced blockin...,31.99,webroot software webroot software 65210 spy sw...
4,b000i82j80,3306668805908280679,right_only,,29.95,webroot software spy sweeper spanish,spy sweeper 4.5 with the most advanced blockin...,31.99,webroot software webroot software 65210 spy sw...
...,...,...,...,...,...,...,...,...,...
577,b000pienka,18037480601440105621,right_only,the weekly reader mastering middle school lear...,29.99,fogware publishing weekly reader mastering hig...,weekly reader?s mastering elementary & middle ...,16.93,fogware publishing weekly reader l mastering ...
578,b000in6tnq,11826796220164615935,right_only,personal finance software for max os x. liquid...,79.99,csdc liquid ledger personal finance,works on mac os x intel mac os x,71.99,modeless software liquid ledger personal finance
579,b000iflerk,17886724437749928713,right_only,sportsman's double play gives you the very bes...,19.95,masque publishing sportsman double play,features sportsman's double play and american ...,17.55,masque publishing sportsman double play ameri...
580,b000pihtbk,17520994730465780544,right_only,,19.99,fogware publishing weekly reader mastering ele...,developed by educational experts the weekly re...,24.91,fogware publishing weekly reader l mastering ...


In [20]:
base_false_negatives

Unnamed: 0,idCompany1,idCompany2,Exist,descr1,price1,full data1,descr2,price2,full data2
0,b0006se5bq,18428750969726461849,left_only,singing coach unlimited - electronic learning ...,99.99,carry tune technology singing coach unlimited,learn to sing with the help of a patented real...,82.5,singing coach unlimited electronic learning p...
1,b00021xhzw,18430621475529168165,left_only,upgrade only; installation of after effects st...,499.99,adobe adobe effect professional 6 5 upgrade st...,adobe after effects pb 6.5 win upgrade.standar...,507,adobe software 22070152 effect 6 5 pbupgrd
2,b0000dbykm,18363672170449359273,left_only,in mia's math adventure: just in time children...,19.99,kutoka mia math adventure time,kutoka interactive 61208 : mia s math adventur...,18.97,kutoka interactive 61208 mias math adventure ...
3,b00029bqa2,18431790997623472871,left_only,disney's 1st & 2nd grade bundle will help your...,14.99,disney disney 1st 2nd grade bundle pixar 1st g...,disney learning 1st & 2nd features an all-star...,13.61,disney learning 1st 2nd grade win
4,b00006hvvo,3409784122469217433,left_only,today enterprises and service providers face i...,0,sonic system inc upg sgms 1000 incremental node,today enterprises and service providers face i...,62920.89,sonicwall gm 1000n incremental lic upg
...,...,...,...,...,...,...,...,...,...
568,b0009x6qew,18305953820307083661,left_only,with instant immersion italian deluxe you'll g...,29.99,topic entertainment instant immersion italian ...,your first class journey to a new ready to com...,31.06,topic entertainment instant immersion italian...
569,b000cs3s2c,17715266298340609214,left_only,- marketing information: macromedia flash remo...,3314.09,adobe flash remoting 1 alp ret eng cd 2u,macromedia flash remoting mx provides the conn...,1928.09,adobe system inc flash remoting 1 0 net java ...
570,b00005bigp,17738870551709614779,left_only,,9.99,school zone shape,shapes challenges children to identify and cre...,9.45,school zone interactive shape track software
571,b000p9cr66,17724977097925207764,left_only,mediarecover gives you the ability to recover ...,29.99,aladdin system mediarecover,mediarecover retrieves your lost photos audio ...,26.14,allume system inc mediarecover
