# Mini projet Qualité de Données : Détections des doublons
## ***Christophe COMPAIN / Sander COHEN***

### Objectif et Données Disponibles
L'objectif du projet est d'identifier les logiciels vendus sur les deux plateformes.

Pour ce faire, nous disposons des données pour chacune des plateformes isolément, respectivement dans les fichiers ***Company1.csv*** et ***Company2.csv***. 

### Import packages, Variables Globales et import csv

In [23]:
import pandas as pd
import nltk
import time
import numpy as np
import math
import re
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\scohe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\scohe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [24]:
path = "D:\\OneDrive - Université Paris-Dauphine\\Bureau\\Cours Master\\12-Qualité de Données\\\Projet\\mini-projet\\"
file1= "Data\\Company1.csv" #"SampleData\\Sample_Company1.csv"
file2= "Data\\Company2.csv" #"SampleData\\Sample_Company2.csv"
real= "Data\\Ground_truth_mappings.csv" #"SampleData\\Sample_Groud_truth_mappings.csv"

In [25]:
company1 = pd.read_csv(path+file1, encoding = "ISO-8859-1")
company2 = pd.read_csv(path+file2, encoding = "ISO-8859-1")
ground_truth_matches = pd.read_csv(path+real, encoding = "ISO-8859-1").drop_duplicates()

### Exploration des données

In [26]:
company1.head(5)

Unnamed: 0,id,title,description,manufacturer,price
0,b000jz4hqo,clickart 950 000 - premier image pack (dvd-rom),,broderbund,0.0
1,b0006zf55o,ca international - arcserve lap/desktop oem 30pk,oem arcserve backup v11.1 win 30u for laptops ...,computer associates,0.0
2,b00004tkvy,noah's ark activity center (jewel case ages 3-8),,victory multimedia,0.0
3,b000g80lqo,peachtree by sage premium accounting for nonpr...,peachtree premium accounting for nonprofits 20...,sage software,599.99
4,b0006se5bq,singing coach unlimited,singing coach unlimited - electronic learning ...,carry-a-tune technologies,99.99


In [27]:
company2.head(5)

Unnamed: 0,id,name,description,manufacturer,price
0,11125907881740407428,learning quickbooks 2007,learning quickbooks 2007,intuit,38.99
1,11538923464407758599,superstart! fun with reading & writing!,fun with reading & writing! is designed to hel...,,8.49
2,11343515411965421256,qb pos 6.0 basic software,qb pos 6.0 basic retail mngmt software. for re...,intuit,637.99
3,12049235575237146821,math missions: the amazing arcade adventure (g...,save spectacle city by disrupting randall unde...,,12.95
4,12244614697089679523,production prem cs3 mac upgrad,adobe cs3 production premium mac upgrade from ...,adobe software,805.99


In [28]:
ground_truth_matches.head(5)

Unnamed: 0,idCompany1,idCompany2
0,b000jz4hqo,18441480711193821750
1,b00004tkvy,18441110047404795849
2,b000g80lqo,18441188461196475272
3,b0006se5bq,18428750969726461849
4,b00021xhzw,18430621475529168165


#### Observation d'un premier duplicat

In [29]:
company1[company1.id == ground_truth_matches.idCompany1[1]]

Unnamed: 0,id,title,description,manufacturer,price
2,b00004tkvy,noah's ark activity center (jewel case ages 3-8),,victory multimedia,0.0


In [30]:
company2[company2.id == ground_truth_matches.idCompany2[1]]

Unnamed: 0,id,name,description,manufacturer,price
1881,18441110047404795849,the beginners bible: noah's ark activity cente...,,,9.95


In [31]:
stop_words = set(nltk.corpus.stopwords.words('english'))  
stop_words.update(["r","v","software","entertainment","inc","usa"])

def prep(texte):
    #suppression des caracteres non alphanumériques + tout en minuscule
    texte = re.sub("[^a-zA-Z0-9_]", " ",str(texte)).lower()
    #remplacement de mots
    texte = texte.replace("professional", "pro").replace(" upg "," upgrade ").replace(" dlx "," deluxe ")
    #tokenization par mot
    tokens = nltk.word_tokenize(texte)
    #supreesion des stopwords
    filtered_tokens = [w for w in tokens if not w in stop_words]
#    # Stemming
#    texte = [nltk.stem.SnowballStemmer('english').stem(w) for w in filtered_tokens]
    # Lemmatization
    texte = [nltk.stem.WordNetLemmatizer().lemmatize(w) for w in filtered_tokens]
    #remise sous forme d'une string
    return " ".join(texte)
        

In [32]:
##retraitement des prix
def retreatprice(texte):
    #suppression des caracteres non alphanumériques + tout en minuscule
    return float(re.sub("[^0-9.]", " ",str(texte)))


In [33]:
company1['Company']="company1"
company1=company1.rename(columns={"title": "name"})
company1['name'] = company1['name'].fillna(' ')
company1['manufacturer'] = company1['manufacturer'].fillna(' ')
company1['description'] = company1['description'].fillna(' ')
company1['price'] = company1['price'].fillna(' ')
company1['price_retreat'] = company1['price'].apply(retreatprice)
company1['full data']=company1['manufacturer'].apply(prep) + ' ' + company1['name'].apply(prep) # + ' ' + company1['description'].apply(prep)

company2['Company']="company2"
company2['name'] = company2['name'].fillna(' ')
company2['manufacturer'] = company2['manufacturer'].fillna(' ')
company2['description'] = company2['description'].fillna(' ')
company2['price'] = company2['price'].fillna(' ')
company2['price_retreat'] = company2['price'].apply(retreatprice)
company2['full data']=company2['manufacturer'].apply(prep) + ' ' + company2['name'].apply(prep) # + ' ' + company2['description'].apply(prep)


In [34]:
corpus = pd.concat([company1, company2],sort=False,ignore_index=True)
#corpus.reset_index(drop=True)
len(corpus)
corpus.tail()

Unnamed: 0,id,name,description,manufacturer,price,Company,price_retreat,full data
4584,14872602878188858026,jumpstart(r) advanced 1st grade,prepare your child for the 1st grade and beyon...,,19.99,company2,19.99,jumpstart advanced 1st grade
4585,14916162814320983138,ibm(r) viavoice(r) advanced edition 10,ibm viavoice advanced edition release 10 is a ...,,78.95,company2,78.95,ibm viavoice advanced edition 10
4586,14974113209571399013,xbox 360: gears of war,as marcus fenix you fight a war against the im...,,59.99,company2,59.99,xbox 360 gear war
4587,14986935400648190776,documents to go premium 7.0,this pda software enables you to use your docu...,,49.99,company2,49.99,document go premium 7 0
4588,14996991014087320062,microsoft(r) picture it! digital image pro 9.0,picture it! digital image pro puts you in cont...,,99.87,company2,99.87,microsoft picture digital image pro 9 0


In [35]:
#recherche des mots unique pour les supprimer
allwords = corpus['full data'].str.split(expand=True).stack().value_counts()
stop_unique = set(allwords[allwords==1].index)

def prep2(texte):
    tokens = nltk.word_tokenize(texte)
    #supreesion des stopwords
    filtered_tokens = [w for w in tokens if not w in stop_unique]
    #remise sous forme d'une string
    return " ".join(filtered_tokens)
        

In [36]:
company1['full data']=company1['full data'].apply(prep2)
company2['full data']=company2['full data'].apply(prep2)

In [37]:
#company1_light = company1[company1['full data'].str.contains(filtre)].reset_index(drop=True)
#company2_light

In [38]:
###données punch software
filtre = "punch"
#stopwords_suppl =" software"
company1_light = company1[company1['full data'].str.contains(filtre)].reset_index(drop=True)
company2_light = company2[company2['full data'].str.contains(filtre)].reset_index(drop=True)
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.5,sublinear_tf=True,stop_words=[filtre])#+stopwords_suppl]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

number_of_matches = 0
matches=[]
start = time.process_time()
for i in range(len(company1_light)):
    try :  
        price1 = float(company1_light.iloc[i,6]) 
    except : 
        price1 = 0
    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
        try :  
            price2 = float(company2_light.iloc[j,6]) 
        except : 
            price2 = 0
        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if  ((similarity > 0.35)) :#or jd_ng1_ng2_name<0.1 :# or name_score<=1) :
                number_of_matches = number_of_matches +1
                matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))
print("Number of matches: {}".format(number_of_matches))
matches_df = pd.DataFrame(matches)
matches_df.columns= ['idCompany1','idCompany2']
diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

Number of matches: 61
Processing time: 0.34375
Number of true positives: 31
Number of false positives: 30
Number of false negatives: 1269
Precision: 0.5081967213114754
Recall: 0.023846153846153847
F measure: 0.0455547391623806


In [39]:
###données topics
filtre = "topic"
#stopwords_suppl =" entertainment"
company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
company1_light = company1_light[company1_light['full data'].str.contains(filtre)].reset_index(drop=True)
company2_light = company2_light[company2_light['full data'].str.contains(filtre)].reset_index(drop=True)
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.05,sublinear_tf=True,stop_words=[filtre]) #+stopwords_suppl]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

new_number_of_matches = 0
new_matches=[]
start = time.process_time()
for i in range(len(company1_light)):
    try :  
        price1 = float(company1_light.iloc[i,6]) 
    except : 
        price1 = 0
    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
        try :  
            price2 = float(company2_light.iloc[j,6]) 
        except : 
            price2 = 0
        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if  ((similarity > 0.3)) or jd_ng1_ng2_name<0.2 :# or name_score<=1) :
                new_number_of_matches = new_number_of_matches +1
                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))
print("New matches: {}".format(new_number_of_matches))
number_of_matches= number_of_matches + new_number_of_matches
print("Total matches: {}".format(number_of_matches))
new_matches_df = pd.DataFrame(new_matches)
new_matches_df.columns= ['idCompany1','idCompany2']
matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)

diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

New matches: 51
Total matches: 112
Processing time: 0.9375
Number of true positives: 61
Number of false positives: 51
Number of false negatives: 1239
Precision: 0.5446428571428571
Recall: 0.04692307692307692
F measure: 0.08640226628895183


In [40]:
###données apple
filtre = "apple"
company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
company1_light = company1_light[company1_light['full data'].str.contains(filtre)].reset_index(drop=True)
company2_light = company2_light[company2_light['full data'].str.contains(filtre)].reset_index(drop=True)
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.1,sublinear_tf=True,stop_words=[filtre]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

new_number_of_matches = 0
new_matches=[]
start = time.process_time()
for i in range(len(company1_light)):
    try :  
        price1 = float(company1_light.iloc[i,6]) 
    except : 
        price1 = 0
    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
        try :  
            price2 = float(company2_light.iloc[j,6]) 
        except : 
            price2 = 0
        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if  ((similarity > 0.4)) or jd_ng1_ng2_name<0.5 :# or name_score<=1) :
                new_number_of_matches = new_number_of_matches +1
                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))
print("New matches: {}".format(new_number_of_matches))
number_of_matches= number_of_matches + new_number_of_matches
print("Total matches: {}".format(number_of_matches))
new_matches_df = pd.DataFrame(new_matches)
new_matches_df.columns= ['idCompany1','idCompany2']
matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)

diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

New matches: 66
Total matches: 178
Processing time: 0.703125
Number of true positives: 108
Number of false positives: 70
Number of false negatives: 1192
Precision: 0.6067415730337079
Recall: 0.08307692307692308
F measure: 0.14614343707713126


In [41]:
###données Encore
filtre = "encore"
company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
company1_light = company1_light[company1_light['full data'].str.contains(filtre)].reset_index(drop=True)
company2_light = company2_light[company2_light['full data'].str.contains(filtre)].reset_index(drop=True)
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.01,sublinear_tf=True,stop_words=[filtre]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

new_number_of_matches = 0
new_matches=[]
start = time.process_time()
for i in range(len(company1_light)):
    try :  
        price1 = float(company1_light.iloc[i,6]) 
    except : 
        price1 = 0
    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
        try :  
            price2 = float(company2_light.iloc[j,6]) 
        except : 
            price2 = 0
        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if ((similarity > 0.2)) or jd_ng1_ng2_name<0.2 :# or name_score<=1) :
                new_number_of_matches = new_number_of_matches +1
                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))
print("New matches: {}".format(new_number_of_matches))
number_of_matches= number_of_matches + new_number_of_matches
print("Total matches: {}".format(number_of_matches))
new_matches_df = pd.DataFrame(new_matches)
new_matches_df.columns= ['idCompany1','idCompany2']
matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)

diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

New matches: 144
Total matches: 322
Processing time: 7.3125
Number of true positives: 201
Number of false positives: 121
Number of false negatives: 1099
Precision: 0.6242236024844721
Recall: 0.15461538461538463
F measure: 0.24784217016029597


In [42]:
###données Adobe
filtre = "adobe"
company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
company1_light = company1_light[company1_light['full data'].str.contains(filtre)].reset_index(drop=True)
company2_light = company2_light[company2_light['full data'].str.contains(filtre)].reset_index(drop=True)
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.5,sublinear_tf=True,stop_words=[filtre]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

new_number_of_matches = 0
new_matches=[]
start = time.process_time()
for i in range(len(company1_light)):
    try :  
        price1 = float(company1_light.iloc[i,6]) 
    except : 
        price1 = 0
    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
        try :  
            price2 = float(company2_light.iloc[j,6]) 
        except : 
            price2 = 0
        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if ((similarity > 0.6)) :# jd_ng1_ng2_name<0.3 :# or name_score<=1) :
                new_number_of_matches = new_number_of_matches +1
                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))
print("New matches: {}".format(new_number_of_matches))
number_of_matches= number_of_matches + new_number_of_matches
print("Total matches: {}".format(number_of_matches))
new_matches_df = pd.DataFrame(new_matches)
new_matches_df.columns= ['idCompany1','idCompany2']
matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)

diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

New matches: 91
Total matches: 413
Processing time: 4.15625
Number of true positives: 251
Number of false positives: 162
Number of false negatives: 1049
Precision: 0.6077481840193705
Recall: 0.1930769230769231
F measure: 0.29305312317571514


In [43]:
###données microsoft
filtre = "microsoft"
#stopwords_suppl =" software"
company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
company1_light = company1_light[company1_light['full data'].str.contains(filtre)].reset_index(drop=True)
company2_light = company2_light[company2_light['full data'].str.contains(filtre)].reset_index(drop=True)
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.5,sublinear_tf=True,stop_words=[filtre]) #+stopwords_suppl]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

new_number_of_matches = 0
new_matches=[]
start = time.process_time()
for i in range(len(company1_light)):
    try :  
        price1 = float(company1_light.iloc[i,6]) 
    except : 
        price1 = 0
    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
        try :  
            price2 = float(company2_light.iloc[j,6]) 
        except : 
            price2 = 0
        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if ((similarity > 0.45)):# or name_score<=1) :
                new_number_of_matches = new_number_of_matches +1
                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))

print("New matches: {}".format(new_number_of_matches))
number_of_matches= number_of_matches + new_number_of_matches
print("Total matches: {}".format(number_of_matches))
new_matches_df = pd.DataFrame(new_matches)
new_matches_df.columns= ['idCompany1','idCompany2']
matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
                
diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

New matches: 73
Total matches: 486
Processing time: 4.140625
Number of true positives: 285
Number of false positives: 201
Number of false negatives: 1015
Precision: 0.5864197530864198
Recall: 0.21923076923076923
F measure: 0.3191489361702127


In [44]:
company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,3), max_df=0.01,sublinear_tf=True)#,stop_words=["software"]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

new_number_of_matches = 0
new_matches=[]
start = time.process_time()
for i in range(len(company1_light)):
#    try :  
    price1 = float(company1_light.iloc[i,6]) 
#    except : 
#        price1 = 0
#    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
#    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
#        try :  
        price2 = float(company2_light.iloc[j,6]) 
#        except : 
#            price2 = 0
#        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
#        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
#        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if ((similarity > 0.5)):# or name_score<=1) :
                new_number_of_matches = new_number_of_matches +1
                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))



print("New matches: {}".format(new_number_of_matches))
number_of_matches= number_of_matches + new_number_of_matches
print("Total matches: {}".format(number_of_matches))
new_matches_df = pd.DataFrame(new_matches)
new_matches_df.columns= ['idCompany1','idCompany2']


matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
#matches_df_temp = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
                
diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
#diff_df = pd.merge(ground_truth_matches, matches_df_temp, how='outer', indicator='Exist')


true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

New matches: 678
Total matches: 1164
Processing time: 756.0
Number of true positives: 724
Number of false positives: 440
Number of false negatives: 576
Precision: 0.6219931271477663
Recall: 0.556923076923077
F measure: 0.5876623376623376


In [23]:
company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.01,sublinear_tf=True)#,stop_words=["software"]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

new_number_of_matches = 0
new_matches=[]
start = time.process_time()
for i in range(len(company1_light)):
#    try :  
    price1 = float(company1_light.iloc[i,6]) 
#    except : 
#        price1 = 0
#    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
#    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
#        try :  
        price2 = float(company2_light.iloc[j,6]) 
#        except : 
#            price2 = 0
#        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
#        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
#        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if ((similarity > 0.5)):# or name_score<=1) :
                new_number_of_matches = new_number_of_matches +1
                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))



print("New matches: {}".format(new_number_of_matches))
number_of_matches= number_of_matches + new_number_of_matches
print("Total matches: {}".format(number_of_matches))
new_matches_df = pd.DataFrame(new_matches)
new_matches_df.columns= ['idCompany1','idCompany2']


matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
###matches_df_temp = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
                
diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
###diff_df = pd.merge(ground_truth_matches, matches_df_temp, how='outer', indicator='Exist')


true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

New matches: 163
Total matches: 1206
Processing time: 293.9375
Number of true positives: 786
Number of false positives: 420
Number of false negatives: 514
Precision: 0.6517412935323383
Recall: 0.6046153846153847
F measure: 0.627294493216281


In [24]:
company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

vectorizer = TfidfVectorizer(ngram_range=(1,1), max_df=0.01,sublinear_tf=True)#,stop_words=["software"]) #ngram_range=(1),
vectors = vectorizer.fit_transform(corpus['full data'])
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()

new_number_of_matches = 0
new_matches=[]
start = time.process_time()
for i in range(len(company1_light)):
#    try :  
    price1 = float(company1_light.iloc[i,6]) 
#    except : 
#        price1 = 0
#    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
#    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
    for j in range(len(company2_light)):
#        try :  
        price2 = float(company2_light.iloc[j,6]) 
#        except : 
#            price2 = 0
#        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
#        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
#        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
            try :
                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
            except : 
                similarity = 0
            if ((similarity > 0.5)):# or name_score<=1) :
                new_number_of_matches = new_number_of_matches +1
                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))



print("New matches: {}".format(new_number_of_matches))
number_of_matches= number_of_matches + new_number_of_matches
print("Total matches: {}".format(number_of_matches))
new_matches_df = pd.DataFrame(new_matches)
new_matches_df.columns= ['idCompany1','idCompany2']


matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
###matches_df_temp = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
                
diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
###diff_df = pd.merge(ground_truth_matches, matches_df_temp, how='outer', indicator='Exist')


true_positives = diff_df[diff_df.Exist=='both']
false_positives = diff_df[diff_df.Exist=='right_only']
false_negatives = diff_df[diff_df.Exist=='left_only']
end = time.process_time()
print("Processing time: {}".format(end - start))
print("Number of true positives: {}".format(len(true_positives)))
print("Number of false positives: {}".format(len(false_positives)))
print("Number of false negatives: {}".format(len(false_negatives)))
precision = len(true_positives)/(len(true_positives)+ len(false_positives))
print("Precision: {}".format(precision))
recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
print("Recall: {}".format(recall))
f_measure = 2*(precision*recall)/(precision+recall)
print("F measure: {}".format(f_measure))

New matches: 297
Total matches: 1503
Processing time: 39.828125
Number of true positives: 931
Number of false positives: 572
Number of false negatives: 369
Precision: 0.6194278110445776
Recall: 0.7161538461538461
F measure: 0.6642882625758117


In [25]:
#company1_light=company1[~company1.id.isin(matches_df.idCompany1)]
#company2_light=company2[~company2.id.isin(matches_df.idCompany2)]
#corpus = pd.concat([company1_light, company2_light],sort=False,ignore_index=True)

In [26]:
#vectorizer = TfidfVectorizer(ngram_range=(1,1), max_df=0.01,sublinear_tf=True)#,stop_words=["software"]) #ngram_range=(1),
#vectors = vectorizer.fit_transform(corpus['full data'])
#feature_names = vectorizer.get_feature_names()
#dense = vectors.todense()

#new_number_of_matches = 0
#new_matches=[]
#start = time.process_time()
#for i in range(len(company1_light)):
##    try :  
#    price1 = float(company1_light.iloc[i,6]) 
##    except : 
##        price1 = 0
##    tokens1name = nltk.word_tokenize(company1_light.iloc[i,7])
##    ng1_tokensname = set(nltk.ngrams(tokens1name, n=1))
#    for j in range(len(company2_light)):
##        try :  
#        price2 = float(company2_light.iloc[j,6]) 
##        except : 
##            price2 = 0
##        tokens2name = nltk.word_tokenize(company2_light.iloc[j,7])
##        ng2_tokensname = set(nltk.ngrams(tokens2name, n=1))
##        jd_ng1_ng2_name = nltk.jaccard_distance(ng1_tokensname, ng2_tokensname)
#        if price1* price2 == 0 or max(price1, price2)/min(price1, price2)<2:
#            try :
#                similarity = np.dot(dense[i],np.transpose(dense[len(company1_light)+j])).item(0)/math.sqrt(np.dot(dense[i],np.transpose(dense[i])).item(0) * np.dot(dense[len(company1_light)+j],np.transpose(dense[len(company1_light)+j])).item(0))
#            except : 
#                similarity = 0
#            if ((similarity > 0.45)):# or name_score<=1) :
#                new_number_of_matches = new_number_of_matches +1
#                new_matches.append((company1_light.iloc[i,0],company2_light.iloc[j,0]))



#print("New matches: {}".format(new_number_of_matches))
#number_of_matches= number_of_matches + new_number_of_matches
#print("Total matches: {}".format(number_of_matches))
#new_matches_df = pd.DataFrame(new_matches)
#new_matches_df.columns= ['idCompany1','idCompany2']


###matches_df = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
#matches_df_temp = pd.concat([matches_df, new_matches_df],sort=False,ignore_index=True)
                
###diff_df = pd.merge(ground_truth_matches, matches_df, how='outer', indicator='Exist')
#diff_df = pd.merge(ground_truth_matches, matches_df_temp, how='outer', indicator='Exist')


#true_positives = diff_df[diff_df.Exist=='both']
#false_positives = diff_df[diff_df.Exist=='right_only']
#false_negatives = diff_df[diff_df.Exist=='left_only']
#end = time.process_time()
#print("Processing time: {}".format(end - start))
#print("Number of true positives: {}".format(len(true_positives)))
#print("Number of false positives: {}".format(len(false_positives)))
#print("Number of false negatives: {}".format(len(false_negatives)))
#precision = len(true_positives)/(len(true_positives)+ len(false_positives))
#print("Precision: {}".format(precision))
#recall = len(true_positives)/(len(true_positives)+ len(false_negatives))
#print("Recall: {}".format(recall))
#f_measure = 2*(precision*recall)/(precision+recall)
#print("F measure: {}".format(f_measure))

In [27]:
base_false_negatives =false_negatives.merge(corpus.loc[corpus['Company'] == 'company1']
                                            .drop(['Company','name','manufacturer'], inplace=False, axis=1)
                                            .rename(columns = {'id': 'idCompany1','description': 'descr1',
                                                               'price': 'price1','full data': 'full data1'}
                                                    , inplace = False)
                                            , how='inner', on='idCompany1').merge(corpus.loc[corpus['Company'] == 'company2']
                                                                                  .drop(['Company','name','manufacturer'], inplace=False, axis=1)
                                                                                  .rename(columns = {'id': 'idCompany2', 
                                                                                                     'description': 'descr2', 
                                                                                                     'price': 'price2', 
                                                                                                     'full data': 'full data2'}, inplace = False)
                                                                                  , how='inner', on='idCompany2')

In [28]:
base_false_positives =false_positives.merge(corpus.loc[corpus['Company'] == 'company1']
                                            .drop(['Company','name','manufacturer'], inplace=False, axis=1)
                                            .rename(columns = {'id': 'idCompany1','description': 'descr1',
                                                               'price': 'price1','full data': 'full data1'}
                                                    , inplace = False)
                                            , how='inner', on='idCompany1').merge(corpus.loc[corpus['Company'] == 'company2']
                                                                                  .drop(['Company','name','manufacturer'], inplace=False, axis=1)
                                                                                  .rename(columns = {'id': 'idCompany2', 
                                                                                                     'description': 'descr2', 
                                                                                                     'price': 'price2', 
                                                                                                     'full data': 'full data2'}, inplace = False)
                                                                                  , how='inner', on='idCompany2')

In [29]:
base_false_positives


Unnamed: 0,idCompany1,idCompany2,Exist,descr1,price1,price_retreat_x,full data1,descr2,price2,price_retreat_y,full data2
0,b00021xhzw,7972469902462196789,right_only,upgrade only; installation of after effects st...,499.99,499.99,adobe adobe effect pro 6 5 upgrade standard pro,system requirements powerpc® g4 or g5 or multi...,306.99,306.99,adobe cs3 effect upgrade
1,b00021xhzw,8148630564555070134,right_only,upgrade only; installation of after effects st...,499.99,499.99,adobe adobe effect pro 6 5 upgrade standard pro,system requirements powerpc® g4 or g5 or multi...,329.99,329.99,adobe cs3 effect academic
2,b000o26lo8,8148630564555070134,right_only,note: this is the upgrade version of adobe aft...,299,299.00,adobe adobe effect cs3 upgrade,system requirements powerpc® g4 or g5 or multi...,329.99,329.99,adobe cs3 effect academic
3,b000gzwjgc,18431827533263730235,right_only,- marketing information: dragon naturallyspeak...,399.54,399.54,nuance academic acad upgrade dragon naturallys...,voice recognition - english - pc,212.62,212.62,nuance dragon naturallyspeaking 9 0 pro upgrade
4,b000083k56,5658005667319165691,right_only,,0,0.00,compaq computer compaq comp secure path v3 0c ...,hp 231327-b22 : usually ships in 24 hours : : ...,65395.24,65395.24,231327 b22 hp storageworks secure path netware...
...,...,...,...,...,...,...,...,...,...,...,...
147,b00006sijr,4579143769589006469,right_only,the big story is flexibility and ease of use. ...,129.99,129.99,punch punch home design architectura series 18,we've taken our bestselling benchmark home des...,135.99,135.99,punch punch 90500 home design studio mac
148,b00006sijr,1203903874013146755,right_only,the big story is flexibility and ease of use. ...,129.99,129.99,punch punch home design architectura series 18,we've taken our bestselling benchmark home des...,135.99,135.99,punch punch 90500 home design studio mac
149,b000ndicqi,2618959948842073817,right_only,note: this is the upgrade version of adobe fla...,199,199.00,adobe adobe flash pro cs3 upgrade mac,system requirements 1ghz or faster powerpc g4 ...,239.99,239.99,adobe cs3 flash pro academic
150,b000ndicqi,18425990585802870010,right_only,note: this is the upgrade version of adobe fla...,199,199.00,adobe adobe flash pro cs3 upgrade mac,adobe flash cs3 professional software is the m...,167.87 gbp,167.87,adobe flash cs3 v9 0 pro win flash pro basic a...


In [30]:
pd.set_option('display.max_colwidth', -1)
pd.set_option("max_rows", None)
base_false_negatives

Unnamed: 0,idCompany1,idCompany2,Exist,descr1,price1,price_retreat_x,full data1,descr2,price2,price_retreat_y,full data2
0,b00021xhzw,18430621475529168165,left_only,upgrade only; installation of after effects standard new disk caching tools speed up your interactive work save any combination of animation parameters as presets -- create transformations masks expressions effects and text tighter integration with other adobe tools - import photoshop cs and illustrator cs files with preserved layers and other attributes output to firewire for easier previewing on ntsc and pal video monitors,499.99,499.99,adobe adobe effect pro 6 5 upgrade standard pro,adobe after effects pb 6.5 win upgrade.standard to pro v6.5 upgrademodel- adbcd36149wi vendor- adobe software features- after effects 6.5 pro for windows- upgrade version upgrade after effects standard to after effects pro 6.5. the essential tool ...,507,507.0,adobe 22070152 effect 6 5 pbupgrd
1,b00006hvvo,3409784122469217433,left_only,today enterprises and service providers face increasing security challenges in their distributed networks from security and virus attacks to enforcing security policies. as a distributed network grows and branches into multiple subnetworks linked by the internet so does the complexity of managing security appliances security policies and updates. a single flaw in security implementation at any point in the network can expose the entire infrastructure allowing malicious access to important data and files with severe consequences. managing security for distributed networks on a site-by-site basis is time consuming expensive and unreliable putting a big strain on already limited resources. sonicwall global management system (sonicwall gms) standard edition enables distributed enterprises and service providers to manage thousands of sonicwall internet security appliances from a central location with a powerful flexible and intuitive application. sonicwall gms delivers a cost-effective global security management solution that reduces staffing requirements speeds up deployment and lowers the cost of managing security services. sonicwall gms gives administrators the integrated tools to manage all security policies and services throughout a large-scale multiple policy enterprise or service provider environment. administrators can easily configure sonicwall firewall settings deploy vpn to any user and add sonicwall upgrade and subscription services such as gateway anti-virus anti-spyware and intrusion prevention service complete anti-virus and content filtering service with the click of a button.,0.0,0.0,sonic system upgrade sgms 1000 incremental node,today enterprises and service providers face increasing security challenges in their distributed networks from security and virus attacks to enforcing security policies. as a distributed network grows and branches into multiple subnetworks linked ...,62920.89,62920.89,sonicwall gm 1000n incremental lic upg
2,b00006hvvo,9687887699930800156,left_only,today enterprises and service providers face increasing security challenges in their distributed networks from security and virus attacks to enforcing security policies. as a distributed network grows and branches into multiple subnetworks linked by the internet so does the complexity of managing security appliances security policies and updates. a single flaw in security implementation at any point in the network can expose the entire infrastructure allowing malicious access to important data and files with severe consequences. managing security for distributed networks on a site-by-site basis is time consuming expensive and unreliable putting a big strain on already limited resources. sonicwall global management system (sonicwall gms) standard edition enables distributed enterprises and service providers to manage thousands of sonicwall internet security appliances from a central location with a powerful flexible and intuitive application. sonicwall gms delivers a cost-effective global security management solution that reduces staffing requirements speeds up deployment and lowers the cost of managing security services. sonicwall gms gives administrators the integrated tools to manage all security policies and services throughout a large-scale multiple policy enterprise or service provider environment. administrators can easily configure sonicwall firewall settings deploy vpn to any user and add sonicwall upgrade and subscription services such as gateway anti-virus anti-spyware and intrusion prevention service complete anti-virus and content filtering service with the click of a button.,0.0,0.0,sonic system upgrade sgms 1000 incremental node,sonicwall global management system (sonicwall gms) enables distributed enterprises and service providers to manage and monitor thousands of sonicwall internet security appliances from a central location with a powerful flexible and intuitive ...,63074.12,63074.12,sonicwall gm 1000 upgrade
3,b00002sac9,8401993747038150345,left_only,now featuring addition subtraction and counting to 30 the award-winning millie's math house has been enhanced to offer even more learning. in seven activities children explore numbers shapes sizes patterns addition and subtraction as they build mouse houses create wacky bugs count animated critters make jellybean cookies and answer math challenges posed by dorothy the duck. use the spanish language version for esl bilingual and spanish language instruction.,0.0,0.0,ibm aap misc part millie math house age 3 7 win mac,develop fundamental early math skills with millie and friends,9.9,9.9,millies math house
4,b00009apna,3435680088823178198,left_only,the complete and easy publishing solution for your home orbusinessproduct informationentrust your business and personal publications to the renowned quality of theprint shop. backed by two decades of industry leadership this versatile programis an essenti,69.99,69.99,broderbund printshop 20 pro publisher,overview access powerful images and design tools. the print shop is your creative toolbox filled with an infinite number of projects you can personalize for every occasion. design memorable cards calendars newsletters and more using a variety of ...,34.9,34.9,printshop 20 pro publisher
5,b0009jlux8,4923451909983851979,left_only,icopydvds2 pro edition contains all you need to copy a dvd movie quickly and easily. the interface is simple and direct just put in your movie press copy and icopydvds2 does the rest! our unique auto fit and auto burn technology creates the highest quality movie possible on a single disc every time!,69.99,69.99,copy dvd 2 pro edition,the 1 selling dvd backup solution - bar none! create perfect dvd movie copies. create backups of any dvd movie. burns dvd movies from dvd to dvd. new feature: now includes burn suite 2005 new feature:capture edit and author dvd video new feature: ...,27.9,27.9,copydvds 2 pro edition
6,b000cszg2m,5032391174084349054,left_only,media made easy vol. 3 is your front-row ticket to enjoying the best of today's entertainment! powerful and easy to use it contains 8 full programs from some of the biggest names in entertainment software: roxio pinnacle steinberg & more. create record edit remix capture and restore -- media made easy vol. 3 does it all.,49.95,49.95,gen x gen x medium made easy vol 3,this is the best collection of media based products on the market. create edit record remix restore capture burn present. media made easy does it all. powerful and easy to use media made easy contains 8 full programs from some of the biggest ...,24.9,24.9,medium made easy volume 3
7,b0002qnd2y,975032998818002648,left_only,customize your forms to your specific needs product information the complete solution for all of your business and personal form needs. selectfrom 1200 essential forms and letters customize the layout text and graphics or use the full suite of tools to create your own documents from scratch. a new simplified user interface quickly guides you though all of the highly advancedfeatures.&nbs,29.99,29.99,valusoft form workshop 1200,overview the complete solution for all of your business and personal form needs. select from 1200 essential forms and letters customize the layout text and graphics or use the full suite of tools to create your own documents from scratch. a new ...,12.9,12.9,form workshop 1200
8,b000cqyclu,1695089910833084773,left_only,combines exciting multimedia technology with great story-telling traditions that have brought bible stories to life for thousands of years. teaches kids reinforces and tests their general knowledge of the bible.,29.95,29.95,dorling kindersley multimedia dk first bible story,brings seven of the best-loved stories from the bible to life,9.9,9.9,first bible story jc
9,b000bezsyi,10911853136488554482,left_only,the fastest easiest way to design!product informationit's easy to bring your wildest design ideas to life with turbocad's powerfulvisualization tools.&nbsp; create professional 2d sketches and precisiondrawings that are sure to impress.&nbsp; the best par,39.99,39.99,imsi imsi turbocad designer 11,lets even the most novice computer user design like a pro. use the interactive tutorials and start creating impressive 2d sketches and precision technical drawings and more. whether your a beginner casual designer or a design enthusiast this ...,12.9,12.9,imsi turbocad designer 11


In [None]:
['microsoft','adobe', 'apple']


In [22]:
pd.set_option("max_rows", None)
allwords

mac                           529
0                             504
1                             467
win                           423
pro                           421
encore                        378
microsoft                     373
adobe                         365
2                             347
3                             298
complete                      293
user                          290
pc                            280
5                             275
window                        271
edition                       268
2007                          253
upgrade                       249
license                       220
xp                            193
deluxe                        191
4                             191
cd                            189
medium                        188
dvd                           187
package                       166
apple                         161
punch                         160
cs3                           157
suite         