# INFOMDWR – Assignment 2: Data Integration & Preparation

## Introduction
In this assignment, you will work on different tasks that are related to data preparation including data profiling, finding records that refer to same entity, and computing the correlation between different attributes. The datasets that you need to work on are available online (links are provided).

In [None]:
# Import the necessary libraries

import re
import time
import pandas as pd
import numpy as np

from typing import Any
from py_stringmatching.similarity_measure.levenshtein import Levenshtein
from py_stringmatching.similarity_measure.jaro import Jaro
from py_stringmatching.similarity_measure.affine import Affine
from itertools import combinations

In [4]:
# Load the csv into a dataframe and print a part of the dataframe

df = pd.read_csv("dft-road-casualty-statistics-collision-2024.csv", low_memory = False)
df[:10]

Unnamed: 0,collision_index,collision_year,collision_ref_no,location_easting_osgr,location_northing_osgr,longitude,latitude,police_force,collision_severity,number_of_vehicles,...,carriageway_hazards_historic,carriageway_hazards,urban_or_rural_area,did_police_officer_attend_scene_of_accident,trunk_road_flag,lsoa_of_accident_location,enhanced_severity_collision,collision_injury_based,collision_adjusted_severity_serious,collision_adjusted_severity_slight
0,202417H103224,2024,17H103224,448894,532505,-1.24312,54.68523,17,3,2,...,-1,0,1,2,2,E01011983,3,1,0.0,1.0
1,202417M217924,2024,17M217924,452135,519436,-1.19517,54.56747,17,2,2,...,-1,0,1,3,2,E01012061,7,1,1.0,0.0
2,202417S204524,2024,17S204524,445427,522924,-1.29837,54.59946,17,3,2,...,0,0,2,1,2,E01012280,-1,0,0.111621,0.888379
3,2024481510889,2024,481510889,533587,181174,-0.07626,51.51371,48,2,1,...,-1,0,1,1,2,E01000005,7,1,1.0,0.0
4,2024481563500,2024,481563500,532676,180902,-0.08948,51.51148,48,2,1,...,-1,0,1,1,2,E01032739,5,1,1.0,0.0
5,202417S114624,2024,17S114624,443610,520960,-1.32678,54.58197,17,2,1,...,-1,0,1,1,2,E01012271,7,1,1.0,0.0
6,2024481561505,2024,481561505,531722,181452,-0.10302,51.51664,48,3,2,...,-1,0,1,1,2,E01032740,3,1,0.0,1.0
7,202417S203724,2024,17S203724,446255,525635,-1.28513,54.62375,17,3,2,...,0,0,1,1,2,E01012243,-1,0,0.127317,0.872683
8,202417S303424,2024,17S303424,443934,514917,-1.32267,54.52764,17,3,1,...,0,0,2,1,2,E01033475,-1,0,0.170336,0.829664
9,202417M109924,2024,17M109924,449627,517955,-1.23421,54.55441,17,3,2,...,-1,0,1,3,2,E01033469,3,1,0.0,1.0


## Task 1: Profiling relational data

For this task, download and read the paper about profiling relational data, select a set of summary statistics about the data (minimum of 10 different values) and write Python code to compute these quantities for a dataset of your choice. Preferably, you can use one of the csv files from the road safety dataset. Explain the importance of each summary statistic that you selected in understanding the characteristics of the dataset.

*Note: Computing the same statistical quantity on multiple columns of the dataset will be counted only once.*

In [5]:
# Set up a dictionary to hold the summary statistics

summary_statistics: dict[str, Any] = {
    "mean": None,
    "median": None,
    "mode": None,
    "minimum": None,
    "maximum": None,
    "range": None,
    "variance": None,
    "standard_deviation": None,
    "records": None,
    "number_of_missing_records": None,
}

In [6]:
# Fill the dictionary for summary statistics

summary_statistics["mean"] = df["speed_limit"].mean()                                           # Mean of the speed limit
summary_statistics["median"] = df["speed_limit"].median()                                       # Median of the speed limit 
summary_statistics["mode"] = df["speed_limit"].mode()                                           # Mode of the speed limit
summary_statistics["minimum"] = df["speed_limit"].min()                                         # Min of the speed limit
summary_statistics["maximum"] = df["speed_limit"].max()                                         # Max of the speed limit
summary_statistics["range"] = abs(df["speed_limit"].max()) + abs(df["speed_limit"].min())       # Range of the speed limit, thus abs(max)+ and(min))
summary_statistics["variance"] = df["speed_limit"].var()                                        # Variance of the speed limit
summary_statistics["standard_deviation"] = df["speed_limit"].std()                              # Standard deviation of the speed limit
summary_statistics["records"] = len(df["speed_limit"])                                          # Number of rows in the dataframe
summary_statistics["number_of_missing_records"] = df.isnull().sum().sum()                       # Number of missing values in the entire dataframe

summary_statistics

{'mean': 35.87966550080751,
 'median': 30.0,
 'mode': 0    30
 Name: speed_limit, dtype: int64,
 'minimum': -1,
 'maximum': 70,
 'range': 71,
 'variance': 211.30421831349014,
 'standard_deviation': 14.536306900774012,
 'records': 100927,
 'number_of_missing_records': 3}

## Task 2: Entity resolution

### Part 1:

Write a Python code to compare every single record in the dataset (ACM.csv) with all the records in (DBLP2.csv) and find the similar records (records that represent the same publication). To compare two records, follow the steps:

In [7]:
# Import the datasets

df_acm = pd.read_csv("ACM.csv")
df_dblp2 = pd.read_csv("DBLP2.csv", encoding="latin1")


In [9]:
# Show the dataframe of ACM

df_acm[:10]

Unnamed: 0,id,title,authors,venue,year
0,304586,The WASA2 object-oriented workflow management ...,"Gottfried Vossen, Mathias Weske",International Conference on Management of Data,1999
1,304587,A user-centered interface for querying distrib...,"Isabel F. Cruz, Kimberly M. James",International Conference on Management of Data,1999
2,304589,"World Wide Database-integrating the Web, CORBA...","Athman Bouguettaya, Boualem Benatallah, Lily H...",International Conference on Management of Data,1999
3,304590,XML-based information mediation with MIX,"Chaitan Baru, Amarnath Gupta, Bertram Lud&#228...",International Conference on Management of Data,1999
4,304582,The CCUBE constraint object-oriented database ...,"Alexander Brodsky, Victor E. Segal, Jia Chen, ...",International Conference on Management of Data,1999
5,304583,The Cornell Jaguar project: adding mobility to...,"Phillippe Bonnet, Kyle Buza, Zhiyuan Chan, Vic...",International Conference on Management of Data,1999
6,304584,The active MultiSync controller of the cubetre...,"Nick Roussopoulos, Yannis Kotidis, Yannis Sism...",International Conference on Management of Data,1999
7,304585,The Jungle database search engine,"Michael B&#246;hlen, Linas Bukauskas, Curtis D...",International Conference on Management of Data,1999
8,306112,ADEPT: an agent-based approach to business pro...,"N. R. Jennings, T. J. Norman, P. Faratin",ACM SIGMOD Record,1998
9,306115,A componentized architecture for dynamic elect...,"Benny Reich, Israel Ben-Shaul",ACM SIGMOD Record,1998


In [10]:
# Show the dataframe of dblp2

df_dblp2[:10]

Unnamed: 0,id,title,authors,venue,year
0,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999
1,conf/vldb/PoosalaI96,Estimation of Query-Result Distribution and it...,"Viswanath Poosala, Yannis E. Ioannidis",VLDB,1996
2,conf/vldb/PalpanasSCP02,Incremental Maintenance for Non-Distributive A...,"Themistoklis Palpanas, Richard Sidle, Hamid Pi...",VLDB,2002
3,conf/vldb/GardarinGT96,Cost-based Selection of Path Expression Proces...,"Zhao-Hui Tang, Georges Gardarin, Jean-Robert G...",VLDB,1996
4,conf/vldb/HoelS95,Benchmarking Spatial Join Operations with Spat...,"Erik G. Hoel, Hanan Samet",VLDB,1995
5,conf/sigmod/Keim99,Efficient Geometry-based Similarity Search of ...,Daniel A. Keim,SIGMOD Conference,1999
6,journals/sigmod/Ouksel02,Mining the World Wide Web: An Information Sear...,Aris M. Ouksel,SIGMOD Record,2002
7,journals/vldb/Seshadri98,Enhanced Abstract Data Types in Object-Relatio...,Praveen Seshadri,VLDB J.,1998
8,journals/sigmod/RamamrithamS97,Report on DART '96: Databases: Active and Real...,"Nandit Soparkar, Krithi Ramamritham",SIGMOD Record,1997
9,journals/sigmod/DAndreaJ96,UniSQL's Next-Generation Object-Relational Dat...,"Phil Janus, Albert D'Andrea",SIGMOD Record,1996


## Part 1

### A. 

Ignore the pub_id

In [11]:
df_acm.drop("id", axis=1, inplace = True)

In [12]:
df_dblp2.drop("id", axis=1, inplace = True)

### B. 

Change all alphabetical characters into lowercase.

In [13]:
for col in df_acm.columns:
    df_acm[col] = df_acm[col].astype(str).str.lower()
for col in df_dblp2.columns:
    df_dblp2[col] = df_dblp2[col].astype(str).str.lower()

In [14]:
df_acm[:10]

Unnamed: 0,title,authors,venue,year
0,the wasa2 object-oriented workflow management ...,"gottfried vossen, mathias weske",international conference on management of data,1999
1,a user-centered interface for querying distrib...,"isabel f. cruz, kimberly m. james",international conference on management of data,1999
2,"world wide database-integrating the web, corba...","athman bouguettaya, boualem benatallah, lily h...",international conference on management of data,1999
3,xml-based information mediation with mix,"chaitan baru, amarnath gupta, bertram lud&#228...",international conference on management of data,1999
4,the ccube constraint object-oriented database ...,"alexander brodsky, victor e. segal, jia chen, ...",international conference on management of data,1999
5,the cornell jaguar project: adding mobility to...,"phillippe bonnet, kyle buza, zhiyuan chan, vic...",international conference on management of data,1999
6,the active multisync controller of the cubetre...,"nick roussopoulos, yannis kotidis, yannis sism...",international conference on management of data,1999
7,the jungle database search engine,"michael b&#246;hlen, linas bukauskas, curtis d...",international conference on management of data,1999
8,adept: an agent-based approach to business pro...,"n. r. jennings, t. j. norman, p. faratin",acm sigmod record,1998
9,a componentized architecture for dynamic elect...,"benny reich, israel ben-shaul",acm sigmod record,1998


In [15]:
df_dblp2[:10]

Unnamed: 0,title,authors,venue,year
0,semantic integration of environmental models f...,d. scott mackay,sigmod record,1999
1,estimation of query-result distribution and it...,"viswanath poosala, yannis e. ioannidis",vldb,1996
2,incremental maintenance for non-distributive a...,"themistoklis palpanas, richard sidle, hamid pi...",vldb,2002
3,cost-based selection of path expression proces...,"zhao-hui tang, georges gardarin, jean-robert g...",vldb,1996
4,benchmarking spatial join operations with spat...,"erik g. hoel, hanan samet",vldb,1995
5,efficient geometry-based similarity search of ...,daniel a. keim,sigmod conference,1999
6,mining the world wide web: an information sear...,aris m. ouksel,sigmod record,2002
7,enhanced abstract data types in object-relatio...,praveen seshadri,vldb j.,1998
8,report on dart '96: databases: active and real...,"nandit soparkar, krithi ramamritham",sigmod record,1997
9,unisql's next-generation object-relational dat...,"phil janus, albert d'andrea",sigmod record,1996


### C. 

Convert multiple spaces to one

In [16]:
for col in df_acm.columns:
    df_acm[col].replace(r'\s {2,}', ' ', regex=True)

df_acm[:10]

Unnamed: 0,title,authors,venue,year
0,the wasa2 object-oriented workflow management ...,"gottfried vossen, mathias weske",international conference on management of data,1999
1,a user-centered interface for querying distrib...,"isabel f. cruz, kimberly m. james",international conference on management of data,1999
2,"world wide database-integrating the web, corba...","athman bouguettaya, boualem benatallah, lily h...",international conference on management of data,1999
3,xml-based information mediation with mix,"chaitan baru, amarnath gupta, bertram lud&#228...",international conference on management of data,1999
4,the ccube constraint object-oriented database ...,"alexander brodsky, victor e. segal, jia chen, ...",international conference on management of data,1999
5,the cornell jaguar project: adding mobility to...,"phillippe bonnet, kyle buza, zhiyuan chan, vic...",international conference on management of data,1999
6,the active multisync controller of the cubetre...,"nick roussopoulos, yannis kotidis, yannis sism...",international conference on management of data,1999
7,the jungle database search engine,"michael b&#246;hlen, linas bukauskas, curtis d...",international conference on management of data,1999
8,adept: an agent-based approach to business pro...,"n. r. jennings, t. j. norman, p. faratin",acm sigmod record,1998
9,a componentized architecture for dynamic elect...,"benny reich, israel ben-shaul",acm sigmod record,1998


In [17]:
for col in df_dblp2.columns:
    df_dblp2[col].replace(r'\s {2,}', ' ', regex=True)

df_dblp2[:10]

Unnamed: 0,title,authors,venue,year
0,semantic integration of environmental models f...,d. scott mackay,sigmod record,1999
1,estimation of query-result distribution and it...,"viswanath poosala, yannis e. ioannidis",vldb,1996
2,incremental maintenance for non-distributive a...,"themistoklis palpanas, richard sidle, hamid pi...",vldb,2002
3,cost-based selection of path expression proces...,"zhao-hui tang, georges gardarin, jean-robert g...",vldb,1996
4,benchmarking spatial join operations with spat...,"erik g. hoel, hanan samet",vldb,1995
5,efficient geometry-based similarity search of ...,daniel a. keim,sigmod conference,1999
6,mining the world wide web: an information sear...,aris m. ouksel,sigmod record,2002
7,enhanced abstract data types in object-relatio...,praveen seshadri,vldb j.,1998
8,report on dart '96: databases: active and real...,"nandit soparkar, krithi ramamritham",sigmod record,1997
9,unisql's next-generation object-relational dat...,"phil janus, albert d'andrea",sigmod record,1996


### D.

Use Levenshtein similarity $(L_{sim}(S_1, S_2) = 1 - \frac{MED(S_1, S_2}{MAX(|S_1|, |S_2|})$
for comparing the values in the **title** attribute and compute the score $(s_t)$. (MED refers to the minimum edit distance and $|S_i|$ is the number of characters in string $S_i$).


In [18]:
# Levenshtein similarity for title comparison
def levenshtein_similarity(s1, s2):
    lev = Levenshtein()
    distance = lev.get_raw_score(s1, s2)
    max_len = max(len(s1), len(s2))
    if max_len == 0:
        return 1.0
    return 1 - (distance / max_len)

# Compare titles from ACM and DBLP datasets
title_similarity = {}
for i, title1 in enumerate(df_acm["title"]):
    for j, title2 in enumerate(df_dblp2["title"]):
        similarity = levenshtein_similarity(title1, title2)
        title_similarity[(i, j)] = similarity

# Find the pair with highest similarity
max_title_sim = max(title_similarity, key=title_similarity.get)
print(f"Highest title similarity: {title_similarity[max_title_sim]}")
print(f"ACM title: {df_acm.iloc[max_title_sim[0]]['title']}")
print(f"DBLP title: {df_dblp2.iloc[max_title_sim[1]]['title']}")


Highest title similarity: 1.0
ACM title: the wasa2 object-oriented workflow management system
DBLP title: the wasa2 object-oriented workflow management system


### E. 

Use Jaro similarity to compare the values in the **authors** field and compute $(S_a)$.

In [19]:
# Jaro similarity for authors comparison
jaro = Jaro()

# Compare authors from ACM and DBLP datasets
authors_similarity = {}
for i, author1 in enumerate(df_acm["authors"]):
    for j, author2 in enumerate(df_dblp2["authors"]):
        similarity = jaro.get_sim_score(author1, author2)
        authors_similarity[(i, j)] = similarity

# Find the pair with highest similarity
max_authors_sim = max(authors_similarity, key=authors_similarity.get)
print(f"Highest authors similarity: {authors_similarity[max_authors_sim]}")
print(f"ACM author: {df_acm.iloc[max_authors_sim[0]]['authors']}")
print(f"DBLP author: {df_dblp2.iloc[max_authors_sim[1]]['authors']}")


Highest authors similarity: 1.0
ACM author: martin bichler, arie segev, j. leon zhao
DBLP author: martin bichler, arie segev, j. leon zhao


### F.

Use a modified version of the affine similarity that is scaled to the interval [0, 1] for the venue attribute $(S_c)$.

In [21]:
affine = Affine()

venue_similarity = {}
raw_scores = []

for i, venue1 in enumerate(df_acm["venue"]):
    for j, venue2 in enumerate(df_dblp2["venue"]):
        similarity = affine.get_raw_score(str(venue1), str(venue2))
        raw_scores.append(similarity)
        venue_similarity[(i, j)] = similarity

# Normalize to [0, 1]
min_score = min(raw_scores)
max_score = max(raw_scores)
for key in venue_similarity:
    if max_score != min_score:
        venue_similarity[key] = (venue_similarity[key] - min_score) / (max_score - min_score)
    else:
        venue_similarity[key] = 1.0  # if all scores are equal

# Get max similarity
max_venue_sim = max(venue_similarity, key=venue_similarity.get)
print(f"Highest venue similarity: {venue_similarity[max_venue_sim]}")
print(f"ACM venue: {df_acm.iloc[max_venue_sim[0]]['venue']}")
print(f"DBLP venue: {df_dblp2.iloc[max_venue_sim[1]]['venue']}")


Highest venue similarity: 1.0
ACM venue: acm transactions on database systems (tods) 
DBLP venue: acm trans. database syst.


### G.

Use Match (1) / Mismatch (0) for the year $(S_y)$.

In [22]:
# Match/Mismatch for year comparison
def year_similarity(year1, year2):
    """Return 1 if years match exactly, 0 otherwise"""
    return 1.0 if year1 == year2 else 0.0

# Compare years from ACM and DBLP datasets
year_similarity_dict = {}
for i, year1 in enumerate(df_acm["year"]):
    for j, year2 in enumerate(df_dblp2["year"]):
        similarity = year_similarity(year1, year2)
        year_similarity_dict[(i, j)] = similarity

# Count exact matches
exact_matches = sum(1 for sim in year_similarity_dict.values() if sim == 1.0)
total_comparisons = len(year_similarity_dict)
print(f"Exact year matches: {exact_matches} out of {total_comparisons} comparisons")
print(f"Match rate: {exact_matches/total_comparisons}")


Exact year matches: 601284 out of 6001104 comparisons
Match rate: 0.10019556401622101


### H.

Use the formula $\text{rec\_sim} = w_1 * s_1 + w_2 * s_2 + w_3 * s_3 + w_4 * s_4$ to combine the scores and compute the final score, where $\sum^4_{i=1}w_i=1$.

In [None]:
# Combine scores using weighted formula: rec_sim = w1*s1 + w2*s2 + w3*s3 + w4*s4
def combine_scores(s_t, s_a, s_c, s_y, w1=0.3, w2=0.3, w3=0.2, w4=0.2):
    return w1 * s_t + w2 * s_a + w3 * s_c + w4 * s_y

# Calculate combined similarity for all pairs
combined_similarity = {}
for i in range(len(df_acm)):
    for j in range(len(df_dblp2)):
        s_t = title_similarity.get((i, j), 0)
        s_a = authors_similarity.get((i, j), 0)
        s_c = venue_similarity.get((i, j), 0)
        s_y = year_similarity_dict.get((i, j), 0)
        
        # Combine scores
        combined_score = combine_scores(s_t, s_a, s_c, s_y)
        combined_similarity[(i, j)] = combined_score

# Find the pair with highest combined similarity
max_combined = max(combined_similarity, key=combined_similarity.get)
print(f"Highest combined similarity: {combined_similarity[max_combined]}")
print(f"ACM record: {df_acm.iloc[max_combined[0]]['title']}")
print(f"DBLP record: {df_dblp2.iloc[max_combined[1]]['title']}")


Highest combined similarity: 1.0
ACM record: security of random data perturbation methods
DBLP record: security of random data perturbation methods


### I.

Report the records with rec_sim > 0.7 as duplicate records by storing the ids of both records in a list

In [24]:
# Report records with rec_sim > 0.7 as duplicate records
def report_duplicates(threshold=0.7):
    duplicates = []
    for (i, j), score in combined_similarity.items():
        if score > threshold:
            duplicates.append((i, j, score))
    
    # Sort by similarity score (descending)
    duplicates.sort(key=lambda x: x[2], reverse=True)
    
    print(f"Found {len(duplicates)} duplicate pairs with similarity > {threshold}")

    return duplicates

# Report duplicates
duplicate_pairs = report_duplicates()


Found 2602 duplicate pairs with similarity > 0.7


### J.

In the table `DBLP-ACM_perfectMapping.csv`, you can find the actual mappings (the ids of the correct duplicate records). Compute the precision of this method by counting the number of duplicate records that you discovered correctly. That is, among all the reported similar records by your method, how many pairs exist in the file `DBLP-ACM_perfectMapping.csv`

In [None]:
# Reload the original datasets to get the correct IDs
df_acm_original = pd.read_csv("ACM.csv")
df_dblp2_original = pd.read_csv("DBLP2.csv", encoding="latin1")

# Load the perfect mapping file
perfect_mapping = pd.read_csv("DBLP-ACM_perfectMapping.csv")

perfect_pairs = set()
for _, row in perfect_mapping.iterrows():
    perfect_pairs.add((row['idACM'], row['idDBLP']))

# Calculate precision
def calculate_precision(reported_duplicates, perfect_pairs, df_acm_original, df_dblp2_original):
    """Calculate precision: TP / (TP + FP)"""
    true_positives = 0
    false_positives = 0
    
    for acm_idx, dblp_idx, score in reported_duplicates:
        # Get the actual IDs from the original datasets
        acm_id = df_acm_original.iloc[acm_idx]['id']
        dblp_id = df_dblp2_original.iloc[dblp_idx]['id']
        
        if (acm_id, dblp_id) in perfect_pairs:
            true_positives += 1
        else:
            false_positives += 1
    
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    return precision, true_positives, false_positives

# Calculate precision
precision, tp, fp = calculate_precision(duplicate_pairs, perfect_pairs, df_acm_original, df_dblp2_original)
print(f"Precision: {precision}")
print(f"True Positives: {tp}")
print(f"False Positives: {fp}")
print(f"Total reported duplicates: {len(duplicate_pairs)}")

Precision: 0.8439661798616449
True Positives: 2196
False Positives: 406
Total reported duplicates: 2602


### K.

Record the running time of the method. You can observe that the program takes a long time to get the results. What can you do to reduce the running time? (Just provide clear discussion – no need for implementing the ideas.)

In [30]:
start_time = time.time()

# Run all the code of this part to see what the total execution time is.

df_acm = pd.read_csv("ACM.csv")
df_dblp2 = pd.read_csv("DBLP2.csv", encoding="latin1")
df_acm.drop("id", axis=1, inplace = True)
df_dblp2.drop("id", axis=1, inplace = True)
for col in df_acm.columns:
    df_acm[col] = df_acm[col].astype(str).str.lower()
for col in df_dblp2.columns:
    df_dblp2[col] = df_dblp2[col].astype(str).str.lower()
for col in df_acm.columns:
    df_acm[col].replace(r'\s {2,}', ' ', regex=True)
for col in df_dblp2.columns:
    df_dblp2[col].replace(r'\s {2,}', ' ', regex=True)

def levenshtein_similarity(s1, s2):
    lev = Levenshtein()
    distance = lev.get_raw_score(s1, s2)
    max_len = max(len(s1), len(s2))
    if max_len == 0:
        return 1.0
    return 1 - (distance / max_len)

title_similarity = {}
for i, title1 in enumerate(df_acm["title"]):
    for j, title2 in enumerate(df_dblp2["title"]):
        similarity = levenshtein_similarity(title1, title2)
        title_similarity[(i, j)] = similarity

max_title_sim = max(title_similarity, key=title_similarity.get)

jaro = Jaro()


authors_similarity = {}
for i, author1 in enumerate(df_acm["authors"]):
    for j, author2 in enumerate(df_dblp2["authors"]):
        similarity = jaro.get_sim_score(author1, author2)
        authors_similarity[(i, j)] = similarity


max_authors_sim = max(authors_similarity, key=authors_similarity.get)
affine = Affine()

venue_similarity = {}
raw_scores = []

for i, venue1 in enumerate(df_acm["venue"]):
    for j, venue2 in enumerate(df_dblp2["venue"]):
        similarity = affine.get_raw_score(str(venue1), str(venue2))
        raw_scores.append(similarity)
        venue_similarity[(i, j)] = similarity

min_score = min(raw_scores)
max_score = max(raw_scores)
for key in venue_similarity:
    if max_score != min_score:
        venue_similarity[key] = (venue_similarity[key] - min_score) / (max_score - min_score)
    else:
        venue_similarity[key] = 1.0 


max_venue_sim = max(venue_similarity, key=venue_similarity.get)
def year_similarity(year1, year2):
    """Return 1 if years match exactly, 0 otherwise"""
    return 1.0 if year1 == year2 else 0.0

year_similarity_dict = {}
for i, year1 in enumerate(df_acm["year"]):
    for j, year2 in enumerate(df_dblp2["year"]):
        similarity = year_similarity(year1, year2)
        year_similarity_dict[(i, j)] = similarity

exact_matches = sum(1 for sim in year_similarity_dict.values() if sim == 1.0)
total_comparisons = len(year_similarity_dict)
def combine_scores(s_t, s_a, s_c, s_y, w1=0.3, w2=0.3, w3=0.2, w4=0.2):
    return w1 * s_t + w2 * s_a + w3 * s_c + w4 * s_y

combined_similarity = {}
for i in range(len(df_acm)):
    for j in range(len(df_dblp2)):
        s_t = title_similarity.get((i, j), 0)
        s_a = authors_similarity.get((i, j), 0)
        s_c = venue_similarity.get((i, j), 0)
        s_y = year_similarity_dict.get((i, j), 0)
        
        combined_score = combine_scores(s_t, s_a, s_c, s_y)
        combined_similarity[(i, j)] = combined_score

max_combined = max(combined_similarity, key=combined_similarity.get)
def report_duplicates(threshold=0.7):
    duplicates = []
    for (i, j), score in combined_similarity.items():
        if score > threshold:
            duplicates.append((i, j, score))
    
    duplicates.sort(key=lambda x: x[2], reverse=True)

    return duplicates

duplicate_pairs = report_duplicates()

df_acm_original = pd.read_csv("ACM.csv")
df_dblp2_original = pd.read_csv("DBLP2.csv", encoding="latin1")

perfect_mapping = pd.read_csv("DBLP-ACM_perfectMapping.csv")

perfect_pairs = set()
for _, row in perfect_mapping.iterrows():
    perfect_pairs.add((row['idACM'], row['idDBLP']))

def calculate_precision(reported_duplicates, perfect_pairs, df_acm_original, df_dblp2_original):
    """Calculate precision: TP / (TP + FP)"""
    true_positives = 0
    false_positives = 0
    
    for acm_idx, dblp_idx, score in reported_duplicates:
        acm_id = df_acm_original.iloc[acm_idx]['id']
        dblp_id = df_dblp2_original.iloc[dblp_idx]['id']
        
        if (acm_id, dblp_id) in perfect_pairs:
            true_positives += 1
        else:
            false_positives += 1
    
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    return precision, true_positives, false_positives

precision, tp, fp = calculate_precision(duplicate_pairs, perfect_pairs, df_acm_original, df_dblp2_original)


# End of all the code.

end_time = time.time()
print(f"Total time taken: {end_time - start_time}")

Total time taken: 1324.6808943748474


### Total Time Taken:

1324/60 = 22 minutes

## Part 2

### step 0: Load in data

In [31]:
# load dataset
acm_data = pd.read_csv("ACM.csv")
dblp_data = pd.read_csv("DBLP2.csv", encoding = "latin1")
dblp_data[:10]

Unnamed: 0,id,title,authors,venue,year
0,journals/sigmod/Mackay99,Semantic Integration of Environmental Models f...,D. Scott Mackay,SIGMOD Record,1999
1,conf/vldb/PoosalaI96,Estimation of Query-Result Distribution and it...,"Viswanath Poosala, Yannis E. Ioannidis",VLDB,1996
2,conf/vldb/PalpanasSCP02,Incremental Maintenance for Non-Distributive A...,"Themistoklis Palpanas, Richard Sidle, Hamid Pi...",VLDB,2002
3,conf/vldb/GardarinGT96,Cost-based Selection of Path Expression Proces...,"Zhao-Hui Tang, Georges Gardarin, Jean-Robert G...",VLDB,1996
4,conf/vldb/HoelS95,Benchmarking Spatial Join Operations with Spat...,"Erik G. Hoel, Hanan Samet",VLDB,1995
5,conf/sigmod/Keim99,Efficient Geometry-based Similarity Search of ...,Daniel A. Keim,SIGMOD Conference,1999
6,journals/sigmod/Ouksel02,Mining the World Wide Web: An Information Sear...,Aris M. Ouksel,SIGMOD Record,2002
7,journals/vldb/Seshadri98,Enhanced Abstract Data Types in Object-Relatio...,Praveen Seshadri,VLDB J.,1998
8,journals/sigmod/RamamrithamS97,Report on DART '96: Databases: Active and Real...,"Nandit Soparkar, Krithi Ramamritham",SIGMOD Record,1997
9,journals/sigmod/DAndreaJ96,UniSQL's Next-Generation Object-Relational Dat...,"Phil Janus, Albert D'Andrea",SIGMOD Record,1996


## Step 1, 2 & 3: Concatenate the values in each record into one single string, change all alphabetical characters into lowercase and Convert multiple spaces to one.

This was done by iterating over the rows, joining all column values together as one long string, then making everything lowercase and using re.sub to turn multiple spaces into one.

In [32]:
# 1: concatenate values into a single string
def concat_values(df):
  """
  Takes a dataframe and returns a list of concatenated strings. Goes row by row
  to convert all column values of that row into one long string.
  """
  concat_strings = []
  for idx, row in df.iterrows():
    # 2: change alphabetical values into lowercase
    string = ' '.join(row.astype(str)).lower()
    # 3: turn multiple spaces into one use re library
    string = re.sub(r"\s+", " ", string).strip()
    concat_strings.append(string)

  return concat_strings

acm_concat = concat_values(acm_data)
dblp_concat = concat_values(dblp_data)

dblp_concat[:3]

['journals/sigmod/mackay99 semantic integration of environmental models for application to global information systems and decision-making d. scott mackay sigmod record 1999',
 'conf/vldb/poosalai96 estimation of query-result distribution and its application in parallel-join load balancing viswanath poosala, yannis e. ioannidis vldb 1996',
 'conf/vldb/palpanasscp02 incremental maintenance for non-distributive aggregate functions themistoklis palpanas, richard sidle, hamid pirahesh, roberta cochrane vldb 2002']

## Step 4 & 5: Combine the records from both tables into one big list as we did during the lab, use the functions in the tutorials from lab 5 to compute the shingles, the minhash signature and the similarity.

Here we combined the records from both tables with a simple '+' operation. Then we defined the functions that were used in the lab, and used them on our data. This left us with a list of sets of shingles, a vocabulary of all found shingles and a one-hot matrix that shows which shingles are in which documents.

In [33]:
# 4: Combine the records from both tables into one big list as we did during the lab.
combined_list = acm_concat + dblp_concat

# 5: Use the functions in the tutorials from lab 5 to compute the shingles, the minhash signature and the similarity.
def shingle(text: str, k: int)->set:
    """
    Create a set of 'shingles' from the input text using k-shingling.

    Parameters:
        text (str): The input text to be converted into shingles.
        k (int): The length of the shingles (substring size).

    Returns:
        set: A set containing the shingles extracted from the input text.
    """
    shingle_set = []
    for i in range(len(text) - k+1):
        shingle_set.append(text[i:i+k])
    return set(shingle_set)

def build_vocab(shingle_sets: list)->dict:
    """
    Constructs a vocabulary dictionary from a list of shingle sets.

    This function takes a list of shingle sets and creates a unified vocabulary
    dictionary. Each unique shingle across all sets is assigned a unique integer
    identifier.

    Parameters:
    - shingle_sets (list of set): A list containing sets of shingles.

    Returns:
    - dict: A vocabulary dictionary where keys are the unique shingles and values
      are their corresponding unique integer identifiers.

    Example:
    sets = [{"apple", "banana"}, {"banana", "cherry"}]
    build_vocab(sets)
    {'apple': 0, 'cherry': 1, 'banana': 2}  # The exact order might vary due to set behavior
    """
    full_set = {item for set_ in shingle_sets for item in set_}
    vocab = {}
    for i, shingle in enumerate(list(full_set)):
        vocab[shingle] = i
    return vocab

def one_hot(shingles: set, vocab: dict):
    vec = np.zeros(len(vocab))
    for shingle in shingles:
        idx = vocab[shingle]
        vec[idx] = 1
    return vec

# find shingles in the lists
shingles = []
k = 3
for sentence in combined_list:
    shingles.append(shingle(sentence,k))

# build vocab using shingles
vocab = build_vocab(shingles)
print(f"Number of vocabulary is:{len(vocab)}")

# create one-hot matrix for which shingles are in each document
shingles_1hot = []
for shingle_set in shingles:
    shingles_1hot.append(one_hot(shingle_set,vocab))
shingles_1hot = np.stack(shingles_1hot)
shingles_1hot.shape # should be (no. of docs, vocab size)

Number of vocabulary is:13501


(4910, 13501)

## Step 5 continuation: Use the functions in the tutorials from lab 5 to compute the shingles, the minhash signature and the similarity.

Here we create the minhash array using the function from the lab. We used 220 hashes, because after some trial and error this gave the highest precision at the end of the pipeline.

We then used the get_signature function from the lab to calculate a signature for each vector in the shingles one-hot matrix.

After this, we computed the similarity between each combination of documents, and saved the ones that were >0.7 in similarity. This has complexity O(N^2), so this took around 1 minute for this data.

In [34]:
# 5 (cont.)
def get_minhash_arr(num_hashes:int,vocab:dict):
    """
    Generates a MinHash array for the given vocabulary.

    This function creates an array where each row represents a hash function and
    each column corresponds to a word in the vocabulary. The values are permutations
    of integers representing the hashed value of each word for that particular hash function.

    Parameters:
    - num_hashes (int): The number of hash functions (rows) to generate for the MinHash array.
    - vocab (dict): The vocabulary where keys are words and values can be any data
      (only keys are used in this function).

    Returns:
    - np.ndarray: The generated MinHash array with `num_hashes` rows and columns equal
      to the size of the vocabulary. Each cell contains the hashed value of the corresponding
      word for the respective hash function.

    Example:
    vocab = {'apple': 1, 'banana': 2}
    get_minhash_arr(2, vocab)
    # Possible output:
    # array([[1, 2],
    #        [2, 1]])
    """
    length = len(vocab.keys())
    arr = np.zeros((num_hashes,length))
    for i in range(num_hashes):
        permutation = np.random.permutation(len(vocab.keys())) + 1
        arr[i,:] = permutation.copy()
    return arr.astype(int)

def get_signature(minhash:np.ndarray, vector:np.ndarray):
    """
    Computes the signature of a given vector using the provided MinHash matrix.

    The function finds the nonzero indices of the vector, extracts the corresponding
    columns from the MinHash matrix, and computes the signature as the minimum value
    across those columns for each row of the MinHash matrix.

    Parameters:
    - minhash (np.ndarray): The MinHash matrix where each column represents a shingle
      and each row represents a hash function.
    - vector (np.ndarray): A vector representing the presence (non-zero values) or
      absence (zero values) of shingles.

    Returns:
    - np.ndarray: The signature vector derived from the MinHash matrix for the provided vector.

    Example:
    minhash = np.array([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
    vector = np.array([0, 1, 0])
    get_signature(minhash, vector)
    output:array([3, 6, 9])
    """
    idx = np.nonzero(vector)[0].tolist()
    shingles = minhash[:,idx]
    signature = np.min(shingles,axis=1)
    return signature

def jaccard_similarity(set1, set2):
    intersection_size = len(set1.intersection(set2))
    union_size = len(set1.union(set2))
    return intersection_size / union_size if union_size != 0 else 0.0

def compute_signature_similarity(signature_1, signature_2):
    """
    Calculate the similarity between two signature matrices using MinHash.

    Parameters:
    - signature_1: First signature matrix as a numpy array.
    - signature_matrix2: Second signature matrix as a numpy array.

    Returns:
    - Estimated Jaccard similarity.
    """
    # Ensure the matrices have the same shape
    if signature_1.shape != signature_2.shape:
        raise ValueError("Both signature matrices must have the same shape.")
    # Count the number of rows where the two matrices agree
    agreement_count = np.sum(signature_1 == signature_2)
    # Calculate the similarity
    similarity = agreement_count / signature_2.shape[0]

    return similarity

# array to hold 220 permutations
minhash_arr =  get_minhash_arr(220,vocab)
signatures = []

# create signature matrix
for vector in shingles_1hot:
    signatures.append(get_signature(minhash_arr,vector))
signatures = np.stack(signatures)
print("the dimensions of the signature matrix are", signatures.shape)

# compute similarity
high_sim = []
for i in range(len(signatures)):
  for j in range(i + 1, len(signatures)):
    # if combination has high similarity in both measures we save it
    if compute_signature_similarity(signatures[i], signatures[j]) > 0.7 and jaccard_similarity(shingles[i], shingles[j]) > 0.7:
      high_sim.append((i, j))

the dimensions of the signature matrix are (4910, 220)


## Step 6: Extract the top 2224 candidates from the LSH algorithm, compare them to the actual mappings in the file DBLP-ACM_perfectMapping.csv and compute the precision of the method.

Here we once again used code from the lab to create the LSH class. We simply run the required functions to get a mapping, and taking the top 2224 will give us the 2224 combinations that are most likely to be the same documents.

20 buckets is once again because we found this to be a good number after a little trial and error.

In [35]:
# 6: Extract the top 2224 candidates from the LSH algorithm, compare them to
# the actual mappings in the file DBLP-ACM_perfectMapping.csv and compute the
# precision of the method.
class LSH:
    """
    Implements the Locality Sensitive Hashing (LSH) technique for approximate
    nearest neighbor search.
    """
    buckets = []
    counter = 0

    def __init__(self, b: int):
        """
        Initializes the LSH instance with a specified number of bands.

        Parameters:
        - b (int): The number of bands to divide the signature into.
        """
        self.b = b
        for i in range(b):
            self.buckets.append({})

    def make_subvecs(self, signature: np.ndarray) -> np.ndarray:
        """
        Divides a given signature into subvectors based on the number of bands.

        Parameters:
        - signature (np.ndarray): The MinHash signature to be divided.

        Returns:
        - np.ndarray: A stacked array where each row is a subvector of the signature.
        """
        l = len(signature)
        assert l % self.b == 0
        r = int(l / self.b)
        subvecs = []
        for i in range(0, l, r):
            subvecs.append(signature[i:i+r])
        return np.stack(subvecs)

    def add_hash(self, signature: np.ndarray):
        """
        Adds a signature to the appropriate LSH buckets based on its subvectors.

        Parameters:
        - signature (np.ndarray): The MinHash signature to be hashed and added.
        """
        subvecs = self.make_subvecs(signature).astype(str)
        for i, subvec in enumerate(subvecs):
            subvec = ','.join(subvec)
            if subvec not in self.buckets[i].keys():
                self.buckets[i][subvec] = []
            self.buckets[i][subvec].append(self.counter)
        self.counter += 1

    def check_candidates(self) -> set:
        """
        Identifies candidate pairs from the LSH buckets that could be potential near duplicates.

        Returns:
        - set: A set of tuple pairs representing the indices of candidate signatures.
        """
        candidates = []
        for bucket_band in self.buckets:
            keys = bucket_band.keys()
            for bucket in keys:
                hits = bucket_band[bucket]
                if len(hits) > 1:
                    candidates.extend(combinations(hits, 2))
        return set(candidates)

# set 20 buckets and create LSH object
b = 20
lsh = LSH(b)
for signature in signatures:
    lsh.add_hash(signature)
candidate_pairs = lsh.check_candidates()

## Step 6 Continuation
Now we read in the perfect mapping data and convert the data into a format that is the same as the data we have saved already.
We check every mapping against the top 2224 mappings that we had found, and save the combinations that we found that were correct. We then calculate the precision of finding documents that are the same using this method.

In [36]:
# 6 (cont.)
perfect_map = pd.read_csv("DBLP-ACM_perfectMapping.csv")
perfect_map

# convert the dataframe to set of tuples, so it is the same format as top_candidates
perfect_mapping = set(zip(perfect_map['idDBLP'], perfect_map['idACM']))

# get the top 2224 candidates indices and then actual ids
top_pairs = list(candidate_pairs)[:2224]

# create lists of ids and put them together
acm_ids = list(acm_data['id'])
dblp_ids = list(dblp_data['id'])
combined_ids = acm_ids + dblp_ids

# create a list of sets of ids so that we can compare to the perfect mapping
top_pairs_ids = []
for idx1, idx2 in top_pairs:
    id1 = combined_ids[idx1]
    id2 = combined_ids[idx2]
    top_pairs_ids.append((id1, id2))

# find out how many pairs are mapped correctly
correct_count = 0
for pair in top_pairs_ids:
    # check every top pair against the pairs in perfect mapping
    if pair in perfect_mapping or (pair[1], pair[0]) in perfect_mapping:
        correct_count += 1

# precision is amount correct / total candidates
precision = correct_count / len(top_pairs)
print("Precision of LSH method:", precision)

# runtime of whole notebook was 8 seconds

Precision of LSH method: 0.5516093229744728


### 6.

Extract the top 2224 candidates from the LSH algorithm, compare them to the actual mappings in the file `DBLP-ACM_perfectMapping.csv` and compute the precision of the method.

In [37]:
class LSH:
    """
    Implements the Locality Sensitive Hashing (LSH) technique for approximate
    nearest neighbor search.
    """
    buckets = []
    counter = 0

    def __init__(self, b: int):
        """
        Initializes the LSH instance with a specified number of bands.

        Parameters:
        - b (int): The number of bands to divide the signature into.
        """
        self.b = b
        for i in range(b):
            self.buckets.append({})

    def make_subvecs(self, signature: np.ndarray) -> np.ndarray:
        """
        Divides a given signature into subvectors based on the number of bands.

        Parameters:
        - signature (np.ndarray): The MinHash signature to be divided.

        Returns:
        - np.ndarray: A stacked array where each row is a subvector of the signature.
        """
        l = len(signature)
        assert l % self.b == 0
        r = int(l / self.b)
        subvecs = []
        for i in range(0, l, r):
            subvecs.append(signature[i:i+r])
        return np.stack(subvecs)

    def add_hash(self, signature: np.ndarray):
        """
        Adds a signature to the appropriate LSH buckets based on its subvectors.

        Parameters:
        - signature (np.ndarray): The MinHash signature to be hashed and added.
        """
        subvecs = self.make_subvecs(signature).astype(str)
        for i, subvec in enumerate(subvecs):
            subvec = ','.join(subvec)
            if subvec not in self.buckets[i].keys():
                self.buckets[i][subvec] = []
            self.buckets[i][subvec].append(self.counter)
        self.counter += 1

    def check_candidates(self) -> set:
        """
        Identifies candidate pairs from the LSH buckets that could be potential near duplicates.

        Returns:
        - set: A set of tuple pairs representing the indices of candidate signatures.
        """
        candidates = []
        for bucket_band in self.buckets:
            keys = bucket_band.keys()
            for bucket in keys:
                hits = bucket_band[bucket]
                if len(hits) > 1:
                    candidates.extend(combinations(hits, 2))
        return set(candidates)

In [38]:
top_pairs_ids = []
for idx1, idx2 in high_sim:
    id1 = combined_ids[idx1]
    id2 = combined_ids[idx2]
    top_pairs_ids.append((id1, id2))

# find out how many pairs are mapped correctly
correct_count = 0
for pair in top_pairs_ids:
    # check every top pair against the pairs in perfect mapping
    if pair in perfect_mapping or (pair[1], pair[0]) in perfect_mapping:
        correct_count += 1

# precision is amount correct / total candidates
precision = correct_count / len(top_pairs)
print("Precision of LSH method:", precision)

Precision of LSH method: 0.560488346281909


### 7. Record the running time of the method.

In [47]:
start_time = time.time()

acm_data = pd.read_csv("ACM.csv")
dblp_data = pd.read_csv("DBLP2.csv", encoding = "latin1")

acm_concat = concat_values(acm_data)
dblp_concat = concat_values(dblp_data)

shingles = []
k = 3
for sentence in combined_list:
    shingles.append(shingle(sentence,k))

vocab = build_vocab(shingles)

shingles_1hot = []
for shingle_set in shingles:
    shingles_1hot.append(one_hot(shingle_set,vocab))
shingles_1hot = np.stack(shingles_1hot)
shingles_1hot.shape 

minhash_arr =  get_minhash_arr(220,vocab)
signatures = []

for vector in shingles_1hot:
    signatures.append(get_signature(minhash_arr,vector))
signatures = np.stack(signatures)

high_sim = []
for i in range(len(signatures)):
  for j in range(i + 1, len(signatures)):
    if compute_signature_similarity(signatures[i], signatures[j]) > 0.7 and jaccard_similarity(shingles[i], shingles[j]) > 0.7:
      high_sim.append((i, j))

b = 20
lsh = LSH(b)
for signature in signatures:
    lsh.add_hash(signature)
candidate_pairs = lsh.check_candidates()

perfect_map = pd.read_csv("DBLP-ACM_perfectMapping.csv")
perfect_map

perfect_mapping = set(zip(perfect_map['idDBLP'], perfect_map['idACM']))

top_pairs = list(candidate_pairs)[:2224]

acm_ids = list(acm_data['id'])
dblp_ids = list(dblp_data['id'])
combined_ids = acm_ids + dblp_ids

top_pairs_ids = []
for idx1, idx2 in top_pairs:
    id1 = combined_ids[idx1]
    id2 = combined_ids[idx2]
    top_pairs_ids.append((id1, id2))

correct_count = 0
for pair in top_pairs_ids:
    if pair in perfect_mapping or (pair[1], pair[0]) in perfect_mapping:
        correct_count += 1

precision = correct_count / len(top_pairs)


top_pairs_ids = []
for idx1, idx2 in high_sim:
    id1 = combined_ids[idx1]
    id2 = combined_ids[idx2]
    top_pairs_ids.append((id1, id2))

correct_count = 0
for pair in top_pairs_ids:
    if pair in perfect_mapping or (pair[1], pair[0]) in perfect_mapping:
        correct_count += 1

precision = correct_count / len(top_pairs)

end_time = time.time()
print(f"Total time taken: {end_time - start_time}")

Total time taken: 49.95915746688843


## Task 3. Data preperation

For this task, use the Pima Indians Diabetes Database.

### 1. 

Compute the correlation between the different columns after removing the `outcome` column.

In [39]:
#loading csv file
df = pd.read_csv("diabetes.csv")

In [41]:
#calculating correlation before cleaning
cor_raw = df.drop(columns=["Outcome"]).corr()
cor_raw

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0


### 2.

Remove the disguised values from the table. We need to remove the values that equal to `0` from columns `BloodPressure`, `SkinThickness` and `BMI` as these are missing values but they have been replaced by the value `0`. Remove the value but keep the record (i.e.) change the value to `null`.

In [42]:
#replacing 0 values with NaN
df_cleaned = df.copy()
df_cleaned[["BloodPressure", "SkinThickness", "BMI"]] = df_cleaned[["BloodPressure", "SkinThickness", "BMI"]].replace(0, np.nan)
df_cleaned[:10]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72.0,35.0,0,33.6,0.627,50,1
1,1,85,66.0,29.0,0,26.6,0.351,31,0
2,8,183,64.0,,0,23.3,0.672,32,1
3,1,89,66.0,23.0,94,28.1,0.167,21,0
4,0,137,40.0,35.0,168,43.1,2.288,33,1
5,5,116,74.0,,0,25.6,0.201,30,0
6,3,78,50.0,32.0,88,31.0,0.248,26,1
7,10,115,,,0,35.3,0.134,29,0
8,2,197,70.0,45.0,543,30.5,0.158,53,1
9,8,125,96.0,,0,,0.232,54,1


### 3.

Fill the cells with `null` using the mean values of the records that have the same class label.

In [43]:
#filling NaN values with the mean per group based on Outcome
for col in ["BloodPressure", "SkinThickness", "BMI"]:
    df_cleaned[col] = df_cleaned.groupby("Outcome")[col].transform(lambda x: x.fillna(x.mean()))

df_cleaned[:10]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72.0,35.0,0,33.6,0.627,50,1
1,1,85,66.0,29.0,0,26.6,0.351,31,0
2,8,183,64.0,33.0,0,23.3,0.672,32,1
3,1,89,66.0,23.0,94,28.1,0.167,21,0
4,0,137,40.0,35.0,168,43.1,2.288,33,1
5,5,116,74.0,27.235457,0,25.6,0.201,30,0
6,3,78,50.0,32.0,88,31.0,0.248,26,1
7,10,115,70.877339,27.235457,0,35.3,0.134,29,0
8,2,197,70.0,45.0,543,30.5,0.158,53,1
9,8,125,96.0,33.0,0,35.406767,0.232,54,1


### 4.

Compute the correlation between the different columns.

In [44]:
#calculating the correlation of the cleaned data
cor_clean = df_cleaned.drop(columns=["Outcome"]).corr()
cor_clean[:10]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Pregnancies,1.0,0.129459,0.208935,0.094172,-0.073535,0.024127,-0.033523,0.544341
Glucose,0.129459,1.0,0.222417,0.220943,0.331357,0.219879,0.137337,0.263514
BloodPressure,0.208935,0.222417,1.0,0.203453,-0.048106,0.286518,-0.002264,0.324439
SkinThickness,0.094172,0.220943,0.203453,1.0,0.104017,0.565443,0.102426,0.135916
Insulin,-0.073535,0.331357,-0.048106,0.104017,1.0,0.185545,0.185071,-0.042163
BMI,0.024127,0.219879,0.286518,0.565443,0.185545,1.0,0.15253,0.027578
DiabetesPedigreeFunction,-0.033523,0.137337,-0.002264,0.102426,0.185071,0.15253,1.0,0.033561
Age,0.544341,0.263514,0.324439,0.135916,-0.042163,0.027578,0.033561,1.0


### 5.

Compare the values from this step with the values in the first step (just mention the most important changes (if any)) and comment on your findings.

In [46]:
#calculating difference between clean and raw correlation
cor_diff = cor_clean - cor_raw
cor_diff

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Pregnancies,0.0,0.0,0.067653,0.175844,0.0,0.006444,0.0,0.0
Glucose,0.0,0.0,0.069828,0.163615,0.0,-0.001192,0.0,0.0
BloodPressure,0.067653,0.069828,0.0,-0.003918,-0.137039,0.004713,-0.043529,0.084911
SkinThickness,0.175844,0.163615,-0.003918,0.0,-0.332766,0.172869,-0.081501,0.249886
Insulin,0.0,0.0,-0.137039,-0.332766,0.0,-0.012314,0.0,0.0
BMI,0.006444,-0.001192,0.004713,0.172869,-0.012314,0.0,0.011884,-0.008664
DiabetesPedigreeFunction,0.0,0.0,-0.043529,-0.081501,0.0,0.011884,0.0,0.0
Age,0.0,0.0,0.084911,0.249886,0.0,-0.008664,0.0,0.0
