# Fast Semantic Similarity


Fast sparse matrix multiplication using sentenceBert vectorization and cosine to compute contextual similarity between two text sentences. Create nGrams for each sentence and compares with the rest to generate sparse similairty matrix.


The following pipeline computes semantic similarity in **two** approaches:

    1. If there's only * one * input file with you (aka. source file):
        
        -  The code finds duplicates and very similar text sentences from source file.
        -  The output is a dataframe in a cluster format, i.e. clusters of similar sentences, duplciates, etc.
        -  The orginal ids are intact so that one could back-track to source file.
        -  You need to set a threshold to filter similar sentences, defined by `source_min_similarity`. 
        
        
    2. If there are * two * files, and you wish to compare and get similar sentences from them:
    
        - The code performs semantic similarity b/w a master file (aka. source) and comparsion file (aka. target).
        - The output is a dataframe with compared sentences and similarity scores.
        - Two IDs are worked out in the process, one belongs to master (source) and one to comparison (target).
        - You need to set a threshold to filter similar sentences, defined by `min_similarity`.

## Steps to run

- Get Python >= 3.6.0

- Put your desired file in connected Input Folder.


- Set Directory paths

- Your input files should atleast contain `ID` and `TEXT` columns!


---

- Select a `Master file` (file you wish to run similarity on OR find duplicates and very similar rows). This will be called as "Master" or "Source".


- (Optional) Select a `Comparsion file` (if you wish to compare similarity with Master file). This will be called as "Comparison" or "Target".

---

- `Approach 1` ---> To collect duplicates, very similar ids (defined by `source_min_similarity`) on master file.
    - Returns dataframe with:
        - identified duplicate rows and
        - very similar rows,
        - initial generic cluster id


---
- `Approach 2` ---> To perform textual similarity computation b/w master and comparsion file (defined by `min_similarity`).
    - Returns dataframe with:
        - identified duplicate rows in master file
        - very similar rows on master file,
        - initial generic cluster id on master file,
        - most similar from comparison file record
        - similar found reocrds from comparison file


---

## Contents:


1. Imports
2. Directory Setup
3. Preprocessing Unit
4. Vectorization Uit
5. Pair-Wise Fast-Similarity Matrix Creation Unit
6. Similarity Computation Unit
7. Execute (aka. final :) )

## Imports

In [1]:
## Imports
'''Update code from Python 3.6.10 to a stable Kernel Python Version 3.8.0 '''

# Standard libs
import os
import sys
import json
import warnings
import re
import io
from io import StringIO
import inspect
import shutil
import ast
import string
import time
import pickle
import glob
import traceback
import multiprocessing
import requests
import logging
import math
import pytz
from itertools import chain
from string import Template
from datetime import datetime, timedelta
from dateutil import parser
import base64
from collections import defaultdict, Counter, OrderedDict
from contextlib import contextmanager
import unicodedata
from functools import reduce
import itertools
import tempfile
from typing import Any, Dict, List, Callable, Optional, Tuple, NamedTuple, Union
from functools import wraps

# graph
import networkx as nx

# Required pkgs
import numpy as np
from numpy import array, argmax
import pandas as pd
import ntpath
import tqdm

# General text correction - fit text for you (ftfy) and others
import ftfy
from fuzzywuzzy import fuzz
#from wordcloud import WordCloud
from spellchecker import SpellChecker

# imbalanced-learn
from imblearn.over_sampling import SMOTE, SVMSMOTE, ADASYN

# scikit-learn
from sklearn.utils import shuffle
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, jaccard_score, silhouette_score, homogeneity_score, calinski_harabasz_score
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity
from sklearn.neighbors import NearestNeighbors, LocalOutlierFactor
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.base import BaseEstimator, TransformerMixin

# scipy
from scipy import spatial, sparse
from scipy.sparse import coo_matrix, vstack, hstack
from scipy.spatial.distance import euclidean, jensenshannon, cosine, cdist
from scipy.io import mmwrite, mmread
from scipy.stats import entropy
from scipy.cluster.hierarchy import dendrogram, ward, fcluster
import scipy.cluster.hierarchy as sch
from scipy.sparse.csr import csr_matrix
from scipy.sparse.lil import lil_matrix
from scipy.sparse.csgraph import connected_components

# sparse_dot_topn: matrix multiplier
from sparse_dot_topn import awesome_cossim_topn
import sparse_dot_topn.sparse_dot_topn as ct

# Gensim
import gensim
from gensim.models import Phrases, Word2Vec, KeyedVectors, FastText, LdaModel
from gensim import utils
from gensim.utils import simple_preprocess
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec
import gensim.downloader as api
from gensim import models, corpora, similarities

# NLTK
import nltk
#nltk_model_data_path = "/someppath/"
#nltk.data.path.append(nltk_model_data_path)
from nltk import FreqDist, tokenize, sent_tokenize, word_tokenize, pos_tag
from nltk.corpus import stopwords, PlaintextCorpusReader
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import *
from nltk.translate.bleu_score import sentence_bleu
print("NLTK loaded.")

# Spacy
import spacy
# spacy_model_data_path = "/Users/pranjalpathak/opt/anaconda3/envs/Python_3.6/lib/python3.6/site-packages/en_core_web_lg/en_core_web_lg-2.2.5"
nlp = spacy.load('en_core_web_lg')  # disabling: nlp = spacy.load(spacy_data_path, disable=['ner'])
from spacy import displacy
from spacy.matcher import Matcher
from spacy.lang.en import English
print("Spacy loaded.")

# TF & Keras
import tensorflow as tf
from keras import backend as K
from keras.layers import *
from tensorflow.keras.layers import Layer, InputSpec, BatchNormalization, Embedding, LSTM, Dense, Activation
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.utils import pad_sequences, CustomObjectScope
from keras.utils.np_utils import to_categorical
from keras import initializers as initializers, regularizers, constraints, optimizers
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard
from keras.models import Sequential, Model, load_model
import tensorflow_hub as hub
print("TensorFlow loaded.")

# Pytorch
import torch
from torch import optim, nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import transformers
from transformers import AutoTokenizer
from transformers import AutoModelWithLMHead
from transformers import pipeline
from transformers import AutoModel
print("PyTorch loaded.")

# Plots
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly import offline
%matplotlib inline

# Theme settings
pd.set_option("display.max_columns", 80)
sns.set_context('talk')
sns.set(rc={'figure.figsize':(15,10)})
sns.set_style("darkgrid")
warnings.filterwarnings('ignore')

  from scipy.sparse.csr import csr_matrix
  from scipy.sparse.lil import lil_matrix


NLTK loaded.


  from .autonotebook import tqdm as notebook_tqdm
2023-01-27 00:26:31.991178: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-27 00:26:33.872360: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-01-27 00:26:33.872486: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-01-27 00:26:34.988961: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Cou

Spacy loaded.
TensorFlow loaded.
PyTorch loaded.


## Directory Setup

In [11]:
# input data
input_data_fp = ""
output_data_fp = ""

# common NLP resources
resources_dir_path = "./data/resources/"

# embedding models
sbert_model_fp = "./models/sentence-transformers-models/all-distilroberta-v1"

## Preprocessing Unit (Unit 1/4)

In [4]:
class preprocessText:
    
    def __init__(self, resources_dir_path, custom_vocab=[], do_lemma=False):
        self.stopwords_file = os.path.join(resources_dir_path, "stopwords.txt")
        self.special_stopwords_file = os.path.join(resources_dir_path, "special_stopwords.txt")
        self.special_characters_file = os.path.join(resources_dir_path, "special_characters.txt")
        self.contractions_file = os.path.join(resources_dir_path, "contractions.json")
        self.chatwords_file = os.path.join(resources_dir_path, "chatwords.txt")
        self.emoticons_file = os.path.join(resources_dir_path, "emoticons.json")
        self.greeting_file = os.path.join(resources_dir_path, "greeting_words.txt")
        self.signature_file = os.path.join(resources_dir_path, "signature_words.txt")
        self.preserve_key = "<$>" # preserve special vocab
        self.vocab_list = custom_vocab
        self.preseve = True if len(custom_vocab) > 0 else False
        self.load_resources()
        self.do_lemma = do_lemma
        return
    
    def load_resources(self):
        
        ### Build Vocab Model --> Words to keep
        self.vocab_list = set(map(str.lower, self.vocab_list))
        self.vocab_dict = {w: self.preserve_key.join(w.split()) for w in self.vocab_list}
        self.re_retain_words = re.compile('|'.join(sorted(map(re.escape, self.vocab_dict), key=len, reverse=True)))
        
        ### Build Stopwords Model --> Words to drop/delete
        with open(self.stopwords_file, 'r', encoding='utf-8', errors='ignore') as f:
            self.stopwords = [x.rstrip() for x in f.readlines()]
        with open(self.special_stopwords_file, 'r', encoding='utf-8', errors='ignore') as f:
            self.stopwords.extend([x.rstrip() for x in f.readlines()])
        with open(self.special_characters_file, 'r', encoding='utf-8', errors='ignore') as f:
            self.stopwords.extend([x.rstrip() for x in f.readlines()])
        self.stopwords = list(sorted(set(self.stopwords).difference(self.vocab_list)))

        ### Build Contractions
        with open(self.contractions_file, 'r', encoding='utf-8', errors='ignore') as f:
            self.contractions = dict(json.load(f))
        
        ### Build Chat-words
        with open(self.chatwords_file, 'r', encoding='utf-8', errors='ignore') as f:
            self.chat_words_map_dict, self.chat_words_list = {}, []
            chat_words = [x.rstrip() for x in f.readlines()]
            for line in chat_words:
                cw = line.split("=")[0]
                cw_expanded = line.split("=")[1]
                self.chat_words_list.append(cw)
                self.chat_words_map_dict[cw] = cw_expanded
            self.chat_words_list = set(self.chat_words_list)
        
        ### Bukd social markups
        # emoticons
        with open(self.emoticons_file, "r") as f:
            self.emoticons = re.compile(u'(' + u'|'.join(k for k in json.load(f)) + u')')
        # emojis
        self.emojis = re.compile("["
                                   u"\U0001F600-\U0001F64F"  # emoticons
                                   u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                                   u"\U0001F680-\U0001F6FF"  # transport & map symbols
                                   u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                                   u"\U00002702-\U000027B0"
                                   u"\U000024C2-\U0001F251"
                                   "]+", flags=re.UNICODE)
        # greeting
        with open(self.greeting_file, 'r', encoding='utf-8', errors='ignore') as f:
            self.greeting_words = [x.rstrip() for x in f.readlines()]
        # signature
        with open(self.signature_file, 'r', encoding='utf-8', errors='ignore') as f:
            self.signature_words = [x.rstrip() for x in f.readlines()]
        # spell-corrector
        self.spell_checker = SpellChecker()   
        return
    
    
    def reserve_keywords_from_cleaning(self, text, reset=False):
        """ 
        Finds common words from a user-provided list of special keywords to preserve them from 
        cleaning steps. Identifies every special keyword and joins them using `self.preserve_key` during the 
        cleaning steps, and later resets it back to original word in the end.
        """
        if reset is False:
            # compile using a dict of words and their expansions, and sub them if found!
            match_and_sub = self.re_retain_words.sub(lambda x: self.vocab_dict[x.string[x.start():x.end()]], text)
            return re.sub(r"([\s\n\t\r]+)", " ", match_and_sub).strip()
        else:
            # reverse the change! - use this at the end of preprocessing
            text = text.replace(self.preserve_key, " ")
            return re.sub(r"([\s\n\t\r]+)", " ", text).strip()


    def basic_clean(self, input_sentences):
        cleaned_sentences = []
        for sent in input_sentences:
            sent = str(sent).strip()
            # FIX text
            sent = ftfy.fix_text(sent)
            # Normalize accented chars
            sent = unicodedata.normalize('NFKD', sent).encode('ascii', 'ignore').decode('utf-8', 'ignore')
            # Removing <…> web scrape tags
            sent = re.sub(r"\<(.*?)\>", " ", sent)
            # Expanding contractions using contractions_file
            sent = re.sub(r"(\w+\'\w+)", lambda x: self.contractions.get(x.group().lower(), x.group().lower()), sent)
            # Removing web urls
            sent = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0–9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»""'']))''', " ", sent)
            # Removing date formats
            sent = re.sub(r"(\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\s\:)", " ", sent)
            # Removing extra whitespaces
            sent = re.sub(r"([\s\n\t\r]+)", " ", sent).strip()
            cleaned_sentences.append(sent)
        return cleaned_sentences


    def deep_clean(self, input_sentences):
        cleaned_sentences = []
        for sent in input_sentences:
            # normalize text to "utf-8" encoding
            sent = unicodedata.normalize('NFKD', str(sent)).encode('ascii', 'ignore').decode('utf-8', 'ignore')
            # lowercasing
            sent = str(sent).strip().lower()

            # <----------------------------- CUSTOM CLEANING ----------------------------- >
            #
            # *** Mark important keywords such as: Domain specific, Question words(wh-words), etc, using 
            # "self.vocab_list". Words from this list if found in any input sentence shall be joined using 
            # a key (self.preserve_key) during pre-processing step, and later un-joined to retain them.
            #
            if self.preseve: 
                sent = self.reserve_keywords_from_cleaning(sent, reset=False)
            #
            # <----------------------------- CUSTOM CLEANING ----------------------------- >

            # remove Emojis
            sent = self.emojis.sub(r'', sent)
            # remove emoticons
            sent = self.emoticons.sub(r'', sent)
            # remove common chat-words
            sent = " ".join([self.chat_words_map_dict[w.upper()] if w.upper() in self.chat_words_list else w for w in sent.split()])
            # FIX text
            sent = ftfy.fix_text(sent)
            # Normalize accented chars
            sent = unicodedata.normalize('NFKD', sent).encode('ascii', 'ignore').decode('utf-8', 'ignore')
            # Removing <…> web scrape tags
            sent = re.sub(r"\<(.*?)\>", " ", sent)
            # Expanding contractions using contractions_file
            sent = re.sub(r"(\w+\'\w+)", lambda x: self.contractions.get(x.group().lower(), x.group().lower()), sent)
            # Removing web urls
            sent = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0–9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»""'']))''', " ", sent)
            # Removing date formats
            sent = re.sub(r"(\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\s\:)", " ", sent)

            # <----------------------------- OPTIONAL CLEANING ----------------------------- >
            #
            # removing punctuations 🔥🔥
            # *** disable them, when sentence structure needs to be retained ***
            sent = re.sub(r"[\$|\#\@\*\%]+\d+[\$|\#\@\*\%]+", " ", sent)
            sent = re.sub(r"\'s", " \'s", sent)
            sent = re.sub(r"\'ve", " \'ve", sent)
            sent = re.sub(r"n\'t", " n\'t", sent)
            sent = re.sub(r"\'re", " \'re", sent)
            sent = re.sub(r"\'d", " \'d", sent)
            sent = re.sub(r"\'ll", " \'ll", sent)
            sent = re.sub(r"[\/,\@,\#,\\,\{,\},\(,\),\[,\],\$,\%,\^,\&,\*,\<,\>]", " ", sent)
            sent = re.sub(r"[\,,\;,\:,\-]", " ", sent)      # main puncts
            
            # remove sentence de-limitters 🔥🔥
            # *** disable them, when sentence boundary/ending is important ***
            # sent = re.sub(r"[\!,\?,\.]", " ", sent)

            # keep only text & numbers 🔥🔥
            # *** enable them, when only text and numbers matter! *** 
            # sent = re.sub(r"\s+", " ", re.sub(r"[\\|\/|\||\{|\}|\[|\]\(|\)]+", " ", re.sub(r"[^A-z0-9]", " ", str(sent))))
            
            # correct spelling mistakes 🔥🔥
            # *** enable them when english spelling mistakes matter *** 
            # sent = " ".join([self.spell_checker.correction(w) if w in self.spell_checker.unknown(sent.split()) else w for w in sent.split()])
            #
            # <----------------------------- OPTIONAL CLEANING ----------------------------- >

            # Remove stopwords
            sent = " ".join(token.text for token in nlp(sent) if token.text not in self.stopwords and 
                                                                 token.lemma_ not in self.stopwords)
            # Lemmatize
            if self.do_lemma:
                sent = " ".join(token.lemma_ for token in nlp(sent))
            # Removing extra whitespaces
            sent = re.sub(r"([\s\n\t\r]+)", " ", sent).lower().strip()

            # <----------------------------- CUSTOM CLEANING ----------------------------- >
            #
            # *** Reverse the custom joining now to un-join the special words found!
            if self.preseve: 
                sent = self.reserve_keywords_from_cleaning(sent, reset=True)
            # <----------------------------- CUSTOM CLEANING ----------------------------- >

            cleaned_sentences.append(sent.strip().lower())
        return cleaned_sentences


    def spacy_get_pos_list(self, results):
        word_list, pos_list, lemma_list, ner_list, start_end_list = [], [], [], [], []
        indices = results['sentences']
        for line in indices:
            tokens = line['tokens']
            for token in tokens:
                # (1). save tokens
                word_list.append(token['word'])
                # (2). save pos
                pos_list.append(token['pos'])
                # (3). save lemmas
                lemma = token['lemma'].lower()
                if lemma in self.stopwords: continue
                lemma_list.append(lemma)
                # (4). save NER
                ner_list.append(token['ner'])
                # (5). save start
                start_end_list.append(str(token['characterOffsetBegin']) + "_" + str(token['characterOffsetEnd']))
        output = {"word_list": word_list, 
                  "lemma_list": lemma_list, 
                  "token_start_end_list": start_end_list,
                  "pos_list": pos_list, "ner_list": ner_list}
        return output

    def spacy_generate_features(self, doc, operations='tokenize,ssplit,pos,lemma,ner'):
        """
        Spacy nlp pipeline to generate features such as pos, tokens, ner, dependency. Accepts doc=nlp(text)
        """
        # spacy doc
        doc_json = doc.to_json()  # Includes all operations given by spacy pipeline

        # Get text
        text = doc_json['text']

        # ---------------------------------------- OPERATIONS  ---------------------------------------- #
        # 1. Extract Entity List
        entity_list = doc_json["ents"]

        # 2. Create token lib
        token_lib = {token["id"]: token for token in doc_json["tokens"]}

        # init output json
        output_json = {}
        output_json["sentences"] = []

        # Perform spacy operations on each sent in text
        for i, sentence in enumerate(doc_json["sents"]):
            # init parsers
            parse = ""
            basicDependencies = []
            enhancedDependencies = []
            enhancedPlusPlusDependencies = []

            # init output json
            out_sentence = {"index": i, "line": 1, "tokens": []}
            output_json["sentences"].append(out_sentence)

            # 3. Split sentences by indices(i), add labels (pos, ner, dep, etc.)
            for token in doc_json["tokens"]:

                if sentence["start"] <= token["start"] and token["end"] <= sentence["end"]:
                    
                    # >>> Extract Entity label
                    ner = "O"
                    for entity in entity_list:
                        if entity["start"] <= token["start"] and token["end"] <= entity["end"]:
                            ner = entity["label"]

                    # >>> Extract dependency info
                    dep = token["dep"]
                    governor = 0 if token["head"] == token["id"] else (token["head"] + 1)  # CoreNLP index = pipeline index +1
                    governorGloss = "ROOT" if token["head"] == token["id"] else text[token_lib[token["head"]]["start"]:
                                                                                     token_lib[token["head"]]["end"]]
                    dependent = token["id"] + 1
                    dependentGloss = text[token["start"]:token["end"]]

                    # >>> Extract lemma
                    lemma = doc[token["id"]].lemma_

                    # 4. Add dependencies
                    basicDependencies.append({"dep": dep,
                                              "governor": governor,
                                              "governorGloss": governorGloss,
                                              "dependent": dependent,
                                              "dependentGloss": dependentGloss})
                    # 5. Add tokens
                    out_token = {"index": token["id"] + 1,
                                 "word": dependentGloss,
                                 "originalText": dependentGloss,
                                 "characterOffsetBegin": token["start"],
                                 "characterOffsetEnd": token["end"]}

                    # 6. Add lemmas
                    if "lemma" in operations:
                        out_token["lemma"] = lemma

                    # 7. Add POS tagging
                    if "pos" in operations:
                        out_token["pos"] = token["tag"]

                    # 8. Add NER
                    if "ner" in operations:
                        out_token["ner"] = ner

                    # Update output json
                    out_sentence["tokens"].append(out_token)

            # 9. Add dependencies operation
            if "parse" in operations:
                out_sentence["parse"] = parse
                out_sentence["basicDependencies"] = basicDependencies
                out_sentence["enhancedDependencies"] = out_sentence["basicDependencies"]
                out_sentence["enhancedPlusPlusDependencies"] = out_sentence["basicDependencies"]
        # ---------------------------------------- OPERATIONS  ---------------------------------------- #
        return output_json
    
    def spacy_clean(self, input_sentences):
        batch_size = min(int(np.ceil(len(input_sentences)/100)), 500)
        
        # Part 1: generate spacy textual features (pos, ner, lemma, dependencies)
        sentences = [self.spacy_generate_features(doc) for doc in nlp.pipe(input_sentences, batch_size=batch_size, n_process=-1)]
        
        # Part 2: collect all the features for each sentence
        spacy_sentences = [self.spacy_get_pos_list(sent) for sent in sentences]

        return spacy_sentences


    ## MAIN ##
    def run_pipeline(self, sentences, operation):
        """
        Main module to execute pipeline. Accepts list of strings, and desired operation.
        """
        if operation=="":
            raise Exception("Please pass a cleaning type - `basic`, `deep` or `spacy` !!")

        # run basic cleaning
        if "basic" == operation.lower(): 
            return self.basic_clean(sentences)

        # run deep cleaning
        if "deep" == operation.lower(): 
            return self.deep_clean(sentences)

        # run spacy pipeline
        if "spacy" == operation.lower(): 
            return self.spacy_clean(sentences)

### Execute

In [5]:
## settings ##

"""
CUSTOM VOCABULARY ::

- List of words you wish to mark and retain them across the preprocessing steps - very important!
- Example, task-specific, domain-specific keywords.

"""

custom_vocab = ["who", "what", "where", "when", "would", "which", "how", "why", "can", "may", 
                "will", "won't", "does", "does not","doesn't", "do", "do i", "do you", "is it", "would you", 
                "is there", "are there", "is it so", "is this true", "to know", "is that true", "are we", 
                "am i", "question is", "can i", "can we", "tell me", "can you explain", "how ain't", 
                "question", "answer", "questions", "answers", "ask", "can you tell"]


"""
Utilities:
- Truncate words to their root-known-word form, stripping off their adjectives, verbs, etc. (Example: "running" becomes "run", "is" becomes "be")
- different from stemmer (PorterStemmer)
- Can use regex based stemming..
- Check Spacy's dependency parsing

"""

do_lemmatizing = True
#do_chinking = False
#do_chunking = False
#do_dependencyParser = False

In [6]:
## Preprocessing ##

preprocessText_obj = preprocessText(resources_dir_path, custom_vocab, do_lemmatizing)

def cleaning(data, text_col):
    data["Basic_%s" % text_col] = preprocessText_obj.run_pipeline(data[text_col], "basic")
    data["Deep_%s" % text_col] = preprocessText_obj.run_pipeline(data[text_col], "deep")
    data["Spacy_%s" % text_col] = preprocessText_obj.run_pipeline(data[text_col], "spacy")
    return data


## SAMPLE
# df = cleaning(df, <_TEXT_COLUMN_>)

In [7]:
## Execute ##

df = pd.DataFrame({"TEXT": ['hello i am good', "hello bye!"]})
cleaning(df, "TEXT")

Unnamed: 0,TEXT,Basic_TEXT,Deep_TEXT,Spacy_TEXT
0,hello i am good,hello i am good,hello good,"{'word_list': ['hello', 'i', 'am', 'good'], 'l..."
1,hello bye!,hello bye!,hello bye,"{'word_list': ['hello', 'bye', '!'], 'lemma_li..."


---

## Vectorization Unit (Unit 2/4)

In [10]:
class BertTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, tokenizer, model, max_length=128, embedding_func: Optional[Callable[[torch.Tensor], torch.Tensor]] = None,):
        self.tokenizer = tokenizer
        self.model = model
        self.model.eval()
        self.max_length = max_length
        self.embedding_func = embedding_func
        if self.embedding_func is None:
            self.embedding_func = lambda x: x[0][:, 0, :].squeeze()

    def _tokenize(self, text):
        # Mean Pooling - Take attention mask into account for correct averaging
        def mean_pooling(model_output, attention_mask):
            token_embeddings = model_output[0] #First element of model_output contains all token embeddings
            input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
            sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
            sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
            return sum_embeddings / sum_mask

        # Tokenize the text with the provided tokenizer
        encoded_input = tokenizer(text, padding=True, truncation=True, max_length=self.max_length, return_tensors='pt')

        # Compute token embeddings
        with torch.no_grad():
            model_output = self.model(**encoded_input)

        # Perform mean pooling
        sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

        # bert takes in a batch so we need to unsqueeze the rows
        return sentence_embeddings

    def transform(self, text: List[str]):
        if isinstance(text, pd.Series):
            text = text.tolist()
        return self._tokenize(text)

    def fit(self, X, y=None):
        """No fitting required so we just return ourselves. For fine-tuning, refer to shared gpu-code!"""
        return self

In [None]:
# SENTENCE-BERT VECTORIZATION

# load tokenizer, model classes
tokenizer = AutoTokenizer.from_pretrained(sbert_model_fp)
model_bert = AutoModel.from_pretrained(sbert_model_fp)

# load vectorizer
bert_vectorizer = BertTransformer(tokenizer, model_bert, embedding_func=lambda x: x[0][:, 0, :].squeeze())
print("Bert Model '%s' loaded." % ntpath.basename(sbert_model_fp))

## SAMPLE FOR BERT CLASS VECTORIZATION
# corpus = df['text_col']
# bert_matrix = F.normalize(bert_vectorizer.transform(corpus), p=2, dim=1)
# bert_matrix = csr_matrix(bert_matrix.numpy().astype(np.float64))
# matches = awesome_cossim_top(A = bert_matrix,
#                              B = bert_matrix.transpose(),
#                              ntop = mat.shape[0],
#                              min_similarity = 0.65)

---

## Pair-wise Fast-Similairty Matrix Creation Unit (Unit 3/4)

In [12]:
class Generate_Similarity_Matrix(object):

    def __init__(self, master, duplicates=None, master_id=None, duplicates_id=None, min_similarity=0.80, vectorizer='tfidf'):
        # UTILITY FUNCTIONS
        def _is_series_of_strings(series_to_test: pd.Series):
            if not isinstance(series_to_test, pd.Series): return False
            elif series_to_test.to_frame().applymap(lambda x: not isinstance(x, str)).squeeze(axis=1).any(): return False
            return True
        def _is_input_data_combination_valid(duplicates, master_id, duplicates_id):
            if duplicates is None and (duplicates_id is not None) or duplicates is not None and ((master_id is None) ^ (duplicates_id is None)): return False
            else: return True

        # VALIDATE INPUT ARGS
        if not _is_series_of_strings(master) or (duplicates is not None and not _is_series_of_strings(duplicates)):
            raise TypeError('Input does not consist of pandas.Series containing only Strings')
        if not _is_input_data_combination_valid(duplicates, master_id, duplicates_id):
            raise Exception('List of data Series options is invalid')
        if master_id is not None and len(master) != len(master_id):
            raise Exception('Both master and master_id must be pandas.Series of the same length.')
        if duplicates is not None and duplicates_id is not None and len(duplicates) != len(duplicates_id):
            raise Exception('Both duplicates and duplicates_id must be pandas.Series of the same length.')

        # SAVE INPUT ARGS
        self._master = master
        self._duplicates = duplicates if duplicates is not None else None
        self._master_id = master_id if master_id is not None else None
        self._duplicates_id = duplicates_id if duplicates_id is not None else None
        self.min_similarity = min_similarity
        self.vectorizer_name = vectorizer     # tfidf, bert

        # CONFIG
        self._true_max_n_matches = None
        self._max_n_matches = len(self._master) if self._duplicates is None else len(self._duplicates)
        self.ngram_size = 3
        self.regex = r'[,-./]|\s'
        self.number_of_processes = multiprocessing.cpu_count() - 1
        self.DEFAULT_COLUMN_NAME = 'side'
        self.DEFAULT_ID_NAME = 'id'
        self.LEFT_PREFIX = 'left_'
        self.RIGHT_PREFIX = 'right_'
        self._matches_list = pd.DataFrame()
        self.is_build = False  # indicates if fit has been called or not

        # -- INIT VECTORIZER --
        if self.vectorizer_name=="tfidf":
            def get_n_grams(string):
                if string is not None: string = string.lower()    # lowercasing all str
                string = re.sub(self.regex, r'', string)
                n_grams = zip(*[string[i:] for i in range(self.ngram_size)])
                return [''.join(n_gram) for n_gram in n_grams]
            # - enable fit() in "_get_tf_idf_matrices(self)"
            self._vectorizer = TfidfVectorizer(min_df=1, analyzer=get_n_grams, dtype=np.float64)
        if self.vectorizer_name=="bert":
            self._vectorizer = BertTransformer(tokenizer, model_bert, embedding_func=lambda x: x[0][:, 0, :].squeeze())
        # -- INIT VECTORIZER --
        return


    def fit(self):
        """
        Fit a vectorizer (already init) with Master & Duplicates matrix and calculate cosine-sim without original-ids.
        Params  : Master, Duplicates
        Return  : dataframe{ Master_Text, Duplicates_Text, cosine_sim(vectorizer_master, vectorized_duplicates) }

        """
        # UTILITY FUNCTIONS
        def fix_diagonal(m: lil_matrix):
            r = np.arange(m.shape[0])
            m[r, r] = 1
            return m
        def symmetrize_matrix(m_symmetric: lil_matrix):
            r, c = m_symmetric.nonzero()
            m_symmetric[c, r] = m_symmetric[r, c]
            return m_symmetric

        # Vectorize the matrices
        # - if duplicate matrix is present use it, else utilize master itself
        master_matrix, duplicate_matrix = self.get_vectorized_matrices()

        # Calculate cosine similarity b/w master & duplicates (if passed, else use master itself)
        matches = self.build_matches(master_matrix, duplicate_matrix)
        self._true_max_n_matches = self._max_n_matches-1

        # Correct sparse matrix multiplcation
        if self._duplicates is None:
            # convert to lil format for best efficiency when setting matrix-elements
            # matrix diagonal elements must be exactly 1 (numerical precision errors introduced by floating-point computations
            #                                             in awesome_cossim_topn sometimes lead to unexpected results)
            matches = matches.tolil()
            matches = fix_diagonal(matches)
            if self._max_n_matches < self._true_max_n_matches:
                matches = symmetrize_matrix(matches)
            matches = matches.tocsr()

        # Create the basic "matches" dataframe with "Master, Duplicate and Similarity" cols only
        r, c = matches.nonzero()
        self._matches_list = pd.DataFrame({'master_side': r.astype(np.int64), 'dupe_side': c.astype(np.int64), 'similarity': matches.data})
        self.is_build = True
        return self

    def get_vectorized_matrices(self):
        """
        Vectorize matrices using one of the vectorizers.
        Params    : Master, Duplicates, Vectorizer_name("tfidf", "bert")
        Return    : vectorizer_master, vectorized_duplicates
        """
        def fit_vectorizer():
            # if both master & duplicates series are set - concat them to fit the vectorizer on all strings at once!
            if self._duplicates is not None:
                strings = pd.concat([self._master, self._duplicates])
            else:
                strings = self._master
            self._vectorizer.fit(strings)
            return self._vectorizer

        if self.vectorizer_name=="tfidf":
            print("tfidf vectorization")
            self._vectorizer = fit_vectorizer()
            master_matrix = self._vectorizer.transform(self._master)
            if self._duplicates is not None:
                duplicate_matrix = self._vectorizer.transform(self._duplicates)
            else:
                # IF there is no duplicate matrix, match on the master matrix itself!
                duplicate_matrix = master_matrix

        if self.vectorizer_name=="bert":
            print("bert vectorization")
            master_matrix = self._vectorizer.transform(self._master)
            # --> Convert Tensor Matrices to CSR (np.float64)
            master_matrix = csr_matrix( F.normalize(master_matrix).numpy().astype(np.float64) )
            if self._duplicates is not None:
                duplicate_matrix = self._vectorizer.transform(self._duplicates)
                duplicate_matrix = csr_matrix( F.normalize(duplicate_matrix).numpy().astype(np.float64) )
            else:
                # IF there is no duplicate matrix, match on the master matrix itself!
                duplicate_matrix = master_matrix

        return master_matrix, duplicate_matrix


    def build_matches(self, master_matrix, duplicate_matrix):
        """
        Builds the cosine similarity matrix of two CSR matrices.
        Params   : vectorizer_master, vectorized_duplicates
        Return   : cosine_sim(vectorized_master, vectorized_duplicates)
        """
        # Matrix A, B
        tf_idf_matrix_1 = master_matrix
        tf_idf_matrix_2 = duplicate_matrix.transpose()

        # Calculate cosine similarity
        optional_kwargs = {'use_threads': self.number_of_processes > 1, 'n_jobs': self.number_of_processes}
        cosine_sim_matrix = awesome_cossim_topn(tf_idf_matrix_1,
                                                tf_idf_matrix_2,
                                                self._max_n_matches,
                                                self.min_similarity,
                                                **optional_kwargs)
        return cosine_sim_matrix


    def get_matches(self):
        """
        Creates the complete dataframe with index matching(ids) if passed.
        Params  : dataframe
        Return  : dataframe{ Master_ids, Master_Text, cosine_similarity, Duplicate_ids, Duplicates_Text }
        """
        # UTILITY FUNCTIONS
        def get_both_sides(master, duplicates, generic_name=(self.DEFAULT_COLUMN_NAME, self.DEFAULT_COLUMN_NAME), drop_index=False):
            lname, rname = generic_name
            left = master if master.name else master.rename(lname)
            left = left.iloc[matches_list.master_side].reset_index(drop=drop_index)
            if self._duplicates is None:
                right = master if master.name else master.rename(rname)
            else:
                right = duplicates if duplicates.name else duplicates.rename(rname)
            right = right.iloc[matches_list.dupe_side].reset_index(drop=drop_index)
            return left, (right if isinstance(right, pd.Series) else right[right.columns[::-1]])
        def prefix_column_names(data, prefix):
            if isinstance(data, pd.DataFrame):
                return data.rename(columns={c: f"{prefix}{c}" for c in data.columns})
            else:
                return data.rename(f"{prefix}{data.name}")

        if self.min_similarity > 0:
            matches_list = self._matches_list
        else:
            raise Exception("min_similarity cannot be set to less than or equal to 0!")

        # ID Retrival
        left_side, right_side = get_both_sides(self._master, self._duplicates, drop_index=False)
        similarity = matches_list.similarity.reset_index(drop=True)
        if self._master_id is None:
            # if ids are not passed
            return pd.concat([prefix_column_names(left_side, self.LEFT_PREFIX),
                              similarity,
                              prefix_column_names(right_side, self.RIGHT_PREFIX)], axis=1)

        else:
            # if ids are passed, retrive ids
            left_side_id, right_side_id = get_both_sides(self._master_id, self._duplicates_id, (self.DEFAULT_ID_NAME, self.DEFAULT_ID_NAME), drop_index=True)
            return pd.concat([prefix_column_names(left_side, self.LEFT_PREFIX),
                              prefix_column_names(left_side_id, self.LEFT_PREFIX),
                              similarity,
                              prefix_column_names(right_side_id, self.RIGHT_PREFIX),
                              prefix_column_names(right_side, self.RIGHT_PREFIX)], axis=1)

In [14]:
# Similarity Computation Engine

def match_strings(master, duplicates=None, master_id=None, duplicates_id=None, min_similarity=0.80, vectorizer=None):
    """ 
    Find pair-wise similarity b/w 'Master' & 'Duplicate' Matrices. 
    """
    if vectorizer and vectorizer.lower().strip() in ['tfidf', 'bert']:
        vectorizer=vectorizer.lower().strip()
        cos_sim_matrix = Generate_Similarity_Matrix(master, duplicates=duplicates, master_id=master_id, duplicates_id=duplicates_id, min_similarity=min_similarity, vectorizer=vectorizer)
        cos_sim_matrix.fit()                     # run vectorizer & generate basic pair-wise cosine sim matrix
        sim_df = cos_sim_matrix.get_matches()    # add ids if passed to sim matrix
        return sim_df
    else:
        raise Exception("Vectorizer is not passed or incorrect! Please select one: [tfidf, bert]")

In [16]:
## GENERIC RUN ::


# Textual semantic Similarity between all strings of file A:
# matches = match_strings(master = df[<_text_col_>], master_id = df[<_id_col_>], 
#                         min_similarity=0.60, vectorizer='bert')

## Textual semantic similarity between two files A and B:
# matches = match_strings(master = df[__text_col_A__], master_id = df[__id_col_A__], 
#                         duplicates = df[__text_col_B__], duplicates_id = df[__id_col_B__], 
#                         min_similarity=0.85, vectorizer='bert')

## Textual semantic similarity between two files A and B having no ids:
# matches = match_strings(master = df[__text_col_A__], duplicates = df[__text_col_B__],
#                         min_similarity=0.85, vectorizer='bert')

---

## Similarity Computation Unit (Unit 4/4)

**STEPS**

- Option 1: Run similarity analysis on 1 source file to get duplicates and very-similar records.
- Option 2: Run similarity analysis between 2 files: "source" & "target".

### OPTION 1. Compute Similarity on 1 file (i.e. source)

- Runs similarity computation on one "source" file to get a list of duplicate ids (dup_idx), similar ids (similar_idx) and merged duplicate-similar-ids (dup_similar_idx), and a cluster ids (cluster_id)

In [8]:
def compute_similarity_source_file(df, source_min_similarity=0.75):

    # 1. collect duplicate ids based on "text" col
    def collect_dups(data, id_col, dup_col, output_col_name):
        dup_dict = data.reset_index()\
                    .groupby(data[dup_col].tolist())[id_col]\
                    .agg(list).reset_index().reset_index(drop=True)\
                    .rename(columns={"index": dup_col, id_col: output_col_name})
        dup_dict = dup_dict.set_index(dup_col)[output_col_name].to_dict()
        data[output_col_name] = data[dup_col].apply(lambda txt: dup_dict[txt])
        return data

    # 2. drop dup ids, keep first
    def drop_dups(data, col):
        return data.drop_duplicates(subset=[col]).reset_index(drop=True)

    # 3. collect similar pairs
    def pairwise_similarity_matrix(data, id_col, text_col, similar_id_col, min_similarity=0.75):
        # TEXTUAL SIMILARITY

        # MODULE 1 :: pair-wise textual similarity
        matches = match_strings(master=data[text_col], master_id=data[id_col], min_similarity=min_similarity, vectorizer='bert')

        # group similar-pairs together (left-join)
        left_col_name, left_unique_id, right_unique_id = "left_%s" % text_col, "left_%s" % id_col, "right_%s" % id_col
        match_df = matches.groupby([left_col_name, left_unique_id])[right_unique_id]\
                          .agg({similar_id_col: lambda x: sorted(set(x))})\
                          .reset_index().sort_values(by=[left_unique_id], ascending=True).reset_index(drop=True)

        # asthestic: drop dummy added left/right names
        matches = matches.drop(columns=['left_index', "right_index"])
        match_df = match_df.rename(columns={left_unique_id: id_col, left_col_name: text_col})
        return matches, match_df

    ## Utility: alphanumeric sort
    _nsre = re.compile('([0-9]+)')
    def natural_sort_key(s):
        return [int(text) if text.isdigit() else text.lower()
                for text in re.split(_nsre, s)]

    # 4. create "dup_similar_idx" col - merge dup_id data with similar_id data
    def combine_dup_similar(data, match_df, id_col, dup_id_col, similar_id_col, dup_sim_id_col):

        # merge df_duplicates with df_similar
        cols_to_use = [id_col, similar_id_col]
        data = data.merge(match_df[cols_to_use], on=id_col, how='outer')

        # create combined list: dup_ids + similar_ids
        # --> "dup_similar_id_col" == "duplicated_pairs_idx" + "similar_pairs_idx
        data[dup_sim_id_col] = [sorted(set(sum(tup, []))) for tup in zip(data[dup_id_col], data[similar_id_col])]

        # custom sorting (to handle alphanumeric ids)
        if isinstance(data[dup_sim_id_col][0], str):
            data[dup_sim_id_col] = data[dup_sim_id_col].apply(lambda x: sorted(x, key=natural_sort_key))
        return data

    # 5. merged all nested lists containing common sub-elements in "dup_similar_id" cols
    def collect_similar_ids(data, id_col, dup_id_col, similar_id_col, dup_sim_id_col):

        # collect nested list which needs to be merged
        list_similar_ids = list(map(list, data[dup_sim_id_col]))

        # merge all nested lists with common elements
        g = nx.Graph()
        edges = [g.add_edges_from(zip(p, p[1:])) if len(p)>1 else g.add_edges_from(zip(p, p[:])) for p in list_similar_ids]
        merged_similar_idx = [sorted(c) for c in nx.connected_components(g)]

        # create two mappings, one for storing cluster_id: list of ids, and one inverted dict
        # --> "id_clus_dict" is the cluster id mapping for each 'unique_id'
        temp_id = 1
        clus_id_dict = {}  # cluster_1: merged([id1, id2,..., idn])
        id_clus_dict = {}  # merged(id1): cluster_1; merged(id1): cluster_1; .., merged(idn): cluster_1
        for lst in merged_similar_idx:
            key = "cluster_%s"%temp_id
            for value in lst:
                id_clus_dict[value] = key
            clus_id_dict[key]=lst
            temp_id+=1

        # assign dup_similar_idx based on two mappings above
        df[dup_sim_id_col] = df[id_col].apply(lambda uid: clus_id_dict[id_clus_dict[uid]])

        # create duplicate id mapping and similar id mapping files
        dup_id_dict = {_id: ids for ids in df[dup_id_col].tolist() for _id in ids}
        sim_id_dict = {_id: ids for ids in df[similar_id_col].tolist() for _id in ids}
        dup_sim_id_dict = {_id: ids for ids in df[dup_sim_id_col].tolist() for _id in ids}

        # custom sorting (to handle alphanumeric ids)
        if isinstance(data[dup_sim_id_col][0], str):
            df[dup_sim_id_col] = df[dup_sim_id_col].apply(lambda x: sorted(x, key=natural_sort_key))
        return data, clus_id_dict, id_clus_dict, dup_id_dict, sim_id_dict, dup_sim_id_dict

    # 6. Drop duplicates based on dup_similar_id_col, i.e. duplicated_id + similar_ids
    def create_final_single_matrix(data, dup_sim_id_col):
        data[dup_sim_id_col] = tuple(map(tuple, data[dup_sim_id_col]))
        data = data.drop_duplicates(subset=[dup_sim_id_col]).reset_index(drop=True)
        data[dup_sim_id_col] = list(map(list, data[dup_sim_id_col]))
        return data

    # 7. Expand each id to assign clusters
    def create_clusters(data, id_col, idx_cluster_map):
        data['cluster_id'] = data[id_col].apply(lambda uid: idx_cluster_map.get(uid, -1))
        return data

    # 8. Display Analytics
    def run_stats():
        print("Stats:\n"
          "\nOrignal number of records = {}"
          "\nTotal dup count = {}"
          "\nTotal similar pairs found = {}"
          "\nFinal number of records post dup-similar rows removal = {}".format(len(original_df),
                                                                              len(sum(df[source_dup_id_col].tolist(), [])),
                                                                              len(sum(df[source_similar_id_col].tolist(), [])),
                                                                              len(df)))
    
    """
    Run similarity computation on pre-processed Master file involving Dup identification, collection & removal, finally running
    similarity analysis on remaining unique rows.
    param    : dataframe (single file containing - 'unique_id', 'cleaned_text')
    return   : final dataframe with only unique rows, original dataframe with results (cluster info)
    """
    
    ## EXECUTE ##
    
    if not isinstance(df, pd.DataFrame):
        raise Exception("Please pass a dataframe object")
    if source_id_col not in df.columns or source_text_col not in df.columns:
        raise Exception("Input dataframe should contain '%s' and '%s' fields, as set in config" % source_id_col, source_text_col)
    if source_clean_text_col not in df:
        raise Exception("Source dataframe must be pre-processed! Perform cleaning on Source df's '%s'!" % source_clean_text_col)
    
    original_df = df.copy()
    df = collect_dups(df, source_id_col, source_clean_text_col, source_dup_id_col)
    df = drop_dups(df, source_clean_text_col)
    matches, match_df = pairwise_similarity_matrix(df, source_id_col, source_clean_text_col, source_similar_id_col, min_similarity=source_min_similarity)
    df = combine_dup_similar(df, match_df, source_id_col, source_dup_id_col, source_similar_id_col, source_dup_similar_id_col)
    df, cluster_id_map, idx_cluster_map, dup_id_dict, sim_id_dict, dup_sim_id_dict = collect_similar_ids(df,  source_id_col, source_dup_id_col, source_similar_id_col, source_dup_similar_id_col)
    df = create_final_single_matrix(df, source_dup_similar_id_col)
    df['cluster_id'] = df[source_id_col].apply(lambda uid: idx_cluster_map.get(uid, -1))
    # save back in original df (without dups or similar dropped!)
    original_df['dup_idx'] = original_df[source_id_col].apply(lambda uid: dup_id_dict.get(uid, -1))
    original_df['similar_idx'] = original_df[source_id_col].apply(lambda uid: sim_id_dict.get(uid, -1))
    original_df['dup_similar_idx'] = original_df[source_id_col].apply(lambda uid: dup_sim_id_dict.get(uid, -1))
    original_df['cluster_id'] = original_df[source_id_col].apply(lambda uid: idx_cluster_map.get(uid, -1))
    run_stats()
    return original_df

In [9]:
## SAMPLE EXECUTION
## columns that should be present already in the file
# source_id_col = 'ID'                 
# source_text_col = 'TEXT'
## columns that will be created during runtime
# source_clean_text_col = 'C_TEXT'               
# source_dup_id_col = "dup_idx"                       
# source_similar_id_col = "similar_idx"           
# source_dup_similar_id_col = "dup_similar_idx"  
# df = cleaning(df, source_text_col, source_clean_text_col)
# df = compute_similarity_source_file(df, source_min_similarity=0.75)

### OPTION 2. Compute Similarity between 2 files

- Runs similarity computation on two files: "Source" & "Target" to get a list of duplicate ids (dup_idx), similar ids (similar_idx) and merged duplicate-similar-ids (dup_similar_idx), and a cluster ids (cluster_id)

In [44]:
def compute_similarity_source_target_files(source_df, target_df, vectorization, min_similarity=0.80):
    """
    Run similarity computation between 2 "pre-processed" files: Master and child(duplicate) file.
    param    : dataframe_1('unique_id', 'cleaned_text'), dataframe_2('unique_id', 'cleaned_text')
    return   : pair-wise comparison df, final df
    """
    
    # VALIDATION
    if not isinstance(source_df, pd.DataFrame) or not isinstance(target_df, pd.DataFrame):
        raise Exception("Please pass a dataframe object!")
    if source_id_col not in source_df.columns or source_text_col not in source_df.columns:
        raise Exception("Source dataframe should contain '%s' and '%s' fields, as set in config" % source_id_col, source_text_col)
    if target_id_col not in target_df.columns or target_text_col not in target_df.columns:
        raise Exception("Target dataframe should contain '%s' and '%s' fields, as set in config" % target_id_col, target_text_col)
    if source_clean_text_col not in source_df or target_clean_text_col not in target_df:
        raise Exception("Dataframes are not pre-processed! Perform cleaning on both Source and Target dfs!")
    if vectorization not in ['tfidf', 'bert']:
        raise Exception("Vectorization error - please use any one: ['tfidf', 'bert']")

    # => SOURCE/MASTER DF
    # - drop dups + similar, keep only unique records to compare with
    original_source_df = source_df.copy()
    source_df[source_dup_similar_id_col] = tuple(map(tuple, source_df[source_dup_similar_id_col]))
    source_df = source_df.drop_duplicates(subset=[source_dup_similar_id_col]).reset_index(drop=True)
    master_ids, master_sents = source_df[source_id_col], source_df[source_clean_text_col]

    # => TARGET/COMPARISON/DUPLICATES DF (target_df or duplicates_df)
    child_ids, child_sents = target_df[target_id_col], target_df[target_clean_text_col]

    # Pair-wise Fast Similairty Matrix Creation
    matches = match_strings(master=master_sents, master_id=master_ids, duplicates=child_sents, duplicates_id=child_ids, min_similarity=min_similarity, vectorizer=vectorization)
    matches = matches.drop(columns=['left_index', 'right_index'])

    # STORE RESULTS
    output=dict()
    for uid in source_df[source_id_col].values:
        # subset from matches_df
        similar = matches[matches['left_%s' % source_id_col] == uid].sort_values(by=['similarity','right_%s' % target_id_col], ascending=[False, True])
        # storing in source_df
        temp_dict = dict()
        temp_dict["most_similar"], temp_dict["max_sim_score"], temp_dict["pairs"], temp_dict["count"] = None, 0.0, None, 0
        if len(similar)>0:
            most_similar_id = similar['right_%s' % target_id_col].values[0]
            temp_dict["most_similar"] = target_df[target_df[target_id_col]==most_similar_id][target_text_col].values[0]
            temp_dict["max_sim_score"] = similar['similarity'].values[0]
            temp_dict["pairs"] = similar['right_%s' % target_id_col].tolist()
            temp_dict["count"] = int(len(similar['right_%s' % target_id_col].tolist()))
        output[uid] = temp_dict
    source_df["most_similar_FAQ"] = source_df[source_id_col].apply(lambda x: output[x]['most_similar'])
    source_df["max_sim_score_FAQ"] = source_df[source_id_col].apply(lambda x: output[x]['max_sim_score'])
    source_df["similar_IDs"] = source_df[source_id_col].apply(lambda x: output[x]['pairs'])
    source_df["sim_FAQ_count"] = source_df[source_id_col].apply(lambda x: output[x]['count'])

    # save back in original df (without dups or similar dropped!)
    original_source_df = original_source_df.merge(source_df[['cluster_id', 'most_similar_FAQ', 'max_sim_score_FAQ', 'similar_IDs', 'sim_FAQ_count']], on=['cluster_id']).sort_values(by=[source_id_col]).reset_index(drop=True)

    # pair-wise comparison results, original df with all columns
    return original_source_df, matches

In [45]:
## SAMPLE EXECUTION
# source_id_col = 'ID'                 
# source_text_col = 'TEXT'
# source_clean_text_col = 'C_TEXT'               
# source_dup_id_col = "dup_idx"                       
# source_similar_id_col = "similar_idx"           
# source_dup_similar_id_col = "dup_similar_idx"  
# source_df = cleaning(source_df, source_text_col, source_clean_text_col)
# target_id_col = 'ID'
# target_text_col = 'FAQ'
# target_clean_text_col = 'clean_FAQ'
# target_df = cleaning(target_df, target_text_col, target_clean_text_col)
# df, pair_wise_sim_df = compute_similarity_source_target_files(source_df, target_df, vectorization="bert", min_similarity=0.80)

---

---


---

# `Execute` - Final Module

### A. Configuration

In [17]:
# :: config ::

# Source File 
source_id_col = 'ID'                                # already present!, mention the column name here
source_text_col = 'TEXT'                            # already present!, mention the column name here
source_clean_text_col = 'clean_text'                # will be created, give a name
source_dup_id_col = "dup_idx"                       # will be created, give a name
source_similar_id_col = "similar_idx"               # will be created, give a name
source_dup_similar_id_col = "dup_similar_idx"       # will be created, give a name


# [OPTIONAL] 
# Target File
target_id_col = 'ID'                                # already present!, mention the column name here
target_text_col = 'TEXT'                            # already present!, mention the column name here
target_clean_text_col = 'clean_TEXT2'               # will be created, give a name

### B. Load input data

In [20]:
def read_source_file(source_fp):
    if not os.path.exists(source_fp):
        raise Exception("File not found!")
    if source_fp.lower().endswith('.csv'): 
        df = pd.read_csv(os.path.join(input_data_paths, source_fp))
    elif source_fp.rsplit(".")[-1] in ["xlsx", "xls"]: 
        df = pd.read_excel(os.path.join(input_data_paths, source_fp))
    else: 
        raise Exception("Unsupported file fromat!")
    # pre-processing
    df = df[[source_id_col, source_text_col]]
    df[source_text_col] = df[source_text_col].fillna(value="NONE")
    df = cleaning(df, source_text_col, source_clean_text_col)
    print("Source file is ready.")
    return df


## OPTIONAL
def read_target_file(target_fp):
    if not os.path.exists(target_fp):
        raise Exception("File not found!")
    if target_fp.lower().endswith('.csv'): 
        df = pd.read_csv(os.path.join(input_data_paths, target_fp))
    elif target_fp.rsplit(".")[-1] in ["xlsx", "xls"]: 
        df = pd.read_excel(os.path.join(input_data_paths, target_fp))
    else: 
        raise Exception("Unsupported file fromat!")
    # pre-processing
    df = df[[target_id_col, target_text_col]]
    df[target_text_col] = df[target_text_col].fillna(value="NONE")
    df = cleaning(df, target_text_col, target_clean_text_col)
    print("Target file is ready.")
    return df

In [48]:
## load a source file ::


source_fp = input_data_fp + "/file_1.xlsx"
df_source = read_source_file(source_fp)
df_source

Source file is ready.


Unnamed: 0,ID,TEXT,clean_text
0,100,How to open 529 accounts?,how open 529 account
1,101,open 529,open 529
2,102,how do people open 529 acc?,how do people open 529 acc
3,103,close my account,close account
4,104,close ira accounts,close ira account
5,105,closing all accounts,closing account


### C. Compute

In [49]:
# 1. Run -----------------------------------------------------
# Compute Similarity on 1 file (i.e. source)
#    >> runfind dups & very similar records in source file
# ------------------------------------------------------------

df_source = compute_similarity_source_file(df_source, source_min_similarity=0.75)
df_source

bert vectorization
Stats:

Orignal number of records = 6
Total dup count = 3
Total similar pairs found = 6
Final number of records post dup-similar rows removal = 3


is deprecated and will be removed in a future version


Unnamed: 0,ID,TEXT,clean_text,dup_idx,similar_idx,dup_similar_idx,cluster_id
0,100,How to open 529 accounts?,how open 529 account,[100],"[100, 102]","[100, 101, 102]",cluster_1
1,101,open 529,open 529,[101],"[100, 101]","[100, 101, 102]",cluster_1
2,102,how do people open 529 acc?,how do people open 529 acc,[102],"[100, 102]","[100, 101, 102]",cluster_1
3,103,close my account,close account,[103],"[103, 105]","[103, 105]",cluster_2
4,104,close ira accounts,close ira account,[104],[104],[104],cluster_3
5,105,closing all accounts,closing account,[105],"[103, 105]","[103, 105]",cluster_2


In [59]:
# 2. Run -----------------------------------------------------
# Compute Similarity between 2 files
#    >> run computaiton b/w source & target files
# ------------------------------------------------------------


# load another target file (for comparison)
target_fp = input_data_fp + "/file_2.xlsx"
df_target = read_target_file(target_fp)

# run computaiton
df, pair_wise_sim = compute_similarity_source_target_files(df_source,  df_target,
                                                           vectorization="bert",
                                                           min_similarity=0.60)

Target file is ready.
bert vectorization


---

### D. Results

In [89]:
df.to_csv(output_data_fp + "/df.csv", index=False)
print("Exported output:", fn)

Exported output: df.csv


----


----