# Detector de diferencias

En este notebook se buscan las diferencias entre dos versiones de un mismo acuerdo a nivel de servicio. Gracias al paquete diff-match-patch de Google se puede detectar las diferencias entre dos oraciones. 

Se emplea la librería spacy para dividir el texto en oraciones.

In [None]:
!pip install diff-match-patch
!python3 -m spacy download en_core_web_sm
!pip install pandas

In [38]:
import spacy
import pandas as pd

nlp = spacy.load('en_core_web_sm', disable=['ner'])

En primer lugar, se desea ver si las oraciones del texto 1 tienen alguna sentencia similar en el texto 2 o viceversa, para ellos se usa la función get_close_matches del modulo difflib. Posteriormente, se emplea el modulo diff_match_patch para obtener las diferencias entre las palabras de las dos oraciones concretas. Esa información se procesa y se imprime de forma que el texto en negrita es el que difiere.

In [39]:
from IPython.display import Markdown, display
import diff_match_patch as dmp_module
import difflib

def printmd(string):
    colorstr = "<span>{}</span>".format(string)
    display(Markdown(colorstr))

def open_file(str1, str2):
    text1 = open(str1, encoding="utf8")
    text2 = open(str2, encoding="utf8")
    return text1, text2

def close_file(str1, str2):
    str1.close()
    str2.close()

def line_similar(line1, line2):
    comparation = line_comparator(line1, line2)
    return mark_diferences(comparation) 
    
def line_comparator(line1, line2):
    dmp = dmp_module.diff_match_patch()
    diff = dmp.diff_main(line1, line2)
    dmp.diff_cleanupSemantic(diff)
    return diff

def aux(i, lines_t1, lines_t2, final_text_1, final_text_2):
    if final_text_2[i] is None: 
        comparation = line_comparator(lines_t1[i], lines_t2[i])
        line_t1, line_t2 = mark_diferences(comparation)
        final_text_1[i] = line_t1
        final_text_2[i] = line_t2
    else:
        final_text_1[i] = "**" + lines_t1[i].strip() + "** "
    return final_text_1, final_text_2

def mark_diferences(comparation):
    line_t1 = ""
    line_t2 = ""
    for c in comparation:
        if c[1].strip() != "":
            if c[0] == 0:
                line_t1 = line_t1 + c[1] 
                line_t2 = line_t2 + c[1] 
            elif c[0] == 1:
                line_t2 = line_t2 + "**" + c[1].strip() + "** "
            else:
                line_t1 = line_t1 + "**" + c[1].strip() + "** "
    return line_t1, line_t2

def comparation_process(i, lines_t1, lines_t2, final_text_1, final_text_2):
    d = difflib.get_close_matches(lines_t1[i], lines_t2, n=1, cutoff=0.5)
    if d:
        line_t1, line_t2 = line_similar(lines_t1[i], lines_t2[lines_t2.index(d[0])])
        final_text_1[i] = line_t1
        final_text_2[lines_t2.index(d[0])] = line_t2
    else:
        if i < len(lines_t2):
            final_text_1, final_text_2 = aux(i, lines_t1, lines_t2, final_text_1, final_text_2)
        else:
            final_text_1[i] = "**" + lines_t1[i].strip() + "** "
    return final_text_1, final_text_2

def complete_text2(final_text_2, lines_t1, lines_t2):
    for l in final_text_2:
        if l is None:
            index = final_text_2.index(l)
            d = difflib.get_close_matches(lines_t2[index], lines_t1, n=1, cutoff=0.5)
            if d:
                line_t2, line_t1 = line_similar(lines_t2[index], lines_t1[lines_t1.index(d[0])])
                final_text_2[index] = line_t2
            else:
                final_text_2[index] = "**" + lines_t2[index].strip() + "** "
    return final_text_2

def convert_list_to_string(list):
    str = ""
    for l in list:
        str = str + l + "\n"
    return str

def delete_asterisk(list):
    res = []
    for l in list:
        res.append(l.replace('*', ''))
    return res

In [48]:
def comparator(file1, file2):
    text1, text2 = open_file(file1, file2)
    lines_t1 = delete_asterisk(text1.readlines())
    lines_t2 = delete_asterisk(text2.readlines())
    final_text_1 = [None for _ in range(len(lines_t1))]
    final_text_2 = [None for _ in range(len(lines_t2))]
    length = len(lines_t1)
    i = 0
    while i < length:
        final_text_1, final_text_2 = comparation_process(i, lines_t1, lines_t2, final_text_1, final_text_2)
        i = i+1
    final_text_2 = complete_text2(final_text_2, lines_t1, lines_t2)
    return final_text_1, final_text_2

In [49]:
v0, v1 = "slas/aws_glue_sla_january_2019.txt", "slas/aws_glue_sla_may_2022.txt"

final_text_1, final_text_2 = comparator(v0, v1)

In [60]:
str1 = convert_list_to_string(final_text_1)
str2 = convert_list_to_string(final_text_2)
str2

'**AWS Glue Service Level Agree** ment\n\nLast Updated: **Ma** y 2, 20**22** \nThis AWS Glue Service Level Agreement (“SLA”) is a policy governing the use of **the Included Services (listed below** ) and applies separately to each account using **the Included Services** . In the event of a conflict between the terms of this SLA and the terms of the AWS Customer Agreement or other agreement with us governing your use of our Services (the “Agreement”), the terms and conditions of this SLA apply, but only to the extent of such conflict. Capitalized terms used herein but not defined herein shall have the meanings set forth in the Agreement.\n\n**Included** Services\n\n**•\tAWS Glue Studio, AWS Glue Crawlers, AWS Glue Data Catalog, AWS Glue Schema Registry, and AWS Glue ETL ("Glue ETL Services")** \n**•\tAWS Glue DataBrew (“Glue DataBrew”)** \nService Commitment\n\nAWS will use commercially reasonable efforts to make **each Included Servic** e available with a Monthly Uptime Percentage for 

In [51]:
textos = [str1, str2]

In [52]:
print("Procesando con Spacy...")
docs = list(nlp.pipe(textos))
print("Hecho.")

Procesando con Spacy...
Hecho.


In [58]:
sents_test = []

for sent in list(docs[1].sents):
  sentence = str(sent)
  if "**" in sentence:
    sents_test.append(sentence)

In [59]:
df = pd.DataFrame(sents_test, columns=["text"])
df.to_csv("test.csv", index=False, encoding='utf-8')
print(df.head())

                                                text
0  **AWS Glue Service Level Agree** ment\n\nLast ...
1  This AWS Glue Service Level Agreement (“SLA”) ...
2  **Included** Services\n\n**•\tAWS Glue Studio,...
3  Glue Crawlers, AWS Glue Data Catalog, AWS Glue...
4  In the event **an Included Servic** e does not...
