# Doc2Vec Process

We divide this process into three steps:

- **Doc2Vec Model Trainng**: Using contracts, we train a doc2vec model to turn contract sentences into representations.

- **Processing a New Contract**: Given the doc2vec model, we start the process in a new contract.

    - Norm Extraction: First, we extract the norms from the new contract;
    - Then, we create a representation for each norm using the doc2vec model.
    
- **Conflict Identification**: Using the norm representations, we can have two different paths to follow:

    - T-SNE: Manual identification of modal verbs. (Experimental)
    - Norm Comparisons: Compare norms and find the most similar among them based on a threshold.

### Doc2Vec Model Training

In [1]:
# -*- coding:utf-8 -*-
import os
import sys
import pickle
import argparse
import logging
from random import shuffle
from convert_to_sentences import convert_to_sentences
from time import gmtime, strftime
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sentence_classifier.sentence_classifier import SentenceClassifier

Using TensorFlow backend.


In [15]:
# CONSTANTS.
TRAIN = False
TRAIN_PATH = 'dataset/manufact_cntrcs.txt'
PREPROCESS = False
TEST = True
TEST_PATH = 'models/model_2017-11-27_18-21-45.doc2vec'
MODEL = False
MODEL_PATH = 'model_2017-11-27_18-21-45.doc2vec'

In [3]:
# Set argparse.
parser = argparse.ArgumentParser(description='Convert sentences and paragraphs into a dense representation.')

# Set logger.
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

formatter = logging.Formatter('%(asctime)s:%(levelname)s:%(message)s')

file_handler = logging.FileHandler('logs/doc2vec.log')
file_handler.setFormatter(formatter)

logger.addHandler(file_handler)

In [21]:
# Set sentence classifier.
sent_cls_path = 'sentence_classifier/classifiers/17-11-03_18:45/sentence_classifier_17-11-03_18:45.pkl'
sent_cls_names_path = 'sentence_classifier/classifiers/17-11-03_18:45/sentence_classifier_dict_17-11-03_18:45.pkl'
sent_cls = SentenceClassifier()
sent_cls.load_classifier(sent_cls_path)
sent_cls_names = pickle.load(open(sent_cls_names_path, 'r'))
sent_cls.set_names(sent_cls_names)

In [9]:
class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
        self.sentences = []

    def __iter__(self):
        for uid, line in enumerate(open(self.filename)):
            pred = sent_cls.predict_class(line)
            if pred[0]:
                yield TaggedDocument(words=line.split(), tags=['SENT_%s' % uid])
            else:
                continue

    def sentences_perm(self):
        shuffle(self.sentences)
        return self.sentences

In [5]:
def get_model_path():

    logger.info('Generating output path.')
    if not os.path.isdir('models'):
        os.makedirs('models')

    return 'models/model_' + strftime("%Y-%m-%d_%H-%M-%S.doc2vec", gmtime())

In [6]:
def train_model(sentences, model=None):
    logger.info('Training model.')

    if not model:
        model = Doc2Vec(size=100, window=2, min_count=2, workers=2, alpha=0.025, min_alpha=0.025)

    model.build_vocab(sentences)

    for epoch in range(10):
        model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
        model.alpha -= 0.002  # decrease the learning rate
        model.min_alpha = model.alpha  # fix the learning rate, no decay

    output_path = get_model_path()

    logger.info('Saving trained model.')
    model.save(output_path)

    return output_path

In [11]:
def create_sent_dict(sentences):

    s_dict = dict()

    for sent in sentences:
        s_dict[sent[1][0]] = sent[0]

    return s_dict

In [16]:
if TRAIN:

    file_path = TRAIN_PATH

    logger.info('Receive training path: %s' % file_path)

    # Get sentences.
    if PREPROCESS:
        logger.info('Preprocessing file.')
        file_path = convert_to_sentences(file_path)

    sentences = LabeledLineSentence(file_path)

    # Create a dict to convert a sent code into its respective sentence.
    sent_dict = create_sent_dict(sentences)

    if not MODEL:
        output_model = train_model(sentences)
    else:
        old_model = Doc2Vec.load(MODEL_PATH)
        output_model = train_model(sentences, old_model)

    base, _ = os.path.splitext(output_model)

    # Save the dict.
    pickle.dump(sent_dict, open(base + '.pkl', 'w'))

elif TEST:
    model = Doc2Vec.load(TEST_PATH)
    # print model.docvecs.most_similar(20)
    print model.infer_vector('This shall be respected.')

else:
    print "Nothing to do here."

[-0.10322545  0.18548104  0.13756144 -0.04352326 -0.22969635  0.04753093
  0.02571246  0.24581064  0.33188435 -0.22686234 -0.10024916 -0.02796608
  0.06870021 -0.1846437   0.08619713  0.08191245 -0.02686881  0.21596356
  0.00322839 -0.08194727  0.16578761  0.28537318  0.21536617  0.13582177
  0.15831652  0.10290952  0.10735521  0.1601747   0.05893871 -0.11352994
  0.12639868 -0.0636748   0.09565083 -0.04995884 -0.07665579 -0.04295772
  0.08856849 -0.20593353  0.25739577  0.36179683  0.24817976  0.2827124
  0.07550059  0.15275338  0.05074598 -0.12287454  0.09645459  0.23509987
 -0.04460795  0.00963673  0.04208945 -0.24068561 -0.05101367 -0.01756793
 -0.26988509  0.02037181 -0.13784027  0.12749006  0.12695429 -0.46913627
  0.37706915  0.05398905 -0.23660167 -0.06240413  0.09760214 -0.01032645
  0.18350114  0.461429   -0.29158783 -0.39826754  0.32975742  0.10630392
  0.04601648  0.04274211  0.13623574  0.34299347  0.37144002 -0.18252996
  0.13218072 -0.27193859  0.13597392 -0.39603809  0.

### Processing a new contract

In [40]:
import pickle
import numpy as np
from nltk.tokenize import sent_tokenize

In [54]:
contract_path = '/home/aires/Dropbox/PUCRS/Mestrado/Dissertation/Corpus/xIbinCorpus/noHTML/licence/amazon.lic.1998.08.10.shtml'

In [55]:
def extract_norms(sentences, path_to_classifier):    
    # Load sentence classifier.
    norms = []
    
    for sent in sentences:
        
        pred = sent_cls.predict_class(sent)
        
        if pred[0]:
            norms.append(sent)
    
    return norms

In [56]:
# Read contract text.
text = open(contract_path, 'r').read()

# Extract sentences.
sentences = sent_tokenize(text)

# Extract Norms.
norms = extract_norms(sentences, sent_classifier_path)

# Get norm representations.
norm_representations = np.zeros(shape=(len(norms), 100))
norm_text = dict()

for i, norm in enumerate(norms):
    norm_text[i] = norm
    norm_representations[i] = model.infer_vector(norm)

### Conflict Identification

We divide this section into two subsections: T-SNE and Norm Representation Comparison

##### T-SNE

##### Norm Representation Comparison

In [57]:
threshold = 0.6

In [58]:
def find_similars(indx, norm_rep, norm_representations):
    diff = np.divide(np.absolute(np.subtract(norm_rep, norm_representations)).sum(axis=1), norm_representations.shape[0])
    similar = np.where(diff > threshold)[0]
    percents = []
    for i in similar:
        percents.append(diff[i])
        
    return similar, percents

In [59]:
# Run over norm representations.
similars = dict()
for ind, norm_rep in enumerate(norm_representations):
    similars[ind] = find_similars(ind, norm_rep, norm_representations)
    if similars[ind]:
        print "Original: %s \n\n Potential Conflict: %s\n Percentage: %.2f\n\n-----------------------------------------" % (norm_text[ind], norm_text[similars[ind][0][0]], similars[ind][1][0])

Original: PAGE 1 licensable (without cost to Amazon.com or Amazon.com D) by Amazon.com during the Support Period; and (b) all Amazon.com IPR embodied in such software and other technology; provided, however, that the Amazon.com Technology shall not include, without limitation, any database, customer data or information or other business information. 

 Potential Conflict: IPR does not include any Trademarks.
 Percentage: 0.73

-----------------------------------------
Original: IPR does not include any Trademarks. 

 Potential Conflict: PAGE 1 licensable (without cost to Amazon.com or Amazon.com D) by Amazon.com during the Support Period; and (b) all Amazon.com IPR embodied in such software and other technology; provided, however, that the Amazon.com Technology shall not include, without limitation, any database, customer data or information or other business information.
 Percentage: 0.73

-----------------------------------------
Original: Amazon.com may change its appointed technica