<h1> Document Vector Embeddings </h1>

Initial experiment will be perfomed based on the experiment by Sugathadasa et al. [https://arxiv.org/pdf/1805.10685.pdf]

<h2> Text Preprocessing </h2>

First step is to create a <i> document corpus </i> which is a subset of most important sentences in each document. We can do that by implementing the <i> PageRank </i> algorithm. Before we do that, we need to preprocess the document by cleaning the text of unwanted charachters and common words. We used lemmatization and case-folding to lowercase as first steps in cleaning the documents. 

<h5> Required libraries </h5>


In [None]:
# !pip install requests

In [66]:
import requests
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy
import nltk.data
import pandas as pd
from text_rank import analyze 
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances
import en_core_web_sm

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [38]:
NUM_OF_DOCUMENTS = 5
NUM_OF_SENTENCES = 30
NUM_OF_CHARACTERS = 10

URL = "https://www.courtlistener.com/api/rest/v3/opinions/"

In [39]:
def get_document(endpoint):
    r = requests.get(url = endpoint)
    data = r.json()
    verdict_text = data["plain_text"]
    verdict_text = verdict_text.replace("\n", " ")
    
    return verdict_text

In [40]:
# Lemma Tokenizer called by TfIdfVectorizer

class LemmaTokenizer():
    def __init__(self):
        self.spacynlp = spacy.load('en_core_web_sm')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_.lower() for token in nlpdoc if (not token.is_punct)]
        return nlpdoc


<h2> <i> TextRank </i> algorithm </h2>

<i> TextRank </i> algorithm will be implemented based on the work of Mihalcea et al [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf]. <br>
We use this algorithm for extracting "most valuable" sentences in a document.  <br> <br>
<i> TextRank </i> algorithm is implemented in a python script named <i> text_rank.py </i>

In [41]:
def apply_textrank(text):
    sorted_sentences = analyze(text, NUM_OF_SENTENCES)
    return sorted_sentences

Sentences that are shorter than N characters should be removed.

<h2> Text processing after <i> TextRank </i> algorithm </h2>

After the <i> TextRank </i> algorithm we apply lemmatization to each word in the document

In [42]:
def sorted_list2str(s): 
    # initialize an empty string
    str1 = "" 
    
    # traverse in the string  
    for ele in s: 
        str1 += " " + ele  
    
    # return string  
    return str1 

In [43]:
def lemmatized_txt2str(s):
    # initialize an empty string
    str1 = "" 
    
    # traverse in the string  
    for ele in s: 
        for ele2 in ele:
            if ele2.isspace():
                continue
            str1 += " " + ele2  
    
    # return string

    return str1 

In [44]:
lemma_tokenizer = LemmaTokenizer()
def lemmatize(sentences):
    # Because of the TextRank algorithm, we have to split the document into sentences to create the document corpus 
    # (document corpus is the k most important sentences after applying TextRank algorithm)


    sentences = tokenizer.tokenize(sorted_list2str(sentences))
    sentences = [x for x in sentences if len(x) > NUM_OF_CHARACTERS]
    lemmatized_text = []

    for sentence in sentences:
        one_sentence = lemma_tokenizer(sentence)
        lemmatized_text.append(one_sentence)
    lemmatized_text = lemmatized_txt2str(lemmatized_text)
    return lemmatized_text

<h2> Apply TF-IDF </h2>

In [45]:
def call_vectorizer(df):
    tfidf_vectorizer = TfidfVectorizer(stop_words = "english")
    tfidf_vector = tfidf_vectorizer.fit_transform(df.iloc[:, 0].tolist())
    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), columns=tfidf_vectorizer.get_feature_names())
        
    return tfidf_vectorizer, tfidf_df 

<h2> Global Term Frequency </h2>

To see how important a word is in the whole dataset, we calculate GTF_IDF matrix applying the formula below:

GTF_IDF = TF_IDF * sum(TF_IDF) / NUM_OF_DOCUMENTS

In [55]:

def calculate_gtfidf(tf_df):
    sum_of_idfs = tfidf_df.sum(axis = 0)
    for i in range(len(tfidf_df.columns)):
        tf_df[tfidf_df.columns[i]] = tf_df[tfidf_df.columns[i]].apply(lambda x: x * (sum_of_idfs[i] / NUM_OF_DOCUMENTS))
        
    return tf_df

<h2> The Experiment </h2>

We run the whole pipeline on N documents from the CourtListener database

In [47]:
df = pd.DataFrame(columns = ["document"])

for i in range(1, NUM_OF_DOCUMENTS):
    document = get_document(URL + str(i))
    sorted_sentences = apply_textrank(document)
    df.loc[i] = lemmatize(sorted_sentences)

vectorizer, tfidf_df = call_vectorizer(df)
gtf_idf = calculate_gtfidf(tfidf_df, tfidf_df.columns)
    



In [77]:
def cos_similarity(tfidf_test):
    
    for i, doc in enumerate(df.iloc[:, 0]): 
        tfidf_test = vectorizer.transform([doc])
        tfidf_test = pd.DataFrame(tfidf_test.toarray(), columns=vectorizer.get_feature_names())
        tfidf_test = calculate_gtfidf(tfidf_test)

        distances = cosine_similarity(tfidf_test, tfidf_df).flatten()
        print(i)
        indexes = np.argsort(distances)[::-1]
        print(indexes)
        return indexes

In [78]:
df

Unnamed: 0,document
1,although the court do not specifically refere...
2,to determine whether a violation have occur w...
3,on february 12 2008 the jury return a unanimo...
4,in august 2009 ansys sue its former employee ...


In [79]:
indexes = cos_similarity(gtf_idf)

0
[0 1 3 2]


