<h1> Document Vector Embeddings </h1>

Initial experiment will be perfomed based on the experiment by Sugathadasa et al. [https://arxiv.org/pdf/1805.10685.pdf]

<h2> Text Preprocessing </h2>

First step is to create a <i> document corpus </i> which is a subset of most important sentences in each document. We can do that by implementing the <i> PageRank </i> algorithm. Before we do that, we need to preprocess the document by cleaning the text of unwanted charachters and common words. We used lemmatization and case-folding to lowercase as first steps in cleaning the documents. 

<h5> Required libraries </h5>


In [20]:
# !pip install requests

In [21]:
import requests
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy
import nltk.data
import pandas as pd
from text_rank import analyze 
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances
import en_core_web_sm
import time
import os
import json
import ast
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [74]:
NUM_OF_DOCUMENTS = 5
NUM_OF_SENTENCES = 50
NUM_OF_CHARACTERS = 10

URL = "https://www.courtlistener.com/api/rest/v3/opinions/"

RUN_TRAIN = False

In [39]:
def get_document(file_name):
    data = ""
    with open(file_name) as json_file:
        data = json.load(json_file)
    return data["plain_text"].replace("\n", " ")


<h2> <i> TextRank </i> algorithm </h2>

<i> TextRank </i> algorithm will be implemented based on the work of Mihalcea et al [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf]. <br>
We use this algorithm for extracting "most valuable" sentences in a document.  <br> <br>
<i> TextRank </i> algorithm is implemented in a python script named <i> text_rank.py </i>

In [40]:
def apply_textrank(text):
    sorted_sentences = analyze(text, NUM_OF_SENTENCES)
    return sorted_sentences

Sentences that are shorter than N characters should be removed.

<h2> Text processing after <i> TextRank </i> algorithm </h2>

After the <i> TextRank </i> algorithm we apply lemmatization to each word in the document

In [41]:
# Lemma Tokenizer called by TfIdfVectorizer

class LemmaTokenizer():
    def __init__(self):
        self.spacynlp = spacy.load('en_core_web_sm')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_.lower() for token in nlpdoc if (not token.is_punct)]
        return nlpdoc


In [42]:
def sorted_list2str(s): 
    str1 = "" 
    for ele in s: 
        str1 += " " + ele  
    return str1 

In [43]:
def lemmatized_txt2str(s):
    str1 = "" 
    for ele in s: 
        for ele2 in ele:
            if ele2.isspace():
                continue
            str1 += " " + ele2  
    return str1 

In [44]:
lemma_tokenizer = LemmaTokenizer()
def lemmatize(sentences):
    # Because of the TextRank algorithm, we have to split the document into sentences to create the document corpus 
    # (document corpus is the k most important sentences after applying TextRank algorithm)


    sentences = tokenizer.tokenize(sorted_list2str(sentences))
    sentences = [x for x in sentences if len(x) > NUM_OF_CHARACTERS]
    lemmatized_text = []

    for sentence in sentences:
        one_sentence = lemma_tokenizer(sentence)
        lemmatized_text.append(one_sentence)
    lemmatized_text = lemmatized_txt2str(lemmatized_text)
    return lemmatized_text

<h2> Apply TF-IDF </h2>

In [85]:
def call_vectorizer(df):
    tfidf_vectorizer = TfidfVectorizer(stop_words = "english")
    tfidf_vector = tfidf_vectorizer.fit_transform(df.iloc[:, 1].values.astype('U').tolist())
    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), columns=tfidf_vectorizer.get_feature_names())
        
    return tfidf_vectorizer, tfidf_df 

<h2> Global Term Frequency </h2>

To see how important a word is in the whole dataset, we calculate GTF_IDF matrix applying the formula below:

GTF_IDF = TF_IDF * sum(TF_IDF) / NUM_OF_DOCUMENTS

In [46]:

def calculate_gtfidf(tf_df):
    sum_of_idfs = tfidf_df.sum(axis = 0)
    for i in range(len(tfidf_df.columns)):
        tf_df[tfidf_df.columns[i]] = tf_df[tfidf_df.columns[i]].apply(lambda x: x * (sum_of_idfs[i] / NUM_OF_DOCUMENTS))
        
    return tf_df

<h2> The Experiment </h2>

We run the whole pipeline on N documents from the CourtListener database

In [54]:
df = pd.DataFrame(columns = ["id", "document"])
if RUN_TRAIN:
    i = 0
    for file_name in [file for file in os.listdir("data/train/") if file.endswith('.json')]:
        try:
            print(i)
            document = get_document("data/train/" + file_name)
            sorted_sentences = apply_textrank(document)
            df.loc[i] = lemmatize(sorted_sentences)
            df.loc[i] = [file_name, lemmatize(sorted_sentences)]
            i += 1
        except Exception as e:
            i += 1
            continue
else:
    df = pd.read_csv("train_textrank.csv", sep='\t')[['id', 'document']]

In [55]:
df

Unnamed: 0,id,document
0,174995.json,09 1504 united states of america appellee v. ...
1,174996.json,in this case the district court instruct the ...
2,175074.json,the bia affirm the april 3 2008 opinion of an...
3,175075.json,moreover fia 's affidavit explicitly confirm ...
4,175076.json,objection your honor to the line of questioni...
...,...,...
1391,198335.json,then on august 15 1994 toyota advise citi tha...
1392,198336.json,the court then say briefly that while petrone...
1393,198337.json,upon careful review of the record appellant '...
1394,198338.json,of medical examiners 375 u.s. 411 1964 we non...


In [58]:

df_test = pd.DataFrame(columns = ["id", "document"])
if RUN_TRAIN:
    i = 0
    for file_name in [file for file in os.listdir("data/test/") if file.endswith('.json')]:
        try:
            document = get_document("data/test/" + file_name)
            sorted_sentences = apply_textrank(document)
            df_test.loc[i] = lemmatize(sorted_sentences)
            df_test.loc[i] = [file_name, lemmatize(sorted_sentences)]
            i += 1
        except Exception as e:
            print(e)
            continue
else:
    df_test = pd.read_csv("test_textrank.csv", sep='\t')[['id', 'document']]

In [59]:
df_test

Unnamed: 0,id,document
0,198340.json,finally the government argue even if turner '...
1,198341.json,5861(d 5871 2 be a felon in know possession o...
2,198342.json,receive evidence interrogate examine and cros...
3,198343.json,yet the high maximum set by the guideline be ...
4,198631.json,98 1710 united states appellee v. michael b. ...
...,...,...
414,199125.json,see downes 182 u.s. at 380 harlan j. dissent ...
415,199126.json,credibility determination be for the jury not...
416,199127.json,see e.g. manso pizarro v. secretary of health...
417,199129.json,the jury find for volkswagen and we reason th...


In [60]:
df.to_csv("train_textrank.csv", sep='\t')

In [61]:
df_test.to_csv("test_textrank.csv", sep='\t')

In [86]:
vectorizer, tfidf_df = call_vectorizer(df)
train_gtf_idf = calculate_gtfidf(tfidf_df)

    

1396


In [87]:
train_gtf_idf.sum(axis = 0)

00          1.645593
000        16.412672
001         0.000114
001b        0.051524
005         0.000197
             ...    
zuleta      0.029197
zulma       0.002448
zuluaga     0.013546
zurosky     0.001466
zyrone      0.002997
Length: 24401, dtype: float64

In [93]:

def cos_similarity(): 
    results = pd.DataFrame(columns = ["verdict", "indexes"])
    for i, doc in enumerate(df_test.iloc[:, 1]): 
        tfidf_test = vectorizer.transform([doc])
        tfidf_test = pd.DataFrame(tfidf_test.toarray(), columns=vectorizer.get_feature_names())
        tfidf_test = calculate_gtfidf(tfidf_test)
        distances = cosine_similarity(tfidf_test, train_gtf_idf).flatten()
        indexes = np.argsort(distances)[::-1]
        indexes = indexes[:100]
        results = results.append({ 
            "verdict" : df_test.iloc[i, 0], 
            "indexes" : indexes}, ignore_index=True)
    return results

In [94]:
results = cos_similarity()



KeyboardInterrupt: 

In [90]:
results

NameError: name 'results' is not defined

In [83]:
results.to_csv("results/text_rank.csv", sep = "\t")