<h1> Document Vector Embeddings </h1>

Initial experiment will be perfomed based on the experiment by Sugathadasa et al. [https://arxiv.org/pdf/1805.10685.pdf]

<h2> Text Preprocessing </h2>

First step is to create a <i> document corpus </i> which is a subset of most important sentences in each document. We can do that by implementing the <i> PageRank </i> algorithm. Before we do that, we need to preprocess the document by cleaning the text of unwanted charachters and common words. We used lemmatization and case-folding to lowercase as first steps in cleaning the documents. 

<h5> Required libraries </h5>


In [9]:
# !pip install requests

In [71]:
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy
import nltk.data
import pandas as pd

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [6]:
# URL for CourtListener data

URL = "https://www.courtlistener.com/api/rest/v3/opinions/1"

r = requests.get(url = URL)
data = r.json()

In [10]:
verdict_text = data["plain_text"]

<h2> <i> TextRank </i> algorithm </h2>

<i> TextRank </i> algorithm will be implemented based on the work of Mihalcea et al [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf].

In [30]:
# Lemma Tokenizer called by TfIdfVectorizer

class LemmaTokenizer():
    def __init__(self):
        self.spacynlp = en_core_web_md.load()
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_.lower() for token in nlpdoc if (not token.is_punct)]
        return nlpdoc


In [34]:
# Because of the TextRank algorithm, we have to split the document into sentences to create the document corpus 
# (document corpus is the k most important sentences after applying TextRank algorithm)


sentences = tokenizer.tokenize(verdict_text)

In [35]:
lemma_tokenizer = LemmaTokenizer()

In [54]:
lemmatized_text = []

for sentence in sentences:
    one_sentence = lemma_tokenizer(sentence)
    without_new_lines = [string for string in one_sentence if not string.startswith("\n")]
    lemmatized_text.append(without_new_lines)

In [76]:
tfidf_vectors = []
vectorizer = TfidfVectorizer(stop_words = "english")
for sentence in lemmatized_text:
    try:
        tfidf_vectors.append(vectorizer.fit_transform(sentence))
    except:
        continue