<h1> Document Vector Embeddings </h1>

Initial experiment will be perfomed based on the experiment by Sugathadasa et al. [https://arxiv.org/pdf/1805.10685.pdf]

<h2> Text Preprocessing </h2>

First step is to create a <i> document corpus </i> which is a subset of most important sentences in each document. We can do that by implementing the <i> PageRank </i> algorithm. Before we do that, we need to preprocess the document by cleaning the text of unwanted charachters and common words. We used lemmatization and case-folding to lowercase as first steps in cleaning the documents. 

<h5> Required libraries </h5>


In [None]:
# !pip install requests

In [1]:
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy
import nltk.data
import pandas as pd
from text_rank import analyze 

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [2]:
# URL for CourtListener data

URL = "https://www.courtlistener.com/api/rest/v3/opinions/1"

r = requests.get(url = URL)
data = r.json()

In [3]:
verdict_text = data["plain_text"]
verdict_text = verdict_text.replace("\n", " ")

In [4]:
# Lemma Tokenizer called by TfIdfVectorizer

class LemmaTokenizer():
    def __init__(self):
        self.spacynlp = en_core_web_md.load()
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_.lower() for token in nlpdoc if (not token.is_punct)]
        return nlpdoc


<h2> <i> TextRank </i> algorithm </h2>

<i> TextRank </i> algorithm will be implemented based on the work of Mihalcea et al [https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf]. <br>
We use this algorithm for extracting "most valuable" sentences in a document.  <br> <br>
<i> TextRank </i> algorithm is implemented in a python script named <i> text_rank.py </i>

In [5]:
sorted_sentences = analyze(verdict_text, 50)

In [7]:
sorted_sentences

['Although the court did not specifically reference the factors that the appellant now highlights, the sentencing transcript, read as a whole, evinces a sufficient weighing of the section 3553(a) factors.',
 "In this venue, the appellant does not challenge any of these rulings but, rather, accepts the district court's calculation of the guideline sentencing range (GSR): 70-87 months.",
 'That suggestion, however, is grounded in a misreading of the statute.1 This provision applies only when the span of the GSR, measured from the low end to the high end, is greater than 24 months.',
 'When a sentencing appeal follows a guilty plea, "we glean the relevant facts from the change-of-plea colloquy, the unchallenged portions of the presentence investigation report (PSI Report), and the record of the disposition hearing."',
 "Citing the Court's follow-on decision in Nelson v. United States, 129 S. Ct. 890 (2009) (per curiam), the appellant labors to convince us that the court below transgressed

<h2> Text processing after <i> TextRank </i> algorithm </h2> 

In [6]:
# Because of the TextRank algorithm, we have to split the document into sentences to create the document corpus 
# (document corpus is the k most important sentences after applying TextRank algorithm)


sentences = tokenizer.tokenize(sorted_sentences)

TypeError: expected string or bytes-like object

In [None]:
lemma_tokenizer = LemmaTokenizer()

In [None]:
lemmatized_text = []

for sentence in sentences:
    one_sentence = lemma_tokenizer(sentence)
    lemmatized_text.append(one_sentence)

In [None]:
tfidf_vectors = []
vectorizer = TfidfVectorizer(stop_words = "english")
for sentence in lemmatized_text:
    try:
        tfidf_vectors.append(vectorizer.fit_transform(sentence))
    except:
        continue