**Relevent Source Materials**

https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34

ETA Module 6, Vectorization with SciKit Learn

Stat. Learning Final Project

### Import

In [1]:
import os
from glob import glob
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

### SetUp

In [2]:
## MODIFY THIS
# get path to your folder that holds the txt files
source_files = "C:/Users/jacqu/Downloads/Court Case PDFs/Court Case TXTs"
# outputs a list of all the txt files in the folder
source_file_list = sorted(glob(f"{source_files}/*.txt"))

# creates a list of tuples with an elememt for the source path and
# for the file title
file_data = []
for source_file_path in source_file_list:
    # split might be different, recommend checking with INFO.sample() or .head()
    file_title = source_file_path.split('\\')[-1].split(".txt")[0]
    file_data.append((source_file_path, file_title))

# creating df with the file title as the index and source path as a col
INFO = pd.DataFrame(file_data, columns=['txt_path','file_title'])\
    .set_index('file_title').sort_index()
INFO.head()

Unnamed: 0_level_0,txt_path
file_title,Unnamed: 1_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",C:/Users/jacqu/Downloads/Court Case PDFs/Court...
"A.D. v. Best Western Int_l, Inc., 2023 U.S. Dist. LEXIS 150376",C:/Users/jacqu/Downloads/Court Case PDFs/Court...
"A.D. v. Choice Hotels Int_l, Inc., 2023 U.S. Dist. LEXIS 150380",C:/Users/jacqu/Downloads/Court Case PDFs/Court...
B.M. v. Wyndham Hotels,C:/Users/jacqu/Downloads/Court Case PDFs/Court...
"Bacon v. Marshall, 2023 U.S. App. LEXIS 32309",C:/Users/jacqu/Downloads/Court Case PDFs/Court...


In [3]:
# making the CORPUS
## CORPUS df: multindex = doc name/index, sent. num, token num
## columns = pos tag, token str, term str (token str normalized)

narratives_list = []
for doc_idx, txt_path in enumerate(INFO['txt_path']):
    with open(txt_path, 'r',  encoding='utf-8') as file:
        narrative = file.read()
    narratives_list.append({"title": INFO.index[doc_idx], "narrative": narrative})

# Convert the list of dictionaries to a DataFrame
narratives = pd.DataFrame(narratives_list)
narratives = narratives.reset_index().set_index("title")
narratives = narratives.drop(columns=['index'])
narratives.head()

Unnamed: 0_level_0,narrative
title,Unnamed: 1_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",OPINION AND ORDER GRANTING DEFENDANT SUMMIT H...
"A.D. v. Best Western Int_l, Inc., 2023 U.S. Dist. LEXIS 150376",OPINION AND ORDER This matter comes before the...
"A.D. v. Choice Hotels Int_l, Inc., 2023 U.S. Dist. LEXIS 150380",OPINION AND ORDER This matter comes before the...
B.M. v. Wyndham Hotels,ORDER GRANTING IN PART AND DENYING IN PART DE...
"Bacon v. Marshall, 2023 U.S. App. LEXIS 32309",[*1] ORDER AND JUDGMENT* _____________________...


In [4]:
df = pd.DataFrame(index=narratives.index)
df['sent_str'] = [nltk.sent_tokenize(narratives.narrative[x]) for x in range(len(narratives))]
df = df.explode('sent_str')
s1 = df.index.to_series()
s2 = s1.groupby(s1).cumcount()
df.index = [df.index, s2]
df.index.names = ['title','sent_num']
# nltk.word_tokenize(df.sent_str[x])
df['token_pos'] = [nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(df.sent_str[x])) for x in range(len(df))]
df = df.explode('token_pos')
s1 = df.index.to_series()
s2 = s1.groupby(s1).cumcount()
df.index = [df.index.get_level_values(level=0), df.index.get_level_values(level=1), s2]
df.index.names = ['title','sent_num', 'token_num']
df.drop(columns=['sent_str'], inplace=True)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token_pos
title,sent_num,token_num,Unnamed: 3_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,0,"(OPINION, NN)"
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,1,"(AND, CC)"
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,2,"(ORDER, NNP)"
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,3,"(GRANTING, NNP)"
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,4,"(DEFENDANT, NNP)"


In [5]:
df['token_str'] = df.token_pos.apply(lambda x: x[0].strip())
df['term_str'] = df.token_pos.apply(lambda x: x[0].lower().strip())
df['pos_tag'] = df.token_pos.apply(lambda x: x[1])
CORPUS = df.drop(columns="token_pos")
CORPUS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token_str,term_str,pos_tag
title,sent_num,token_num,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,0,OPINION,opinion,NN
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,1,AND,and,CC
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,2,ORDER,order,NNP
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,3,GRANTING,granting,NNP
"A.B. v. Shilo Inn, Salem, LLC, 2023 U.S. Dist. LEXIS 143289",0,4,DEFENDANT,defendant,NNP


In [6]:
np.random.seed(3418)

**Here's a question:** Do we want to get rid of stop words? Maybe use a custom list of stop words... and do we want to do lemmatization on the words? Consult Brain??? 0.0 

If the answer is no to one or both questions, we can just have a df with index: title and columns: raw_narrative, n_tokens and go straight to TFIDFVec. We don't need to do most of the steps above; just go straight from narratives df to tfidf_engine.fit_transform(narratives.narrative). 

### Vectorization with SciKit Learn, TFIDF

In [None]:
## DOC df: index = doc name/index
## columns = narrative str, num tokens

In [None]:
def gather_docs(CORPUS, ohco_level, term_col='term_str'):
    OHCO = CORPUS.index.names
    CORPUS[term_col] = CORPUS[term_col].astype('str')
    DOC = CORPUS.groupby(OHCO[:ohco_level])[term_col].apply(lambda x:' '.join(x)).to_frame('doc_str')
    return DOC

In [None]:
DOC = gather_docs(CORPUS, 1)
DOC['n_tokens'] = DOC.doc_str.apply(lambda x: len(x.split()))
DOC.head()

In [None]:
ngram_range = (2,2)
n_terms = 1000

**Applying TFIDF Vectorization**

In [None]:
tfidf_engine = TfidfVectorizer(
    stop_words = 'english',
    ngram_range = ngram_range,
    max_features = n_terms,
    norm = 'l2', 
    use_idf = True)

**Vectorized data**

In [None]:
X = tfidf_engine.fit_transform(DOC.doc_str)
print(X[:1])

**Learned vocabulary**

In [None]:
import itertools
print(dict(itertools.islice(tfidf_engine.vocabulary_.items(), 5)))

In [None]:
TFIDF = pd.DataFrame(X.toarray(), columns=tfidf_engine.get_feature_names_out(), index=DOC.index)
TFIDF.head()

In [None]:
TFIDF.stack().to_frame('score').score.nlargest(20).to_frame('score')

### VOCAB DF
Making a vocabulary list with significant uni/bi grams based on tfidf... these are weights?

In [None]:
VOCAB = TFIDF.mean().to_frame('tfidf_mean')
VOCAB.sort_values('tfidf_mean', ascending=False).head(20)

### Logistic Regression

### SVM