# **01_PREPROCESSING**

Summary:


1.   Import and Normalization
2.   Split Opinions into Subjects of Interest
3.   Text Cleaning





---

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import sys
sys.path.append('/content/drive/My Drive/Università/inforet_prj/')

In [3]:
!pip install -U spacy unidecode

Collecting spacy
  Downloading spacy-3.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 4.0 MB/s 
[?25hCollecting unidecode
  Downloading Unidecode-1.3.2-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 71.5 MB/s 
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 24.6 MB/s 
Collecting thinc<8.1.0,>=8.0.9
  Downloading thinc-8.0.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623 kB)
[K     |████████████████████████████████| 623 kB 62.0 MB/s 
[?25hCollecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.1-cp37-cp37m-manylinux2014_x86_64.whl (456 kB)
[K     |████████████████████████████████| 456 kB 37.2 MB/s 
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.0-py3-no

In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[K     |████████████████████████████████| 13.6 MB 77 kB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.1.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [1]:
import lzma, json
import pandas as pd
import pickle
from tqdm import tqdm
import spacy
import string
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from unidecode import unidecode
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
sns.set()
tqdm.pandas()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
nlp = spacy.load("en_core_web_sm")

## 1. Import and Normalization

### *1.1 Data Import*


**NB**: run the 3 cells below only if on Google Colab. Otherwise skip them and download the compressed data manually from https://api.case.law/v1/bulk/22341/download/

In [None]:
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver', options=chrome_options)
wd.get("https://case.law/bulk/download/")

In [None]:
wd.find_element_by_xpath("/html/body/div/main/div/div/div[2]/div/div[2]/div/div[2]/a").click()

In [None]:
!unzip Illinois-20200302-text.zip
!mv Illinois-20200302-text/data/data.jsonl.xz data.jsonl.xz
!rm -r Illinois-20200302-text
!rm Illinois-20200302-text.zip

### *1.2 Data Normalization*

Creation of opinions, citations and df

In [None]:
# We know that there will be 183146 items,
# so we set this manually since tqdm will not
# be able to display a progress bar when reading from
# a file.
pbar = tqdm(total=183146)

# Read directly from the compressed file.
# We will create a list where each element is a line
# of the file, which in turns is a json
# (casted in python as a dict).
with lzma.open("data.jsonl.xz") as f:
    cases = []

    for line in f:
        cases.append(json.loads(str(line, 'utf8')))
        pbar.update(1)

    pbar.close()

100%|██████████| 183146/183146 [01:23<00:00, 2183.15it/s]


In [None]:
# https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
df = pd.json_normalize(cases)

In [None]:
del cases

In [None]:
# Flattens the list of attorneys to a single string
# with ; as separator
df["casebody.data.attorneys"] = df.apply(lambda x: "; ".join(x["casebody.data.attorneys"]), axis=1)

In [None]:
"""
Each element of the columns 'citations' and 'casebody.data.opinions' is
a list, and in turn each element of the list is a json object.
This means that we need to unravel those column to have a more "flatten"
version (like a simple table, eg. a DataFrame).
The approach shown here consists of creating two different DataFrames
that will contain data from the two columns. In order to preserve the
association of each row of the new DataFrame with the corresponding data
in the original DataFrame, we will add to each json a new key called "id"
that will have the original row number as value.
"""

def add_id_todict(x, col):
    vals = x[col]

    for i, elem in enumerate(vals):
        d = elem
        d["id"] = x.name
        vals[i] = d

    return vals

In [None]:
df["casebody.data.opinions"] = df.apply(lambda x: add_id_todict(x, "casebody.data.opinions"), axis=1)
df["citations"] = df.apply(lambda x: add_id_todict(x, "citations"), axis=1)

In [None]:
# For clarity, let's also add the "id" column to the original df
df["id"] = df.index.values

In [None]:
# We merge each element in the "citations" column (which is a list)
# to a single list called "citations".
#
# Using list comprehension instead of df["column"].sum()
# because the latter is slow for large df. See:
# https://stackoverflow.com/a/51576777
citations = [item for x in df["citations"] for item in x]
df.drop(columns=["citations"], inplace=True)

In [None]:
# Same for the opinions column
opinions = [item for x in df["casebody.data.opinions"] for item in x]
df.drop(columns=["casebody.data.opinions"], inplace=True)

In [None]:
# Let's now get the flattened table from the citations
# and from the opinions
citations_df = pd.json_normalize(citations)

In [None]:
opinions_df = pd.json_normalize(opinions)

We now have 3 dataframes that can be joined using the "id" column.

In [None]:
df['year'] = pd.to_datetime(df['decision_date']).apply(lambda x: x.year)
opinions_df = pd.merge(opinions_df, df[['year','id']], on="id", how="left")

### *1.3 Serialize data*


In [None]:
with open("/content/drive/MyDrive/Università/inforet_prj/df.pkl", "wb") as f:
    pickle.dump(df, f)

In [None]:
with open("/content/drive/MyDrive/Università/inforet_prj/citations.pkl", "wb") as f:
    pickle.dump(citations_df, f)

In [None]:
with open("/content/drive/MyDrive/Università/inforet_prj/opinions.pkl", "wb") as f:
    pickle.dump(opinions_df, f)

In [None]:
del df
del citations_df
del opinions
del citations
del opinions_df

In [None]:
import gc
gc.collect()

253

---

## **2. Split Opinions into Subjects of Interest**

We divide into 3 groups rows based on the lists of terms provided for each subject of interest: narcotics, weapons and investigation.

In [3]:
with open("/content/drive/MyDrive/Università/inforet_prj/opinions.pkl", "rb") as f:
  opinions_df = pickle.load(f)

In [4]:
opinions_df["text"] = opinions_df["text"].str.replace("|", " ")

In [5]:
opinions_df.author = opinions_df.author.fillna("")
array = opinions_df["author"].progress_apply(lambda x: nltk.word_tokenize(x.lower()))

authors_judges = []

for op in array:
    for token in op:
        if token.isalpha() and len(token) > 1:
            authors_judges.append(token)

authors_judges = set(authors_judges)

100%|██████████| 194366/194366 [00:22<00:00, 8819.74it/s]


In [6]:
with open("authors_judges.pkl", "wb") as f:
    pickle.dump(authors_judges, f)

In [7]:
def typo(text):
    cleaned_text = (
        text.replace('cannabi ','cannabis ')
        .replace('lysergic acid diethylamide', 'lsd')
        .replace('methylenedioxymethamphetamine', 'mdma')
        .replace('MDMA', 'mdma')
        .replace('methylenedioxyamphetamine', 'mda')
        .replace('ciacetyl','diacetyl')
        .replace(' nar cotic', ' narcotic')
        .replace(' fi ','')
        )
    return cleaned_text

In [8]:
opinions_df['text'] = opinions_df.text.progress_apply(lambda x: typo(x))
#typo(narco_data.lemmatized[30])

100%|██████████| 194366/194366 [00:16<00:00, 12140.36it/s]


In [12]:
#narcotics = ["cannabis",  "marijuana",  "lsd", "heroin", 'methaqualone', "ecstasy", "mdma", "cocaine", "cocaine", "methamphetamine", "hydromorphone", "dilaudid", "meperidine", "demerol", "oxycodone", "dexedrine", "fentanyl", "ritalin", "methadone", "amphetamine", "phencyclidine", "ephedrine"]
narcotics = [ "cannabis",  "marijuana",  "lsd", "heroin", 'methaqualone', "ecstasy", "peyote", "mescaline", "mda", "mdma", "cocaine", "methamphetamine", "hydromorphone", "dilaudid", "meperidine", "demerol", "oxycodone", "dexedrine", "fentanyl", "ritalin", "methadone", "amphetamine", "phencyclidine", "pseudoephedrine", "ephedrine", "meth", "opium", "dilaudid", "preludin","ketamine", "anabolic" , "steroids",  "testosterone", "ketamine", "modafinil", "provigil", "adderall", "methylphenidate", "memantine", "axura", "soma", "xanax", "darvon", "darvocet", "valium", "ativan", "talwin", "ambien", "tramadol",  "ethclorvynol","phenylpropanolamine", "lomotil", "motofen", "lyrica", "parepectolin", "tetracaine"]
weapons = ["gun", "knife", "weapon", "firearm", "rifle", "carabine", "shotgun", "assaults rifle", "sword", "blunt objects"]
investigations = ["gang", "mafia", "serial killer", "rape", "thefts", "recidivism", "arrest", "ethnicity", "caucasian", "afroamerican", "native american", "hispanic", "gender", "male", "female", "man", "woman", "girl", "boy", "robbery", "cybercrime"]

In [13]:
narco_df = opinions_df.loc[opinions_df['text'].str.contains("|".join(narcotics)).any(level=0)] # 35410 rows 

In [14]:
narco_df

Unnamed: 0,type,text,author,id,year
2,majority,CHIEF JUSTICE HEIPLE\ndelivered the opinion of...,CHIEF JUSTICE HEIPLE,1,1997
3,dissent,"JUSTICE HARRISON,\ndissenting:\nThe trial cour...","JUSTICE HARRISON,",1,1997
6,majority,JUSTICE FREEMAN\ndelivered the opinion of the ...,JUSTICE FREEMAN,4,1997
8,majority,JUSTICE BILANDIC\ndelivered the opinion of the...,JUSTICE BILANDIC,5,1997
10,majority,JUSTICE COLWELL\ndelivered the opinion of the ...,JUSTICE COLWELL,7,1997
...,...,...,...,...,...
194341,majority,PRESIDING JUSTICE QUINN\ndelivered the opinion...,PRESIDING JUSTICE QUINN,183124,2006
194349,majority,Mr. Justice Smith\ndelivered the opinion of th...,Mr. Justice Smith,183131,1947
194353,majority,Mb. Presiding Justice Stone\ndelivered the opi...,Mb. Presiding Justice Stone,183133,1938
194357,majority,"OPINION\nFrederick, J.\nThis cause comes befor...","Frederick, J. Frederick, J. Frederick, J.",183137,1999


In [15]:
narco_df.to_csv("narco_df.csv", index=False, sep="|")

In [16]:
!cp narco_df.csv /content/drive/MyDrive/Università/inforet_prj

In [None]:
del opinions_df
del authors_judges

In [None]:
#gc.collect()

---

## **3. Text Cleaning**
Load Opinions from the previous step.

In [None]:
with open("authors_judges.pkl", "rb") as f:
    authors_judges = pickle.load(f)

In [None]:
# Proper nouns found in the dataset
names = ["Brinks", "Flores", "People v.","Pinnix", "Garvey", "Steinbach", "Fowlar", "Mobil", "Milian", "TQ", "Yanez", "Tawanda", "Geder", "Mason", "Payne", "Bair", "ILCS",  "tbe", "tbat", "Delores","Stivers", "Spades", "Snyders", "Nally", "Budaj", "Yacoo", "Cosgrove", "Cos-grove", "Gayles", "Hodges"]

In [None]:
def full_text_clean(text):

    bb = (
        text.replace(' U.S. ','US')
        .replace(' S.Ct. ','SCt')
        .replace(' f. supp. ', ' fsupp ')
        .replace(' cir.', ' cir ')
        .replace("[o]", "o")
        .replace(" CIR ", " confidential source ")
        .replace("Reg.", " regulation ")
        .replace("miIe", " mile ")
        .replace(" com mitted ", " committed ")
        .replace("wtap", "tap")
        )
    
    temp = bb.split()
    bb = " ".join([ele for ele in temp if not ele[0].isupper()])
    
    bb = bb.split(":")
    bb.pop(0)
    bb = ' '.join(bb)


    bb = unidecode(re.sub(' +', ' ', bb.strip())) #any additional whitespaces and foreign characters
    bb = bb.strip()
    bb = re.sub('[0-9]{1,2} [Uu]\.[Ss]\.[Cc]\. §\s?\d+(\w+)?( \([0-9]{4}\))?',' USCCITATION ', bb)
    bb = re.sub('[a-zA-Z]+ [vV]\. [a-zA-Z]+',' CaseAvCaseB ', bb) #CaseA v. CaseB = CaseAvCaseB
    bb = re.sub('\d+ (Ark|Ill)\. \d+',' StateCase ', bb) #300 Ark. 230 = 300Ark230
    bb = re.sub(' [Ss][Tt][Aa][Tt][Ss]\.',' StateCase2 ',bb) #300 Ark. 230 = 300Ark230
    bb = re.sub('\d+ [A-z]+\.[ ]*[A-z]+\.[ ]*\d[A-z]+ \d+',' CaseRef ',bb) #953 S.W.2d 559 or 87 L.Ed.2d 481
    bb = re.sub('[Jj][Rr]\.', 'Jr ', bb)
    bb = re.sub('\d+ (Ark|Ill)\. App. \d+',' StateAppCase ', bb)
    bb = re.sub('(Ark|Ill)\. Code Ann\. § ',' StateCodeSection ', bb)
    bb = re.sub(' [Ii][Dd]\.',' Idem ', bb)
    bb = re.sub('§+',' Section ', bb)
    bb = re.sub('[Aa][Nn][Nn][Oo][:.]* \d+ [Aa]\.*[ ]*[Ll]\.*[ ]*[Rr]\.*[ ]*\d+','anno', bb)
    bb = re.sub(' [Aa][Nn][Nn][Oo][:.]*',' anno', bb)
    bb = re.sub('[Cc][Ff]\.','cf', bb)
    bb = re.sub(' [Rr][Ee][Vv]\. [Ss][Tt][Aa][Tt]\.',' revstat ', bb)
    bb = re.sub('[ \d]+[Pp][Aa][Rr]\.',' par ', bb)
    bb = re.sub('[ \d]+[Ss][Tt][Aa][Tt]\.',' stat ', bb)
    bb = re.sub("[\(\[].*?[\)\]]", "", bb)
    
    bb = (
        bb.replace("USCCITATION", "")
        .replace("CaseAvCaseB", "")
        .replace("StateCase", "")
        .replace("StateCase2", "")
        .replace("CaseRef", "")
        .replace("StateAppCase", "")
        .replace("StateCodeSection", "")
        .replace("anno", "")
    )

    bb = unidecode(re.sub(' +', ' ', bb.strip()))
    bb = bb.strip()

    doc = nlp(bb)
    persons = set([str(ent.text).lower() for ent in doc.ents if ent.label_ == "PERSON"])
    persons = [x.translate(str.maketrans('', '', string.punctuation)) for x in set(nltk.word_tokenize(" ".join(persons)))]
    persons.extend(names)

    result = []
    for token in doc:
        if (len(token.text) > 1 
            and token.text.isalpha() # Token is word
            and token.pos_ not in ['NUM', 'PROPN']  # Token not NUM, PROPN nor ADV
            and not token.is_punct # Token not punctuation
            and token.text not in authors_judges # Token is not a judge
            and token.text not in persons # Token is not a persona name
        ):

            result.append((token.text.lower(), token.lemma_.lower(), token.pos_))
    
    # Our result is a string of the form:
    # "text lemma POS; text lemma POS; text lemma POS; ..."
    result = "; ".join([text + " " + lemma + " " + pos for text, lemma, pos in result])
    
    return result

In [34]:
# 5 H
with open("narco_nlp.csv", "w") as my_empty_csv:
    pass

pbar = tqdm(total=35410) # narco_df total rows
chunksize = 1

for chunk in pd.read_csv("narco_df.csv", chunksize=chunksize, sep="|", usecols=["text"]):
    chunk['spacy_nlp'] = chunk.apply(lambda row: full_text_clean(row["text"]), axis=1)
    chunk.drop(columns=["text"], inplace=True)
    chunk.to_csv("narco_nlp.csv", index=False, sep="|", mode="a", header=False)

    pbar.update(1)

pbar.close()

  0%|          | 106/35410 [01:10<7:10:35,  1.37it/s]

KeyboardInterrupt: ignored

In [None]:
!cp narco_nlp.csv /content/drive/MyDrive/Università/inforet_prj

## **SENTENCES**

In [None]:
narcotics_schedule_1 = ["cannabis",  "marijuana", "mdma", "lsd", "heroin", "cannabis"]

In [None]:
narco_1_pmi = opinions_df.loc[opinions_df['text'].str.contains("|".join(narcotics_schedule_1)).any(level=0)] # 5923 

In [None]:
from nltk.tokenize import sent_tokenize
narco_1_pmi["sentences"] = narco_1_pmi.text.apply(lambda x: sent_tokenize(x)) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
narco_1_pmi["sentences_joined"] = narco_1_pmi.sentences.apply(lambda x: "; ".join([y.replace(";", "") for y in x]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
narco_1_pmi.drop(columns=["sentences"]).to_csv("narco_1_pmi.csv", index=False, sep="|")

In [None]:
!cp narco_1_pmi.csv /content/drive/MyDrive/Università/inforet_prj

In [None]:
narco_1_pmi.sentences[8]

In [None]:
names = ["Brinks", "Flores", "People v.","Pinnix", "Garvey", "Steinbach", "Fowlar", "Mobil", "Milian", "TQ", "Yanez", "Tawanda", "Geder", "Mason", "Payne", "Bair", "ILCS",  "tbe", "tbat", "Delores","Stivers", "Spades", "Snyders", "Nally", "Budaj", "Yacoo", "Cosgrove", "Cos-grove", "Gayles", "Hodges"]

In [None]:
def full_text_clean_sentences(testo):
    lista = []
    for text in testo:
        #doc = nlp(text)
        #persons = set([ent.text for ent in doc.ents if ent.label_ == "PERSON"])

        bb = (
            #text.lower()
            text.replace(' U.S. ','US')
            .replace(' S.Ct. ','SCt')
            .replace(' f. supp. ', ' fsupp ')
            .replace(' cir.', ' cir ')
            .replace("[o]", "o") #OCR is doing weird things: [o]ne
            .replace(" CIR ", " confidential source ")
            .replace("Reg.", " regulation ")
            )
        
        temp = bb.split()
        bb = " ".join([ele for ele in temp if not ele[0].isupper()])
        
        #bb = bb.split(":")
        #bb.pop(0)
        #bb = ' '.join(bb)
        
        # IMPORTANTE:
        # Gli steps che fai qui sotto con le regex su bb, non ti eliminano i match!!
        # Nota infatti come per esempio
        #
        #   bb = re.sub('[a-zA-Z]+ [vV]\. [a-zA-Z]+',' CaseAvCaseB ', bb)
        #
        # ti dica: trovami pezzi di testo come "CaseA v. CaseB" e sostituiscili con
        # "CaseAvCaseB". Viene perciò inserito un placeholder.
        # Quindi, se volessi in realtà rimuovere completamente quella
        # stringa (eg. non ti interessa fare un conteggio di quanti rif. a casi ci sono),
        # ci sono due modi. Uno è cambiare il secondo parametro di ogni funzione re.sub() 
        # qui sotto, perchè deve diventare " ", cioè:
        #
        #   bb = re.sub('[a-zA-Z]+ [vV]\. [a-zA-Z]+', " ", bb)
        #
        # Una volta fatto ciò, rimuovi spazi multipli mettendo, alla fine di questi
        # re.sub(...), di nuovo le prime due righe:
        #
        #   bb = unidecode(re.sub(' +', ' ', bb.strip()))
        #   bb = bb.strip()
        #
        # L'altro metodo, magari più veloce, è usare i replace come ho fatto io sotto.

        bb = unidecode(re.sub(' +', ' ', bb.strip())) #any additional whitespaces and foreign characters
        bb = bb.strip()
        bb = re.sub('[0-9]{1,2} [Uu]\.[Ss]\.[Cc]\. §\s?\d+(\w+)?( \([0-9]{4}\))?',' USCCITATION ', bb)
        bb = re.sub('[a-zA-Z]+ [vV]\. [a-zA-Z]+',' CaseAvCaseB ', bb) #CaseA v. CaseB = CaseAvCaseB
        bb = re.sub('\d+ (Ark|Ill)\. \d+',' StateCase ', bb) #300 Ark. 230 = 300Ark230
        bb = re.sub(' [Ss][Tt][Aa][Tt][Ss]\.',' StateCase2 ',bb) #300 Ark. 230 = 300Ark230
        bb = re.sub('\d+ [A-z]+\.[ ]*[A-z]+\.[ ]*\d[A-z]+ \d+',' CaseRef ',bb) #953 S.W.2d 559 or 87 L.Ed.2d 481
        bb = re.sub('[Jj][Rr]\.', 'Jr ', bb)
        bb = re.sub('\d+ (Ark|Ill)\. App. \d+',' StateAppCase ', bb)
        bb = re.sub('(Ark|Ill)\. Code Ann\. § ',' StateCodeSection ', bb)
        bb = re.sub(' [Ii][Dd]\.',' Idem ', bb)
        bb = re.sub('§+',' Section ', bb)
        bb = re.sub('[Aa][Nn][Nn][Oo][:.]* \d+ [Aa]\.*[ ]*[Ll]\.*[ ]*[Rr]\.*[ ]*\d+','anno', bb)
        bb = re.sub(' [Aa][Nn][Nn][Oo][:.]*',' anno', bb)
        bb = re.sub('[Cc][Ff]\.','cf', bb)
        bb = re.sub(' [Rr][Ee][Vv]\. [Ss][Tt][Aa][Tt]\.',' revstat ', bb)
        #bb = re.sub('[ \d]+[Cc][Hh]\.',' ch ', bb)
        bb = re.sub('[ \d]+[Pp][Aa][Rr]\.',' par ', bb)
        bb = re.sub('[ \d]+[Ss][Tt][Aa][Tt]\.',' stat ', bb)
        bb = re.sub("[\(\[].*?[\)\]]", "", bb) # remove brackets and what is inside. Perchè questo??
        
        
        # Secondo metodo: si possono rimuovere parole
        # placeholders in questo modo:
        bb = (
            bb.replace("USCCITATION", "")
            .replace("CaseAvCaseB", "")
            .replace("StateCase", "")
            .replace("StateCase2", "")
            .replace("CaseRef", "")
            .replace("StateAppCase", "")
            .replace("StateCodeSection", "")
            .replace("anno", "")        
            #.replace("cf", "")
            #.replace("ch", "")
            #.replace("par", "")
            #.replace("stat", "")
        )
        bb = unidecode(re.sub(' +', ' ', bb.strip()))
        bb = bb.strip()

        doc = nlp(bb)
        persons = set([str(ent.text).lower() for ent in doc.ents if ent.label_ == "PERSON"])
        persons = [x.translate(str.maketrans('', '', string.punctuation)) for x in set(nltk.word_tokenize(" ".join(persons)))]
        persons.extend(names)


        # Mettiamo qui la pulizia delle persons, authors e tiriamo fuori pure
        # la lemmatization e POS_spacy.


        result = []
        for token in doc:
            if (len(token.text) > 1 
                and token.text.isalpha() # Token is word
                and token.pos_ not in ['NUM', 'PROPN', 'ADV']  # Token not NUM
                and not token.is_punct # Token not punctuation
                and token.text not in authors_judges # Token is not a judge
                and token.text not in persons # Token is not a persona name
            ):

                result.append(token.lemma_.lower())
        
        # Our result is a string of the form:
        # "text lemma POS; text lemma POS; text lemma POS; ..."
        #bb = " ".join([i for i in nltk.word_tokenize(bb) if len(i) > 1  and i.isalpha()])
        #result = "; ".join([text + " " + lemma + " " + pos for text, lemma, pos in result])
        result = " ".join([ lemma for lemma in result])
        lista.append(result)
    return lista

In [None]:
narco_1_pmi.iloc[8]["sentences"]

In [None]:
full_text_clean(narco_1_pmi["sentences"][8])

NameError: ignored

In [None]:
with open("narco_1_pmi_nlp.csv", "w") as my_empty_csv:
    pass

pbar = tqdm(total=4165) # narco_df total rows
chunksize = 1

for chunk in pd.read_csv("narco_1_pmi.csv", chunksize=chunksize, sep="|", usecols=["sentences_joined"]):
    chunk["sentences"] = chunk.sentences_joined.apply(lambda x: x.split("; "))
    chunk['spacy_nlp'] = chunk.apply(lambda row: full_text_clean_sentences(row["sentences"]), axis=1)
    chunk.drop(columns=["sentences", "sentences_joined"], inplace=True)
    chunk.to_csv("narco_1_pmi_nlp.csv", index=False, sep="|", mode="a", header=False)

    pbar.update(1)

pbar.close()

100%|██████████| 4165/4165 [1:55:24<00:00,  1.66s/it]


In [None]:
narco_1_pmi

Unnamed: 0,type,text,author,id,year,sentences,sentences_joined
8,majority,JUSTICE BILANDIC\ndelivered the opinion of the...,JUSTICE BILANDIC,5,1997,[JUSTICE BILANDIC\ndelivered the opinion of th...,JUSTICE BILANDIC\ndelivered the opinion of the...
63,majority,PRESIDING JUSTICE CAHILL\ndelivered the opinio...,PRESIDING JUSTICE CAHILL,55,2000,[PRESIDING JUSTICE CAHILL\ndelivered the opini...,PRESIDING JUSTICE CAHILL\ndelivered the opinio...
65,majority,"JUSTICE HUTCHINSON\ndelivered, the opinion of ...",JUSTICE HUTCHINSON,56,2000,"[JUSTICE HUTCHINSON\ndelivered, the opinion of...","JUSTICE HUTCHINSON\ndelivered, the opinion of ..."
83,majority,JUSTICE McBRIDE\ndelivered the opinion of the ...,JUSTICE McBRIDE,72,2000,[JUSTICE McBRIDE\ndelivered the opinion of the...,JUSTICE McBRIDE\ndelivered the opinion of the ...
86,majority,JUSTICE COLWELL\ndelivered the opinion of the ...,JUSTICE COLWELL,75,2000,[JUSTICE COLWELL\ndelivered the opinion of the...,JUSTICE COLWELL\ndelivered the opinion of the ...
...,...,...,...,...,...,...,...
194287,majority,PRESIDING JUSTICE RIZZI\ndelivered the opinion...,PRESIDING JUSTICE RIZZI,183077,1986,[PRESIDING JUSTICE RIZZI\ndelivered the opinio...,PRESIDING JUSTICE RIZZI\ndelivered the opinion...
194288,concurrence,"JUSTICE McNAMARA,\nspecially concurring:\nI ag...","JUSTICE McNAMARA,",183077,1986,"[JUSTICE McNAMARA,\nspecially concurring:\nI a...","JUSTICE McNAMARA,\nspecially concurring:\nI ag..."
194322,majority,Mr. PRESIDING JUSTICE GOLDBERG\ndelivered the ...,Mr. PRESIDING JUSTICE GOLDBERG,183108,1976,[Mr. PRESIDING JUSTICE GOLDBERG\ndelivered the...,Mr. PRESIDING JUSTICE GOLDBERG\ndelivered the ...
194341,majority,PRESIDING JUSTICE QUINN\ndelivered the opinion...,PRESIDING JUSTICE QUINN,183124,2006,[PRESIDING JUSTICE QUINN\ndelivered the opinio...,PRESIDING JUSTICE QUINN\ndelivered the opinion...


In [None]:
y = narco_1_pmi[:3]
y

Unnamed: 0,type,text,author,id,year,sentences
8,majority,JUSTICE BILANDIC\ndelivered the opinion of the...,JUSTICE BILANDIC,5,1997,[JUSTICE BILANDIC\ndelivered the opinion of th...
63,majority,PRESIDING JUSTICE CAHILL\ndelivered the opinio...,PRESIDING JUSTICE CAHILL,55,2000,[PRESIDING JUSTICE CAHILL\ndelivered the opini...
65,majority,"JUSTICE HUTCHINSON\ndelivered, the opinion of ...",JUSTICE HUTCHINSON,56,2000,"[JUSTICE HUTCHINSON\ndelivered, the opinion of..."


In [None]:
y["sent_clean"] = y.sentences.progress_apply(lambda x: full_text_clean_sentences(x)) 


  0%|          | 0/3 [00:00<?, ?it/s][A
 67%|██████▋   | 2/3 [00:03<00:01,  1.58s/it][A
100%|██████████| 3/3 [00:08<00:00,  2.84s/it]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [None]:
y.iloc[0]["sent_clean"]

In [None]:
!cp narco_1_pmi_nlp.csv /content/drive/MyDrive/Università/inforet_prj