### New Methodology

The last time, I created a function that would take the document, a start phrase, and an end phrase to find the narrative within the court case given. While this is proven to work, it is impossible to automate this as the start phrase and end phrase are not consistent within districts/counties/states/etc. 

After another careful review of the court cases given to me, I've noticed that IF the court case has a narrative section (as some documents do not), then the section will be devoid of links (so text that is blue, italized, AND underlined). Therefore, the idea for this new methodology is to extract the sections of the document where there are NO links. 

The rough process will be as follows:
1. The first step is to create a list of all potential headers (like we did before)
2. There are about 2-3 major sections that are consistent within all court cases that I've seen: Case Summary/Core Terms, Notes, and Opinion. The propbability of the narrative being in the first two sections is extremely low, so the next step would be to extact the potential headers and the corresponding text associated with them  within the Opinion section.
3. The next step would be to go through each text between each potential header and count the number of words associated with a link (count the number of blue, italized, and underlined words).
4. If the count=0 at the end of the scan, keep that section saved in a dictionary. The key being the potential header name and the value being a list of the words within that section. If the length of the dictionary=0 then that means that there is no narrative section within the document and the algorithm will let the user know that.
5. The last step would be to combine the keys and values of the dictionary into one concise string of text (it may look ugly and need some cleaning).


In [6]:
# needed packages
import fitz
import sys
from collections import Counter

In [7]:
def flags_decomposer(flags):
    """Make font flags human readable."""
    l = []
    if flags & 2 ** 0:
        l.append("superscript")
    if flags & 2 ** 1:
        l.append("italic")
    if flags & 2 ** 2:
        l.append("serifed")
    else:
        l.append("sans")
    if flags & 2 ** 3:
        l.append("monospaced")
    else:
        l.append("proportional")
    if flags & 2 ** 4:
        l.append("bold")
    return ", ".join(l)

In [24]:
def get_styles(doc):
    style_counts = []

    for page in doc:
        #, flags=11

        paths = page.get_drawings()  # get drawings on the page

        drawn_lines = []
        for p in paths:
            # print(p)
            for item in p["items"]:
                # print(item[0])
                if item[0] == "l":  # an actual line
                    # print(item[1], item[2])
                    p1, p2 = item[1], item[2]
                    if p1.y == p2.y:
                        drawn_lines.append((p1, p2))
                elif item[0] == "re":  # a rectangle: check if height is small
                    # print(item[0])
                    # print(item[1])
                    r = item[1]
                    if r.width > r.height and r.height <= 2:
                        drawn_lines.append((r.tl, r.tr))  # take top left / right points
        
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )

                    r = fitz.Rect(s['bbox']) 
                    for p1, p2 in drawn_lines:  # check distances for start / end points
                        if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                            font_properties = " ".join([font_properties, 'underlined'])

                    style_counts.append(font_properties)
    styles = dict(Counter(style_counts))

    style_list = sorted(styles.items(), key=lambda x:x[1], reverse=True)
    
    return style_list

In [11]:
def get_headers(doc, style_list):

    headers = {}
    count = 0
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    slist = [style[0] for style in style_list][1:]

    for page in doc:

        paths = page.get_drawings()  # get drawings on the page

        drawn_lines = []
        for p in paths:
            # print(p)
            for item in p["items"]:
                # print(item[0])
                if item[0] == "l":  # an actual line
                    # print(item[1], item[2])
                    p1, p2 = item[1], item[2]
                    if p1.y == p2.y:
                        drawn_lines.append((p1, p2))
                elif item[0] == "re":  # a rectangle: check if height is small
                    # print(item[0])
                    # print(item[1])
                    r = item[1]
                    if r.width > r.height and r.height <= 2:
                        drawn_lines.append((r.tl, r.tr))  # take top left / right points
        
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    # print(s['text'][:8])
                    # print(s['bbox'])
                    count+=1
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )
            
                    if s['size'] >= p_size:
                        if s['text'].isupper()==True:
                            headers.update({s['text']: count})
            
                        if font_properties in slist:
                            headers.update({s['text']: count})
                
                        r = fitz.Rect(s['bbox']) 
                        for p1, p2 in drawn_lines:  # check distances for start / end points
                            if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                                headers.update({s['text']: count})
                                break
                    
    return headers

### Step 1: Extracting the Potential Headers

Above are the functions that I made last time. Combining these will complete the first step of the process. Here's an example below:

In [12]:
doc = fitz.open('pplvpv.pdf')
get_styles(doc)

[("Font: 'Helvetica' (sans, proportional), size 10, color #000000", 450),
 ("Font: 'Helvetica' (sans, proportional), size 9, color #000000", 101),
 ("Font: 'Helvetica-Oblique' (italic, sans, proportional), size 10, color #0077cc",
  93),
 ("Font: 'Helvetica-BoldOblique' (italic, sans, proportional, bold), size 10, color #000000",
  65),
 ("Font: 'Helvetica-Bold' (sans, proportional, bold), size 10, color #000000",
  55),
 ("Font: 'Helvetica-Oblique' (italic, sans, proportional), size 9, color #0077cc",
  30),
 ("Font: 'Helvetica' (superscript, sans, proportional), size 8, color #000000",
  24),
 ("Font: 'Helvetica' (sans, proportional), size 6, color #000000", 24),
 ("Font: 'Helvetica-Oblique' (italic, sans, proportional), size 9, color #000000",
  20),
 ("Font: 'Helvetica-Oblique' (italic, sans, proportional), size 10, color #000000",
  11),
 ("Font: 'Helvetica-BoldOblique' (italic, sans, proportional, bold), size 9, color #000000",
  11),
 ("Font: 'Arial' (sans, proportional), size 1

In [13]:
style_list = get_styles(doc)
get_headers(doc, style_list)

{'People v P.V.': 1,
 '2005QN060819': 4,
 ' [****1] ': 7,
 'Subsequent History:': 9,
 'People v. V., 2019 NYLJ LEXIS 1619 (Apr. 30, 2019)': 13,
 'Case Summary': 14,
 'Overview': 15,
 'CPL 440.10(1)(i)': 17,
 'sex trafficking': 824,
 'sex': 802,
 'Outcome': 28,
 'Counsel:': 30,
 ' [***1] ': 32,
 'The Legal Aid Society': 33,
 'Criminal Appeals Bureau': 35,
 'Elizabeth L. Isaacs': 37,
 'Richard A. Brown': 40,
 'District Attorney': 42,
 'Kathryn E. Mullen': 44,
 'Judges:': 46,
 'Opinion by:': 48,
 'Opinion': 50,
 ' [**498]  [*345] ': 51,
 ' [*346] ': 57,
 '§ ': 59,
 '230.00': 315,
 'Penal Law § 240.37': 64,
 'Penal Law § ': 667,
 '120.00': 70,
 'Penal Law § 240.26 [1]': 544,
 'Criminal Procedure Law § 440.10 (1) (i)': 84,
 '(6)': 561,
 ' [****2] ': 91,
 'Penal Law § 230.34': 528,
 ' [***2] ': 96,
 ' [**499] ': 113,
 'Relevant Laws': 114,
 'sex ': 339,
 'trafficking': 793,
 ' [*347] ': 120,
 'Criminal Procedure Law § 440.10': 122,
 'section 240.37': 128,
 '230.03': 133,
 ' [***3] ': 135,
 '

#### An extension to step 1

To make things easier, I want to distingiush the type of 'different' each potential header falls under. This is currently a list of all potential headers, but why? I want to indicate which ones are larger in text size than paragraph text, which ones have different font properties and how, and which ones are underlined. So I'm going to change the get_headers() function to add that information.

The output will be a dictionary of dictionaries with the keys being the text and the value being another dictionary with the count (location) and the type (how it's different).

In [26]:
def get_headers(doc, style_list):

    headers = {}
    count = 0
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    slist = [style[0] for style in style_list][1:]

    for page in doc:

        paths = page.get_drawings()  # get drawings on the page

        drawn_lines = []
        for p in paths:
            # print(p)
            for item in p["items"]:
                # print(item[0])
                if item[0] == "l":  # an actual line
                    # print(item[1], item[2])
                    p1, p2 = item[1], item[2]
                    if p1.y == p2.y:
                        drawn_lines.append((p1, p2))
                elif item[0] == "re":  # a rectangle: check if height is small
                    # print(item[0])
                    # print(item[1])
                    r = item[1]
                    if r.width > r.height and r.height <= 2:
                        drawn_lines.append((r.tl, r.tr))  # take top left / right points
        
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    # print(s['text'][:8])
                    # print(s['bbox'])
                    count+=1
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )

                    r = fitz.Rect(s['bbox']) 
                    for p1, p2 in drawn_lines:  # check distances for start / end points
                        if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                            font_properties = " ".join([font_properties, 'underlined'])
            
                    if s['size'] >= p_size:
                        if s['text'].isupper()==True:
                            headers.update({s['text']: {'location':count, 'type':f'upper, {font_properties}'}})
            
                        if font_properties in slist:
                            headers.update({s['text']: {'location':count, 'type':f'{font_properties}'}})
                    
    return headers

In [29]:
doc = fitz.open('pplvpv.pdf')
get_styles(doc)[:4]

[("Font: 'Helvetica' (sans, proportional), size 10, color #000000", 440),
 ("Font: 'Helvetica' (sans, proportional), size 9, color #000000", 101),
 ("Font: 'Helvetica-Oblique' (italic, sans, proportional), size 10, color #0077cc underlined",
  90),
 ("Font: 'Helvetica-BoldOblique' (italic, sans, proportional, bold), size 10, color #000000 underlined",
  65)]

In [31]:
style_list = get_styles(doc)
get_headers(doc, style_list)['People v P.V.']

{'location': 1,
 'type': "Font: 'Helvetica-BoldOblique' (italic, sans, proportional, bold), size 14, color #0077cc underlined"}

In [33]:
style_list[0][0].split('color')[1].split()[0].strip(',')

'#000000'

### Step 3: Scanning the Text
I'm thinking of skipping step 2 for now since that is a modifier that is supposed to improve the algorithm's accuracy and isn't exactly necessary. I'm going to repurpose the get_narrative() function that I made last time to scan through each section.

In [46]:
def get_narrative(doc, style_list, header_dict):

    p_color = style_list[0][0].split('color')[1].split()[0].strip(',')
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    # list of potential headers
    keys_as_list = list(header_dict)
    # keeping track of location (aka index of span)
    count = 0
    # keeping track of the number of links in each section
    link_tracker = {}

    for header_index in range(len(keys_as_list)):
        start = keys_as_list[header_index]
        start_span = header_dict[start]["location"]
        
        if header_index+1 < len(keys_as_list):
            end = keys_as_list[header_index+1]
            end_span = header_dict[end]["location"]
        else:
            end = keys_as_list[header_index]
            end_span = header_dict[end]["location"]

        # keeping track of the number of links found in each potential header section
        links_counter = 0

        for page in doc:

            paths = page.get_drawings()  # get drawings on the page

            drawn_lines = []
            for p in paths:
                # print(p)
                for item in p["items"]:
                    # print(item[0])
                    if item[0] == "l":  # an actual line
                        # print(item[1], item[2])
                        p1, p2 = item[1], item[2]
                        if p1.y == p2.y:
                            drawn_lines.append((p1, p2))
                    elif item[0] == "re":  # a rectangle: check if height is small
                        # print(item[0])
                        # print(item[1])
                        r = item[1]
                        if r.width > r.height and r.height <= 2:
                            drawn_lines.append((r.tl, r.tr))  # take top left / right points
            
            blocks = page.get_text("dict", flags=11)["blocks"]

            for b in blocks:  # iterate through the text blocks
                for l in b["lines"]:  # iterate through the text lines
                    for s in l["spans"]:  # iterate through the text spans
                        # keeping track of span index
                        count += 1

                        font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                            s["font"],  # font name
                            flags_decomposer(s["flags"]),  # readable font flags
                            s["size"],  # font size
                            s["color"],  # font color
                        )

                        r = fitz.Rect(s['bbox']) 
                        for p1, p2 in drawn_lines:  # check distances for start / end points
                            if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                                font_properties = " ".join([font_properties, 'underlined'])
                        
                        # checking text in between start and end index
                        if count >= start_span and count < end_span:
                            # seting up check indicator for if span is a link or not
                            # assuming text is a link
                            check = True
                            # if it's a different size than paragraph text, it's not a link
                            if s['size']!=p_size:
                                check = False
                            # if it's not italized, it's not a link
                            if 'italic' not in font_properties:
                                check = False
                            # if it's not underlined, it's not a link
                            if 'underlined' not in font_properties:
                                check = False
                            # if it's the same color as the paragraph text
                            if s['color']==p_color:
                                check = False
                            # if after checking font size, italics, underlines, and color, 
                            # the text is still classified as a link, add 1 to links_counter
                            if check==True:
                                links_counter+=1

        link_tracker.update({start:links_counter})                                
                    
    return link_tracker

In [47]:
headers = get_headers(doc, style_list)

In [48]:
get_narrative(doc, style_list, headers)

{'People v P.V.': 0,
 '2005QN060819': 0,
 ' [****1] ': 0,
 'Subsequent History:': 0,
 'People v. V., 2019 NYLJ LEXIS 1619 (Apr. 30, 2019)': 0,
 'Case Summary': 0,
 'Overview': 0,
 'CPL 440.10(1)(i)': 0,
 'sex trafficking': 0,
 'sex': 0,
 'Outcome': 0,
 'Counsel:': 0,
 ' [***1] ': 0,
 'The Legal Aid Society': 0,
 'Criminal Appeals Bureau': 0,
 'Elizabeth L. Isaacs': 0,
 'Richard A. Brown': 0,
 'District Attorney': 0,
 'Kathryn E. Mullen': 0,
 'Judges:': 0,
 'Opinion by:': 0,
 'Opinion': 0,
 ' [**498]  [*345] ': 0,
 ' [*346] ': 0,
 '§ ': 0,
 '230.00': 0,
 'Penal Law § 240.37': 0,
 'Penal Law § ': 0,
 '120.00': 0,
 'Penal Law § 240.26 [1]': 0,
 'Criminal Procedure Law § 440.10 (1) (i)': 0,
 '(6)': 0,
 ' [****2] ': 0,
 'Penal Law § 230.34': 0,
 ' [***2] ': 0,
 ' [**499] ': 0,
 'Relevant Laws': 0,
 'sex ': 0,
 'trafficking': 0,
 ' [*347] ': 0,
 'Criminal Procedure Law § 440.10': 0,
 'section 240.37': 0,
 '230.03': 0,
 ' [***3] ': 0,
 '64 Misc. 3d 344, *345; 100 N.Y.S.3d 496, **498; 2019 N.Y

In [76]:
headers = {}
count = 0
each_line = []
p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))

doc = fitz.open('pplvpv.pdf')
for page in doc:
    #, flags=11
    blocks = page.get_text("dict", flags=11)["blocks"]

    for b in blocks:  # iterate through the text blocks
        for l in b["lines"]:  # iterate through the text lines
            texts = ""
            for s in l['spans']:
                if s['size'] >= p_size:
                    texts = "".join ([texts, s['text']])
            print(texts)

People v P.V.
Criminal Court of the City of New York, Queens County
 April 30, 2019, Decided 
2005QN060819


 [****1]  The People of the State of New York, Plaintiff, v P.V., Defendant.
Subsequent History: As Corrected May 8, 2019.
As Corrected May 14, 2019.
Reported at People v. V., 2019 NYLJ LEXIS 1619 (Apr. 30, 2019)
Case Summary
Overview
HOLDINGS: [1]-Pursuant to CPL 440.10(1)(i), defendant's pleas of guilty to prostitution were vacated because she 
was a victim of sex trafficking from 2003 to 2006; the trafficker coerced her to work in commercial sex for 
extensive hours every night, the house rules were oppressive and coercive, the trafficker was physically 
threatening, and the trafficker controlled her access to hormones that allowed her to physically transition and realize 
her true gender identity; [2]-Defendant's conviction for disorderly conduct could not be vacated because it was not a 
prostitution-related charge.
Outcome
Motion to vacate convictions granted in part and d