### Starting Point
The following code blocks are what I've been developing for the past 3 rounds. I have the needed packages, the flag decomposer, and the get_styles() function that I've created and will keep and use to develop more functions. 

In [1]:
# needed packages
import fitz
import sys
from collections import Counter

In [2]:
def flags_decomposer(flags):
    """Make font flags human readable."""
    l = []
    if flags & 2 ** 0:
        l.append("superscript")
    if flags & 2 ** 1:
        l.append("italic")
    if flags & 2 ** 2:
        l.append("serifed")
    else:
        l.append("sans")
    if flags & 2 ** 3:
        l.append("monospaced")
    else:
        l.append("proportional")
    if flags & 2 ** 4:
        l.append("bold")
    return ", ".join(l)

In [3]:
def get_styles(doc):
    style_counts = []

    for page in doc:
        #, flags=11

        paths = page.get_drawings()  # get drawings on the page

        drawn_lines = []
        for p in paths:
            # print(p)
            for item in p["items"]:
                # print(item[0])
                if item[0] == "l":  # an actual line
                    # print(item[1], item[2])
                    p1, p2 = item[1], item[2]
                    if p1.y == p2.y:
                        drawn_lines.append((p1, p2))
                elif item[0] == "re":  # a rectangle: check if height is small
                    # print(item[0])
                    # print(item[1])
                    r = item[1]
                    if r.width > r.height and r.height <= 2:
                        drawn_lines.append((r.tl, r.tr))  # take top left / right points
        
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )

                    r = fitz.Rect(s['bbox']) 
                    for p1, p2 in drawn_lines:  # check distances for start / end points
                        if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                            font_properties = " ".join([font_properties, 'underlined'])

                    style_counts.append(font_properties)
    styles = dict(Counter(style_counts))

    style_list = sorted(styles.items(), key=lambda x:x[1], reverse=True)
    
    return style_list

### New Methodology: get_opinion()
There are typically 2-3 major headers in the court cases that I've seen. Usually titled Case Summary, Core Terms, or Opinion and so forth; they all have the same font properties such as size, flags, and font style. While Case Summary and Core Terms may or may not be present, it is apparent that Opinion will most likely be present (high propbability).

So get_styles() will determine the font properties that are unique to the paragraph text. Using that, we can use the paragraph text size to our advantage (names: p_size).

This function takes the document and reads the pdf line by line, each line having a 'count' tagged to it. So the first line read is 1, second is 2, and so forth. This will help with finding line location in the future. For each line read, the counter adds 1 and then moves on to splitting the line into spans. It compares the span text size to the paragraph text size. If the span text size is equal to or greater than the paragraph text size, the function will add it to the texts string. This gets just the text and removes and subscripts in each line.

The function will then split texts into a list so that if the number of words in the line is between 0 and 10, the line will be considered a potential header (saved in headers dict with the value being the location).

The main purpose of this function is to find out where the major header, Opinion, is located (assuring that the correct location has been most likely found). The variable is opinion_loc and will be used in the future. The extensive checks are put in place to reduce the probability of extracting the wrong location of the word "Opinion" (such as the word appearing more than once).

In [78]:
def get_opinion(doc, style_list):

    headers = {}
    count = 0
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                texts = ""
                count+=1
                for s in l['spans']:
                    if s['size'] >= p_size:
                        texts = "".join ([texts, s['text']])
                text_list = texts.split()
                if len(text_list) > 0 and len(text_list) < 7:
                    headers.update({texts:count})

    opinion_loc = headers['Opinion']
    return opinion_loc

### Getting the Major Headers

This code block can get the 2-3 major headers (Core Terms, Case Summary, Opinion, etc.). Currently, I have no use for this code, but for demonstrative purposes, I will keep it.

In [5]:
def get_majors(doc, style_list, opinion_loc):
    
    count = 0
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    new_headers = {}
    header_properties = ""

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                count+=1
                if count==opinion_loc:
                    for s in l['spans']:
                        header_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                            s["font"],  # font name
                            flags_decomposer(s["flags"]),  # readable font flags
                            s["size"],  # font size
                            s["color"],  # font color
                        )

    count = 0                
    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                count+=1
                for s in l['spans']:
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )
                    if font_properties==header_properties:
                        new_headers.update({s['text']:count})

    return new_headers

In [6]:
doc = fitz.open('pplvpv.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
get_majors(doc, style_list, opinion_loc)

{'Case Summary': 11, 'Opinion': 26}

In [106]:
doc = fitz.open('usvbell.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
get_majors(doc, style_list, opinion_loc)

{'Core Terms': 11, 'LexisNexis® Headnotes': 15, 'Opinion': 119}

In [8]:
doc = fitz.open('statevward.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
get_majors(doc, style_list, opinion_loc)

{'Case Summary': 18, 'Opinion': 53}

In [9]:
doc = fitz.open('tompvus.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
get_majors(doc, style_list, opinion_loc)

{'Core Terms': 11, 'Opinion': 18}

### get_masterlist()
Next step is to get a list of subheaders under the major header, Opinion. 

Beings with creating a masterlist of font style properties that define a subheader. 

vocab_font is the font for vocabulary words in text. This is where the text is bold, italicized, underlined, and matches the size of the paragraph text size. I don't include vocab where text size is smaller than paragraph text because that will be filtered out later. I don't include vocab where text size is larger than paragraph text because that would be considered a header/subheader.

links_font is the font for words that are considered 'links'. This is where the text is blue and matches the size of the paragraph text size. I don't include links where text size is smaller than paragraph text because that will be filtered out later. I don't include links where text size is larger than paragraph text because that would be considered a header/subheader.

I start creating a master list of subheader styles called slist. I filter out the fonts that match the paragraph font style, the links font style, and the vocab font style. I also filter out fonts where the size is smaller than the paragraph text size. 

This function returns a list of header/subheader fonts. 

In [12]:
def get_master(style_list):

    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    p_color = style_list[0][0].split('color')[1].split()[0].strip(',')
    p_font = style_list[0][0]

    bad_fonts = []

    for style in style_list:
        font_str = style[0]
        s_size = int(font_str.split('size')[1].split()[0].strip(','))
        s_color = font_str.split('color')[1].split()[0].strip(',')

        # if font matches paragraph font, it's a bad_font
        if font_str==p_font:
            bad_fonts+=[font_str]
        # if font doesn't match paragraph text color, it's a bad_font
        if s_color!=p_color:
            bad_fonts+=[font_str]
        # if font matches characteristics of vocab word font, it's a bad font
        if ('bold' in font_str and 'underlined' in font_str) and ('italic' in font_str and p_size==s_size):
            bad_fonts+=[font_str]
        # if font size is smaller than paragraph text size, it's a bad_font
        if s_size<p_size:
            bad_fonts+=[font_str]

    master = []
    for style in style_list:
        if style[0] not in bad_fonts:
            master += [style[0]]

    return master

### get_subheaders()
This function will read the pdf, starting from opinion_loc, and compare each span font style to the mastserlist. If the font style matches a font in the masterlist, add the line to opinion_subheaders list.

In [102]:
def get_subheaders(doc, style_list, opinion_loc, master):

    for page in doc:

        paths = page.get_drawings()  # get drawings on the page

        drawn_lines = []
        for p in paths:
            # print(p)
            for item in p["items"]:
                # print(item[0])
                if item[0] == "l":  # an actual line
                    # print(item[1], item[2])
                    p1, p2 = item[1], item[2]
                    if p1.y == p2.y:
                        drawn_lines.append((p1, p2))
                elif item[0] == "re":  # a rectangle: check if height is small
                    # print(item[0])
                    # print(item[1])
                    r = item[1]
                    if r.width > r.height and r.height <= 2:
                        drawn_lines.append((r.tl, r.tr))  # take top left / right points

    count = 0
    opinion_subheaders = {}
    p_color = style_list[0][0].split('color')[1].split()[0].strip(',')

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                texts = ""
                count+=1
                span_fonts = []
                if count>=opinion_loc:
                    for s in l['spans']:
                        font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                            s["font"],  # font name
                            flags_decomposer(s["flags"]),  # readable font flags
                            s["size"],  # font size
                            s["color"],  # font color
                        )

                        r = fitz.Rect(s['bbox']) 
                        for p1, p2 in drawn_lines:  # check distances for start / end points
                            if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                                font_properties = " ".join([font_properties, 'underlined'])
                    
                        span_fonts+=[font_properties]
                        texts = "".join ([texts, s['text']])
                
                text_list = texts.split()
                if len(text_list) > 0 and len(text_list) < 7:
                    if any(i in span_fonts for i in master):
                        opinion_subheaders.update({texts:count})
                    if texts.isupper()==True:
                        opinion_subheaders.update({texts:count})
                    
    return opinion_subheaders

In [103]:
def get_zero_links_headers(doc, style_list, opinion_subheaders):
    p_color = style_list[0][0].split('color')[1].split()[0].strip(',')
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    # list of potential headers
    keys_as_list = list(opinion_subheaders)
    # keeping track of the number of links in each section
    link_tracker = {}
    zero_links = {}

    for header_index in range(len(keys_as_list)):
        start = keys_as_list[header_index]
        start_span = opinion_subheaders[start]
        
        if header_index+1 < len(keys_as_list):
            end = keys_as_list[header_index+1]
            end_span = opinion_subheaders[end]
        else:
            end = keys_as_list[header_index]
            end_span = opinion_subheaders[end]

        # keeping track of the number of links found in each potential header section
        links_counter = 0
        count = 0

        for page in doc:

            paths = page.get_drawings()  # get drawings on the page

            drawn_lines = []
            for p in paths:
                # print(p)
                for item in p["items"]:
                    # print(item[0])
                    if item[0] == "l":  # an actual line
                        # print(item[1], item[2])
                        p1, p2 = item[1], item[2]
                        if p1.y == p2.y:
                            drawn_lines.append((p1, p2))
                    elif item[0] == "re":  # a rectangle: check if height is small
                        # print(item[0])
                        # print(item[1])
                        r = item[1]
                        if r.width > r.height and r.height <= 2:
                            drawn_lines.append((r.tl, r.tr))  # take top left / right points
            
            blocks = page.get_text("dict", flags=11)["blocks"]

            for b in blocks:  # iterate through the text blocks
                for l in b["lines"]:  # iterate through the text lines
                    for s in l["spans"]:  # iterate through the text spans
                        # keeping track of span index
                        count += 1

                        font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                            s["font"],  # font name
                            flags_decomposer(s["flags"]),  # readable font flags
                            s["size"],  # font size
                            s["color"],  # font color
                        )

                        r = fitz.Rect(s['bbox']) 
                        for p1, p2 in drawn_lines:  # check distances for start / end points
                            if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                                font_properties = " ".join([font_properties, 'underlined'])
                        
                        # checking text in between start and end index
                        if count >= start_span and count < end_span:
                            # seting up check indicator for if span is a link or not
                            # assuming text is a link
                            check = True
                            # if it's the same color as the paragraph text
                            if s['color']!=p_color and s['size']==p_size:
                                links_counter+=1

        link_tracker.update({start:{'counter':links_counter, 'location':opinion_subheaders[start]}})

    links_as_list = list(link_tracker)
    for each_counter in range(len(link_tracker)):
        current_header = links_as_list[each_counter]
        if link_tracker[current_header]['counter']==0:
            zero_links.update({current_header:link_tracker[current_header]['location']})
    
    return zero_links

In [104]:
doc = fitz.open('pplvpv.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
master = get_master(style_list)
get_subheaders(doc, style_list, opinion_loc, master)
# get_zero_links_headers(doc, style_list, opinion_subheaders)

{'Opinion': 26,
 ' [**498]  [*345] Toko Serita, J.': 27,
 ' [**499] Relevant Laws': 45,
 '(CPL 440.10 [6]).': 98,
 ' [**501]  [*350] 1. Findings of Fact': 141,
 ' [*351] 2. Conclusions of Law': 172,
 'from a young [****5]  age.17': 202,
 ' [***20] ': 365}

In [105]:
doc = fitz.open('statevbraun.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
master = get_master(style_list)
get_subheaders(doc, style_list, opinion_loc, master)
# get_zero_links_headers(doc, style_list, opinion_subheaders)

{'Opinion': 49,
 '¶1  [*759]  [**887] FEARING, J. —': 50,
 '[S]ex traffickers select victims who demonstrate ': 51,
 'vulnerabilities including homelessness, substance ': 52,
 '[T]raffickers control their victims through physical ': 60,
 'violence, sexual violence, psychological violence ': 61,
 'socially isolating them, and controlling them ': 67,
 'drug dependency. …': 69,
 'prostitution [*760] . Because overwhelming evidence ': 84,
 'FACTS': 92,
 "behind RCW 9A.40.100, Washington's trafficking ": 111,
 ' [*762] ': 165,
 ' [**889] ': 212,
 'Jane [***9]  ': 259,
 ' [*765] ': 306,
 ' [**890] ': 314,
 ' [**892] ': 499,
 ' [**893]  ': 604,
 'son [***21]  to view her naked.': 607,
 'PROCEDURE': 664,
 'Washington [***24]  ': 690,
 'trafficking.': 728,
 'Uncertainty. [***26] ': 756,
 'trafficking and second degree promoting prostitution.': 783,
 'could [***30]  say no.': 870,
 'trafficking described by [Jane].': 900,
 '3.1.1.4. The defendant exerted psychological ': 905,
 "harm, restrain, o

In [85]:
doc = fitz.open('statevward.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
master = get_master(style_list)
get_subheaders(doc, style_list, opinion_loc, master)
# get_zero_links_headers(doc, style_list, opinion_subheaders)

{'Opinion': 53,
 'OPINION': 67,
 'Facts and Procedural History': 68,
 'Analysis': 238,
 'I. Sufficiency': 248,
 '301(3)(A), (C), (D).': 347,
 'B. Promoting Prostitution': 382,
 'II. Failure to Merge Convictions': 411,
 'sex ': 624,
 'convictions [*16]  ': 439,
 'prong of the Blockburger analysis.': 496,
 'Id. ': 509,
 '39-13-301(3)(A), (C), (D).': 581,
 "defendant's [*21]  ": 628,
 'III. Jury Instruction': 630,
 'evidence [*23]  ': 692,
 'accomplice\'s] testimony." Id.': 730,
 'Conclusion': 758,
 'J. ROSS DYER, JUDGE': 765}

In [89]:
doc = fitz.open('tompvus.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
master = get_master(style_list)
get_subheaders(doc, style_list, opinion_loc, master)
# get_zero_links_headers(doc, style_list, opinion_subheaders)

{'Opinion': 18,
 'DECISION AND ORDER': 19,
 'See Second Circuit Docket No. 17-822.': 23,
 'SO ORDERED.': 54,
 '/s/ Richard J. Arcara': 57,
 'HONORABLE RICHARD J. ARCARA': 58,
 'UNITED STATES DISTRICT JUDGE': 59}

In [92]:
opinion_loc

18

In [88]:
doc = fitz.open('usvbell.pdf')
style_list = get_styles(doc)
opinion_loc = get_opinion(doc, style_list)
master = get_master(style_list)
get_subheaders(doc, style_list, opinion_loc, master)
# get_zero_links_headers(doc, style_list, opinion_subheaders)

{'Opinion': 119,
 'ORDER': 120,
 'I. FACTUAL BACKGROUND': 125,
 'II. PROCEDURAL BACKGROUND': 139,
 'III. DISCUSSION': 181,
 'A. Motion for Judgment of Acquittal': 182,
 'HN1[': 184,
 '2. Summary of the Evidence': 213,
 '700, [*40]  745, 804.': 617,
 'a. Counts 1 through 3': 618,
 'HN2[': 622,
 'HN3[': 631,
 'b. Counts 4 through 7': 746,
 'HN4[': 747,
 'B. Motion for New Trial': 781,
 'HN5[': 785,
 '2. Evidence Presented at Trial': 799,
 "3. Addition of T.A.'s Testimony": 816,
 'a. Newly Discovered Evidence': 833,
 'HN7[': 834,
 'b. Due Diligence': 854,
 'HN8[': 855,
 'c. Cumulative or Impeaching': 875,
 'HN9[': 878,
 'd. Materiality': 899,
 'e. Likelihood of Acquittal': 903,
 'HN10[': 904,
 'IV. CONCLUSION': 926,
 'IT IS SO ORDERED.': 929,
 'U.S. DISTRICT JUDGE': 933}