### Methodology: Extracting Headers
Since pdf files consist of unstructured text, we need to find some similarities over the different documents on how headers and paragraphs are separated.Usually  headers and paragraphs are often separated by the font size an/od font weight of the text and that the most used font can be considered the paragraph

So, theoretically, I can extract the font sizes, tag each word or line with the font size, use that to find the headers and subheaders, and then extract the text between the starter header and ending header for the narratives. gs>.

Down below I have a created a function called 'fonts' that will take in a document and count the number of words for each font size and font style. 

In [None]:
def fonts(doc, granularity=False):
    """Extracts fonts and their usage in PDF documents.
    :param doc: PDF document to iterate through
    :type doc: <class 'fitz.fitz.Document'>
    :param granularity: also use 'font', 'flags' and 'color' to discriminate text
    :type granularity: bool
    :rtype: [(font_size, count), (font_size, count}], dict
    :return: most used fonts sorted by count, font style information
    """
    styles = {}
    font_counts = {}

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for b in blocks:  # iterate through the text blocks
            if b['type'] == 0:  # block contains text
                for l in b["lines"]:  # iterate through the text lines
                    for s in l["spans"]:  # iterate through the text spans
                        if granularity:
                            identifier = "{0}_{1}_{2}_{3}".format(s['size'], s['flags'], s['font'], s['color'])
                            styles[identifier] = {'size': s['size'], 'flags': s['flags'], 'font': s['font'],
                                                  'color': s['color']}
                        else:
                            identifier = "{0}".format(s['size'])
                            styles[identifier] = {'size': s['size'], 'font': s['font']}

                        font_counts[identifier] = font_counts.get(identifier, 0) + 1  # count the fonts usage

    font_counts = sorted(font_counts.items(), key=itemgetter(1), reverse=True)

    if len(font_counts) < 1:
        raise ValueError("Zero discriminating fonts found!")

    return font_counts, styles

In [None]:
import fitz
from operator import itemgetter

In [None]:
doc = fitz.open('usvstennis.pdf')

In [None]:
font_counts, styles = fonts(doc, granularity=True)

After implementing my function on ta court case, we can see the most used font size, which is a good indicator that the words with this size are the paragraphs. 

My next step would be to create a dictionary that distinguishes between headers, paragraphs, and other.

In [None]:
def font_tags(font_counts, styles):
    """Returns dictionary with font sizes as keys and tags as value.
    :param font_counts: (font_size, count) for all fonts occuring in document
    :type font_counts: list
    :param styles: all styles found in the document
    :type styles: dict
    :rtype: dict
    :return: all element tags based on font-sizes
    """
    p_size = styles[font_counts[0][0]].get('size')  # get the paragraph's size
    p_font = styles[font_counts[0][0]].get('font')
    p_flag = styles[font_counts[0][0]].get('flags')
    p_color = styles[font_counts[0][0]].get('color')

    return p_size, p_font, p_flag, p_color

### STOPPED
It has come to my attention after reviewing the US v. Stennis court case that some sub headers have the same font size, style, and color as pargraph text. The only difference would be the fact that the sub headers are in all caps, that it is italized, or that it is enurmated (ex. I II III or A B C etc).

This has put a damper on my current methodology.

In [None]:
def font_tags(font_counts, styles):
    """Returns dictionary with font sizes as keys and tags as value.
    :param font_counts: (font_size, count) for all fonts occuring in document
    :type font_counts: list
    :param styles: all styles found in the document
    :type styles: dict
    :rtype: dict
    :return: all element tags based on font-sizes
    """
    p_style = styles[font_counts[0][0]]  # get style for most used font by count (paragraph)
    p_size = p_style['size']  # get the paragraph's size

    # sorting the font sizes high to low, so that we can append the right integer to each tag 
    font_sizes = []
    for (font_size, count) in font_counts:
        font_sizes.append(float(font_size))
    font_sizes.sort(reverse=True)

    # aggregating the tags for each font size
    idx = 0
    size_tag = {}
    for size in font_sizes:
        idx += 1
        if size == p_size:
            idx = 0
            size_tag[size] = '<p>'
        if size > p_size:
            size_tag[size] = '<h{0}>'.format(idx)
        elif size < p_size:
            size_tag[size] = '<s{0}>'.format(idx)

    return size_tag

In [None]:
size_tag = font_tags(font_counts, styles)
size_tag

In [None]:
def headers_para(doc, size_tag):
    """Scrapes headers & paragraphs from PDF and return texts with element tags.
    :param doc: PDF document to iterate through
    :type doc: <class 'fitz.fitz.Document'>
    :param size_tag: textual element tags for each size
    :type size_tag: dict
    :rtype: list
    :return: texts with pre-prended element tags
    """
    header_para = []  # list with headers and paragraphs
    first = True  # boolean operator for first header
    previous_s = {}  # previous span

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for b in blocks:  # iterate through the text blocks
            if b['type'] == 0:  # this block contains text

                # REMEMBER: multiple fonts and sizes are possible IN one block

                block_string = ""  # text found in block
                for l in b["lines"]:  # iterate through the text lines
                    for s in l["spans"]:  # iterate through the text spans
                        if s['text'].strip():  # removing whitespaces:
                            if first:
                                previous_s = s
                                first = False
                                block_string = size_tag[s['size']] + s['text']
                            else:
                                if s['size'] == previous_s['size']:

                                    if block_string and all((c == "|") for c in block_string):
                                        # block_string only contains pipes
                                        block_string = size_tag[s['size']] + s['text']
                                    if block_string == "":
                                        # new block has started, so append size tag
                                        block_string = size_tag[s['size']] + s['text']
                                    else:  # in the same block, so concatenate strings
                                        block_string += " " + s['text']

                                else:
                                    header_para.append(block_string)
                                    block_string = size_tag[s['size']] + s['text']

                                previous_s = s

                    # new block started, indicating with a pipe
                    block_string += "|"

                header_para.append(block_string)

    return header_para

In [None]:
headers = headers_para(doc, size_tag)

In [None]:
[element for element in headers if element.startswith('<h')]

### New Approach, Same Methodology
I made a new function called flags_decomposer that makes a list of different flags that include bold and italics.

In [1]:
import sys
from collections import Counter
import fitz

In [2]:
def flags_decomposer(flags):
    """Make font flags human readable."""
    l = []
    if flags & 2 ** 0:
        l.append("superscript")
    if flags & 2 ** 1:
        l.append("italic")
    if flags & 2 ** 2:
        l.append("serifed")
    else:
        l.append("sans")
    if flags & 2 ** 3:
        l.append("monospaced")
    else:
        l.append("proportional")
    if flags & 2 ** 4:
        l.append("bold")
    return ", ".join(l)

### get_styles
This next function called get_styles has the same purpose of my other function, fonts, but this time, it's more throrough in it's categorization. So it includes italics, color, size, font style, and other flags and returns the number of spans in each category. It returns a list of tuples with the count of each span. (I will explain 'span' in a different section.)

In [3]:
def get_styles(doc):
    style_counts = []

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )

                    style_counts.append(font_properties)
    styles = dict(Counter(style_counts))

    style_list = sorted(styles.items(), key=lambda x:x[1], reverse=True)
    
    return style_list

An issue is the runtime. I imagine with all the for loops, it's bound to be tedious for larger data inputs. 

In [4]:
doc = fitz.open('usvbell.pdf')
style_list = get_styles(doc)
style_list

[("Font: 'Helvetica' (sans, proportional), size 10, color #000000", 1256),
 ("Font: 'Helvetica-BoldOblique' (italic, sans, proportional, bold), size 10, color #000000",
  260),
 ("Font: 'Helvetica-Oblique' (italic, sans, proportional), size 10, color #0077cc",
  100),
 ("Font: 'Helvetica-Bold' (sans, proportional, bold), size 10, color #000000",
  96),
 ("Font: 'Helvetica-BoldOblique' (italic, sans, proportional, bold), size 10, color #0077cc",
  20),
 ("Font: 'Helvetica' (sans, proportional), size 9, color #000000", 16),
 ("Font: 'Helvetica-Oblique' (italic, sans, proportional), size 10, color #000000",
  10),
 ("Font: 'Helvetica' (superscript, sans, proportional), size 8, color #000000",
  4),
 ("Font: 'Helvetica' (sans, proportional), size 6, color #000000", 4),
 ("Font: 'Helvetica-Bold' (sans, proportional, bold), size 14, color #000000",
  3),
 ("Font: 'Helvetica-Bold' (sans, proportional, bold), size 9, color #000000",
  2),
 ("Font: 'Helvetica-BoldOblique' (italic, sans, proport

### get_headers
This function takes in the document and the style list and outputs a list of POTENTIAL headers (not actual headers). These potential headers have the following characteristics:

1. The font size is either equal to or greater than the regular text size. This requirement filters out the subscript text and is a must to be considered a potential header.
2. If the text is all uppercase, it is considered a possible header and added to the list.
3. If the complete font properties of the text do not match the regular text properties, it is considered a possible header and added to the list.

It first takes the style list and chops it down to instances of 50 spans or less into a new list named better_list. This assumes that the number of times a specific combination of font characteristics appears is 50 times or less. I think this is a safe assumption since this assumes that the number of headers in a document is 50 or less PER font style. This restriction is set in place to combat extracting noise.

### STOPPED
It has come to my attention that in US v. Bell, some of the headers are only distinguishable by being BOLD. Meaning that the font style, the font color, and the font size are the same for the subheaders as it is for the regular text... so I have taken out the better_list. This means that there will be a lot of noise in the header list, but it cannot be helped. It will not interfere with the final result as I know.

In [None]:
def get_headers(doc, style_list):

    headers = {}
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    p_style = style_list[0][0]
    count = 0

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    count += 1
                    if s['size'] >= p_size:
                        font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                            s["font"],  # font name
                            flags_decomposer(s["flags"]),  # readable font flags
                            s["size"],  # font size
                            s["color"],  # font color
                        )

                        if s['text'].isupper()==True:
                            headers.update({s['text']:count})
                        
                        if font_properties != p_style:
                            headers.update({s['text']:count})
                    
    return headers

In [5]:
def get_headers(doc, style_list):

    headers = {}
    count = 0
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    #better_list = [style for style in style_list if style[1]<=50]
    slist = [style[0] for style in style_list][1:]

    for page in doc:
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    count += 1
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )

                    if s['size'] >= p_size:
                        if s['text'].isupper()==True:
                                headers.update({s['text']:count})
                    
                        if font_properties in slist:
                                headers.update({s['text']:count})
                    
    return headers

In [6]:
headers = get_headers(doc, style_list)
headers

{'United States v. Bell': 2,
 'Subsequent History:': 9,
 'United States v. Bell, 761 F.3d 900, 2014 U.S. App. LEXIS 14912 (8th Cir. Iowa, ': 11,
 'Aug. 4, 2014)': 12,
 'Core Terms': 13,
 'prostitution': 1747,
 'sex': 1687,
 'pimp': 1383,
 'coercion': 1695,
 'trafficking': 1433,
 'force': 1693,
 'LexisNexis® Headnotes': 28,
 'HN1': 371,
 ' Trials, Motions for Acquittal': 33,
 'Fed. R. Crim. P. 29(a)': 375,
 'Sex': 134,
 'Prostitution': 136,
 'HN2': 1118,
 ' Goods Smuggling, Elements': 88,
 '18 U.S.C.S. §§ 1591(a)(1)': 76,
 '1591(b)(1)': 1172,
 'HN3': 1143,
 'commercial sex': 1745,
 '18 ': 107,
 'U.S.C.S. §§ 1591(a)': 108,
 '1591(a)(1)': 1168,
 '1591(a)(2)': 1170,
 '2': 1458,
 'Pandering': 141,
 'HN4': 1397,
 ' ': 140,
 ' & Pimping, Elements': 142,
 '18 U.S.C.S. §§ 2422(a)': 153,
 'HN5': 1491,
 ' Postconviction Proceedings, Motions for New Trial': 227,
 'Fed. R. Crim. P. 33(a)': 1495,
 'Rule 33': 1761,
 'HN6': 1558,
 'Fed. ': 183,
 'R. Crim. P. 33': 184,
 'HN7': 1595,
 'Fed. R. Crim. P. 

### get_narrative()
This function takes in the document, the list of font styles, a start string, an end string, and the header list and outputs the narrative. Specifically, it makes sure that the start and end are distict phrases (have distinct font styles) from regular text. Therefore, even if the start phrase appears in the text before it's needed, the extraction won't start prematurely. Same with the end phrase. If the end phrase appears in the text before we need it, it won't end prematurely.

After this, it may require some additional cleaning, but I think it's a good start.

In [7]:
def get_narrative(doc, style_list, start, end, header_list):

    narrative = ""
    count = 0
    start_span = header_list[start]
    end_span = header_list[end]

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    count += 1
                    if count >= start_span and count < end_span:
                        narrative+=s['text']
                    
    return narrative

In [8]:
get_narrative(doc, style_list, start='2. Summary of the Evidence', end='a. Counts 1 through 3', header_list=headers)



### Combining the Results
In this section, I will be combining the three functions into one so that it can all run at once.

In [9]:
## needed packages
# import fitz
# import sys
# from collections import Counter

In [None]:
def get_narratives(doc, start, end):
    style_counts = []

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )

                    style_counts.append(font_properties)
    styles = dict(Counter(style_counts))

    style_list = sorted(styles.items(), key=lambda x:x[1], reverse=True)

    headers = {}
    count = 0
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    # better_list = [style for style in style_list if style[1]<=50]
    slist = [style[0] for style in style_list][1:]

    for page in doc:
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    count += 1
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )

                    if s['size'] >= p_size:
                        if s['text'].isupper()==True:
                                headers.update({s['text']:count})
                    
                        if font_properties in slist:
                                headers.update({s['text']:count})

    narrative = ""
    count = 0
    start_span = headers[start]
    end_span = headers[end]

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    count += 1
                    if count >= start_span and count < end_span:
                        narrative+=s['text']
                    
    return narrative

In [None]:
doc = fitz.open('pplvpv.pdf')
style_list = get_styles(doc)
get_headers(doc, style_list)

### STOPPED
After trying to run this through People v. PV, it has come to my attention that my code doesn't include text that is underlined... I need to work on this more.

In [None]:
doc = fitz.open('pplvpv.pdf')
get_narratives(doc, start='1. Findings of Fact', end='2. Conclusions of Law')

In [None]:
doc = fitz.open('statevbraun.pdf')
get_narratives(doc, start='FACTS', end='PROCEDURE')

In [None]:
doc = fitz.open('statevward.pdf')
get_narratives(doc, start='Facts and Procedural History', end='Analysis')

### 'Span'
The code below is a snippet of the get_styles() function and will help explain how everything is counted. Basically, the code takes each page and separates it line by line. So each span is each line, with some exceptions. The exception being that it also separates it when it hits a change in flags too (so if a word is in italics, it'll be put on a different line).

In [74]:
doc = fitz.open('usvstennis.pdf')
page = doc[0]

blocks = page.get_text("dict", flags=11)["blocks"]
for b in blocks:  # iterate through the text blocks
    for l in b["lines"]:  # iterate through the text lines
        for s in l["spans"]:  # iterate through the text spans
            print(s['text'])

 
United States v. Stennis
United States District Court for the District of Minnesota
June 27, 2022, Decided; June 27, 2022, Filed
Case No. 20-CR-0019 (PJS/BRT)
Reporter
2022 U.S. Dist. LEXIS 112659 *; 2022 WL 2312214
UNITED STATES OF AMERICA, Plaintiff, v. DARNELL DESHAWN STENNIS, Defendant.
Core Terms
sex
, 
prostitution
, 
trafficking
, 
pimp
, phone, 
coercion
, investigated, obstruct, arrested, hotel, advertisements, 
counts, jail, convicted, reasonable jury, bathroom, violence, custody, started, phone number, text message, 
assaulted, charges, sounded, beyond a reasonable doubt, arrive, driver, upset, door
Counsel:
 
 [*1] 
Joseph H. Thompson and Manda M. Sertich, UNITED STATES ATTORNEY'S OFFICE, for plaintiff.
Daniel L. Gerdts, for defendant.
Judges:
 Patrick J. Schiltz, United States District Judge.
Opinion by:
 Patrick J. Schiltz
Opinion
MEMORANDUM
Defendant Darnell Stennis was charged with several crimes in connection with his 
sex
 
trafficking
 of women. At 
trial, Stennis 

### Incorporating Underlined Text

After some digging, it seems that my current method would be unable to handle underlined text (as seen in my attempt below). This is because underlines are actually just a line drawn on top of the text, and unless there is a tag already associated with the word or words, its near impossible to snag the underlined text with the current code I have. 

Most methods online require the PDF tool to look for lines that are drawn close enough to the text, which is what I attempted to do down below. At first, I was unsuccessful, but I have finally been able to extract underlined text! 

The first raw code chunk is the block that I used to test a single page in the People v. PV. The method was to define the cooridnates for what we defined as a rectangle or line and save it in a list named drawn_lines. Then, I itterated through the spans of the page and saved the bbox as r. The bbox holds the coordinates for the span in the form of a rectangle. So comparing the current span text cooridnates with pre-selected coordinates, I was able to see which span texts were underlined and which were not.

The second raw code chunk is the block that I used to test incorporating the rest of the requirements for selecting potential headers.

In [None]:
doc = fitz.open('pplvpv.pdf')
page = doc[3]
paths = page.get_drawings()  # get drawings on the page
# print(paths)

drawn_lines = []
for p in paths:
    # print(p)
    for item in p["items"]:
        # print(item[0])
        if item[0] == "l":  # an actual line
            # print(item[1], item[2])
            p1, p2 = item[1], item[2]
            if p1.y == p2.y:
                drawn_lines.append((p1, p2))
        elif item[0] == "re":  # a rectangle: check if height is small
            # print(item[0])
            # print(item[1])
            r = item[1]
            if r.width > r.height and r.height <= 2:
                drawn_lines.append((r.tl, r.tr))  # take top left / right points

p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
slist = [style[0] for style in style_list][1:]
count = 0

blocks = page.get_text("dict", flags=11)["blocks"]
for b in blocks:  # iterate through the text blocks
    for l in b["lines"]:  # iterate through the text lines
        for s in l["spans"]:  # iterate through the text spans
            # print(s['text'][:8])
            # print(s['bbox'])
            count+=1
            font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                s["font"],  # font name
                flags_decomposer(s["flags"]),  # readable font flags
                s["size"],  # font size
                s["color"],  # font color
            )
            
            if s['size'] >= p_size:
                if s['text'].isupper()==True:
                    print(s['text'], count)
            
                if font_properties in slist:
                    print(s['text'], count)
                
                r = fitz.Rect(s['bbox']) 
                for p1, p2 in drawn_lines:  # check distances for start / end points
                    if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                        print(s['text'], count)
                        break

In [None]:
def get_headers(doc, style_list):

    headers = {}
    count = 0
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    slist = [style[0] for style in style_list][1:]

    for page in doc:

        paths = page.get_drawings()  # get drawings on the page

        drawn_lines = []
        for p in paths:
            # print(p)
            for item in p["items"]:
                # print(item[0])
                if item[0] == "l":  # an actual line
                    # print(item[1], item[2])
                    p1, p2 = item[1], item[2]
                    if p1.y == p2.y:
                        drawn_lines.append((p1, p2))
                elif item[0] == "re":  # a rectangle: check if height is small
                    # print(item[0])
                    # print(item[1])
                    r = item[1]
                    if r.width > r.height and r.height <= 2:
                        drawn_lines.append((r.tl, r.tr))  # take top left / right points
        
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    # print(s['text'][:8])
                    # print(s['bbox'])
                    count+=1
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )
            
                    if s['size'] >= p_size:
                        if s['text'].isupper()==True:
                            headers.update({s['text']: count})
            
                        if font_properties in slist:
                            headers.update({s['text']: count})
                
                        r = fitz.Rect(s['bbox']) 
                        for p1, p2 in drawn_lines:  # check distances for start / end points
                            if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                                headers.update({s['text']: count})
                                break
                    
    return headers

In [None]:
doc = fitz.open('pplvpv.pdf')
style_list = get_styles(doc)
get_headers(doc, style_list).get('1. Findings of Fact')

In [None]:
blocks=page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
max_lineheight=0
for b in blocks:
    for l in b["lines"]:
        bbox=fitz.Rect(l["bbox"])
        if bbox.height > max_lineheight:
            max_lineheight = bbox.height

# make a list of words
words = page.get_text("words", sort=True)
#print(words)
# if underlined, the bottom left / right of a word
# should not be too far away from left / right end of some line:
for w in words:  # w[4] is the actual word string
    r = fitz.Rect(w[:4])   # first 4 items are the word bbox
    for p1, p2 in drawn_lines:
        rect = fitz.Rect(p1.x, p1.y - max_lineheight, p2.x, p2.y) # the rectangle "above" a drawn line
        text = page.get_textbox(rect)
        print(f"Underlined: '{text}'.")


_____
words = page.get_text("words", sort=True)
# if underlined, the bottom left / right of a word
# should not be too far away from left / right end of some line:
for w in words:  # w[4] is the actual word string
    print(w)
    r = fitz.Rect(w[:4])  # first 4 items are the word bbox
    for p1, p2 in drawn_lines:  # check distances for start / end points
        if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
            print(f"Word '{w[4]}' is underlined!")
            break  # don't check more lines

### Putting it All Together (PT2)

In [82]:
def get_narratives(doc, start, end):
    style_counts = []

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )

                    style_counts.append(font_properties)
    styles = dict(Counter(style_counts))

    style_list = sorted(styles.items(), key=lambda x:x[1], reverse=True)

    headers = {}
    count = 0
    p_size = int(style_list[0][0].split('size')[1].split()[0].strip(','))
    slist = [style[0] for style in style_list][1:]

    for page in doc:

        paths = page.get_drawings()  # get drawings on the page

        drawn_lines = []
        for p in paths:
            # print(p)
            for item in p["items"]:
                # print(item[0])
                if item[0] == "l":  # an actual line
                    # print(item[1], item[2])
                    p1, p2 = item[1], item[2]
                    if p1.y == p2.y:
                        drawn_lines.append((p1, p2))
                elif item[0] == "re":  # a rectangle: check if height is small
                    # print(item[0])
                    # print(item[1])
                    r = item[1]
                    if r.width > r.height and r.height <= 2:
                        drawn_lines.append((r.tl, r.tr))  # take top left / right points
        
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    # print(s['text'][:8])
                    # print(s['bbox'])
                    count+=1
                    font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                        s["font"],  # font name
                        flags_decomposer(s["flags"]),  # readable font flags
                        s["size"],  # font size
                        s["color"],  # font color
                    )
            
                    if s['size'] >= p_size:
                        if s['text'].isupper()==True:
                            headers.update({s['text']: count})
            
                        if font_properties in slist:
                            headers.update({s['text']: count})
                
                        r = fitz.Rect(s['bbox']) 
                        for p1, p2 in drawn_lines:  # check distances for start / end points
                            if abs(r.bl - p1) <= 4 and abs(r.br - p2) <= 4:
                                headers.update({s['text']: count})
                                break

    narrative = ""
    count = 0
    start_span = headers[start]
    end_span = headers[end]

    for page in doc:
        #, flags=11
        blocks = page.get_text("dict", flags=11)["blocks"]

        for b in blocks:  # iterate through the text blocks
            for l in b["lines"]:  # iterate through the text lines
                for s in l["spans"]:  # iterate through the text spans
                    count += 1
                    if count >= start_span and count < end_span:
                        narrative+=s['text']
                    
    return narrative

In [83]:
doc = fitz.open('pplvpv.pdf')
get_narratives(doc, start='1. Findings of Fact', end='2. Conclusions of Law')

'1. Findings of FactThe facts pertaining to defendant\'s arrest on March 3, 2008, under docket number 2008QN012069 were addressed at the aforementioned hearing held before this court on October 5, 2018. Defendant credibly testified that when she was arrested on the underlying charges she was [****6]  under the direct control of another trafficker identified as A. Her trafficker was physically present at the time of the incident and watching her at the location where he required her to engage in sex work. As defendant approached a potential customer to meet A\'s earning quota, another sex worker tried to steal away the "john" by approaching the same individual. Fearing her trafficker\'s wrath, and the possibility of violence if she lost this potential client, defendant\'s altercation with the rival sex worker led to her arrest.'

In [84]:
doc = fitz.open('statevbraun.pdf')
get_narratives(doc, start='FACTS', end='PROCEDURE')

"FACTS¶3 This prosecution arises from the protracted, poignant, and pungent relationship between Jane, [***3]  a pseudonym, and defendant Lars Braun. The four-year relationship included the performance of commercial sex acts by Jane at the request of and for the financial gain and sexual enchantment of Braun. Braun's appeal raises two principal questions: First, whether Braun, within the meaning of Washington's human trafficking statute, meted “force, coercion, and fraud” against Jane in fulfillment of one of the crime's elements? Second, whether the force, coercion, or fraud led to Jane's prostitution in satisfaction of a second element of the crime? Answers to these questions require a review of the language, intent, and history behind RCW 9A.40.100, Washington's trafficking statute. Answering the questions also necessitates a narrative of Jane's remarkable story, remarkable not because of its singularity or the tragedy portrayed but because of the prevalence, yet hidden nature, of t

In [85]:
doc = fitz.open('statevward.pdf')
get_narratives(doc, start='Facts and Procedural History', end='Analysis')

'Facts and Procedural HistoryA Madison County grand jury indicted the defendant, Randall Ray Ward, for two counts of trafficking a person for a commercial sex act (Counts one and three) and two counts of promoting prostitution (Counts two 1 It is the policy of this Court to refer to victims of sexual crimes by their initials. We intend no disrespect.Page 2 of 8and four). Following a jury trial, the defendant was convicted of one count of trafficking a person for a commercial sex act (Count three) and two counts of promoting prostitution (Counts two and four). The defendant was acquitted on Count one. At trial, the State presented the following facts for the jury\'s review.T.G. testified she was a heroin addict for twenty years, and, to support her habit, she began engaging in acts of prostitution. Initially, she worked for a man named Chew. However, after a physical altercation, T.G. wanted more protection. The defendant approached her and promised to take care of her if she worked as 

### Concluding Notes as of Nov. 17 2023
It appears that this get_narratives() function has passed through all given cases. It is noted that Thompson v. US, US. v. Gemma, and Walker v. Madden have not gone through the function as I could not identify the narrative section on the cases provided.

Overall, this task was quite tedious. The headers/subheaders came in all different forms. The most frustrating was when they had almost the same exact format as the paragraph text as that made narrowing headers down more difficult. A set back would be that the get_headers() function does not perfectly get the headers/subheaders, instead it creates a list of potential headers. 

Luckily, I was able to filter out subscripts first just by taking out the text that was smaller than the paragraph text. That just left the headers/subheaders, the 'special' paragraph text, and the regular paragraph text. The 'special' paragraph text consisted of text spans that were frequent in count (not as frequent as regular), so regular words that were either bold, italicized, underlined, etc.

So, a recap on my final function, get_narratives():

1. It takes three arguments; the document, the start phrase, and the end phrase.
2. The first part categorises each span by font properties and counts the number of occurances each category appears. This allows the function to differentiate the regular paragraph text from the others. The final product is a list of each category and their counts, listed in descending order.
3. The second part creates a list of potential headers and the potential header's position in the pdf. The list of potential headers contains the span texts that are equal to or larger than the regular text (in size) and meets ONE of the following requirements:

    A. The span is in all caps,

    B. The span has at least one difference in font attribute (font, flag, size, or color) than the regular paragraph text,

    OR

    C. The span is underlined.

4. The last part takes the list of potential headers and finds the start and end phrase within the list along with the corresponding pdf locations and extracts the narrative located between the start and end position.

A final note. Because the function requires a start and end phrase, I think it would be beneficial to run two or more court cases that are from the same dictrict/county through the function using the same start and end phrase to see if it produces acceptable results. 