In [1]:
from unstructured.partition.auto import partition
from unstructured.documents.elements import *

## Partition
Partitioning in unstructured allows us to extract content from unstructured documents. It divides the elements of the documents into elements like Title, ListItem, Text, and NarrativeText among others. For the sake of the application of training a chatbot for movie recommendations, after I partition the documents I will only be keeping the narrative text elements, which contain the contents of the review.

In our first example, we will be providing a URL to the partition function which will fetch and parse the returned html document. As you can see, providing only the URL was filtered by the website but adding a user-agent header allowed us to properly fetch the html document.

In [22]:
url_elements = partition(url='https://sarahgvincentviews.com/movies/dune-part-one/')
for i in url_elements:
    print(i)

Not Acceptable!
An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.


In [27]:
# Fetch the URL with a custom user agent
url_elements = partition(url='https://sarahgvincentviews.com/movies/dune-part-one/', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.4; Win64; x64; en-US) Gecko/20130401 Firefox/73.2'})

# Create a set of the unique types of elements in the returned HTML
url_types = set([type(i) for i in url_elements])

# Print the elements that are of type NarrativeText
print("Narrative Text Content: \n")
for i in [i for i in url_elements if isinstance(i, NarrativeText)]:
    print(i)

# Print the types of elements in the returned HTML
print("\nTypes of Elements in the HTML:\n")
for i in url_types:
    print(i)

Narrative Text Content: 

“Dune” or “Dune: Part One” (2021) is the second film adaptation of Frank Herbert’s 1965 novel series. In the future year of 10191, an Emperor governs the universe, and this universe’s most valuable resource is spice, which aids in interstellar travel, but it is only found on Arrakis, a desert planet. The Emperor determines which House (thank you, “Game of Thrones” for making these concepts more comprehensible) will occupy Arrakis and insure that spice can be harvested without interference from the indigenous people, the Fremen, who never consented to the Emperor’s rule. The brutal house of Harkonnen have occupied the land and grown wealthy as a result of their position, but the Emperor orders them to withdraw and gives the job to the House of Atreides, whose head is Duke Leto Atreides (Oscar Isaac) and whose home planet is Caladan. His consort, Lady Jessica (Rebecca Ferguson) is training their young, inexperienced son, Paul (Timothee Chamalet), in the ways of 

In the following examples, I will partition other Dune movie reviews in different formats including image, PDF, and htm (already downloaded)

In [2]:
pdf_filename = 'example-files/dune_review_pdf.pdf'
image_filename = 'example-files/dune_review_image.png'
html_filename = 'example-files/dune_review_html.htm'

pdf_elements = partition(filename=pdf_filename)
image_elements = partition(filename=image_filename)
html_elements = partition(filename=html_filename)

pdf_elements_narrative = [i for i in pdf_elements if isinstance(i, NarrativeText)]
image_elements_narrative = [i for i in image_elements if isinstance(i, NarrativeText)]
html_elements_narrative = [i for i in html_elements if isinstance(i, NarrativeText)]

print("\nPDF Narrative Text Elements: \n")
for i in pdf_elements_narrative:
    print(i)

print("\nImage Narrative Text Elements: \n")
for i in image_elements_narrative:
    print(i)

print("\nHTML Narrative Text Elements: \n")
for i in html_elements_narrative:
    print(i)
    
print("\n\nUnique Types across all three files: \n")
all_types = set([type(i) for i in pdf_elements + image_elements + html_elements])
for i in all_types:
    print(i)

This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name



PDF Narrative Text Elements: 

Directed by Denis Villeneuve, performances by Timothée Chalamet, Rebecca Ferguson, and Oscar Isaac, 2021. 156 mins. Reviewed by Fabio Bego
Denis Villeneuve’s film Dune (2021) provides interesting insight on how
notions of race, gender, and empire that are at the core of current post- colonial critique are being transferred into popular culture. Analyses of the short- and long-term consequences of colonialism in the contempo- rary world pervade public discourse in shows and documentaries for main- stream media, blockbuster movies, institutionally financed film festivals, and art exhibitions. From a political perspective it is possible to distinguish two broad approaches. On the one hand there is a critique from the left which is focused on the deconstruction of race and ethnicity. On the other hand, there is a critique from the far right that aims at restoring race and ethnic divisions and privileges which were presumably spoiled by “globalization” or “co

## Cleaning

Unstructured provides fucntions to clean the partitioned data including bullets, dashes, whitespace, unicode quotes, and even text translation. The below examples are the only apparent issue in from the examples above, grouping broken paragraphs and extra whitespace

In [3]:
from unstructured.cleaners.core import group_broken_paragraphs, clean

sum_pdf_narrative = sum([len(i.text) for i in pdf_elements_narrative])
sum_image_narrative = sum([len(i.text) for i in image_elements_narrative])
sum_html_narrative = sum([len(i.text) for i in html_elements_narrative])

print("Sum of lengths pre-cleaning:")
print("PDF: ", sum_pdf_narrative)
print("Image: ", sum_image_narrative)
print("HTML: ", sum_html_narrative)

bp_cleaned_pdf_narrative = []
for i in pdf_elements_narrative:
    bp_cleaned_pdf_narrative.append(group_broken_paragraphs(i.text))
    
bp_cleaned_image_narrative = []
for i in image_elements_narrative:
    bp_cleaned_image_narrative.append(group_broken_paragraphs(i.text))
    
bp_cleaned_html_narrative = []
for i in html_elements_narrative:
    bp_cleaned_html_narrative.append(group_broken_paragraphs(i.text))

sum_bp_cleaned_pdf_narrative = sum([len(i) for i in bp_cleaned_pdf_narrative])
sum_bp_cleaned_image_narrative = sum([len(i) for i in bp_cleaned_image_narrative])
sum_bp_cleaned_html_narrative = sum([len(i) for i in bp_cleaned_html_narrative])

print("\nSum of lengths after broken paragraphs:")
print("PDF: ", sum_bp_cleaned_pdf_narrative)
print("Image: ", sum_bp_cleaned_image_narrative)
print("HTML: ", sum_bp_cleaned_html_narrative)

Sum of lengths pre-cleaning:
PDF:  11016
Image:  1560
HTML:  3864

Sum of lengths after broken paragraphs:
PDF:  11016
Image:  1560
HTML:  3764


In [4]:
cleaned_pdf_narrative = [clean(i, bullets=True, extra_whitespace=True, dashes=True, trailing_punctuation=True, lowercase=True) for i in bp_cleaned_pdf_narrative]
cleaned_image_narrative = [clean(i, bullets=True, extra_whitespace=True, dashes=True, trailing_punctuation=True, lowercase=True) for i in bp_cleaned_image_narrative]
cleaned_html_narrative = [clean(i, bullets=True, extra_whitespace=True, dashes=True, trailing_punctuation=True, lowercase=True) for i in bp_cleaned_html_narrative]

sum_cleaned_pdf_narrative = sum([len(i) for i in cleaned_pdf_narrative])
sum_cleaned_image_narrative = sum([len(i) for i in cleaned_image_narrative])
sum_cleaned_html_narrative = sum([len(i) for i in cleaned_html_narrative])

print("\nSum of lengths after generic clean:")
print("PDF: ", sum_cleaned_pdf_narrative)
print("Image: ", sum_cleaned_image_narrative)
print("HTML: ", sum_cleaned_html_narrative)


Sum of lengths after generic clean:
PDF:  10991
Image:  1548
HTML:  3754


## Extracting 

The extracting functionality in unstructured provides the functionaility to extract features from objects such as datetime, emails, ip addresses, bulleted lists, and phone numbers. While the extraction functions provided in Unstructured are not directly applicable to the movie reviews I selected, I am showing the translate_text function provided by unstructured that uses the Helsinki NLP MT Models

In [8]:
from unstructured.cleaners.translate import translate_text

translated_pdf_narrative = [translate_text(i, target_lang='fr') for i in cleaned_pdf_narrative] 
for i in translated_pdf_narrative:
    print(i)


mise en scène de denis villeneuve, performances de timothée chalamet, rebecca ferguson, et oscar isaac, 2021. 156 minutes. reviewed by fabio bego
denis villeneuve’s film dune (2021) provides interesting insight on how
Les notions de race, de genre et d'empire qui sont au cœur de la critique postcoloniale actuelle sont transférées dans la culture populaire. des analyses des conséquences à court et à long terme du colonialisme dans le monde contemporain imprègnent le discours public dans les spectacles et les documentaires pour les principaux médias, les films blockbuster, les festivals de cinéma financés par les institutions et les expositions d'art. d'un point de vue politique il est possible de distinguer deux grandes approches. d'une part, il y a une critique de gauche qui se concentre sur la déconstruction de la race et de l'ethnicité. d'autre part, il y a une critique de l'extrême droite qui vise à restaurer la race et les divisions ethniques et les privilèges qui ont vraisemblable

## Chunking

Chunking allows us to split document elements into smaller parts for use cases like retrieval augmented generation. It is similar to the partitioning we performed above but combines elements from the partitioning to produce a sequence of CompositeElements, Table, or TableChunk elements. Since in our examples, we did not have any tables, it should just produce CompositeElements

In [15]:
from unstructured.chunking.basic import chunk_elements

# The basic chunking strategy performed on the pdf elements during the partitioning process provides an identical outcome to chunk_elements function performed after partitioning.
chunk_elements_pdf = chunk_elements(pdf_elements)
print(len(chunk_elements_pdf))

partition_chunk_pdf = partition(filename=pdf_filename, chunking_strategy='basic')
print(len(partition_chunk_pdf))

# Showing that as predicted the chunks are made up entirely of CompositeElement obejcts
for i in set([type(i) for i in chunk_elements_pdf + partition_chunk_pdf]):
    print(i)

# By artificially limiting the max_characters parameter, the chunk_elements function will create more chunks
chunk_elements_pdf_small = chunk_elements(pdf_elements, max_characters=250)
print(len(chunk_elements_pdf_small))
    

32
32
<class 'unstructured.documents.elements.CompositeElement'>
58
