# About this Notebook

This note book outlines analysing the structure of text and associated patterns whilst laying the foundation of a chatbot that can read in and analyse a PDF. Of which you can query and discuss the content using vector search. 




In [20]:
##### SETTING UP THE ENVIRONMENT #####

# %pip install ipykernel -U --user --force-reinstall
# %pip install --upgrade pip
# %pip install diversity
# %pip install spacy
# %pip install nltk
# %pip install --upgrade packaging
# !python3 -m spacy download en

# import nltk
# import ssl

# try:
#     _create_unverified_https_context = ssl._create_unverified_context
# except AttributeError:
#     pass
# else:
#     ssl._create_default_https_context = _create_unverified_https_context

# nltk.download('punkt_tab')

In [33]:
#### IMPORTING DUMMY ESSAY ####

with open('example_essay.txt', 'r') as file:
    essay_text = file.read()
    
split_text = essay_text.split('\n')


#### Extracting Syntactic Templates from a generated essay

In [34]:
from diversity import compression_ratio, homogenization_score, ngram_diversity_score, extract_patterns

cr = compression_ratio(split_text, 'gzip')
hs = homogenization_score(split_text, 'rougel')
# hs = homogenization_score(data_example, 'bertscore') 
nds = ngram_diversity_score(split_text, 4)

print(cr, hs, nds)


==> Scoring all pairs


100%|██████████| 9120/9120 [00:00<00:00, 39226.80it/s]

2.216 0.052 2.938





In [37]:
n = 5 
top_n = 100
patterns = extract_patterns(split_text, n, top_n)
patterns

{'IN JJ NNS IN NN': {'on massive amounts of text',
  'on relevant parts of input',
  'over long sequences of text'},
 'HYPH NN NNP : :': {'- Thought Prompting : -', '- shot Prompting : -'},
 'NNP : : VBZ DT': {'Prompting : - Assigns a',
  'Prompting : - Provides a',
  'Prompting : - Requires no'},
 'JJ NNS CD . NN': set(),
 '. NNP NNP : :': set(),
 'JJ NN HYPH VBN NNS': {'simple rule - based systems',
  'various language - related tasks'},
 'NNP -LRB- NNP NNP JJ': {'GPT ( Generative Pre -'},
 '-LRB- NNP NNP JJ VBN': {'( Generative Pre - trained'},
 'NNP NNP JJ VBN NNP': {'Generative Pre - trained Transformer'},
 'NNP JJ VBN NNP -RRB-': {'Pre - trained Transformer )'},
 'NN NNP : : VBZ': {'shot Prompting : - Provides',
  'shot Prompting : - Requires'},
 'VBZ DT NN TO VB': {'Allows the model to focus', 'asks the model to perform'},
 ': : VBZ DT JJ': {': - Assigns a specific', ': - Provides a few'},
 'CC JJ NNS CD .': {'and residual connections 3 .', 'or logical problems 4 .'},
 'NNS CD .

In [44]:
with open('human_essay.txt', 'r') as new_file:
    text = new_file.read()
    text_split_human = text.split('\n')

text_split_human
patterns = extract_patterns(text_split_human, n, top_n)
patterns


{'NN IN DT JJ NN': {'definition of a good developer',
  'democracy in the Western world',
  'indicator of a good design',
  'industry like no other shakedown',
  'model as the gold standard',
  'one with the architectural fit',
  'tech on a worldwide scale',
  'wave of the same category'},
 '. DT JJ NN IN': set(),
 'DT JJ NN IN DT': {'The 2nd wave of the',
  'The sole indicator of a',
  'The standard definition of a',
  'an auxiliary outlet in the'},
 'JJ NN IN DT JJ': {'2nd wave of the same',
  'big tech on a worldwide',
  'sole indicator of a good',
  'standard definition of a good'},
 'DT JJ NN IN NN': {'The bad thing about design',
  'a dominant point in management',
  'the direct line of fire',
  'the gold standard of app',
  'the high ground around privacy'},
 'DT JJ JJ NN .': {'a good mobile developer .', 'a great mobile developer .'},
 'NNP , NNP , CC': {'Apple , Android , and',
  'C++ , Java , and',
  'Cordova , Xamarin , and'},
 'DT NNS IN DT NN': {'the buzzwords on every tec

In [45]:
# Get patterns for both texts
human_patterns = extract_patterns(text_split_human, n, top_n)
example_patterns = extract_patterns(split_text, n, top_n)

# Sort patterns by frequency and get top 5
def get_top_5_patterns(patterns):
    # Convert patterns dict to list of tuples (pattern, examples)
    pattern_list = [(k, len(v)) for k,v in patterns.items() if len(v) > 0]
    # Sort by frequency (count of examples) in descending order
    pattern_list.sort(key=lambda x: x[1], reverse=True)
    # Return top 5 or all if less than 5
    return pattern_list[:5]

print("Top 5 patterns in human-written essay:")
for pattern, freq in get_top_5_patterns(human_patterns):
    print(f"{pattern}: {freq} occurrences")

print("\nTop 5 patterns in example essay:")
for pattern, freq in get_top_5_patterns(example_patterns):
    print(f"{pattern}: {freq} occurrences")


Top 5 patterns in human-written essay:
NN IN DT JJ NN: 8 occurrences
DT JJ NN IN NN: 5 occurrences
DT JJ NN IN DT: 4 occurrences
JJ NN IN DT JJ: 4 occurrences
NNP , NNP , CC: 3 occurrences

Top 5 patterns in example essay:
IN JJ NNS IN NN: 3 occurrences
NNP : : VBZ DT: 3 occurrences
HYPH NN NNP : :: 2 occurrences
JJ NN HYPH VBN NNS: 2 occurrences
NN NNP : : VBZ: 2 occurrences


#### Analysis of Pattern Differences Between Human and Example Essays

The pattern analysis reveals interesting differences in writing style between the human-written and example essays:

1. Pattern Frequency:
- Human essay has higher pattern frequencies (8, 5, 4, 4, 3 occurrences)
- Example essay has lower frequencies (3, 3, 2, 2, 2 occurrences)

2. Pattern Types:
- Human essay favors noun-preposition-adjective patterns (e.g. "NN IN DT JJ NN")
- Example essay uses more technical/structured patterns with colons and hyphens

3. Key Differences:
- Human writing shows more natural language flow with descriptive phrases
- Example essay has more formatted/templated structure typical of technical writing

4. Notable Patterns:
Human Essay:
- Uses more complex noun phrases with prepositions
- More varied sentence structures
- Natural language patterns

Example Essay: 
- More rigid formatting patterns
- Technical/documentation style
- Structured headings and lists

This suggests the human essay has a more natural writing style while the example essay follows a more structured technical format.


# Building First Chat BOT - PDF Reader
The user uploads a PDF of their choice through the user interface.

The application parses the PDF using a PDF parsing library and splits the extracted text into manageable chunks.

The chunks are converted into vector form, called embeddings.

When a user issues a query through the chat interface, the query is also converted into vector form.

The vector similarity between the query vector and each of the chunk vectors is calculated.

The text corresponding to the top-k most similar vectors are retrieved.

The retrieved text is fed along with the query and any other additional instructions to an LLM

The LLM uses the given information to generate an answer to the user query.

The response is displayed on the user interface. The user can now respond (clarification question, new question, gratitude etc.)

The entire conversation history is fed back to the LLM during each turn of the conversation.

In [4]:
import pytesseract
print(pytesseract.get_tesseract_version())

pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'

5.5.0


In [5]:
# %pip install openai langchain gradio unstructured
# %pip install langchain-community
# %pip install --upgrade pydantic
# %pip install pdfminer.six
# %pip uninstall pdfminer.six
# %pip install pi_heif
# %pip install unstructured_inference
# %pip install pytesseract
# %pip install poppler-utils
# %pip install pytesseract

# If Tesseract is not in your PATH, include the following line
# pytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'

# LangChain: This very popular framework enables building LLM application pipelines.

# Gradio: This library allows you to build LLM-driven user interfaces

# Unstructured: This is a PDF parsing suite that supports a variety of methods for extracting text from PDFs.
# Sentence-Transformers: This is a library facilitating embeddings generation from texts
# Open AI: This API provides access to the GPT* family of models from Open AI.
import langchain
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
import gradio as gr

In [6]:
loader = UnstructuredPDFLoader('gpt4_technical_report.pdf')
data = loader.load()

pytesseract is not installed. Cannot use the ocr_only partitioning strategy. Falling back to partitioning with another strategy.
Falling back to partitioning with hi_res.
Failed to get OCRAgent instance: No module named 'unstructured_pytesseract'


RuntimeError: Could not get the OCRAgent instance. Please check the OCR package and the OCR_AGENT environment variable.