# Documentation


## Part 1: Document Conversion, OCR, and Preprocessing

### Document Conversion:
I used the pypdf library to convert PDF documents into text. The Document_Conversion function takes a PDF file path as input, extracts text from each page, and concatenates it into a single string. pypdf was chosen for its simplicity and efficiency in handling PDF files without the need for additional dependencies.

### OCR Integration
For OCR (Optical Character Recognition), I use the pdf2image library to convert PDF pages into images and pytesseract to extract text from these images. The OCR_Integration function converts each page of the PDF to an image, applies OCR using pytesseract, and combines the extracted text into a single string. pytesseract was chosen for its accuracy and wide language support, making it suitable for documents in various languages.

### Preprocessing
The preprocess function performs several steps to clean and prepare the text for LLM processing:
* Removing repeating lines to reduce redundancy.
* Cleaning text by removing newlines, tabs, and non-ASCII characters.
* Language detection using the langdetect library, which supports a wide range of languages.
* Sentence segmentation and tokenization using spaCy, a powerful NLP library. * I prioritize web-based models (_core_web_sm) over news-based models (_core_news_sm) for better general-purpose text handling.

## Part 2: LLM-Powered Understanding and Actions

### LLM Integration
I used a combination of pre-trained models for various tasks, leveraging the power of transfer learning. This approach allows me to benefit from models trained on large datasets without the need for extensive fine-tuning.

### Information Extraction
* Named Entity Recognition (NER): I used the flair library with the flair/ner-german-large model for robust NER. Although this model is German-specific, it was chosen for its high accuracy and can be easily replaced with language-specific models.
* Relationship Extraction: Also using flair, I extract relationships between entities. The pre-trained relations classifier was chosen for its ability to identify semantic relationships.
* Summarization: I use the bert-extractive-summarizer library, which leverages BERT for extractive summarization. This model was chosen for its ability to identify key sentences without the need for fine-tuning.

### Document Classification
For document classification, I used the SetFitModel from the setfit library, specifically the luis-cardoso-q/kotodama-multilingual-v3 model. This model is pre-trained to classify documents into categories like "buying," "invoice," "refund," etc. The multilingual aspect makes it suitable for my diverse language requirements.

### Internal Translation
I used the transformers library from Hugging Face for translation. The google/bert2bert_L-24_wmt_de_en model is used for German to English translation. This model was chosen for its high-quality translations and the ability to handle long texts (up to 4000 tokens). For other languages, similar models can be integrated.

## Part 3: User Interaction via a Chatbot Interface

### Chatbot UI
I used streamlit to create a web-based chatbot interface. streamlit was chosen for its simplicity in creating interactive web apps with Python. The interface allows users to:
* Upload PDF documents.
* View document language, summary, and English translation (if applicable).
* Ask questions about entities, relationships, summary, or document category.
* Provide feedback on the chatbot's responses.

### Pipeline Integration
The process_document function integrates all the pipeline components:
* Convert PDF to text and apply OCR.
* Clean and preprocess the text.
* Translate if not in English.
* Extract entities, relationships, and generate summary.
* Classify the document.

The chatbot then uses this processed information to answer user queries.

### Technical Requirements
Libraries and Frameworks

* Document Processing: pypdf, pdf2image, pytesseract
* NLP and Preprocessing: spacy, langdetect, flair
* LLM and Information Extraction: bert-extractive-summarizer, setfit, transformers
* Chatbot UI: streamlit

# Installing Necessary Packages

In [11]:
!pip install pypdf



In [12]:
!pip install pytesseract



In [13]:
!apt install tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [14]:
!pip install pdf2image



In [24]:
!apt-get update && apt-get install poppler-utils -y

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.39)] [Connecting to security.ubuntu.com] [Connecti                                                                                                    Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
Readi

In [23]:
!apt-get install poppler-utils

Reading package lists... 0%Reading package lists... 0%Reading package lists... 0%Reading package lists... 4%Reading package lists... 4%Reading package lists... 5%Reading package lists... 5%^C


In [16]:
!pip install nltk



In [17]:
!pip install langdetect



In [18]:
!python -m spacy download de_core_news_sm

Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m74.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
!pip install flair

Collecting flair
  Downloading flair-0.13.1-py3-none-any.whl (388 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.3/388.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting boto3>=1.20.27 (from flair)
  Downloading boto3-1.34.122-py3-none-any.whl (139 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bpemb>=0.3.2 (from flair)
  Downloading bpemb-0.3.5-py3-none-any.whl (19 kB)
Collecting conllu>=4.0 (from flair)
  Downloading conllu-4.5.3-py2.py3-none-any.whl (16 kB)
Collecting deprecated>=1.2.13 (from flair)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting ftfy>=6.1.0 (from flair)
  Downloading ftfy-6.2.0-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting janome>=0.4.2 (from flair)
  Downloading Janome-0.5.0-py2.py3-none-any.whl (1

In [19]:
!pip install bert-extractive-summarizer



In [11]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.35.0-py2.py3-none-any.whl (8.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m84.8 MB/s[0m eta [36m0:00:00[0m
Collecting watchdog>=2.1.5 (from streamlit)
  Downloading watchdog-4.0.1-py3-none-manylinux2014_x86_64.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.0/83.0 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading gitdb-4.0

In [1]:
!pip install langchain

Collecting langchain
  Using cached langchain-0.2.3-py3-none-any.whl (974 kB)
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain)
  Using cached langchain_core-0.2.5-py3-none-any.whl (314 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Using cached langchain_text_splitters-0.2.1-py3-none-any.whl (23 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Using cached langsmith-0.1.75-py3-none-any.whl (124 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.0->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting packaging<24.0,>=23.2 (from langchain-core<0.3.0,>=0.2.0->langchain)
  Downloading packaging-23.2-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_6

In [2]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.2.4-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.6-py3-none-any.whl (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.21.3-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensi

In [3]:
!pip install setfit

Collecting setfit
  Downloading setfit-1.0.3-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.3.0 (from setfit)
  Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers>=2.2.1 (from setfit)
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate>=0.3.0 (from setfit)
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.3.0->setfit)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)

In [8]:
!pip install streamlit

Collecting streamlit
  Using cached streamlit-1.35.0-py2.py3-none-any.whl (8.6 MB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Using cached GitPython-3.1.43-py3-none-any.whl (207 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Using cached pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
Collecting watchdog>=2.1.5 (from streamlit)
  Using cached watchdog-4.0.1-py3-none-manylinux2014_x86_64.whl (83 kB)
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Using cached gitdb-4.0.11-py3-none-any.whl (62 kB)
Collecting smmap<6,>=3.0.1 (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading smmap-5.0.1-py3-none-any.whl (24 kB)
Installing collected packages: watchdog, smmap, pydeck, gitdb, gitpython, streamlit
Successfully installed gitdb-4.0.11 gitpython-3.1.43 pydeck-0.9.1 smmap-5.0.1 streamlit-1.35.0 watchdog-4.0.1


# Building the app

In [31]:
%%writefile app.py
import streamlit as st
from pypdf import PdfReader
from pdf2image import convert_from_path
import pytesseract
import spacy
from langdetect import detect
import re
from flair.data import Sentence
from flair.nn import Classifier
from flair.models import SequenceTagger
from summarizer import Summarizer
from setfit import SetFitModel
from transformers import pipeline

def Document_Conversion(path):
  reader = PdfReader(path)
  text = ""
  for page in reader.pages:
    per_page_text = page.extract_text()
    text += per_page_text
  return text

def OCR_Integration(path):
  pages = convert_from_path(path)
  extracted_text = []
  for page in pages:
    text = pytesseract.image_to_string(page)
    extracted_text.append(text)

  string_text = ""
  for texts in extracted_text:
    string_text += texts
  return string_text


def preprocess(text):
  #removing repeating lines:
  lines = text.split("\n")
  unique_lines = []
  for line in lines:
    if line not in unique_lines:
      unique_lines.append(line)

  text = "\n".join(unique_lines)

  # clean text (same for all languages)
  text = text.replace("\n", " ").replace("\t", " ").replace("-", " ")
  text = re.sub(r"[^\x00-\x7F]+", "", text)  # remove non-ASCII characters
  text = text.strip()  # remove leading and trailing whitespace

  #detect language
  try:
      language = detect(text)
  except:
      language = None

  nlp_model = None

  language_model_names = [f"{language}_core_web_sm", f"{language}_core_news_sm"]  # Prioritize web first

  for language_model_name in language_model_names:
    try:
      nlp_model = spacy.load(language_model_name)
     # print(f"Loaded spaCy model: {language_model_name}")
      break
    except:
      continue
      # print(f"spaCy model not available: {language_model_name}")

  tokens = []
  if nlp_model:
      doc = nlp_model(text)
      sentences = [sent.text for sent in doc.sents]
      for sentence in sentences:
          doc = nlp_model(sentence)
          for token in doc:
              tokens.append(token.text)
  else:
      tokens = text.split()

  return tokens, language

def summary_gen(text):
  model = Summarizer()
  result = model(text)
  return result

def named_entity_recognition_flair(tokens):
  tagger = SequenceTagger.load("flair/ner-german-large")
  new_full = Sentence(" ".join(tokens))
  # predict NER tags
  tagger.predict(new_full)

  #print('The following NER tags are found:')
  # iterate over entities and print
  entities = []
  relationships = []
  labels = []
  for entity in new_full.get_spans('ner'):
    entities.append(entity)
    #print(entity)

  extractor = Classifier.load('relations')

  extractor.predict(new_full)

  # check which relations have been found
  relations = new_full.get_labels('relation')
  for relation in relations:
    relationships.append(relation)
    #print(relation)

  # use the `get_labels()` method with parameter 'relation' to iterate over all relation predictions.
  for label in new_full.get_labels('relation'):
    labels.append(label)
    #print(label)
  return entities, relationships, labels


def clean_text(text):
  lines = [line.strip() for line in text.splitlines() if line.strip()]
  unwanted_chars = set("°~—;:-+#^&")
  filtered_lines = [line for line in lines if not any(char in line for char in unwanted_chars)]
  cleaned_text = "\n".join(filtered_lines)
  return cleaned_text

def internal_translation(text, source_language):
  if source_language == "de":
    model_name = "google/bert2bert_L-24_wmt_de_en"
  pipe = pipeline("translation_" + source_language + "_to_en", model=model_name)
  res = pipe(text, max_length=4000)
  return res

def document_classification(text):
  model = SetFitModel.from_pretrained("luis-cardoso-q/kotodama-multilingual-v3")
  category = model.predict(text)
  return category #categories are : buying, company name, invoice, refund, rent, salary

def process_document(file):

    with open("temp.pdf", "wb") as f:
        f.write(file.getvalue())


    text = Document_Conversion("temp.pdf")
    ocr_text = OCR_Integration("temp.pdf")
    combined_text = text + "\n" + ocr_text


    cleaned_text = clean_text(combined_text)
    tokens, language = preprocess(cleaned_text)

    translated_text = ""
    if language != "en":
      translated_text = internal_translation(text, language)
    else:
      pass

    entities, relationships, labels = named_entity_recognition_flair(tokens)

    summary = summary_gen(cleaned_text)

    category = document_classification(cleaned_text)

    return cleaned_text, entities, relationships, labels, summary, language, translated_text, category

def main():
    st.set_page_config(page_title="Chat PDF")
    st.header("Chat PDF 💬")

    pdf = st.file_uploader("Upload your PDF file", type="pdf")

    if pdf is not None:

        text, entities, relationships, labels, summary, language, translated_text, category = process_document(pdf)

        st.subheader(f"Document Language: {language}")
        st.subheader("Document Summary:")
        st.write(summary)
        if language != "en":
          st.subheader(f"English Translation")
          st.subheader("Document Contents:")
          st.write(translated_text)
        else:
          pass

        st.subheader("Chat with your Document:")
        user_input = st.text_input("Ask a question about your document:")

        if user_input:

            if "entit" in user_input.lower():
              st.write("Entities found in the document:")
              for entity in entities:
                st.write(f"- {entity}")
            elif "relation" in user_input.lower():
              st.write("Relationships found in the document:")
              for relation in relationships:
                st.write(f"- {relation}")
            elif "summary" in user_input.lower():
              st.write("Document Summary:")
              st.write(summary)
            elif "category" in user_input.lower():
              st.write("In the following categories: buying, company name, invoice, refund, rent, salary, the category of this document is:")
              st.write(category)
            else:
                st.write("I'm sorry, I don't understand that question. You can ask about entities, relationships, or the summary of the document.")

            # User feedback
            st.write("Was this response helpful?")
            col1, col2 = st.columns(2)
            with col1:
                st.button("👍")
            with col2:
                st.button("👎")

if __name__ == "__main__":
    main()

Overwriting app.py


In [34]:
! wget -q -O - ipv4.icanhazip.com

34.106.230.164


In [38]:
! streamlit run app.py & npx localtunnel --port 8501


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.106.230.164:8501[0m
[0m
[K[?25hnpx: installed 22 in 3.275s
your url is: https://tall-beans-shine.loca.lt
2024-06-09 10:04:17.839715: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-09 10:04:17.839771: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-09 10:04:17.844797: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register facto