<a href="https://colab.research.google.com/github/kishdas/from_modernaipro/blob/main/Document_analysis_with_LLM_Modern_AI_Pro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\#Modern AI Pro: Document analysis with LLMs
### Step 1: Setup the basics for processing document:
Get one of our [research papers](https://drive.google.com/file/d/1kfjF9iuGG74ORFGu4v9dMO15pgsKI5Rh/view?usp=sharing) for sample. Download a copy locally and upload to the runtime.

In [2]:
# We will use a simple utility to make the text wrap properly when printing.
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [3]:
# Read pages of the document
!pip install -q -U pypdf2
from PyPDF2 import PdfReader
reader = PdfReader('arso1.pdf')
text = ""
for i in range(0, len(reader.pages)):
    page = reader.pages[i]
    text += page.extract_text() + " "

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━[0m [32m174.1/232.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h

FileNotFoundError: [Errno 2] No such file or directory: 'arso1.pdf'

In [None]:
def display_word_cloud(top_100_words):
  wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(dict(top_100_words))

  plt.figure(figsize=(10, 5))
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis('off')
  plt.show()

## Step 2: Visualize the data

In [None]:
import re
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = re.sub(r'[^a-zA-Z\s]', '', text)
text = text.lower()
words = text.split()
word_counts = Counter(words)
top_100_words = word_counts.most_common(100)

display_word_cloud(top_100_words)

That is a lot of just common words. Let's remove them and display again.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

# Filter out stop words from your list of words
filtered_words = [word for word in words if word not in stop_words]
word_counts_filtered = Counter(filtered_words)

# If you still want to limit it to the top 100 words
top_100_words_filtered = word_counts_filtered.most_common(100)

display_word_cloud(top_100_words_filtered)

Lemmatize to group similar words

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text)  # Tokenize the text
stop_words = set(stopwords.words('english'))

# Lemmatize tokens and remove stop words
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words and token.isalpha()]

# Recount words
word_counts = Counter(lemmatized_tokens)

# Extract the top 100 words
top_100_words_lemmatized = word_counts.most_common(100)
display_word_cloud(top_100_words_lemmatized)

## Step 3: Storing the docs in Vector DB

**Split the texts into small chunks**

In [None]:
!pip install -q -U langchain langchainhub langchain-community chromadb sentence-transformers
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.document import Document

documents = [Document(page_content=text, metadata={"source": "local"})]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=40)
all_splits = text_splitter.split_documents(documents)

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={"device": "cpu"})

In [None]:
from langchain.vectorstores import Chroma

vectordb_paper = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db_paper")

In [None]:
query = "Tell me about likability index"
docs = vectordb_paper.similarity_search(query)
print(docs[0].page_content)

In [None]:
retriever_paper = vectordb_paper.as_retriever()

## Step 3: Setting up the LLM

In [None]:
!pip install -U -q langchain-groq
import os
from google.colab import userdata
from langchain_groq import ChatGroq
os.environ["GROQ_API_KEY"] = userdata.get("GROQ_API_KEY")
llm_groq = ChatGroq(model_name="llama3-70b-8192")

In [None]:
from langchain.chains import RetrievalQA
qa_paper = RetrievalQA.from_chain_type(
    llm_groq,
    chain_type="stuff",
    retriever=retriever_paper,
    verbose=True
)

In [None]:
def rag_manager(qa, query):
    print("\nResult: ", qa.run(query))

In [None]:
rag_manager(qa_paper,"Tell me about the likability index")

## Step 4: You can analyze any piece of text now.

In [None]:
news = """SRINAGAR, India (AP) — For decades, India has focused its defense policy on its land borders with rivals Pakistan and China. Now, as its global ambitions expand, it is beginning to flex its naval power in international waters, including anti-piracy patrols and a widely publicized deployment close to the Red Sea to help protect ships from attacks during Israel’s war with Hamas.

India sent three guided missile destroyers and reconnaissance aircraft in November when Yemen-based Houthi rebels began targeting ships in solidarity with Hamas, causing disruptions in a key trading route that handles about 12% of global trade.

The deployment highlights the country as a “proactive contributor” to international maritime stability, said Vice Adm. Anil Kumar Chawla, who retired in 2021 as head of India’s southern naval command.


“We are not doing it only out of altruism. Unless you are a maritime power you can never aspire to be a global power,” Chawla said. India, already a regional power, is positioning itself “as a global player today, an upcoming global power,” he said.
India is widely publicizing the deployments, signaling its desire to assume a wider responsibility in maritime security to the world and its growing maritime ambitions to regional rival China.

“It is a message to China that, look, we can deploy such a large force here. This is our backyard. Though we don’t own it, but we are probably the most capable and responsible resident naval power,” Chawla said.

The Indian navy has helped at least four ships, three of which were attacked by Houthi rebels and another that Washington blamed on Iran, a charge denied by Tehran. It has also conducted several anti-piracy missions."""

documents = [Document(page_content=news, metadata={"source": "local"})]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

In [None]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")
retriever = vectordb.as_retriever()
qa = RetrievalQA.from_chain_type(
    llm= llm_groq,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

In [None]:
rag_manager(qa, """ What are all the key countries involved in this? comment on the geopolitics behind it. """)