<a href="https://colab.research.google.com/github/kareemrasheed89/DataQuestVisualization-Proj/blob/master/Doc_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Building Document Summarizer
------------------------------
Summarizing long texts can be quite a challenge, but with LangChain and Language Learning Model (LLM), it’s made simple. Imagine you’re reading a lengthy book or a detailed report, and you need to condense it into a short, easy-to-read summary.

LangChain(with LLM) provides several strategies to help you do just that. Let’s dive into these strategies using real-world examples to make things clearer.

In [35]:
!pip install llama-index llama-index-llms-groq llama-index-embeddings-huggingface



In [36]:
import os


In [37]:
from google.colab import userdata
os.environ['GROQ_API_KEY']=userdata.get('GROQ_API_KEY')

In [38]:
import nest_asyncio
nest_asyncio.apply()

In [39]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.llms.groq import Groq
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [40]:
llm =Groq(model="llama-3.1-8b-instant")

In [44]:
Settings.llm=llm #setting the large language model to be used for the doc summary
Settings.embed_model=HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") #setting the embedding model to be used for the doc summary

In [45]:
Documents=SimpleDirectoryReader(input_files=["/content/The-Data-Engineers-Guide-to-Apache-Spark.pdf"]).load_data()

In [46]:
print(type(Documents))

<class 'list'>


In [47]:
len(Documents)

98

In [48]:
Documents[0]

Document(id_='9e2706ab-fa73-4e1a-8cf1-604a0f18cfff', embedding=None, metadata={'page_label': '1', 'file_name': 'The-Data-Engineers-Guide-to-Apache-Spark.pdf', 'file_path': '/content/The-Data-Engineers-Guide-to-Apache-Spark.pdf', 'file_type': 'application/pdf', 'file_size': 13315912, 'creation_date': '2024-10-23', 'last_modified_date': '2024-10-23'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='The Dat a Engineer’ s Guide t o ', mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

In [49]:
splitter=SentenceSplitter(chunk_size=2048)
nodes=splitter.get_nodes_from_documents(Documents)

In [50]:
summary_index=SummaryIndex(nodes)

In [51]:
summary_query_engine=summary_index.as_query_engine(response_mode="tree_summarize",
                                                   use_async=True)

In [52]:
response=summary_query_engine.query("What is Apache Spark?")

In [53]:
print(response.response)

A unified analytics engine for large-scale data processing.


In [57]:
response2=summary_query_engine.query("What are the main features of Apache Spark?")



RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.1-8b-instant` in organization `org_01javs6vfnecfsjt6khqjy561j` on tokens per minute (TPM): Limit 20000, Used 39685, Requested 4675. Please try again in 1m13.081s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

In [None]:
print(response2.response)

In [56]:
#Storing the summary index with the nodes
summary_index.storage_context.persist(persist_dir="./summary_index")

In [59]:
from llama_index.core import load_index_from_storage
from llama_index.core import StorageContext


In [61]:
storage_context=StorageContext.from_defaults(persist_dir="./summary_index")

In [63]:
summary_index=load_index_from_storage(storage_context=storage_context)

In [64]:
summary_query_engine=summary_index.as_query_engine(response_mode="tree_summarize",
                                                   use_async=True)

In [65]:
response=summary_query_engine.query("What is Apache Spark?")

In [66]:
print(response)

A unified analytics engine for large-scale data processing that provides high-level APIs in multiple programming languages and a highly optimized engine that supports general execution graphs.
