# Practice using different kinds of document loaders.
Will be wrapped up in utils/document_loaders.py. \
Can import script to ingest different kinds of unstructured data into text form to be used on LLM.

## PPT Loader

In [2]:
# custom file for different document loaders
from utils import document_loaders
from utils import qna_llm

In [2]:
ppt_loader = document_loaders.PowerPointLoader()

docs = ppt_loader.load("data/ml_course.pptx")

In [3]:
context = ppt_loader.format_docs(docs)
#context = ppt_loader.clean_text(context) # not really necessary for pptx
context[:1000]

'### Slide 1:\n\nMachine Learning Model Deployment\nIntroduction to ML Pipeline\nhttps://bit.ly/bert_nlp\n\n\n### Slide 2:\n\nWhat is Machine Learning Pipeline?\n\n\n### Slide 3:\n\nType of ML Deployment\nBatch: In batch deployment, ML models process large volumes of data at scheduled intervals, ideal for tasks like end-of-day reporting or monthly analytics.\nStream: Stream deployment enables ML models to process and analyze data in real-time as it flows in, suitable for applications like fraud detection or live social media analysis.\nRealtime: Realtime deployment allows ML models to provide instant predictions or decisions in response to incoming data, essential for use cases like recommendation systems or autonomous driving.\nEdge: Edge deployment involves running ML models on local devices close to the data source, reducing latency and bandwidth usage, which is crucial for IoT applications and smart devices.\n\n\n### Slide 4:\n\nInfrastructure and Integration\nHardware and Software

In [5]:
question = """
For each PowerPoint slide provided above, write a 2-minute script that effectively conveys the key points.
Ensure a smooth flow between slides, maintaining a clear and engaging narrative.
"""
response = qna_llm.ask_llm(context, question)
response[:1000]

'Here are the scripts for each PowerPoint slide:\n\n**Slide 1: Machine Learning Model Deployment**\n\n[Opening shot of a presentation title]\n\nNarrator: "Welcome to our presentation on machine learning model deployment. In today\'s digital landscape, deploying machine learning models is crucial for any organization that wants to stay competitive. But with so many options available, it can be overwhelming to choose the right one."\n\n[Pause for emphasis]\n\nNarrator: "In this presentation, we\'ll take you through the basics of machine learning pipeline, type of deployment, infrastructure and integration, benefits, challenges, data and model management, A/B testing, security, compliance, and bias. So let\'s get started!"\n\n**Slide 2: What is Machine Learning Pipeline?**\n\n[Visuals of a workflow diagram]\n\nNarrator: "A machine learning pipeline is the process of collecting, processing, and analyzing data to build and deploy machine learning models. It involves several stages, includin

### Using save_markdown from document loader class to save to folder

In [6]:
ppt_loader.save_markdown(response, 'llm_reports/ppt_script.md')

## Excel Loader
Similar operations, but for excel files

In [1]:
from utils import document_loaders
from utils import qna_llm

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [2]:
excel_loader = document_loaders.ExcelLoader()
docs = excel_loader.load("data/sample.xlsx")
docs

[Document(metadata={'source': 'data/sample.xlsx', 'file_directory': 'data', 'filename': 'sample.xlsx', 'last_modified': '2025-04-21T14:37:46', 'page_name': 'Data', 'page_number': 1, 'text_as_html': '<table><tr><td>First Name</td><td>Last Name</td><td>City</td><td>Gender</td></tr><tr><td>Brandon</td><td>James</td><td>Miami</td><td>M</td></tr><tr><td>Sean</td><td>Hawkins</td><td>Denver</td><td>M</td></tr><tr><td>Judy</td><td>Day</td><td>Los Angeles</td><td>F</td></tr><tr><td>Ashley</td><td>Ruiz</td><td>San Francisco</td><td>F</td></tr><tr><td>Stephanie</td><td>Gomez</td><td>Portland</td><td>F</td></tr></table>', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 'category': 'Table', 'element_id': 'c44665196e07c27314922db69accb8b6'}, page_content='First Name Last Name City Gender Brandon James Miami M Sean Hawkins Denver M Judy Day Los Angeles F Ashley Ruiz San Francisco F Stephanie Gomez Portland F')]

We preserve the html formatting of the sheet so that the llm can understand the structure of the excel file. This can be later changed to markdown for better visibility.

In [3]:
context = excel_loader.format_docs(docs)
context

'### Sheet:\n\n<table><tr><td>First Name</td><td>Last Name</td><td>City</td><td>Gender</td></tr><tr><td>Brandon</td><td>James</td><td>Miami</td><td>M</td></tr><tr><td>Sean</td><td>Hawkins</td><td>Denver</td><td>M</td></tr><tr><td>Judy</td><td>Day</td><td>Los Angeles</td><td>F</td></tr><tr><td>Ashley</td><td>Ruiz</td><td>San Francisco</td><td>F</td></tr><tr><td>Stephanie</td><td>Gomez</td><td>Portland</td><td>F</td></tr></table>\n\n---'

Some example llm tasks using this data:

In [4]:
question = "Return this data in Markdown format."
response = qna_llm.ask_llm(context, question)
print(response)

| First Name | Last Name | City | Gender |
| --- | --- | --- | --- |
| Brandon | James | Miami | M |
| Sean | Hawkins | Denver | M |
| Judy | Day | Los Angeles | F |
| Ashley | Ruiz | San Francisco | F |
| Stephanie | Gomez | Portland | F |


In [6]:
question = "Return all entries in the table where Gender is female. Format the response in Markdown. Only output the resulting table."
response = qna_llm.ask_llm(context, question)
print(response)

| First Name | Last Name | City    | Gender |
|------------|-----------|---------|--------|
| Judy       | Day        | Los Angeles | F      |
| Ashley     | Ruiz       | San Francisco | F      |
| Stephanie  | Gomez      | Portland   | F      |


## Microsoft Office data - Word Document

In [7]:
from utils import document_loaders
from utils import qna_llm

In [None]:
docx_loader = document_loaders.WordLoader()
docs = docx_loader.load("data/job_description.docx")
docs

[Document(metadata={'source': 'data/job_description.docx'}, page_content='Job Description - Data Scientist\n\nAt SpiceJet, we rely on data to provide us valuable insights, and to automate our systems and solutions to help us increase revenues, reduce costs and provide improved customer experiences. We are seeking an experienced data scientist to deliver insights and automate our systems and processes. Ideal team member will have mathematical and statistical expertise, experience with modern data science programming languages and machine learning/AI platforms and techniques. You will mine, clean and interpret our data and then develop machine learning models to deliver business value across different parts of the business. \n\nObjectives of this Role\n\nUse Data Science and Machine Learning to increase revenue, reduce costs and increase customer satisfaction.\n\nCollaborate with product design and engineering to develop an understanding of needs\n\nUnderstand where the required data res

In [11]:
context = docx_loader.format_docs(docs)
context = docx_loader.clean_text(context)
context[:1000]

'Job Description - Data Scientist At SpiceJet, we rely on data to provide us valuable insights, and to automate our systems and solutions to help us increase revenues, reduce costs and provide improved customer experiences. We are seeking an experienced data scientist to deliver insights and automate our systems and processes. Ideal team member will have mathematical and statistical expertise, experience with modern data science programming languages and machine learning/AI platforms and techniques. You will mine, clean and interpret our data and then develop machine learning models to deliver business value across different parts of the business. Objectives of this Role Use Data Science and Machine Learning to increase revenue, reduce costs and increase customer satisfaction. Collaborate with product design and engineering to develop an understanding of needs Understand where the required data resides and work on ways to extract the relevant data. Research and devise statistical and m

In [12]:
question = """
My name is Frank. I am a recent graduate from MIT with a focus on NLP and ML. 
I am applying for a Data Scientist position at SpiceJet.
Please write a concise job application email for me, removing any placeholders, including references to job boards or sources.
"""

response = qna_llm.ask_llm(context, question)
print(response)

Subject: Application for Data Scientist Position at SpiceJet

Dear Hiring Manager,

I am excited to apply for the Data Scientist position at SpiceJet. As a recent graduate from MIT with a focus on NLP and ML, I am confident that my skills and expertise align well with the requirements of this role.

With a strong foundation in mathematical and statistical techniques, I have developed proficiency in machine learning platforms and techniques, data mining, mathematics, and statistical analysis. My experience with Python, R, Excel, Tableau, and SQL enables me to effectively extract insights from data and develop predictive models.

I am particularly drawn to this role because of the opportunity to apply my skills to drive business value across different parts of the business. I am excited about the prospect of collaborating with product design and engineering to understand customer needs and developing solutions that increase revenue, reduce costs, and improve customer satisfaction.

Throu

## Potential future work for this concept:
Use your own resume as additional context. Ask the llm to generate a cover letter based on skills/info from your resume, match to what the job description is asking for. This can be done by just combining the two as a single context to pass in, or can create a chain to take in two separate contexts and return the desired description/cover letter.

# Youtube video transcripts and SEO keywords
Extract youtube transcripts and send to LLM to perform desired tasks. NOTE: couldn't get description and info to work, seems to be an issue with pytube which langchain_community is using.

In [1]:
from utils.document_loaders import YouTubeLoader
from utils import qna_llm

USER_AGENT environment variable not set, consider setting it to identify your requests.


### Configured youtube loader to already chunk the transcript based on user-defined chunk_size_seconds.
Transcripts often go long (especially longer videos). This allows pre-chunking to be built into the function.

In [3]:
# mcp explained youtube video
url = 'https://www.youtube.com/watch?v=_d0duu3dED4'
yt_loader = YouTubeLoader()
docs = yt_loader.load(url, chunk_size_seconds=180) # set the chunk size in seconds. Default is 600 (5min)

2 chunks based on 3 minute chunks (video is <5min)

In [7]:
len(docs)

2

In [5]:
context = yt_loader.format_docs(docs)
context[:1000]

"### Untitled--Timestamp: 00:00:00\n\nToday we're diving into the model context protocol or MCP. One of the most significant advancements in LLM integration released by Anthropic in late 2024. So what exactly is MCP? At its core, the model context protocol is an open standard that enables seamless integration between AI models like claude and external data sources or tools. is addressing a fundamental limitation that has held back AI assistance from reaching their potential. Before MCP, connecting models to each new data source require custom implementations, which can get expensive. MCB solves this by providing a universal open standard for connecting AI systems with data sources, replacing fragmented integrations with a single protocol. This means we can give AI systems access to databases, file systems, APIs, and other tools in a standardized way. Let's break down the architecture. MCP follows a client server models with three key components. Hosts, clients, and servers. Host are LL

### Use LLM to generate youtube video keywords 
SEO tool to generate keywords based on youtube transcript. 

In [12]:
question = """
You are an assistant for generating SEO keywords for YouTube.
Please generate a list of keywords from the above context.
Be creative and correct spelling if needed.
"""

keywords = []
for doc in docs:
    kws = qna_llm.ask_llm(context=doc.page_content, question=question)
    keywords.append(kws)

keywords = ', '.join(keywords)

In [13]:
print(keywords)

Here's a list of SEO keywords based on the provided context:

**Main Keywords:**

1. Model Context Protocol (MCP)
2. AI Integration
3. Artificial Intelligence (AI) Assistance
4. Natural Language Processing (NLP)

**Long-Tail Keywords:**

1. Standardized integration for AI models and external data sources
2. Seamless AI model connections
3. Universal protocol for AI system access
4. Open standard for AI assistance
5. Secure file access for AI applications
6. Executable functions in AI model context
7. Prompt-based instruction injection
8. Structured data objects for AI reference
9. Client-server architecture for AI integration
10. Server-side capabilities for AI tools

**Keyword Phrases:**

1. "MCP enables seamless AI model connections"
2. "Standardized protocol for AI system access"
3. "Seamless integration of AI and external data sources"
4. "Secure file access for AI applications"
5. "Prompt-based instruction injection in AI models"

Feel free to modify or expand these keywords as ne