## Setup Environment

In [1]:
!pip install -U langchain langchain-community langchain-core langchain-deepseek
!pip install faiss-cpu

Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-deepseek
  Downloading langchain_deepseek-0.1.3-py3-none-any.whl.metadata (1.1 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
[0mCollecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-openai<1.0.0,>=0.3.9 (from langchain-deepseek)
  Downloading langchain_openai-0.3.12-py3-none-any.whl.metadata (2.3 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->l

In [2]:
!gdown 1n8HhgQenifEBGJKMcOwh53q5JOJbv58a

Downloading...
From: https://drive.google.com/uc?id=1n8HhgQenifEBGJKMcOwh53q5JOJbv58a
To: /content/description.txt
  0% 0.00/6.87k [00:00<?, ?B/s]100% 6.87k/6.87k [00:00<00:00, 16.8MB/s]


## Split Document

In [3]:
from langchain.document_loaders import TextLoader

loader = TextLoader("description.txt")
document = loader.load()

print(document)

[Document(metadata={'source': 'description.txt'}, page_content="Alex is a dedicated software engineer with a passion for solving complex problems. Alex has over five years of experience in full-stack development, specializing in JavaScript and Python. Alex thrives in collaborative environments and enjoys mentoring junior developers. Alex’s attention to detail ensures that every project meets high standards of quality and efficiency.\u200b\u200b\n\n\u200b\u200bJordan is an enthusiastic marketing professional with a knack for creative storytelling. Jordan has led successful campaigns for global brands, blending data-driven strategies with compelling narratives. Jordan’s ability to connect with diverse audiences makes them a valuable asset to any team. Jordan is also an avid traveler, drawing inspiration from different cultures and perspectives.\u200b\u200b\n\n\u200b\u200bRiley is a compassionate nurse with a strong commitment to patient care. Riley works in a busy urban hospital, where t

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Splitter to split document into chunks.
splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", "! ", "? ", "\u200b"],  # Split by these sperators.
    chunk_size=1,  # Every chunk has at least one token.
    chunk_overlap=0,  # No overlap between chunks.
    keep_separator=False  # discard sperators after splitting.
)

docs = splitter.split_documents(document)
print(docs)

[Document(metadata={'source': 'description.txt'}, page_content='Alex is a dedicated software engineer with a passion for solving complex problems'), Document(metadata={'source': 'description.txt'}, page_content='Alex has over five years of experience in full-stack development, specializing in JavaScript and Python'), Document(metadata={'source': 'description.txt'}, page_content='Alex thrives in collaborative environments and enjoys mentoring junior developers'), Document(metadata={'source': 'description.txt'}, page_content='Alex’s attention to detail ensures that every project meets high standards of quality and efficiency.'), Document(metadata={'source': 'description.txt'}, page_content='Jordan is an enthusiastic marketing professional with a knack for creative storytelling'), Document(metadata={'source': 'description.txt'}, page_content='Jordan has led successful campaigns for global brands, blending data-driven strategies with compelling narratives'), Document(metadata={'source': 'd

## Embedding Documents

In [5]:
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

emb = embedding_model.embed_query("Hello, I'm learning LLM.")  # embedding a sentence into a vector.
print(type(emb))  # list.
print(len(emb))  # 768

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

<class 'list'>
768


## Build Vector Database

In [6]:
from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(docs, embedding_model)

In [7]:
vectorstore.similarity_search("Introduce Alex", k=3)

[Document(id='e6e02e39-4ddd-4f05-80de-5c79491ef404', metadata={'source': 'description.txt'}, page_content='Alex is a dedicated software engineer with a passion for solving complex problems'),
 Document(id='09a9b5b1-a103-4aa1-a46e-5bcf211bcbe8', metadata={'source': 'description.txt'}, page_content='Alex thrives in collaborative environments and enjoys mentoring junior developers'),
 Document(id='6663bcac-aa48-4ef6-b80b-c927228b9376', metadata={'source': 'description.txt'}, page_content='Alex has over five years of experience in full-stack development, specializing in JavaScript and Python')]

# Chat with LLM

Here, we use Deepseek as exmaple. First, you should top up a little money and get your API by [this link](https://platform.deepseek.com/).

In [8]:
import os
from langchain_deepseek import ChatDeepSeek

# Config your deekseek
os.environ["DEEPSEEK_API_KEY"] = ""  # todo

llm = ChatDeepSeek(
    model="deepseek-chat"
)

In [9]:
messages = [
    ("human", "Hi, do you know Alex?"),
]
ai_msg = llm.invoke(messages)
print(ai_msg.content)

It depends! If you're referring to a specific person named Alex, I don't have personal knowledge of individuals unless they are public figures. If you're asking about a public figure (e.g., a celebrity, author, or historical figure), let me know more details so I can help.  

If you're asking in general—yes, I "know" Alex as a common name! 😊 How can I assist you?


Next, let's build our prompt.

In [10]:
prompt_template = """You are an AI assistant. Please answer user's question based on the following content:
'''
{content}
'''

Answer user's question directly. Don't need to say "Based on the information".
"""

question = "Hi, do you know Alex?"

Finally, let's combine all processes into a function.

In [11]:
def chat(question: str):
    docs = vectorstore.similarity_search(question, k=3)
    content = '\n'.join([doc.page_content for doc in docs])
    messages = [
        ("system", prompt_template.format(content=content)),
        ("human", question),
    ]
    resp = llm.invoke(messages)
    return resp.content

In [12]:
print(chat("Who can analyse data?"))

Both Casey and Sage can analyze data. Casey has strong analytical skills and attention to detail, while Sage is an astrophysicist specifically analyzing data from the James Webb Space Telescope.
