In [27]:
from langchain_core.runnables import RunnableLambda
from langchain.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers.string import StrOutputParser
from IPython.display import Markdown, display
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain_core.runnables import RunnableLambda, RunnableParallel


In [8]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM

template = """


Question: {question}

Answer: Let's think step by step."""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="llama3.2:3b")

chain = prompt | model | StrOutputParser()

Markdown(chain.invoke({"question": "What is LangChain for LLMs? Answer in 1000 words."}))

LangChain is an open-source library developed to enhance the performance and efficiency of Large Language Models (LLMs) in various Natural Language Processing (NLP) tasks. In this response, we'll break down what LangChain is, its key features, and how it can benefit LLMs.

**What is LangChain?**

LangChain is a Python library designed to facilitate the integration of language models with other tools and services used in NLP, such as data pipelines, databases, and workflow management systems. It provides a standardized interface for interacting with various components of the LLM pipeline, allowing developers to build custom workflows that optimize the performance and efficiency of their LLMs.

**Key Features**

LangChain is built around several key features that make it an attractive choice for integrating language models with other tools and services:

1. **Modular Architecture**: LangChain's modular architecture allows users to choose only the components they need, making it easy to integrate with existing workflows and pipelines.
2. **Standardized Interface**: The library provides a standardized interface for interacting with LLMs, databases, and other NLP components, ensuring consistency across different tools and services.
3. **Scalability**: LangChain is designed to handle large volumes of data and complex workflows, making it suitable for enterprise-level applications.
4. **Flexibility**: Users can easily customize the library to fit their specific use cases, incorporating new features or modifying existing ones.

**Components of LangChain**

LangChain consists of several components that work together to provide a comprehensive set of tools for integrating language models:

1. **Web3**: This component provides an interface to interact with blockchain-based databases and services.
2. **Task**: The Task component allows users to define complex workflows, including data processing, model inference, and task execution.
3. **Data**: The Data component manages the storage, retrieval, and manipulation of large datasets used by LLMs.
4. **Pipeline**: This component enables users to build and manage custom workflows that integrate with their language models.

**Benefits for LLMs**

LangChain offers several benefits when it comes to Large Language Models:

1. **Improved Efficiency**: By integrating LangChain with existing workflows and pipelines, developers can optimize the performance of their LLMs, reducing processing time and increasing throughput.
2. **Enhanced Scalability**: The library's modular architecture and scalability features enable users to handle large volumes of data and complex workflows, making it suitable for enterprise-level applications.
3. **Increased Flexibility**: LangChain's customizable nature allows users to incorporate new features or modify existing ones, ensuring the library remains relevant as LLMs evolve.

**Real-World Applications**

LangChain has numerous real-world applications across various industries:

1. **Chatbots and Conversational AI**: Integrating LangChain with LLMs can create more efficient and effective chatbots that understand context and respond accordingly.
2. **Data Annotation and Processing**: The library's data management features enable fast and accurate annotation, making it suitable for large-scale NLP projects.
3. **Content Generation**: By integrating LangChain with LLMs, developers can build content generation tools that produce high-quality text quickly and efficiently.

**Future Developments**

The development of LangChain is ongoing, with new features and updates being added regularly:

1. **Improved Workflow Management**: Upcoming releases will include enhanced workflow management capabilities, allowing users to define more complex workflows and integrate multiple components seamlessly.
2. **New Data Sources**: The library may soon support integration with new data sources, such as external databases or cloud-based storage services.

**Conclusion**

LangChain is an essential tool for developers working with Large Language Models in NLP applications. Its modular architecture, standardized interface, scalability features, and flexibility make it an attractive choice for integrating language models with other tools and services. By providing a comprehensive set of components, LangChain enables users to build custom workflows that optimize the performance and efficiency of their LLMs. As LLMs continue to evolve, LangChain's continued development will ensure it remains relevant in the NLP landscape.

### Basic LCEL

In [None]:
# Prompt
prompt = ChatPromptTemplate.from_template("Translate to French: {text}")

# LLM
llm = ChatOllama(model="llama3.2:3b")

# Output parser
parser = StrOutputParser()

# LCEL chain
chain = prompt | llm | parser

# Run
chain.invoke({"text": "I love you"})

'The translation of "I love you" in French depends on the formality and level of affection. Here are a few options:\n\n- Je t\'aime (informal, very intimate)\n- J\'adore (more formal, but still affectionate)\n- Tu m\'aimes aussi (formal, but with a more casual tone)\n- Je t\'aime beaucoup (very formal, expressing strong love)\n\nFor an even more romantic touch, you could say:\n\n- "Je t\'aime plus que tout le monde" (I love you more than anyone else)\n- "Tu es mon tout" (You are my everything)'

In [10]:
(prompt | llm).invoke({"text": "I love you"})

AIMessage(content='"Je t\'aime" (pronounced "zhuh tehm") is a more common way to say it in French, but the translation of "I love you" can also be:\n\n* "Je t\'adore" (pronounced "zhuh teh-DOHR") - a more intense and passionate expression\n* "Je t\'aime beaucoup" (pronounced "zhuh teh-mee BOH-keu") - meaning "I love you very much"\n* "Je vous aime" (pronounced "zhuh voo eh-MAY") - meaning "I love you" (with a more formal tone)\n\nHowever, the most common and widely used expression in French is indeed "Je t\'aime".', additional_kwargs={}, response_metadata={'model': 'llama3.2:3b', 'created_at': '2025-06-22T15:53:10.408138785Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 1821306680, 'load_duration': 14014614, 'prompt_eval_count': 32, 'prompt_eval_duration': 16544062, 'eval_count': 151, 'eval_duration': 1790325153}, id='run--1aed982c-5e1e-4679-8da8-43304b9ebc9e-0')

#### Custom Callable

In [19]:
from langchain_core.runnables import RunnableLambda

def capitalize(input: str) -> str:
    return {'text': input['text'].upper()}
def capitalize(input: str) -> str:
    return input['text'].upper()
custom_runnable = RunnableLambda(capitalize)
# Prompt
prompt = ChatPromptTemplate.from_template("Translate to French: {text}")

# LLM
llm = OllamaLLM(model="llama3.2:3b")
chain = custom_runnable | prompt |  llm

Markdown(chain.invoke({"text": "omelettes are awesome"}))

The translation to French would be:

LES OMELETTES SONT INCROYABLES

### Larger chains - length 4 and 5


Save a vector index - from a known book or earnings report


In [30]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings

# 1. Read your .txt file
with open("docs/StockWatson.txt", "r", encoding="utf-8") as f:
    data = f.read()

# 2. Split text into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=2500*2, chunk_overlap=200)
docs = splitter.create_documents([data])

# 3. Create embeddings
embeddings = OllamaEmbeddings(model = "nomic-embed-text")

# 4. Build FAISS index
vectorstore = FAISS.from_documents(docs, embeddings)

# 5. Save the index locally
vectorstore.save_local("my_faiss_index")

RAG - Chain of length 4

In [31]:
# 1. Retriever
vectorstore = FAISS.load_local(
    "my_faiss_index",
    OllamaEmbeddings(model="nomic-embed-text"),
    allow_dangerous_deserialization=True
)
retriever = vectorstore.as_retriever()

# 2. Prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question based on the documents below. Answer in an elaborate manner, listing all the details."),
    ("human", "Documents:\n{context}\n\nQuestion:\n{question}")
])

# 3. LLM
llm = OllamaLLM(model="qwen2.5:7b")

# 4. Parser
parser = StrOutputParser()

# 1. Bundle step: run retriever AND echo the question
bundle = RunnableParallel({
    "docs": retriever,
    # pull "question" out of the input dict
    "question": RunnableLambda(lambda inp: inp["question"])
})

# 2. Assemble into exactly the inputs your prompt needs
assemble = RunnableLambda(lambda inp: {
    "context": "\n".join(d.page_content for d in inp["docs"]),
    "question": inp["question"]
})

# 3. Now chain it all
chain = bundle | assemble | prompt | llm | parser

# 4. Invoke with a single dict containing "question"
response = chain.invoke({"question": "What are the ways of handling panel or longitudinal data? Give me a brief introduction to each one of them"})
Markdown(response)


Handling panel or longitudinal data involves several methods, each with its own advantages and appropriate contexts. Here's an overview of some common approaches:

1. **Fixed Effects Model**:
   - **Description**: This model controls for unobserved heterogeneity that is constant over time but varies across entities (e.g., individuals, firms). It does this by including entity-specific intercepts.
   - **Use Case**: Suitable when the unobserved effects are correlated with the independent variables of interest. For example, if you're analyzing the impact of a policy on different states and want to control for state-specific characteristics that don't change over time.
   - **Advantages**: It can address endogeneity issues by isolating within-entity variation.
   - **Disadvantages**: Can be less efficient when the number of entities is large compared to the number of time periods.

2. **Random Effects Model**:
   - **Description**: This model assumes that unobserved heterogeneity is not correlated with the independent variables and can vary randomly across entities over time. It includes both entity-specific intercepts (as in fixed effects) but treats them as random draws from a distribution.
   - **Use Case**: Appropriate when you believe that the omitted variables are not related to the regressors of interest, or when there's too much variation among entities and time periods for fixed effects.
   - **Advantages**: More efficient than fixed effects if the unobserved heterogeneity is truly random.
   - **Disadvantages**: If the unobserved heterogeneity is correlated with the independent variables, this model can produce biased estimates.

3. **Pooled Ordinary Least Squares (OLS)**:
   - **Description**: This method assumes that there's no unobserved heterogeneity across entities or over time. It treats all data as if they were cross-sectional.
   - **Use Case**: Suitable when the assumption of no omitted variables is likely to hold, and you have a large number of time periods relative to the number of entities.
   - **Advantages**: Simple and computationally efficient.
   - **Disadvantages**: Can be biased in the presence of unobserved heterogeneity.

4. **Randomized Controlled Trials (RCTs) or Quasi-Experiments**:
   - **Description**: These involve either randomized assignment to treatment groups or natural experiments where a policy change occurs randomly across some units.
   - **Use Case**: To estimate causal effects in the presence of unobserved heterogeneity, by comparing outcomes before and after a treatment or between treated and control groups.
   - **Advantages**: Can provide unbiased estimates if proper randomization is achieved (RCTs) or if the natural experiment meets certain assumptions.
   - **Disadvantages**: May require strong assumptions about the comparability of units.

5. **Difference-in-Differences (DiD)**:
   - **Description**: This method compares changes in outcomes over time between a treatment group and a control group to estimate the effect of an intervention.
   - **Use Case**: Often used when you have panel data and can identify a "treatment" and "control" group, where the treatment group experiences the intervention while the control group does not.
   - **Advantages**: Can handle both within- and between-entity variation.
   - **Disadvantages**: Requires that the pre-treatment trends in outcomes are similar across groups.

6. **Generalized Method of Moments (GMM)**:
   - **Description**: This is a more general approach used when there are multiple moment conditions available, allowing for estimation even if some of them are not identified.
   - **Use Case**: Often used with panel data to address issues like serial correlation and heteroskedasticity.
   - **Advantages**: Flexible and can handle complex models with many instruments.
   - **Disadvantages**: Can be computationally intensive.

7. **Dynamic Panel Data Models**:
   - **Description**: These models account for the lagged dependent variable, capturing how past values of the dependent variable affect current outcomes.
   - **Use Case**: Suitable when there is a clear temporal dynamic in the data, such as stock prices or economic growth over time.
   - **Advantages**: Captures both short-term and long-term effects.
   - **Disadvantages**: Can be challenging to estimate due to potential autocorrelation issues.

8. **Mundlak’s Random Effects Model**:
   - **Description**: This model includes the expected value of the unobserved entity-specific effect in the regression equation, allowing for a more flexible specification.
   - **Use Case**: Useful when you have some information about the factors causing unobserved heterogeneity.
   - **Advantages**: Can handle both fixed and random effects simultaneously.
   - **Disadvantages**: Requires additional data or assumptions.

9. **Panel Probit Models**:
   - **Description**: An extension of probit models to panel data, used when the dependent variable is binary.
   - **Use Case**: Suitable for analyzing discrete outcomes in a panel setting.
   - **Advantages**: Accounts for unobserved heterogeneity and time-invariant characteristics.
   - **Disadvantages**: Computationally intensive.

10. **Heterogeneous Effects Models**:
    - **Description**: These models allow for different effects of the independent variables on different entities or at different times.
    - **Use Case**: When you suspect that the relationship between variables might differ across subgroups or time periods.
    - **Advantages**: More flexible and can capture heterogeneous responses to treatments or policies.
    - **Disadvantages**: Can be complex to estimate and interpret.

These methods provide a comprehensive toolkit for analyzing panel data, allowing researchers to address various issues such as omitted variable bias, unobserved heterogeneity, and dynamic effects.