# Installing packages

- langchain
- openai
- tqdm: library to show the progress of an action (downloading, training, ...) 
- jq: lightweight and flexible JSON processor
- unstructured: A library that prepares raw documents for downstream ML tasks
- pypdf: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files
- tiktoken: a fast open-source tokenizer by OpenAI.

In [22]:
# %pip install langchain openai tqdm unstructured pypdf tiktoken
# %pip jq

# Loading documents

In [1]:
from langchain.docstore.document import Document
import pandas as pd
# https://www.kaggle.com/datasets/guillemservera/global-daily-climate-data?select=daily_weather.parquet
df =  pd.read_parquet('data/daily_weather.parquet', engine='pyarrow')
 
documents = []
for i, row in df.head().iterrows():
    string = " ".join([
         "{}: {}".format(k, v) 
         for k, v in row.to_dict().items()
    ])
    doc = Document(
        page_content=string,
        meta_data={'row': i}
    )
    documents.append(doc)

In [2]:
documents

[Document(page_content='station_id: 41515 city_name: Asadabad date: 1957-07-01 00:00:00 season: Summer avg_temp_c: 27.0 min_temp_c: 21.1 max_temp_c: 35.6 precipitation_mm: 0.0 snow_depth_mm: nan avg_wind_dir_deg: nan avg_wind_speed_kmh: nan peak_wind_gust_kmh: nan avg_sea_level_pres_hpa: nan sunshine_total_min: nan'),
 Document(page_content='station_id: 41515 city_name: Asadabad date: 1957-07-02 00:00:00 season: Summer avg_temp_c: 22.8 min_temp_c: 18.9 max_temp_c: 32.2 precipitation_mm: 0.0 snow_depth_mm: nan avg_wind_dir_deg: nan avg_wind_speed_kmh: nan peak_wind_gust_kmh: nan avg_sea_level_pres_hpa: nan sunshine_total_min: nan'),
 Document(page_content='station_id: 41515 city_name: Asadabad date: 1957-07-03 00:00:00 season: Summer avg_temp_c: 24.3 min_temp_c: 16.7 max_temp_c: 35.6 precipitation_mm: 1.0 snow_depth_mm: nan avg_wind_dir_deg: nan avg_wind_speed_kmh: nan peak_wind_gust_kmh: nan avg_sea_level_pres_hpa: nan sunshine_total_min: nan'),
 Document(page_content='station_id: 4151

In [3]:
documents[0]

Document(page_content='station_id: 41515 city_name: Asadabad date: 1957-07-01 00:00:00 season: Summer avg_temp_c: 27.0 min_temp_c: 21.1 max_temp_c: 35.6 precipitation_mm: 0.0 snow_depth_mm: nan avg_wind_dir_deg: nan avg_wind_speed_kmh: nan peak_wind_gust_kmh: nan avg_sea_level_pres_hpa: nan sunshine_total_min: nan')

In [4]:
from langchain.document_loaders import (
    UnstructuredCSVLoader,
    UnstructuredHTMLLoader,
    UnstructuredImageLoader,
    PythonLoader,
    PyPDFLoader,
    JSONLoader
)

In [5]:
# https://hastie.su.domains/Papers/ESLII.pdf
file_path = 'mixed_data/element_of_SL.pdf'

sl_loader = PyPDFLoader(file_path=file_path)
sl_data = sl_loader.load_and_split()

In [6]:
sl_data[0]

Document(page_content='Springer Series in Statistics\nTrevor Hastie\nRobert TibshiraniJerome FriedmanSpringer Series in Statistics\nThe Elements of\nStatistical Learning\nData Mining, Inference, and Prediction\nThe Elements of Statistical LearningDuring the past decade there has been an explosion in computation and information tech-\nnology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the approach is statistical, theemphasis is on concepts rather than mathematics. Many examples are given, with a liberaluse of color graphics. It should be a va

In [7]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter
)

# split on "\n\n"
splitter1 = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
)

# split ["\n\n", "\n", " ", ""]
splitter2 = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
)


sl_data1 = sl_loader.load_and_split(text_splitter=splitter1)
sl_data2 = sl_loader.load_and_split(text_splitter=splitter2)

In [8]:
len(sl_data1[600].page_content)

2173

In [9]:
len(sl_data2[1].page_content)

999

In [10]:
from langchain.document_loaders import DirectoryLoader

folder_path = 'mixed_data/'

mixed_loader = DirectoryLoader(
    path=folder_path,
    use_multithreading=True,
    show_progress=True
)

mixed_data = mixed_loader.load_and_split()

100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.24s/it]


In [11]:
mixed_data

[Document(page_content='DEEP LEARNING\n\nDS-GA 1008 · SPRING 2021 · NYU CENTER FOR DATA SCIENCE\n\nINSTRUCTORS Yann LeCun & Alfredo Canziani LECTURES Wednesday 9:30 – 11:30, Zoom PRACTICA Tuesdays 9:30 – 10:30, Zoom FORUM r/NYU_DeepLearning DISCORD NYU DL MATERIAL 2021 repo\n\n2021 edition disclaimer\n\nCheck the repo’s README.md and learn about:\n\nContent new organisation\n\nThe semester’s second half intellectual dilemma\n\nThis semester repository\n\nPrevious releases\n\nLectures\n\nMost of the lectures, labs, and notebooks are similar to the previous edition, nevertheless, some are brand new.\nI will try to make clear which is which.\n\nLegend: 🖥 slides, 📝 notes, 📓 Jupyter notebook, 🎥 YouTube video.\n\nTheme 1: Introduction\n\nHistory and resources 🎥 🖥\n\nGradient descent and the backpropagation algorithm 🎥 🖥\n\nNeural nets inference 🎥 📓\n\nModules and architectures 🎥 🖥\n\nNeural nets training 🎥 🖥 📓 📓\n\nHomework 1: backprop\n\nTheme 2: Parameters sharing\n\nRecurrent and convolut

# Summarizing

## The "stuff" chain

In [12]:
import os

os.environ['OPENAI_API_KEY'] = ''

In [13]:
from langchain.chains.summarize import load_summarize_chain
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()

chain = load_summarize_chain(
    llm=llm,
    chain_type='stuff'
)

chain.run(sl_data[:20])

'The "Elements of Statistical Learning" is a book that explores the field of statistical learning, focusing on concepts rather than mathematics. It covers topics such as data mining, machine learning, and bioinformatics, and includes a range of methods from supervised to unsupervised learning. The book is written by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, who are professors of statistics at Stanford University. The second edition includes new topics such as graphical models, random forests, and ensemble methods, making it a valuable resource for statisticians and those interested in data mining in science or industry.'

In [14]:
chain.run(documents[:5])

'The summary includes weather data for Asadabad from July 1st to July 5th, 1957 during the Summer season. The average temperatures ranged from 22.8°C to 30.8°C, with the highest temperature reaching 41.7°C on July 5th. There was minimal precipitation on July 1st and July 5th, with a higher amount of 4.1mm on July 4th. Other weather parameters such as wind direction, speed, gust, sea level pressure, snow depth, and sunshine duration were not recorded.'

## Custom prompt

In [15]:
print(chain.llm_chain.prompt.template)

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:


In [16]:
from langchain.prompts import PromptTemplate

template = """
Write a concise summary of the following in spanish:

"{text}"

CONCISE SUMMARY IN SPANISH:
"""

prompt = PromptTemplate.from_template(template)

chain = load_summarize_chain(
    llm=llm,
    prompt=prompt   
)

chain.run(sl_data[:2])

'El libro "The Elements of Statistical Learning" es una obra que aborda el tema de la minería de datos, inferencia y predicción en el contexto del crecimiento de la tecnología y la abundancia de datos en diversos campos. Escrito por Trevor Hastie, Robert Tibshirani y Jerome Friedman, abarca temas como aprendizaje supervisado y no supervisado, redes neuronales, máquinas de vectores de soporte, entre otros. La segunda edición incluye nuevos temas como modelos gráficos, bosques aleatorios y métodos de conjunto. Los autores son reconocidos investigadores en el área de la estadística en la Universidad de Stanford.'

## The Map-reduce chain

In [17]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='map_reduce',
    verbose=True
)

chain.run(sl_data[:20])



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"Springer Series in Statistics
Trevor Hastie
Robert TibshiraniJerome FriedmanSpringer Series in Statistics
The Elements of
Statistical Learning
Data Mining, Inference, and Prediction
The Elements of Statistical LearningDuring the past decade there has been an explosion in computation and information tech-
nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the ap

'"The book "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman explores statistical concepts in data mining, machine learning, and bioinformatics. The second edition includes new topics and updates to improve accessibility, with a focus on supervised and unsupervised learning methods. The preface discusses the organization, changes, and reasons for updating the book, while the text covers topics such as linear methods, kernel smoothing, model assessment, boosting, neural networks, support vector machines, random forests, ensemble learning, and undirected graphical models. The book emphasizes learning from data to predict outcomes in various fields."'

## Custom prompt

In [18]:
print(chain.llm_chain.prompt.template)

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:


In [19]:
print(chain.combine_document_chain.llm_chain.prompt.template)

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:


In [20]:
map_template = """The following is a set of documents

{text}

Based on this list of docs, please identify the main themes 
Helpful Answer:"""

combine_template = """The following is a set of summaries:

{text}

Take these and distill it into a final, consolidated list of the main themes. 
Return that list as a comma separated list. 
Helpful Answer:"""


map_prompt = PromptTemplate.from_template(map_template)
combine_prompt = PromptTemplate.from_template(combine_template)

chain = load_summarize_chain(
    llm=llm,
    chain_type='map_reduce',
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    verbose=True
)

chain.run(sl_data[:20])



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a set of documents

Springer Series in Statistics
Trevor Hastie
Robert TibshiraniJerome FriedmanSpringer Series in Statistics
The Elements of
Statistical Learning
Data Mining, Inference, and Prediction
The Elements of Statistical LearningDuring the past decade there has been an explosion in computation and information tech-
nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the approach i

'Data mining, statistical learning, machine learning, bioinformatics, supervised learning, unsupervised learning, neural networks, support vector machines, classification trees, boosting, graphical models, random forests, ensemble methods, Lasso, non-negative matrix factorization, spectral clustering, methods for "wide" data, multiple testing, false discovery rates, linear methods, model assessment and selection, high-dimensional problems, kernel smoothing methods, model inference and averaging, estimation methods, model selection and bias-variance tradeoff, specific algorithms, computational considerations, prototype methods, neural networks, support vector machines, flexible discriminants, boosting, random forests, ensemble learning, undirected graphical models, high-dimensional problems, linear classifiers, regularization techniques, high-dimensional regression, feature selection, classification without available features, feature assessment, multiple-testing problem, statistical le

In [21]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='map_reduce',
    verbose=True
)

chain.run(documents[:200])



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"station_id: 41515 city_name: Asadabad date: 1957-07-01 00:00:00 season: Summer avg_temp_c: 27.0 min_temp_c: 21.1 max_temp_c: 35.6 precipitation_mm: 0.0 snow_depth_mm: nan avg_wind_dir_deg: nan avg_wind_speed_kmh: nan peak_wind_gust_kmh: nan avg_sea_level_pres_hpa: nan sunshine_total_min: nan"


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"station_id: 41515 city_name: Asadabad date: 1957-07-02 00:00:00 season: Summer avg_temp_c: 22.8 min_temp_c: 18.9 max_temp_c: 32.2 precipitation_mm: 0.0 snow_depth_mm: nan avg_wind_dir_deg: nan avg_wind_speed_kmh: nan peak_wind_gust_kmh: nan avg_sea_level_pres_hpa: nan sunshine_total_min: nan"


CONCISE SUMMARY:[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"station_id: 4151

'In Asadabad, station_id 41515 recorded summer weather data from July 1 to July 5, 1957. Temperatures ranged from 22.8°C to 30.8°C, with precipitation on July 3 and July 4. Other weather details such as wind speed, sea level pressure, and sunshine duration were not available for any of the days.'

## The Refine chain

In [22]:
chain = load_summarize_chain(
    llm=llm,
    chain_type='refine',
    verbose=True
)

chain.run(sl_data[:20])



[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"Springer Series in Statistics
Trevor Hastie
Robert TibshiraniJerome FriedmanSpringer Series in Statistics
The Elements of
Statistical Learning
Data Mining, Inference, and Prediction
The Elements of Statistical LearningDuring the past decade there has been an explosion in computation and information tech-
nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. While the appro

'The existing summary of "The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman provides a comprehensive overview of important concepts in data mining, inference, and prediction, focusing on statistics, machine learning, and bioinformatics. The second edition includes new topics like graphical models, random forests, ensemble methods, and methods for "wide" data. The authors, prominent researchers in statistics, updated the second edition to reflect the rapid pace of research in statistical learning. The book explores challenges in learning from data, covering topics such as linear methods for regression and classification, boosting fits, forward stagewise additive modeling, exponential loss, AdaBoost, loss functions, robustness, off-the-shelf procedures for data mining, boosting trees, numerical optimization via gradient boosting, right-sized trees for boosting, regularization, interpretation, and illustrations of various datasets. Additionally


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYour job is to produce a final summary
We have provided an existing summary up to a certain point: The book covers a comprehensive range of topics in machine learning, including supervised learning, regression, classification, kernel smoothing, model assessment and selection, additive models, trees, boosting, neural networks, support vector machines, prototype methods, unsupervised learning, random forests, ensemble learning, undirected graphical models, high-dimensional problems, and more. The second edition also includes chapters on undirected graphical models, learning in high-dimensional feature spaces, support vector machines, flexible discriminants, prototype methods, and nearest-neighbors. The authors have made improvements to address colorblind readers and clarify concepts related to error-rate estimation. The book also covers variable types, statistical decision theory, 

'The existing summary is already comprehensive and does not require any refinement.'

## Custom prompt

In [23]:
print(chain.initial_llm_chain.prompt.template)

Write a concise summary of the following:


"{text}"


CONCISE SUMMARY:


In [24]:
print(chain.refine_llm_chain.prompt.template)

Your job is to produce a final summary.
We have provided an existing summary up to a certain point: {existing_answer}
We have the opportunity to refine the existing summary (only if needed) with some more context below.
------------
{text}
------------
Given the new context, refine the original summary.
If the context isn't useful, return the original summary.


In [25]:
initial_template = """
Extract the most relevant themes from the following:


"{text}"


THEMES:"""

refine_template = """
Your job is to extract the most relevant themes
We have provided an existing list of themes up to a certain point: {existing_answer}
We have the opportunity to refine the existing list(only if needed) with some more context below.
------------
{text}
------------
Given the new context, refine the original list
If the context isn't useful, return the original list and ONLY the original list.
Return that list as a comma separated list.

LIST:"""

initial_prompt = PromptTemplate.from_template(initial_template)
refine_prompt = PromptTemplate.from_template(refine_template)

chain = load_summarize_chain(
    llm=llm,
    chain_type='refine',
    question_prompt=initial_prompt,
    refine_prompt=refine_prompt,
    verbose=True
)

chain.run(sl_data[:20])



[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Extract the most relevant themes from the following:


"Springer Series in Statistics
Trevor Hastie
Robert TibshiraniJerome FriedmanSpringer Series in Statistics
The Elements of
Statistical Learning
Data Mining, Inference, and Prediction
The Elements of Statistical LearningDuring the past decade there has been an explosion in computation and information tech-
nology. With it have come vast amounts of data in a variety of fields such as medicine, biolo-gy, finance, and marketing. The challenge of understanding these data has led to the devel-opment of new tools in the field of statistics, and spawned new areas such as data mining,machine learning, and bioinformatics. Many of these tools have common underpinnings butare often expressed with different terminology. This book describes the important ideas inthese areas in a common conceptual framework. Whi

'Boosting and Additive Trees, Neural Networks, Projection Pursuit Regression, Fitting Neural Networks, Some Issues in Training Neural Networks, Example: Simulated Data, Example: ZIP Code Data, Bayesian Neural Nets and the NIPS 2003 Challenge, Computational Considerations, Bayes, Boosting and Bagging, Performance Comparisons, Random Forests, Ensemble Learning, Undirected Graphical Models, High-Dimensional Problems: p≫N, Diagonal Linear Discriminant Analysis, Nearest Shrunken Centroids, Linear Classifiers with Quadratic Regularization, Regularized Discriminant Analysis, Logistic Regression with Quadratic Regularization, The Support Vector Classifier, Feature Selection, Computational Shortcuts When p≫N, Linear Classifiers with L1 Regularization, Application of Lasso to Protein Mass Spectroscopy, The Fused Lasso for Functional Data, Classification When Features are Unavailable, Example: String Kernels and Protein Classification, Classification and Other Models Using Inner-Product Kernels a

'Explosion in computation and information technology, Vast amounts of data in various fields, Tools and techniques in statistics, data mining, machine learning, and bioinformatics, Common conceptual framework for understanding these areas, Emphasis on concepts rather than mathematics, Examples and color graphics provided, Broad coverage, including supervised and unsupervised learning, Introduction of new topics in the second edition, Prominent researchers and authors in the field, Statistical modeling software and environment (R/S-PLUS), Introduction of various data mining tools and techniques'

In [26]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

model = ChatOpenAI()

prompt = ChatPromptTemplate.from_template("tell me a joke about {foo}")

chain = prompt | model

In [27]:
chain.invoke({'foo':'test'})

AIMessage(content='Why did the math book look sad during the test? \nBecause it had too many problems!')