### Document Loaders


#### CSV Loader

In [63]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path="../../Sessions_Part2/datasets/sns_datasets/titanic.csv")
data = loader.load()
print(data[0])

page_content='survived: 0\npclass: 3\nsex: male\nage: 22.0\nsibsp: 1\nparch: 0\nfare: 7.25\nembarked: S\nclass: Third\nwho: man\nadult_male: True\ndeck: \nembark_town: Southampton\nalive: no\nalone: False' metadata={'source': '../../Sessions_Part2/datasets/sns_datasets/titanic.csv', 'row': 0}


In [64]:
print(data[0].page_content)

survived: 0
pclass: 3
sex: male
age: 22.0
sibsp: 1
parch: 0
fare: 7.25
embarked: S
class: Third
who: man
adult_male: True
deck: 
embark_town: Southampton
alive: no
alone: False


#### HTML Parsers

In [65]:
from langchain.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader('../../Sessions_Part2/datasets/harry_potter_html/001.htm')
data = loader.load()
data

[Document(page_content='A Day of Very Low Probability\n\nBeneath the moonlight glints a tiny fragment of silver, a fraction of a line…\n\n(black robes, falling)\n\n…blood spills out in litres, and someone screams a word.\n\nEvery inch of wall space is covered by a bookcase. Each bookcase has six shelves, going almost to the ceiling. Some bookshelves are stacked to the brim with hardback books: science, maths, history, and everything else. Other shelves have two layers of paperback science fiction, with the back layer of books propped up on old tissue boxes or lengths of wood, so that you can see the back layer of books above the books in front. And it still isn’t enough. Books are overflowing onto the tables and the sofas and making little heaps under the windows.\n\nThis is the living-room of the house occupied by the eminent Professor Michael Verres-Evans, and his wife, Mrs. Petunia Evans-Verres, and their adopted son, Harry James Potter-Evans-Verres.\n\nThere is a letter lying on th

#### Markdown Loader

In [66]:
from langchain.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader(file_path='../../Sessions_Part2/datasets/harry_potter_md/001.md')

data = loader.load()

print(data[0].page_content)

A Day of Very Low Probability

Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…

(black robes, falling)

…blood spills out in litres, and someone screams a word.

Every inch of wall space is covered by a bookcase. Each bookcase has six shelves, going almost to the ceiling. Some bookshelves are stacked to the brim with hardback books: science, maths, history, and everything else. Other shelves have two layers of paperback science fiction, with the back layer of books propped up on old tissue boxes or lengths of wood, so that you can see the back layer of books above the books in front. And it still isn’t enough. Books are overflowing onto the tables and the sofas and making little heaps under the windows.

This is the living-room of the house occupied by the eminent Professor Michael Verres-Evans, and his wife, Mrs. Petunia Evans-Verres, and their adopted son, Harry James Potter-Evans-Verres.

There is a letter lying on the living-room table, and an unstampe

#### PDF Loader

In [67]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader('../../Sessions_Part2/datasets/harry_potter_pdf/hpmor-trade-classic.pdf')

data = loader.load()

print(data[0].page_content)

Harry Potter and the Methods of Rationality


#### Wikipedia loader

In [68]:
from langchain.document_loaders import WikipediaLoader
loader = WikipediaLoader(query='LangChain', load_max_docs=1)
data = loader.load()

data

[Document(page_content='LangChain is a framework designed to simplify the creation of applications using large language models (LLMs). As a language model integration framework, LangChain\'s use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n\n== History ==\nLangChain was launched in October 2022 as an open source project by Harrison Chase, while working at machine learning startup Robust Intelligence. The project quickly garnered popularity, with improvements from hundreds of contributors on GitHub, trending discussions on Twitter, lively activity on the project\'s Discord server, many YouTube tutorials, and meetups in San Francisco and London. In April 2023, LangChain had incorporated and the new startup raised over $20 million in funding at a valuation of at least $200 million from venture firm Sequoia Capital, a week after announcing a $10 million seed investment from Benchmark.In Octobe

In [69]:
data[0].metadata

{'title': 'LangChain',
 'summary': "LangChain is a framework designed to simplify the creation of applications using large language models (LLMs). As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n",
 'source': 'https://en.wikipedia.org/wiki/LangChain'}

### arXiv loader

In [70]:
from langchain_community.document_loaders import ArxivLoader

loader = ArxivLoader(query='1706.03762', load_max_docs=1) 

data = loader.load()

data

[Document(page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, d

In [71]:
data[0].metadata

{'Published': '2023-08-02',
 'Title': 'Attention Is All You Need',
 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin',
 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, 

### Creating a bot that can answer questions based on Wikipedia articles

In [72]:
import os
from langchain_openai import ChatOpenAI
from langchain.globals import set_llm_cache
from langchain.cache import InMemoryCache
from langchain.prompts import HumanMessagePromptTemplate, ChatPromptTemplate
from langchain.document_loaders import WikipediaLoader

import os
with open('../../openai_api_key.txt') as f:
    api_key = f.read()
os.environ['OPENAI_API_KEY'] = api_key

chat = ChatOpenAI()
set_llm_cache(InMemoryCache())

In [73]:
def qa_bot(topic, question):
    chat = ChatOpenAI()
    loader = WikipediaLoader(query = topic, load_max_docs=1)
    data = loader.load()
    
    document = data[0]
    title = document.metadata['title']
    summary = document.metadata['summary']
    
    question = question
    

    human_template = "Read about the wikipedia article on {title} having content {content} and answer the {question}"
    
    human_message_prompt= HumanMessagePromptTemplate.from_template(human_template)
    
    chat_prompt = ChatPromptTemplate.from_messages([human_message_prompt])
    
    prompt = chat_prompt.format_prompt(title = title, content = summary, question = question)
    
    response = chat(messages = prompt.to_messages())
    
    return response.content

In [74]:
qa_bot('Artifical Intelligence', 'What was AI born?')

'AI was founded as an academic discipline in 1956.'