# Hola, mundo en LangChain

## Instalar librerías principales y configuración de API Key de OpenAI

In [3]:
import os

OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

## Carga de documents

In [5]:
# Hacer la llamada del API
import requests
# Hacer el documento legible para el modelo
from langchain.document_loaders import PyPDFLoader

# El modelo de OpenAI no tiene esta información

urls = [
    'https://arxiv.org/pdf/2306.06031v1.pdf',
    'https://arxiv.org/pdf/2306.12156v1.pdf',
    'https://arxiv.org/pdf/2306.14289v1.pdf',
    'https://arxiv.org/pdf/2305.10973v1.pdf',
    'https://arxiv.org/pdf/2306.13643v1.pdf'
]

ml_papers = []
for i, url in enumerate(urls):
    response = requests.get(url)
    filename = f'paper{i+1}.pdf'
    with open(filename, 'wb') as f:
        f.write(response.content)
        print(f'Descargado {filename}')
        
        loader = PyPDFLoader(filename)
        data = loader.load()
        ml_papers.extend(data)
        


# Utiliza la lista ml_papers para acceder a los elementos de todos los documentos descargados
print('Contenido de ml_papers:')
print()

Descargado paper1.pdf
Descargado paper2.pdf
Descargado paper3.pdf
Descargado paper4.pdf
Descargado paper5.pdf
Contenido de ml_papers:



In [6]:
type(ml_papers), len(ml_papers), ml_papers[3]

(list,
 57,
 Document(page_content='Figure 1: FinGPT Framework.\n4.1 Data Sources\nThe first stage of the FinGPT pipeline involves the collec-\ntion of extensive financial data from a wide array of online\nsources. These include, but are not limited to:\n•Financial news: Websites such as Reuters, CNBC, Yahoo\nFinance, among others, are rich sources of financial news\nand market updates. These sites provide valuable informa-\ntion on market trends, company earnings, macroeconomic\nindicators, and other financial events.\n•Social media : Platforms such as Twitter, Facebook, Red-\ndit, Weibo, and others, offer a wealth of information in\nterms of public sentiment, trending topics, and immediate\nreactions to financial news and events.\n•Filings : Websites of financial regulatory authorities, such\nas the SEC in the United States, offer access to company\nfilings. These filings include annual reports, quarterly earn-\nings, insider trading reports, and other important company-\nspecific in

Uní todos los papaers cargados, veo cómo se componen los papers y el loader
Type es una lista
Cuantos (len) -> 57
Tomo el elemento 4 y miro que información tiene

## Split de documents

In [7]:
# Antes de volverlos Embeddings debo partir la info
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 200,
    length_function = len
)

documents = text_splitter.split_documents(ml_papers)

In [8]:
len(documents), documents[3]

(211,
 Document(page_content='ing untapped potentials in open finance.\n•Data-centric approach : Recognizing the significance of\ndata curation, FinGPT adopts a data-centric approach and\nimplements rigorous cleaning and preprocessing methods\nfor handling varied data formats and types, thereby ensur-\ning high-quality data.\n•End-to-end framework : FinGPT embraces a full-stack\nframework for FinLLMs with four layers:\n–Data source layer : This layer assures comprehensive\nmarket coverage, addressing the temporal sensitivityarXiv:2306.06031v1  [q-fin.ST]  9 Jun 2023', metadata={'source': 'paper1.pdf', 'page': 0}))

Convertir el texto en embeddings, el modelo solo acepta cierto número de tokenes, entonces hay que hacerlos más pequeños. No puedo ingresar todos los documentos, debo partirlos en documentos más pequeños. Convertirlo en números

Ya puedo alimentar el modelo de embeddings e ingresarlos en una base de datos de vectores con Chroma

## Embeddings e ingesta a base de datos vectorial

In [14]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings
)

retriever = vectorstore.as_retriever(
    search_kwargs={"k":3}
)

## Modelos de chat y cadenas para consulta de información

In [15]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

chat = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name="gpt-3.5-turbo",
    temperature=0.0
)

qa_chain = RetrievalQA.from_chain_type(
    llm=chat,
    chain_type='stuff',
    retriever=retriever
)

In [16]:
query = 'what is fingpt?'
qa_chain.run(query)

'FinGPT is a model that focuses on open finance and adopts a data-centric approach with rigorous cleaning and preprocessing methods to ensure high-quality data. It embraces a full-stack framework for FinLLMs with four layers, including a data source layer for comprehensive market coverage. Additionally, FinGPT offers hands-on tutorials and demo applications for financial tasks like robo-advisory services, quantitative trading, and low-code development, showcasing the practical applicability of LLMs in finance.'