<a href="https://colab.research.google.com/github/ibran-el/go-colab/blob/main/uni_eden_ai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Codes Below illustrates how to use langchain to fine-tune a large language model on specific data (custom data) from a document, pdf, textfile, or a markdown. in this case, we are using Eden AI, you can use Open AI or other many models supported by langchain. some lines of code vary depending on the LLM provider. you might wanna consider checking out lang chains documentation.

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install langchain

In [None]:
!pip install PyPDF2

In [None]:
!pip install python-dotenv

In [None]:
!pip install Flask Streamlit

In [None]:
!pip install faiss-cpu faiss-gpu

In [None]:
!pip install python-docx

IMPORTANT IMPORTS

In [44]:
from google.colab import userdata
import docx
from dotenv import load_dotenv
from PyPDF2 import PdfReader

# imports below are langchain specific import
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
# from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains.question_answering import load_qa_chain

# the imports down below are EDEN AI related imports
from langchain_community.llms import EdenAI
from langchain_community.embeddings.edenai import EdenAiEmbeddings


import os

os.environ['EDENAI_API_KEY'] = userdata.get('EDEN_KEY')
load_dotenv()

FILE READING FUNCTIONS

In [45]:
# files reading functions
def for_pdf(dir_path):
    with open(dir_path, 'rb') as pfile:
        pdf_r = PdfReader(pfile)
        text = ""
        for page in range(len(pdf_r.pages)):
            text+=pdf_r.pages[page].extract_text()
    return text

def for_doc(dir_path):
    with open(dir_path, 'r'):
        doc_r = docx.Document(dir_path)
        text = ""
        for par in doc_r.paragraphs:
            text += par.text + "\n"
        return text

def for_text(dir_path):
    with open(dir_path, 'r') as file:
        text = file.read()


#general document reading function
def readFilez(directory):
    combined_txt = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith('.txt'):
            combined_txt += for_text(file_path)
        elif filename.endswith('.docx'):
            combined_txt += for_doc(file_path)
        elif filename.endswith('.pdf'):
            combined_txt += for_pdf(file_path)
    return combined_txt

WORK CODES

divide text into chunks that that will be sent to memory for processing since we cannot load the whole document at once.

In [46]:
data_dir = '/content/drive/MyDrive/edenAI/edentrain/data/'
text = readFilez(data_dir)

# split text into chunks for easy processing and memory management
char_txt_splitter = CharacterTextSplitter(
    separator='\n', chunk_size=1000, chunk_overlap=200, length_function=len)

text_chunks = char_txt_splitter.split_text(text)


create embeddings vectors

In [47]:
embeddings = EdenAiEmbeddings(provider='openai')

docsearch = FAISS.from_texts(text_chunks,embeddings)

load an LLM

In [59]:
llm = llm = EdenAI(edenai_api_key=os.getenv("EDENAI_API_KEY"), provider="openai", temperature=0.3, max_tokens=250)
chain = load_qa_chain(llm, chain_type='stuff')

querying the model, and 'chain' the response

In [61]:
query = """what does sz003 mean"""
docs = docsearch.similarity_search(query)

response = chain.run(input_documents=docs, question=query)

print(" ")
print(query)
print(response)

 
what does sz003 mean
 SZ003 refers to the code for the Bachelor of IT Application & Management program at the State University of Zanzibar (SUZA) in Zanzibar.
