<a href="https://colab.research.google.com/github/kmk4444/Langchain/blob/main/Loaders_Splitter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Langchain provides more than one loaders but we focus on three loaders.
1. WebBaseLoader - Load from a web page
2. PyPDFLoader - Load data from a pdf
3. UnstructuredExcelLoader - Load data from a excel

Document loader means that several structure data converts into documents. (Document.page_content and Document.metada)

**Requirements.txt**

In [1]:
!touch requirements.txt
!echo langchain >> requirements.txt
!echo langchain-openai >> requirements.txt
!echo langchain-google-genai >> requirements.txt
!echo langchain_experimental >> requirements.txt
!echo openai >> requirements.txt
!echo anthropic >> requirements.txt
!echo cohere >> requirements.txt
!echo streamlit >> requirements.txt
!echo python-dotenv >> requirements.txt
!echo beautifulsoup4 >> requirements.txt
!echo faiss-cpu >> requirements.txt
!echo pypdf >> requirements.txt
!echo unstructured >> requirements.txt
!echo networkx >> requirements.txt
!echo openpyxl >> requirements.txt
!echo rapidocr-onnxruntime >> requirements.txt

**terminal / bash komutu**

In [None]:
pip install -r requirements.txt

In [None]:
# LOADER

# WebBaseLoader
from langchain_community.document_loaders import WebBaseLoader

target_url="https://kpmg.com/tr/tr/home/gorusler/2023/12/uretken-yapay-zeka-uygulamalarinin-kurumsallasma-yaklasimi.html"

loader = WebBaseLoader(target_url) # we create a loader object and we assign to this object.
raw_documents = loader.load() # we load document from url. (type is document)


with open("URL_Icerik.txt","w") as file: # we create a file to save raw_documents for string format.
  file.write(raw_documents[0].page_content) #page_content is information we want.

print("Dosya işlemi tamamlandı.")

print(raw_documents[0].metadata) # we can design metadata in RAG.

# PyPDFLoader - Bir PDF dosyasından içerik yükleme
from langchain_community.document_loaders import PyPDFLoader

#first file
filepath = "/content/drive/MyDrive/timeline.pdf"
loader = PyPDFLoader(filepath) # we are ready to load our file
pages = loader.load() # we load our pdf

print(pages[39].page_content,pages[39].metadata) # index 39 means that 40th page
##########################################################################################
print("##########################################################################################")
#second file
filepath = "/content/drive/MyDrive/digital.pdf"
loader = PyPDFLoader(filepath,extract_images=True) # extract_images provides that could we convert image to text? We are checking.
pages = loader.load()# we load our pdf
print(pages[6].page_content)
##########################################################################################
print("##########################################################################################")

#UnstructuredExcelLoader - Load data from a excel
from langchain_community.document_loaders import UnstructuredExcelLoader
filepath="/content/drive/MyDrive/ai_course.xlsx"
loader = UnstructuredExcelLoader(filepath, mode="elements") # it converts excel table into html table
docs = loader.load()

excel_content = docs[0].metadata["text_as_html"] # all information of excel is 0th index.

# we will save html data in html file.
with open("excel.html","w") as file:
  file.write(excel_content)

print("Dosya işlemi tamamlandı.")


Dosya işlemi tamamlandı.
{'source': 'https://kpmg.com/tr/tr/home/gorusler/2023/12/uretken-yapay-zeka-uygulamalarinin-kurumsallasma-yaklasimi.html', 'title': 'Üretken Yapay Zeka Uygulamalarının Kurumsallaşma Yaklaş - KPMG Türkiye', 'description': 'Mikro-Dil-Modeli Mimarisi', 'language': 'tr-TR'}
 
 
1989: George Cybenko proves that neural networks can approximate continuous 
functions  
 
 
1989: Yann LeCun's convolutional neural network for handwritten -digit recognition 
(LeNet -1)  
 
 
 
1989: Kurt Hornik proves that neural networks are universal app roximators  
 
 
 
1990: Carver Mead describes a neuromorph ic processor   {'source': '/content/drive/MyDrive/timeline.pdf', 'page': 39}
##########################################################################################
European Economy  Economic Briefs                                                                            Issue 054 | July 2020  
  
 
 
5 
 expansion of e -commerce and e -govern ment and the 
adoption of new

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Dosya işlemi tamamlandı.


**Resource:**
- https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

In [7]:
%%writefile app.py
# SPLITTER
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
import os
from dotenv import load_dotenv

#load_dotenv()
#my_key_openai = os.getenv("openai_apikey")
#my_key_google = os.getenv("google_apikey")
my_key_openai="----"
my_key_google="----"

llm_gemini = ChatGoogleGenerativeAI(google_api_key=my_key_google, model="gemini-pro", convert_system_message_to_human=True)

embeddings = OpenAIEmbeddings(api_key=my_key_openai)

def split_content(splitter_type, target_url="", chunk_size=500, chunk_overlap=0):
  loader = WebBaseLoader(target_url)
  raw_documents = loader.load()

  if splitter_type == "Character": # it depends on the number of character. Sometimes, to save content, character number can be increased. It is langchain advantage :)
    text_splitter = CharacterTextSplitter(
        chunk_size=chunk_size, # chunk_size means that the number of character.
        chunk_overlap=chunk_overlap, # for example, you selected chunk_size 800. First chunk consists of 800 character and you selected that chunk_overlap is 200.
        #Second chunk will start the last 200 character of first chumk and 600 following characters.
        length_function=len
    )

  elif splitter_type == "Recursive": # It depends on number of character and recursive characters(for example line space, comma, dot).
    # It seperates chunk using these recursive characters. Also, you can determine character.
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )


  elif splitter_type == "Semantic": # It just depends on context. It seperates chunks using semantic algorithm.
    text_splitter = SemanticChunker(embeddings)

  splitted_documents = text_splitter.split_documents(raw_documents)

  return splitted_documents

import streamlit as st

st.set_page_config(page_title="Splitter Karşılaştırması", layout="wide")
st.title("Splitter Karşılaştırılması")
st.divider()

target_url = st.text_input(label="İşlenecek Web Adresini Giriniz:")
st.divider()
chunk_size = st.slider(label="Kesit büyüklüğünü belirleyiniz:",min_value=100, max_value=2000, value=1000, step=100, key="url_chunk_size")
st.divider()
chunk_overlap = st.slider(label="Çakışma büyüklüğünü belirleyiniz",min_value=0, max_value=1000, value=0, step=100, key="url_chunk_overlap")
st.divider()
submit_btn= st.button(label="Başla", key = "url_button")
st.divider()

if submit_btn:
  col_character, col_recursive, col_semantic = st.columns(3)

  with col_character:
    splitted_documents = split_content(splitter_type="Character", target_url=target_url, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    st.subheader("Character Splitter")
    for splitted_document in splitted_documents:
      st.success(splitted_document.page_content)

  with col_recursive:
    splitted_documents = split_content(splitter_type="Recursive", target_url=target_url, chunk_size=chunk_size,chunk_overlap=chunk_overlap)
    st.subheader("Recursive Character Splitter")
    for splitted_document in splitted_documents:
      st.info(splitted_document.page_content)

  with col_semantic:
    splitted_documents = split_content(splitter_type="Semantic",target_url=target_url,chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    st.subheader("Semantic Splitter")
    for splitted_document in splitted_documents:
      st.warning(splitted_document.page_content)



Overwriting app.py


In [8]:
!npm install localtunnel
!streamlit run /content/app.py &>/content/logs.txt &
!npx localtunnel --port 8501

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No description
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No repository field.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No README data
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No license field.
[0m
[K[?25h+ localtunnel@2.0.2
updated 1 package and audited 36 packages in 0.436s

3 packages are looking for funding
  run `npm fund` for details

found 2 [93mmoderate[0m severity vulnerabilities
  run `npm audit fix` to fix them, or `npm audit` for details
[K[?25hnpx: installed 22 in 1.918s
your url is: https://lucky-tools-end.loca.lt
^C
