# Markdown extraction

This notebook shows examples of text extraction from MD files with different packages

**Table of contents**<a id='toc0_'></a>    
- 1. [Methods to load MD files](#toc1_)    
  - 1.1. [Load from unstructured local MD loader](#toc1_1_)    
  - 1.2. [Load from unstructured io API](#toc1_2_)    
- 2. [Evaluate loded docs by embedding similarity](#toc2_)    
  - 2.1. [Embedding & Storage](#toc2_1_)    
  - 2.2. [Similarity search](#toc2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

import glob
import pandas as pd
from dotenv import load_dotenv
from langchain.text_splitter import MarkdownHeaderTextSplitter
from tqdm.autonotebook import trange


  from tqdm.autonotebook import trange


## 1. <a id='toc1_'></a>[Methods to load MD files](#toc0_)

In [2]:
folder_loc = kit_dir
md_files = list(glob.glob(f'{folder_loc}/*.md'))
file_path = md_files[0]


### 1.1. <a id='toc1_1_'></a>[Load from unstructured local MD loader](#toc0_)

In [9]:
from langchain.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader(file_path, mode="elements")
docs_unstructured_local = loader.load()
for doc in docs_unstructured_local[:10]:
    print(f'{doc.page_content}\n---')

SambaNova AI Starter Kits
---
Data Extraction Examples
---
Data Extraction Examples
Overview
Getting started
Deploy in vitual environment
Deploy in Docker container


File Loaders
CSV Documents
XLS/XLSX Documents
DOC/DOCX Documents
RTF Documents
Markdown Documents
HTML Documents
Multidocument
PDF Documents
Included Files
---
Overview
---
This kit include a series of Notebooks that demonstrates various methods for extracting text from documents in different input formats. including Markdown, PDF, CSV, RTF, DOCX, XLS, HTML
---
Getting started
---
Deploy the starter kit
---
Option 1: Run through local virtual environment
---
Important: With this option some funcionalities requires to install some pakges directly in your system
- pandoc (for local rtf files loading)
- tesseract-ocr (for PDF ocr and table extraction)
- poppler-utils (for PDF ocr and table extraction)
---
Clone repo.
git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git
2.1 Install requirements: It is re

### 1.2. <a id='toc1_2_'></a>[Load from unstructured io API](#toc0_)

In [10]:
from langchain.document_loaders import UnstructuredAPIFileLoader
# register at Unstructured.io to get a free API Key
load_dotenv(os.path.join(repo_dir,'.env'))

loader = UnstructuredAPIFileLoader(file_path,
                                   mode="elements",
                                   api_key=os.environ.get('UNSTRUCTURED_API_KEY'),
                                   url=os.environ.get("UNSTRUCTURED_URL"))
docs_unstructured_api = loader.load()
for doc in docs_unstructured_api:
    print(f'{doc.page_content}\n---')

SambaNova AI Starter Kits
---
Data Extraction Examples
---
Data Extraction Examples
Overview
Getting started
Deploy in vitual environment
Deploy in Docker container


File Loaders
CSV Documents
XLS/XLSX Documents
DOC/DOCX Documents
RTF Documents
Markdown Documents
HTML Documents
Multidocument
PDF Documents
Included Files
---
Overview
---
This kit include a series of Notebooks that demonstrates various methods for extracting text from documents in different input formats. including Markdown, PDF, CSV, RTF, DOCX, XLS, HTML
---
Getting started
---
Deploy the starter kit
---
Option 1: Run through local virtual environment
---
Important: With this option some funcionalities requires to install some pakges directly in your system
- pandoc (for local rtf files loading)
- tesseract-ocr (for PDF ocr and table extraction)
- poppler-utils (for PDF ocr and table extraction)
---
Clone repo.
git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git
2.1 Install requirements: It is re

## 2. <a id='toc2_'></a>[Evaluate loded docs by embedding similarity](#toc0_)

### 2.1. <a id='toc2_1_'></a>[Embedding & Storage](#toc0_)

In [12]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

encode_kwargs = {'normalize_embeddings': True}
embd_model = HuggingFaceInstructEmbeddings( model_name='intfloat/e5-large-v2',
                                            embed_instruction="", # no instructions needed for candidate passages
                                            query_instruction="Represent this sentence for searching relevant passages: ",
                                            encode_kwargs=encode_kwargs)
vectorstore_unstructured_local = FAISS.from_documents(documents=docs_unstructured_local, embedding=embd_model)
vectorstore_unstructured_api = FAISS.from_documents(documents=docs_unstructured_api, embedding=embd_model)

load INSTRUCTOR_Transformer
max_seq_length  512


### 2.2. <a id='toc2_2_'></a>[Similarity search](#toc0_)

In [20]:
query = "how I clone the repo?"

ans = vectorstore_unstructured_local.similarity_search(query)
print("-------Unstructured local Loader----------\n")
print(ans[0].page_content)


ans_2 = vectorstore_unstructured_api.similarity_search(query)
print("--------Unstructured api loader------------\n")
print(ans_2[0].page_content)


-------Unstructured local Loader----------

Clone repo.
git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git
--------Unstructured api loader------------

Clone repo.
git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git
