### OPEN AI-BASED FILE RETRIEVAL SYSTEM

## Basic File Name retrieval

### Installation

We start by installing all the required packages for the same

In [None]:
!pip install langchain
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken
!pip install fuzzywuzzy
!pip install PyPDF2
!pip install azure-storage-blob

Collecting langchain
  Downloading langchain-0.0.218-py3-none-any.whl (1.2 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.5.8-py3-none-any.whl (26 kB)
Collecting langchainplus-sdk>=0.0.17 (from langchain)
  Downloading langchainplus_sdk-0.0.17-py3-none-any.whl (25 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain)
  Downloading openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=3.3.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain)
  Downloading marshmallow-3.19.0-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Load Required Packages

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader, CSVLoader,DirectoryLoader, GitLoader, NotebookLoader, OnlinePDFLoader, PythonLoader, TextLoader, UnstructuredFileLoader, UnstructuredHTMLLoader, UnstructuredPDFLoader, UnstructuredWordDocumentLoader, WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.document_loaders import AzureBlobStorageContainerLoader
from fuzzywuzzy import fuzz
from fuzzywuzzy import process



### OpenAI API Key

In [None]:
# Get your API keys from openai, you will need to create an account.
# Here is the link to get the keys: https://platform.openai.com/account/billing/overview
import os
os.environ["OPENAI_API_KEY"] = ""

### Connect to Blob

In [None]:
loaders = AzureBlobStorageContainerLoader(conn_str="DefaultEndpointsProtocol=https;AccountName=ishitagptblob1;EndpointSuffix=core.windows.net",container="ishita-container-langchain")

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

In [None]:
documents = loaders.load()
print(f'You have {len(documents)} documents in your data')


You have 17 documents in your data


### Load Multiple file extensions ##dont execute

In [None]:
folder_path = '/content/gdrive/My Drive/data/'
os.listdir(folder_path)

['test1.pdf',
 'test2.pdf',
 'test3.pdf',
 'test4.pdf',
 'test5.pdf',
 'EmpSampledata.csv',
 'CauseAndEffectOfHomelessness2.txt',
 'CauseAndEffectOfHomelessness3.txt',
 'Crime with Violence in USA and SA.docx',
 'CauseAndEffectOfHomelessness.txt',
 'Wiki.md',
 'PerksPlus.pdf',
 'Copy of Benefit_Options.pdf',
 'role_library.pdf',
 'File Viewer Migration factory.pdf',
 'Northwind_Standard_Benefits_Details.pdf',
 'Northwind_Health_Plus_Benefits_Details (1).pdf',
 'OPEN AI-BASED\xa0FILE RETRIEVAL\xa0SYSTEM (1).pptx']

In [None]:
# location of the files.
#loaders = [UnstructuredPDFLoader(os.path.join(folder_path, fn)) for fn in os.listdir(folder_path)]
text_loader = DirectoryLoader(folder_path, glob='**/*.txt')
pdf_loader = DirectoryLoader(folder_path, glob='**/*.pdf')
readme_loader = DirectoryLoader(folder_path, glob='**/*.md')
doc_loader = DirectoryLoader(folder_path, glob='**/*.docx')
ppt_loader = DirectoryLoader(folder_path, glob='**/*.pptx')
csv_loader = DirectoryLoader(folder_path, glob='**/*.csv')
html_loader = UnstructuredHTMLLoader(folder_path, glob='**/*.html')
if folder_path.endswith(".ipynb"):
  pynb_loader = NotebookLoader(folder_path)
loaders = [pdf_loader, readme_loader, text_loader, doc_loader, ppt_loader,csv_loader]
documents = []
for loader in loaders:
  documents.extend(loader.load())


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


### Vector Store
Chroma as vectorstore to index and search embeddings


There are three main steps going on after the documents are loaded:

- Splitting documents into chunks

- Creating embeddings for each document

- Storing documents and embeddings in a vectorstore


In [None]:
index = VectorstoreIndexCreator().from_loaders([loaders])
print(index)

vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7f1807f467d0>


### Testing for user sample query

In [None]:
#input the query for which you want the related document
query_vector=input()

What are document errors?


In [None]:
#store it in result for further processing
result = index.query(query_vector)
print(query_vector)
print(result)

What are document errors
 Document errors are mistakes or omissions in the documentation that is submitted with a claim. These errors can include incorrect or missing information, incorrect coding, or incorrect dates of service. Document errors can lead to delays in processing or denial of a claim.


In [None]:
print(index.query_with_sources(result))
print(index.query_with_sources(result)['sources'])

{'question': ' Document errors are mistakes or omissions in the documentation that is submitted with a claim. These errors can include incorrect or missing information, incorrect coding, or incorrect dates of service. Document errors can lead to delays in processing or denial of a claim.', 'answer': ' Document errors can lead to delays in processing or denial of a claim.\n', 'sources': '/tmp/tmpene3qesg/ishita-container-langchain/Northwind_Standard_Benefits_Details.pdf, /tmp/tmpyxl2__d5/ishita-container-langchain/Northwind_Health_Plus_Benefits_Details (1).pdf'}
/tmp/tmpene3qesg/ishita-container-langchain/Northwind_Standard_Benefits_Details.pdf, /tmp/tmpyxl2__d5/ishita-container-langchain/Northwind_Health_Plus_Benefits_Details (1).pdf


### Testing with user input for filename
**Works with both right and wrong entered filename**

In [None]:
from azure.storage.blob import BlobServiceClient

# Define your storage account and container details
account_name = 'ishitagptblob1'
account_key = ''
container_name = "ishita-container-langchain"

# Create a BlobServiceClient object
blob_service_client = BlobServiceClient.from_connection_string(f"DefaultEndpointsProtocol=https;AccountName={account_name};AccountKey={account_key};EndpointSuffix=core.windows.net")

# Get the container client
container_client = blob_service_client.get_container_client(container_name)

# List blobs in the container
blob_list = [blob.name for blob in container_client.list_blobs()]


In [None]:
#Enter the name of the file to output the file source
blob_name=input("Enter the file name: ")

Enter the file name: EmpSampledata.csv


In [None]:
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

# Check if the blob exists
if blob_client.exists():
    # Print the blob URL
    blob_url = blob_client.url
    print(f"Blob URL: {blob_url}")
else:
    matches = process.extract(blob_name, blob_list, scorer=fuzz.ratio, limit=1)
    # Print the suggested blob names
    for match in matches:
      blob_name = match[0]
      blob_url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"
      print(f"Suggested file: {blob_name} - URL: {blob_url}")

Blob URL: https://ishitagptblob1.blob.core.windows.net/ishita-container-langchain/EmpSampledata.csv


### Testing with different types of files

In [None]:
#with pdf
print(index.query('What is the Northwind Standard Health Plan'))
print(index.query_with_sources('What is the Northwind Standard Health Plan')['sources'])

 The Northwind Standard Health Plan is a comprehensive health plan that provides coverage for medical, vision, and dental services, as well as preventive care services and prescription drug coverage. It offers a variety of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies. It does not offer coverage for emergency services, mental health and substance abuse coverage, or out-of-network services.
/content/gdrive/My Drive/data/Northwind_Standard_Benefits_Details.pdf, /content/gdrive/My Drive/data/Copy of Benefit_Options.pdf


In [None]:
#with md
print(index.query('How to setup CMAV tool'))
print(index.query_with_sources('How to setup CMAV tool')['sources'])

 To setup the CMAV tool, you need to download and unzip the file, open the “CloudMigrationAssessmentAndValidation.sln” file in Visual Studio or any other IDE, and open the “Configuration.json” file and set the values for different keys according to the instructions provided.
/content/gdrive/My Drive/data/test5.pdf


In [None]:
#with txt
print(index.query('What are the effects of homelessness'))
print(index.query_with_sources('What are the effects of homelessness')['sources'])

 The effects of homelessness can include poor health, personal and psychological decline, decreased access to opportunity, loss of job or income, poverty, substance abuse, violence in the home, and disability and illness.
/content/gdrive/My Drive/data/CauseAndEffectOfHomelessness3.txt, /content/gdrive/My Drive/data/CauseAndEffectOfHomelessness2.txt


In [None]:
#with doc
print(index.query('Tell me about criminal violence against Black Americans'))
print(index.query_with_sources('Tell me about criminal violence against Black Americans')['sources'])

 Criminal violence against Black Americans has been a major problem in the United States and South Africa in the last two decades. The rate of violence in the Black American communities was exceedingly high, with the number of violence cases in the United States among the Black Americans being twenty-eight times higher than those in Germany and France. In 1990, most of the homicidal cases were discovered to target Black American males, with the highest number of these cases involving men between the ages of fifteen and twenty-four years. Most of these crimes were gang-related and involved drug smuggling, robbery, and gang fights.
/content/gdrive/My Drive/data/Crime with Violence in USA and SA.docx


In [None]:
#with ppt
print(index.query('Problem Statement Related to Migration of artifacts'))
print(index.query_with_sources('Problem Statement Related to Migration of artifacts')['sources'])

 The problem statement is to build a chatbot-like solution to help Factory team members to identify the artifacts and information that would enable them to consume the info more efficiently and optimize their delivery.
/content/gdrive/My Drive/data/File Viewer Migration factory.pdf


In [None]:
#with csv
print(index.query('Print all Latino employees'))
print(index.query_with_sources('Print all Latino employees')['sources'])

 Elias Alvarado, Eva Rivera, Logan Rivera, Mateo Her, Jose Henderson, Abigail Mejia
/content/gdrive/My Drive/data/EmpSampledata.csv
