# Prerequesites

In [1]:
import os
from dotenv import load_dotenv

# Define the target directory
target_directory = r'C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search'

# Load .env file
load_dotenv()

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search


# 1. Extract, Chunk and Index 

We are going to use two custom classses to help us remove the burden of code. We'll have two cleint one for extarcticn fsta and the ireh for chcinckon and indexidn on azure search

### SharePointDataExtractor

SharePointDataExtractor is a client designed to interact with Microsoft SharePoint through the Microsoft Graph API. It handles various tasks related to fetching and processing data from SharePoint sites.

Key Functionalities:

+ Authentication: Handles OAuth authentication with Microsoft Graph API using tenant ID, client ID, and client secret.
- Data Retrieval: Fetches data from specific SharePoint sites and drives. This includes retrieving site IDs, drive IDs, and files within a SharePoint site.
+ File Processing: Ability to filter files based on modification time and file formats. It can retrieve file contents, particularly from .docx files.
- Permissions Handling: Fetches and processes file permissions to understand access control and roles associated with SharePoint files.
- Data Extraction: Compiles detailed information about each file, including content, location, and user roles, into a structured format.

Usage Context:
This client is used when there is a need to extract and process data from SharePoint. It's particularly useful for applications that require automated retrieval and processing of documents and files stored in SharePoint.

## TextChunkingIndexing

TextChunkingIndexing is a client focused on processing and indexing text data. It primarily deals with chunking large text into manageable pieces and preparing it for indexing or further analysis.

Key Functionalities:

+ Environment Setup: Loads necessary environment variables required for indexing and chunking operations.
- Text Chunking: Capable of breaking down large text data into smaller chunks based on character count, which is essential for text analysis and indexing in databases.
+ Customization: Offers customization options for chunk size and overlap, making it versatile for various text processing needs.

Usage Context:
This client is particularly useful in scenarios where large text documents need to be processed, analyzed, or indexed. For example, in Natural Language Processing tasks, machine learning model training, or when preparing data for storage in databases where smaller text chunks are preferable.

### Example Workflow:

Use client_scrapping (SharePointDataExtractor) to retrieve documents from a SharePoint site.
Pass these documents to client_indexing (TextChunkingIndexing) to break the text into smaller, more manageable chunks.
Use the chunked text and indexing to our selected Vector Database Azure Search. 

In [2]:
from gbb_ai.sharepoint_data_extractor import SharePointDataExtractor
from gbb_ai.langchain_indexing import TextChunkingIndexing

# Instantiate the SharePointDataExtractor client
# This client is responsible for connecting to Microsoft SharePoint through the Microsoft Graph API.
# The client handles the complexities of interacting with SharePoint's REST API, providing an easy-to-use interface for data extraction.
client_scrapping = SharePointDataExtractor()

# Instantiate the TextChunkingIndexing client
# This cleint is resposnsinle for chunking text into smaller pieces using Langchaing framework, which are then indexed by Azure Cognitive Search.
# The client offers customizable options for how text should be chunked, ensuring flexibility to suit various text processing needs.
client_indexing = TextChunkingIndexing()


In [3]:
SITE_DOMAIN = 'mngenvmcap747548.sharepoint.com'
SITE_NAME = 'Contoso'

In [4]:
# Retrieve .docx file contents from a specified SharePoint site using SharePointDataExtractor
content_files = client_scrapping.retrieve_sharepoint_files_content(site_domain=SITE_DOMAIN, site_name=SITE_NAME, minutes_ago=None,file_formats=["docx"])

2023-12-09 20:16:39,991 - micro - MainProcess - INFO     New access token retrieved.... (sharepoint_data_extractor.py:msgraph_auth:58)
2023-12-09 20:16:39,992 - micro - MainProcess - INFO     Decoded Access Token:
{
  "aud": "https://graph.microsoft.com",
  "iss": "https://sts.windows.net/9495d8c9-4ebb-4107-b905-c7b45d1b7b7a/",
  "iat": 1702174299,
  "nbf": 1702174299,
  "exp": 1702178199,
  "aio": "E2VgYHjqzM3iyTY70/+7wJ/X23clAQA=",
  "app_displayname": "dev-graph",
  "appid": "118583ee-94ed-45dd-870b-73784045eb37",
  "appidacr": "1",
  "idp": "https://sts.windows.net/9495d8c9-4ebb-4107-b905-c7b45d1b7b7a/",
  "idtyp": "app",
  "oid": "4f614374-65fa-45fc-8369-cb616a6fe08f",
  "rh": "0.Ab0AydiVlLtOB0G5Bce0XRt7egMAAAAAAAAAwAAAAAAAAADLAAA.",
  "roles": [
    "TeamsActivity.Read.All",
    "SharePointTenantSettings.Read.All",
    "People.Read.All",
    "Sites.Read.All",
    "Sites.Manage.All",
    "Directory.Read.All",
    "OnlineMeetingTranscript.Read.All",
    "BrowserSiteLists.ReadWrite.

In [5]:
client_indexing.setup_aoai()

In [6]:
DEPLOYMENT ="foundational-ada"
MODEL_NAME="text-embedding-ada-002"
client_indexing.load_embedding_model(deployment=DEPLOYMENT,model_name=MODEL_NAME)

2023-12-09 20:16:44,273 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model text-embedding-ada-002, deployment foundational-ada, and chunk size 1000 (langchain_indexing.py:load_embedding_model:103)
2023-12-09 20:16:44,278 - micro - MainProcess - INFO     OpenAIEmbeddings object created successfully. (langchain_indexing.py:load_embedding_model:116)


OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, async_client=None, model='text-embedding-ada-002', deployment='foundational-ada', openai_api_version='2023-05-15', openai_api_base='https://ml-workspace-dev-eastus-001-aoai.openai.azure.com/', openai_api_type='azure', openai_proxy='', embedding_ctx_length=8191, openai_api_key='d050ad8b96ef4ecbb5099eece1212a91', openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=16, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=True, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None)

In [7]:
# Tis fucntion creates index in Azure AI Search if not existeen and laod configuration - please modify the function if needed quickguide how below
client_indexing.setup_azure_search(index_name="langchain-vector-demo-custom")

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 1/1 [00:00<00:00,  1.77it/s]
100%|██████████| 1/1 [00:00<00:00, 15.90it/s]


ValueError: You need to specify at least the following fields {'content_vector': 'Collection(Edm.Single)'} or provide alternative field names in the env variables.

content_vector current type: 'Edm.String'. It has to be 'Collection(Edm.Single)' or you can point to a different 'Collection(Edm.Single)' field name by using the env variable 'AZURESEARCH_FIELDS_CONTENT_VECTOR'

#### Quick Guide to Setting Up Azure Search Index

Let's set up an Azure Search index tailored for advanced search capabilities, including semantic and vector-based searches. Here's a step-by-step guide:

##### Embedding Function Setup:

Define embedding_function to transform text into vectors. This powers the semantic search.

##### Define Index Fields:

Create fields like id, content, content_vector, and others in the fields list. Each field represents a document attribute.
Make sure content_vector aligns with your embedding function's output.

##### Initialize Azure Search Client:

Instantiate AzureSearch with your Azure endpoint, key, custom index_name, and the fields list.
Configure semantic settings to fine-tune search relevance.

##### Customize As Needed:

Modify fields based on your document attributes.
Adjust index_name or semantic configurations to fit your specific search needs.

```python 
from azure.search.documents.indexes.models import (
    SearchFieldDataType, SimpleField, SearchableField, SemanticSettings, SemanticConfiguration, PrioritizedFields, SemanticField
)
from azure.search.documents.models import Vector
from langchain.vectorstores.azuresearch import AzureSearch
from your_embedding_module import embeddings  # Replace with your actual module

# Embedding function and fields setup
embedding_function = embeddings.embed_query
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
    SearchableField(name="content", type=SearchFieldDataType.String, searchable=True),
    # ... other fields ...
]

# Azure Search client initialization
vector_store = AzureSearch(
    azure_search_endpoint=os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT"),
    azure_search_key=os.getenv("AZURE_SEARCH_ADMIN_KEY"),
    index_name="your-custom-index-name",
    embedding_function=embedding_function,
    fields=fields,
    # Semantic settings
    semantic_settings=SemanticSettings(
        default_configuration="config",
        configurations=[
            SemanticConfiguration(
                name="config",
                prioritized_fields=PrioritizedFields(
                    title_field=SemanticField(field_name="content"),
                    # ... other configurations ...
                ),
            )
        ],
    ),
)

# Now, your Azure Search index is ready for advanced querying!
```

In [None]:
chuncks = client_indexing.split_documents_by_character(content_files)

In [None]:
client_indexing.embed_and_index(texts=chuncks)

100%|██████████| 1/1 [00:00<00:00, 13.14it/s]
100%|██████████| 1/1 [00:00<00:00, 14.68it/s]
100%|██████████| 1/1 [00:00<00:00, 15.41it/s]
100%|██████████| 1/1 [00:00<00:00, 14.96it/s]
100%|██████████| 1/1 [00:00<00:00, 14.67it/s]
100%|██████████| 1/1 [00:00<00:00, 14.46it/s]
100%|██████████| 1/1 [00:00<00:00, 14.76it/s]
100%|██████████| 1/1 [00:00<00:00, 14.39it/s]
100%|██████████| 1/1 [00:00<00:00, 12.88it/s]
100%|██████████| 1/1 [00:00<00:00, 14.38it/s]
100%|██████████| 1/1 [00:00<00:00, 14.99it/s]
100%|██████████| 1/1 [00:00<00:00, 15.77it/s]
100%|██████████| 1/1 [00:00<00:00, 14.09it/s]
100%|██████████| 1/1 [00:00<00:00, 16.10it/s]
100%|██████████| 1/1 [00:00<00:00, 15.05it/s]
100%|██████████| 1/1 [00:00<00:00, 15.03it/s]
100%|██████████| 1/1 [00:00<00:00, 14.70it/s]
100%|██████████| 1/1 [00:00<00:00, 15.75it/s]
100%|██████████| 1/1 [00:00<00:00, 12.38it/s]
100%|██████████| 1/1 [00:00<00:00, 16.76it/s]
100%|██████████| 1/1 [00:00<00:00, 14.90it/s]
100%|██████████| 1/1 [00:00<00:00,

## Search 

In [None]:
import os
from dotenv import load_dotenv
import openai

# Define the target directory
target_directory = r'C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search'

# Load .env file
load_dotenv()

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search


In [None]:
from gbb_ai.trimming_ai_search import AzureSearchManager

client_search = AzureSearchManager()

In [None]:
search_query = "LLM is a master of laws"

In [None]:
results = client_search.hybrid_retrieval_rerank(search_query=search_query, security_group="Group_critical", top_k=5, azure_deployment_name="foundational-ada", semantic_configuration_name="config")

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 1/1 [00:00<00:00,  1.71it/s]
2023-12-09 20:14:56,605 - micro - MainProcess - INFO     Search query: LLM is a master of laws, results: [{'score': 0.02812499925494194, 'reranker_score': 2.2067630290985107, 'content': 'A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.[1] LLMs are artificial neural networks (mainly transformers[2]) and are (pre-)trained using self-supervised learning and semi-supervised learning. As autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word.[3] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Lar

In [None]:
results

['A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.[1] LLMs are artificial neural networks (mainly transformers[2]) and are (pre-)trained using self-supervised learning and semi-supervised learning. As autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word.[3] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results.[4] They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the corpora.[5] Notable ex