# Real-Time SharePoint Document Indexing with Security Trimming via Graph API and Azure AI Search

Leverage the Microsoft [Graph API](https://learn.microsoft.com/en-us/sharepoint/dev/apis/sharepoint-rest-graph) in combination with Azure AI Search to index SharePoint Online documents in real-time.It employs [Langchain](https://python.langchain.com/docs/integrations/vectorstores/azuresearch) as a high-level orchestration tool for chunking and vectorizing text, harnessing Ada from Azure OpenAI to transform SharePoint text into meaningful vectors This solution excels in tracking document updates and applying [security trimming](https://learn.microsoft.com/en-us/azure/search/search-security-trimming-for-azure-search) to align search results with user access levels. Additionally, it enhances the search experience by enabling hybrid + reranking ([RRF](https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking)) features, utilizing state-of-the-art, out-of-the-box relevance scoring provided by Azure AI search. With Azure AI as the vector store, users benefit from advanced search capabilities, including more accurate and contextually relevant results.


#### Flow

1. [Extracting Files from SharePoint with SharePointDataExtractor](#extracting-files-from-sharepoint-with-sharepointdataextractor)
2. [Chunking, Text Vectorization, and Indexing with TextChunkingIndexing](#chunking-text-vectorization-and-indexing-with-textchunkingindexing)
3. [Search with Embedded Security Trimming Intelligence](#search-with-security-trimming)

## Prerequesites

Modify `target_directory` in the code to match the path of your desired directory before executing the notebook.

In [1]:
import os

# Define the target directory (change yours)
target_directory = r'C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search'

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search


#### Setting Up Conda Environment and Configuring VSCode for Jupyter Notebooks

Follow these steps to create a Conda environment and set up your VSCode for running Jupyter Notebooks:

##### Create Conda Environment from the Repository

1. **Prepare the Environment File**:
   - Ensure you have an `environment.yml` file in your repository. This file should list all the necessary libraries and dependencies for your project.

2. **Use `make` to Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory and look at the Makefile.
   - Execute the `make` command specified below to create the Conda environment using the `environment.yml` file:
     
     ```bash
     make create_conda_env
     ```

   - This command runs a `make` target that creates a Conda environment as defined in `environment.yml`.

3. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate [YourEnvName]
     ```
     Replace `[YourEnvName]` with the name of your environment as specified in `environment.yml`.

##### Configure VSCode for Jupyter Notebooks

1. **Install Required Extensions**:
   - Download and install the `Python` and `Jupyter` extensions in VSCode.

2. **Attach Kernel to VSCode**:
   - Once the Conda environment is created, you should be able to see it in the kernel selection (top right corner of your VSCode interface).
   - Select your newly created environment as the kernel for running Jupyter Notebooks.

By following these steps, you'll set up a dedicated Conda environment for your project and configure VSCode to run Jupyter Notebooks efficiently. This environment will contain all the necessary dependencies in your `environment.yml` file.

#### Necessary Azure AI Services for Running This Notebook

+ *Azure Cognitive Search*: This service is crucial for indexing and querying large amounts of data. It provides advanced search capabilities, including full-text search, AI-powered insights, and rich data integration.

+ *Azure OpenAI Service*: Utilized for leveraging advanced AI models like Ada for text vectorization and other AI-based tasks. 

+ *Microsoft Graph API*: Essential for accessing SharePoint Online data. It allows the notebook to authenticate, access documents, and understand document-level permissions within SharePoint. To run the notebook with Microsoft Graph API for accessing SharePoint Online data, you need to register an application in Azure. This process involves several steps:
    - Sign in to the [Azure portal](https://portal.azure.com/) as the Admin you copied from above/
    - If you have access to multiple tenants, use the Directories + subscriptions filter  in the top menu to switch to the tenant in which you want to register the application.
    - Search for and select Azure Active Directory.
    - Under Manage, select App registrations > New registration.
    - Enter a Name for your application, for example <code>sharepoint-cog-search-indexing</code>. Users of your app might see this name, and you can change it later.
    - Select Register.
    - Under Manage, select Certificates & secrets.
    - Under Client secrets, select New client secret, enter a name, and then select Add. Record the value which will be the "Client Secret" in a safe location for use in a later step. NOTE: Do no copy the "Secredt ID" as this is not needed.
    - Under Manage, select API Permissions > Add a permission. Select Microsoft Graph.
    - Select Application permissions.
    - Under User node, select User.Read.All as well as Site.Read.All, then select Add permissions.
    - If you notice that "Grant Admin Consent" is required, enable this now. Make sure all permissions have been granted admin consent. If you require an Admin, please see this [document](https://learn.microsoft.com/azure/active-directory/develop/console-app-quickstart?pivots=devlang-python) for additional help.
    - Click "Overview" and copy the "Application (client) ID" as well as the "Directory (tenant) ID"



#### Configure Environment Variables 

To load secrets and configurations for your notebook, you will use environment variables. These variables should be defined in a .env file, which you need to create and fill with your specific keys and service endpoints. *Refer to .env.sample for Guidance.*

In [2]:
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

True

In [3]:
# Local Variables Needed

# SharePoint Configuration
SITE_DOMAIN = 'mngenvmcap747548.sharepoint.com'  # Domain of your SharePoint site
SITE_NAME = 'Contoso'                           # Name of your SharePoint site

# Azure Open AI Deployment Configuration
DEPLOYMENT = "foundational-ada"                 # Deployment name for your Azure Open AI service needed for text embeddings
MODEL_NAME = "text-embedding-ada-002"           # Model name for text embeddings in Azure Open AI service

# Azure AI Search Index Configuration
INDEX_NAME = "langchain-vector-demo-custom"     # Name of the index in Azure Cognitive Search

### 1. Extracting Files from SharePoint with `SharePointDataExtractor`

The `SharePointDataExtractor` class, built atop the official SDK, streamlines interactions with Microsoft SharePoint through the Microsoft Graph API, focusing on efficient data retrieval and permissions management.

#### Key Features:

- **Authentication**: Automates OAuth with Microsoft Graph API using tenant ID, client ID, and client secret.
- **Data Retrieval**: Retrieves site and drive IDs, and files from SharePoint sites.
- **File Processing**: Filters and processes files, primarily .docx (extensible to other formats), based on modification time and type.
- **Permissions Management**: Analyzes file permissions for understanding access controls and associated roles.
- **Data Extraction**: Compiles detailed information about files, including content, location, and user roles, into a structured format.


In [4]:
from gbb_ai.sharepoint_data_extractor import SharePointDataExtractor

# Instantiate the SharePointDataExtractor client
# The client handles the complexities of interacting with SharePoint's REST API, providing an easy-to-use interface for data extraction.
client_scrapping = SharePointDataExtractor()

In [5]:
# Retrieve .docx file contents from a specified SharePoint site using SharePointDataExtractor
content_files = client_scrapping.retrieve_sharepoint_files_content(site_domain=SITE_DOMAIN, site_name=SITE_NAME, minutes_ago=None,file_formats=["docx"])

2023-12-11 17:00:28,038 - micro - MainProcess - INFO     New access token retrieved.... (sharepoint_data_extractor.py:msgraph_auth:59)
2023-12-11 17:00:28,040 - micro - MainProcess - INFO     Decoded Access Token:
{
  "aud": "https://graph.microsoft.com",
  "iss": "https://sts.windows.net/9495d8c9-4ebb-4107-b905-c7b45d1b7b7a/",
  "iat": 1702335327,
  "nbf": 1702335327,
  "exp": 1702339227,
  "aio": "E2VgYNj8br7t7bUmT1h/Ha+ZnzS3FgA=",
  "app_displayname": "dev-graph",
  "appid": "118583ee-94ed-45dd-870b-73784045eb37",
  "appidacr": "1",
  "idp": "https://sts.windows.net/9495d8c9-4ebb-4107-b905-c7b45d1b7b7a/",
  "idtyp": "app",
  "oid": "4f614374-65fa-45fc-8369-cb616a6fe08f",
  "rh": "0.Ab0AydiVlLtOB0G5Bce0XRt7egMAAAAAAAAAwAAAAAAAAADLAAA.",
  "roles": [
    "TeamsActivity.Read.All",
    "SharePointTenantSettings.Read.All",
    "People.Read.All",
    "Group.Read.All",
    "Sites.Read.All",
    "Group.ReadWrite.All",
    "Sites.Manage.All",
    "Directory.Read.All",
    "OnlineMeetingTrans

### 2. Chunking, Text Vectorization, and Indexing with `TextChunkingIndexing`

The `TextChunkingIndexing` simplifies the role in chunking, text vectorization, and indexing in Azure AI Search acting as Vector Database. It utilizes Langchain as an orchestrator to simplify and enhance the text proccesing strategy. More about Ai search and LangChain integration [here](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-cognitive-search-and-langchain-a-seamless-integration-for/ba-p/3901448)

#### Key Features of `TextChunkingIndexing`:

- **Text Chunking**: Breaks down extensive text data into smaller chunks based on character count, facilitating easier analysis and indexing.
- **Customization**: Allows for the adjustment of chunk size and overlap, catering to various text processing needs.
- **Text Vectorization**: Transforms the chunked text into vector representations, essential for efficient indexing and retrieval.
- **Indexing to Vector Store**: The vectorized text is then indexed into Azure AI Search, a powerful vector database for storing and retrieving text data.

#### Importance of Chunking Fine-tuning and overlapping:

- Fine-tuning chunk sizes and overlaps is critical for optimizing text retrieval quality, particularly in applications requiring precise search functionalities (relevance), like RAGs. More about fine tuning and aunderatand releveance scores [here](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-cognitive-search-outperforming-vector-search-with-hybrid/ba-p/3929167)

In [6]:
from gbb_ai.langchain_indexing import TextChunkingIndexing

client_indexing = TextChunkingIndexing()

In [7]:
# Instantiate the TextChunkingIndexing client
# This cleint is resposnsinle for chunking text into smaller pieces using Langchaing framework, which are then indexed by Azure AI Search.
client_indexing = TextChunkingIndexing()

client_indexing.setup_aoai()

client_indexing.load_embedding_model(deployment=DEPLOYMENT,model_name=MODEL_NAME)


2023-12-10 12:39:53,214 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model text-embedding-ada-002, deployment foundational-ada, and chunk size 1000 (langchain_indexing.py:load_embedding_model:102)
2023-12-10 12:39:53,232 - micro - MainProcess - INFO     OpenAIEmbeddings object created successfully. (langchain_indexing.py:load_embedding_model:115)


OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, async_client=None, model='text-embedding-ada-002', deployment='foundational-ada', openai_api_version='2023-05-15', openai_api_base='https://ml-workspace-dev-eastus-001-aoai.openai.azure.com/', openai_api_type='azure', openai_proxy='', embedding_ctx_length=8191, openai_api_key='d050ad8b96ef4ecbb5099eece1212a91', openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=16, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=True, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None)

#### Quick Guide to Setting Up Azure Search Index (Optional)

Let's set up an Azure Search index tailored for advanced search capabilities, including semantic and vector-based searches. Here's a step-by-step guide:

##### Embedding Function Setup:

Define embedding_function to transform text into vectors. This powers the semantic search.

##### Define Index Fields:

Create fields like id, content, content_vector, and others in the fields list. Each field represents a document attribute.
Make sure content_vector aligns with your embedding function's output.

##### Initialize Azure Search Client:

Instantiate AzureSearch with your Azure endpoint, key, custom index_name, and the fields list.
Configure semantic settings to fine-tune search relevance.

##### Customize As Needed:

Modify fields based on your document attributes.
Adjust index_name or semantic configurations to fit your specific search needs.

```python 
from azure.search.documents.indexes.models import (
    SearchFieldDataType, SimpleField, SearchableField, SemanticSettings, SemanticConfiguration, PrioritizedFields, SemanticField
)
from azure.search.documents.models import Vector
from langchain.vectorstores.azuresearch import AzureSearch
from your_embedding_module import embeddings  # Replace with your actual module

# Embedding function and fields setup
embedding_function = embeddings.embed_query
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, filterable=True),
    SearchableField(name="content", type=SearchFieldDataType.String, searchable=True),
    # ... other fields ...
]

# Azure Search client initialization
vector_store = AzureSearch(
    azure_search_endpoint=os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT"),
    azure_search_key=os.getenv("AZURE_SEARCH_ADMIN_KEY"),
    index_name="your-custom-index-name",
    embedding_function=embedding_function,
    fields=fields,
    # Semantic settings
    semantic_settings=SemanticSettings(
        default_configuration="config",
        configurations=[
            SemanticConfiguration(
                name="config",
                prioritized_fields=PrioritizedFields(
                    title_field=SemanticField(field_name="content"),
                    # ... other configurations ...
                ),
            )
        ],
    ),
)

# Now, your Azure Search index is ready for advanced querying!
```

In [8]:
# this is the function that will load or create the index mentioned above
client_indexing.setup_azure_search(index_name="langchain-vector-demo-custom")

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 1/1 [00:00<00:00,  1.54it/s]
100%|██████████| 1/1 [00:00<00:00, 15.91it/s]
2023-12-10 12:40:00,377 - micro - MainProcess - INFO     Azure Cognitive Search client configured successfully. (langchain_indexing.py:setup_azure_search:202)


<langchain.vectorstores.azuresearch.AzureSearch at 0x271375885e0>

In [9]:
# function to split the text from content_files (a list of Document objects) into chunks of 1000 characters and overlap of 200 characters
chuncks = client_indexing.split_documents_in_chunks(content_files, chunk_size=1000, chunk_overlap=200)

In [10]:
# Indexing chunks in Azure AI Search
client_indexing.embed_and_index(texts=chuncks)

100%|██████████| 1/1 [00:00<00:00, 14.49it/s]
100%|██████████| 1/1 [00:00<00:00, 15.23it/s]
100%|██████████| 1/1 [00:00<00:00,  8.55it/s]
100%|██████████| 1/1 [00:00<00:00, 15.63it/s]
100%|██████████| 1/1 [00:00<00:00, 14.89it/s]
100%|██████████| 1/1 [00:00<00:00, 15.41it/s]
100%|██████████| 1/1 [00:00<00:00, 15.98it/s]
100%|██████████| 1/1 [00:00<00:00, 14.33it/s]
100%|██████████| 1/1 [00:00<00:00, 14.07it/s]
100%|██████████| 1/1 [00:00<00:00, 16.41it/s]
100%|██████████| 1/1 [00:00<00:00, 15.43it/s]
100%|██████████| 1/1 [00:00<00:00, 15.54it/s]
100%|██████████| 1/1 [00:00<00:00, 15.63it/s]
100%|██████████| 1/1 [00:00<00:00, 15.00it/s]
100%|██████████| 1/1 [00:00<00:00, 14.51it/s]
100%|██████████| 1/1 [00:00<00:00, 15.56it/s]
100%|██████████| 1/1 [00:00<00:00, 11.50it/s]
100%|██████████| 1/1 [00:00<00:00, 15.20it/s]
100%|██████████| 1/1 [00:00<00:00, 13.81it/s]
100%|██████████| 1/1 [00:00<00:00, 14.04it/s]
100%|██████████| 1/1 [00:00<00:00, 16.89it/s]
100%|██████████| 1/1 [00:00<00:00,

### 3. Search with security treamming 

In [1]:
import os
from dotenv import load_dotenv

# Define the target directory
target_directory = r'C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search'

# Load .env file
load_dotenv()

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search


In [2]:
INDEX_NAME = "langchain-vector-demo-custom" 

In [3]:
from gbb_ai.trimming_ai_search import AzureSearchManager

client_search = AzureSearchManager(index_name=INDEX_NAME)

In [4]:
user_groups = client_search.get_current_user_groups()
user_groups

['admins']

In [5]:
search_query = "LLM in International Business Law"

In [6]:
security_group="Group_criticaldfvdm;mvs"
security_group_list=["Group_critical","Group_high","Group_medium","Group_low"]

In [7]:
results = client_search.secure_hybrid_search_rerank(search_query=search_query, security_group=security_group, top_k=5, azure_deployment_name="foundational-ada", semantic_configuration_name="config")

  from .autonotebook import tqdm as notebook_tqdm
100%|██████████| 1/1 [00:00<00:00,  1.74it/s]
2023-12-11 17:13:17,458 - micro - MainProcess - INFO     Search Results:
Result 1:
Score: 0.026480834931135178
Reranker Score: 2.258655071258545
Content: A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.[1] LLMs are artificial neural networks (mainly transformers[2]) and are (pre-)trained using self-supervised learning and semi-supervised learning. As autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word.[3] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, 

In [18]:
results

['A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.[1] LLMs are artificial neural networks (mainly transformers[2]) and are (pre-)trained using self-supervised learning and semi-supervised learning. As autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word.[3] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results.[4] They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the corpora.[5]',
 "Flamin

In [None]:
if not self.token:
    logger.error("Access token is not available.")
    return []

headers = {
    'Authorization': f'Bearer {self.token}',
    'Content-Type': 'application/json'
}
endpoint = f"https://graph.microsoft.com/v1.0/users/{user_id}/memberOf"

try:
    response = requests.get(endpoint, headers=headers)
    if response.status_code != 200:
        logger.error(f"Error retrieving user groups: {response.status_code} {response.reason}")
        logger.error(f"Response content: {response.text}")
        return []
    return response.json()
except Exception as e:
    logger.error(f"Exception in retrieving user groups: {e}")
    return []