## 📚 Prerequisites

Before running this notebook, ensure you have configured SharePoint, set up an application for handling Graph API authentication, and set the appropriate configuration parameters. [steps here](README.md)

## 📋 Table of Contents

This notebook guides you through the following sections:

1. [**Create an Azure Cognitive Search Index**](#create-index): This index will store the content from a document hosted on SharePoint Online.

2. [**Initialize the `client_extractor` client**](#init-client): This client manages the connection to a SharePoint site through the Microsoft Graph REST API and retrieves the Site ID for the site.

3. [**Download and Process Content and Metadata**](#download-process): The `client_extractor` client provides several methods for this:
    - Download and process all `.docx` and `.pdf` files from a SharePoint site.
    - Download and process only `.docx` files from a specific SharePoint site that were modified or uploaded in the last 60 minutes.
    - Download and process files from a specific folder within a SharePoint site.
    - Download and process a specific file within a SharePoint site.

4. [**Ingest into Azure AI Search Index**](#ingest-index): The extracted content and metadata are ingested into the Azure AI Search Index for easy retrieval and search.

For more details, refer to the following resources:
- [Quickstart: Register an app with the Azure AD v2.0 endpoint](https://learn.microsoft.com/en-us/azure/active-directory/develop/console-app-quickstart?pivots=devlang-python)
- [Create a Demo SharePoint Online Environment](https://cdx.transform.microsoft.com/) (Note: To use this, you need to either be a Microsoft Employee or part of the Microsoft Partner Program: [Microsoft Partner Program](https://partner.microsoft.com/dashboard/account/v3/enrollment/introduction/partnership))



### 🚀 Getting Started

#### Setting Up Conda Environment and Configuring VSCode for Jupyter Notebooks

Follow these steps to create a Conda environment and set up your VSCode for running Jupyter Notebooks:

##### Create Conda Environment from the Repository

> Instructions for Windows users: 

1. **Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory.
   - Execute the following command to create the Conda environment using the `environment.yml` file:
     ```bash
     conda env create -f environment.yml
     ```
   - This command creates a Conda environment as defined in `environment.yml`.

2. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate sharepoint-indexing
     ```

> Instructions for Linux users (or Windows users with WSL or other linux setup): 

1. **Use `make` to Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory and look at the Makefile.
   - Execute the `make` command specified below to create the Conda environment using the `environment.yml` file:
     ```bash
     make create_conda_env
     ```

2. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate sharepoint-indexing
     ```

##### Configure VSCode for Jupyter Notebooks

1. **Install Required Extensions**:
   - Download and install the `Python` and `Jupyter` extensions for VSCode. These extensions provide support for running and editing Jupyter Notebooks within VSCode.

2. **Open the Notebook**:
   - Open the Jupyter Notebook file (`01-indexing-content.ipynb`) in VSCode.

3. **Attach Kernel to VSCode**:
   - After creating the Conda environment, it should be available in the kernel selection dropdown. This dropdown is located in the top-right corner of the VSCode interface.
   - Select your newly created environment (`sharepoint-indexing`) from the dropdown. This sets it as the kernel for running your Jupyter Notebooks.

4. **Run the Notebook**:
   - Once the kernel is attached, you can run the notebook by clicking on the "Run All" button in the top menu, or by running each cell individually.


By following these steps, you'll establish a dedicated Conda environment for your project and configure VSCode to run Jupyter Notebooks efficiently. This environment will include all the necessary dependencies specified in your `environment.yml` file. If you wish to add more packages or change versions, please use `pip install` in a notebook cell or in the terminal after activating the environment, and then restart the kernel. The changes should be automatically applied after the session restarts.

#### Setting the Target Directory

Before executing the notebook, modify `target_directory` to point to the location where you downloaded this code.

In [1]:
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents import SearchClient
from azure.search.documents.indexes.models import (
    CorsOptions,
    SearchIndex,
    ComplexField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
)

# Define the target directory (change yours)
target_directory = (
    r"C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search"
)

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search


#### Configure Environment Variables 

Before running this notebook, you must configure certain environment variables. We will now use environment variables to store our configuration. This is a more secure practice as it prevents sensitive data from being accidentally committed and pushed to version control systems.

Create a `.env` file in your project root (use the provided `.env.sample` as a template) and add the following variables:

```env
# Azure Active Directory Configuration
TENANT_ID='[Your Azure Tenant ID]'
CLIENT_ID='[Your Azure Client ID]'
CLIENT_SECRET='[Your Azure Client Secret]'

# SharePoint Site Configuration
SITE_HOSTNAME='[Your SharePoint Site Domain]'
SITE_NAME='[Your SharePoint Site Name]'

# Azure AI Search Service Configuration
SEARCH_SERVICE_ENDPOINT='[Your Azure Search Service Endpoint]'
SEARCH_INDEX_NAME='[Your Azure Search Index Name]'
SEARCH_ADMIN_API_KEY='[Your Azure Search Admin API Key]'
```

Replace the placeholders (e.g., [Your Azure Tenant ID]) with your actual values.

+ `TENANT_ID`, `CLIENT_ID`, and `CLIENT_SECRET` from your register Application. [Detailed steps here](README.md)
- `SITE_HOSTNAME` and `SITE_NAME` specify the SharePoint site from which data will be extracted.
+ `SEARCH_SERVICE_ENDPOINT`, `SEARCH_INDEX_NAME`, and `SEARCH_ADMIN_API_KEY` are used to configure the Azure AI Search service.

> 📌 **Note**
> Remember not to commit the .env file to your version control system. Add it to your .gitignore file to prevent it from being tracked.

## Create an Azure Cognitive Search Index <a id='create-index'></a>

In [2]:
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Set the service endpoint and API key from the environment
# Create an SDK client
endpoint = os.environ["SEARCH_SERVICE_ENDPOINT"]
search_client = SearchClient(
    endpoint=endpoint,
    index_name=os.environ["SEARCH_INDEX_NAME"],
    credential=AzureKeyCredential(os.environ["SEARCH_ADMIN_API_KEY"]),
)

admin_client = SearchIndexClient(
    endpoint=endpoint,
    index_name=os.environ["SEARCH_INDEX_NAME"],
    credential=AzureKeyCredential(os.environ["SEARCH_ADMIN_API_KEY"]),
)

In [3]:
# Delete the index if it exists
try:
    result = admin_client.delete_index(os.environ["SEARCH_INDEX_NAME"])
    print("Index", os.environ["SEARCH_INDEX_NAME"], "Deleted")
except Exception as ex:
    print(ex)

Index langchain-vector-demo-custom Deleted


In [4]:
# Create the index
fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        filterable=True,
        sortable=True,
        key=True,
    ),
    SearchableField(
        name="name", type=SearchFieldDataType.String, filterable=True, sortable=True
    ),
    SimpleField(
        name="created_datetime",
        type=SearchFieldDataType.DateTimeOffset,
        facetable=True,
        filterable=True,
        sortable=True,
    ),
    SearchableField(
        name="created_by",
        type=SearchFieldDataType.String,
        filterable=True,
        sortable=True,
    ),
    SimpleField(
        name="size",
        type=SearchFieldDataType.Int32,
        facetable=True,
        filterable=True,
        sortable=True,
    ),
    SimpleField(
        name="last_modified_datetime",
        type=SearchFieldDataType.DateTimeOffset,
        facetable=True,
        filterable=True,
        sortable=True,
    ),
    SearchableField(
        name="last_modified_by",
        type=SearchFieldDataType.String,
        filterable=True,
        sortable=True,
    ),
    ComplexField(
            name="read_access_entity",
            collection=True,
            fields=[SimpleField(name="list_item", type=SearchFieldDataType.String, searchable=True)],
            searchable=True),
    SimpleField(name="source", type=SearchFieldDataType.String),
    SearchableField(
        name="content", type=SearchFieldDataType.String, analyzer_name="en.lucene"
    ),
]
cors_options = CorsOptions(allowed_origins=["*"], max_age_in_seconds=60)
scoring_profiles = []
suggester = [{"name": "sg", "source_fields": ["name"]}]

index = SearchIndex(
    name=os.environ["SEARCH_INDEX_NAME"],
    fields=fields,
    scoring_profiles=scoring_profiles,
    suggesters=suggester,
    cors_options=cors_options,
)

try:
    result = admin_client.create_index(index)
    print("Index", result.name, "created")
except Exception as ex:
    print(ex)

Index langchain-vector-demo-custom created


## Initialize the `client_extractor` client <a id='init-client'></a>

In [5]:
from gbb_ai.sharepoint_data_extractor import SharePointDataExtractor

# Instantiate the SharePointDataExtractor client
# The client handles the complexities of interacting with SharePoint's REST API, providing an easy-to-use interface for data extraction.
client_scrapping = SharePointDataExtractor()

> 💡 **Note**
> The `get_site_id` and `get_drive_id` methods are optional. They are automatically called by the `retrieve_sharepoint_files_content` function. However, they are available for use if further analysis is required.

In [6]:
# Load environment variables from the .env file
client_scrapping.load_environment_variables_from_env_file()

# Authenticate with Microsoft Graph API
client_scrapping.msgraph_auth()

# Get the Site ID for the specified SharePoint site
site_id = client_scrapping.get_site_id(
    site_hostname=os.environ["SITE_HOSTNAME"], site_name=os.environ["SITE_NAME"]
)

# Get the Drive ID associated with the Site ID
drive_id = client_scrapping.get_drive_id(site_id)

2023-12-19 13:21:09,725 - micro - MainProcess - INFO     Successfully loaded environment variables: TENANT_ID, CLIENT_ID, CLIENT_SECRET (sharepoint_data_extractor.py:load_environment_variables_from_env_file:86)
2023-12-19 13:21:10,376 - micro - MainProcess - INFO     New access token retrieved. (sharepoint_data_extractor.py:msgraph_auth:118)
2023-12-19 13:21:10,377 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:187)
2023-12-19 13:21:10,890 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:191)
2023-12-19 13:21:11,507 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:208)


## Download and Process Content and Metadata <a id='download-process'></a>

> 💡 **Note**
> The method returns a list of dictionaries, with each dictionary representing a file and containing all SharePoint content and metadata. 
> Here's an example of the expected output:

```python
[
    {
        'content': 'LLM creators should exclude from their training data papers on creating or enhancing pathogens....',  # content
        'id': '01W3WT6PG5HFCYLSOAMNGIGWEBISZCI5X4',  # The unique identifier of the file
        'source': 'https://XXX.sharepoint.com/sites/XXX/_layouts/15/Doc.aspx?sourcedoc=%7B854539DD-C0C9-4C63-8358-8144B22476FC%7D&file=test3.docx&action=default&mobileredirect=true',  # The source URL of the file
        'name': 'test3.docx',  # The name of the file
        'size': 73576,  # The size of the file in bytes
        'created_by': 'System Administrator',  # The user who created the file
        'created_datetime': '2023-12-15T00:44:01Z',  # The date and time when the file was created
        'last_modified_datetime': '2023-12-15T00:44:15Z',  # The date and time when the file was last modified
        'last_modified_by': 'System Administrator',  # The user who last modified the file
        'read_access_entity': 'Contoso Visitors'  # The entity that has read access to the file
    },
    # ... more files ...
]
```

In [7]:
# Download and process all `.docx` and `.pdf` files from a specific Site ID.
files_from_root_folder = client_scrapping.retrieve_sharepoint_files_content(
    site_hostname=os.environ["SITE_HOSTNAME"],
    site_name=os.environ["SITE_NAME"],
    file_formats=["docx", "pdf"],
)

2023-12-19 13:21:11,537 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:187)
2023-12-19 13:21:12,189 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:191)
2023-12-19 13:21:12,849 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:208)
2023-12-19 13:21:12,851 - micro - MainProcess - INFO     Making request to Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:247)
2023-12-19 13:21:13,394 - micro - MainProcess - INFO     Received response from Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:250)


In [8]:
# Download and process only `.docx` files from a specific SharePoint Site modified or uploaded in last 60 min.
files_from_root_folder_last_60_min = client_scrapping.retrieve_sharepoint_files_content(
    site_hostname=os.environ["SITE_HOSTNAME"],
    site_name=os.environ["SITE_NAME"],
    file_formats=["docx"],
    minutes_ago=60,
)

2023-12-19 13:21:16,391 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:187)
2023-12-19 13:21:16,905 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:191)
2023-12-19 13:21:17,495 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:208)
2023-12-19 13:21:17,495 - micro - MainProcess - INFO     Making request to Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:247)
2023-12-19 13:21:18,041 - micro - MainProcess - INFO     Received response from Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:250)
2023-12-19 13:21:18,041 - micro - MainProcess - ERROR    No files found in the site's drive (sharepoint_data_extractor.py:retrieve_sharepo

In [9]:
# Download and process files from a specific folder within a SharePoint site.
selected_files_content = client_scrapping.retrieve_sharepoint_files_content(
    site_hostname=os.environ["SITE_HOSTNAME"],
    site_name=os.environ["SITE_NAME"],
    folder_path="/test/test2/test3/",
)

2023-12-19 13:21:18,073 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:187)
2023-12-19 13:21:18,602 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:191)
2023-12-19 13:21:19,191 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:208)
2023-12-19 13:21:19,191 - micro - MainProcess - INFO     Making request to Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:247)
2023-12-19 13:21:19,872 - micro - MainProcess - INFO     Received response from Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:250)
2023-12-19 13:21:32,808 - micro - MainProcess - INFO     Text extraction from PDF bytes was successful. (pdf_utils.py:extract_text_from_pd

In [10]:
# Download and process a specific file within a SharePoint site.
selected_file_content = client_scrapping.retrieve_sharepoint_files_content(
    site_hostname=os.environ["SITE_HOSTNAME"],
    site_name=os.environ["SITE_NAME"],
    folder_path="/test/test2/test3/",
    file_names=["test3.docx"],
)

2023-12-19 13:21:33,557 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:187)
2023-12-19 13:21:34,055 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:191)
2023-12-19 13:21:34,705 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:208)
2023-12-19 13:21:34,705 - micro - MainProcess - INFO     Making request to Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:247)
2023-12-19 13:21:35,275 - micro - MainProcess - INFO     Received response from Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:250)


In [11]:
selected_file_content

[{'content': 'A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.[1] LLMs are artificial neural networks (mainly transformers[2]) and are (pre-)trained using self-supervised learning and semi-supervised learning.\nAs autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word.[3] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results.[4] They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the corpora.[

## Ingest into Azure AI Search Index <a id='ingest-index'></a>

> 💡 **Note**
> The `upload_documents` method is part of the Azure Cognitive Search SDK for Python. It's used to upload a batch of documents to an Azure AI Search index.
>
> This method accepts a list of documents, where each document is a dictionary that represents a JSON document. Each key-value pair in the dictionary corresponds to a field in the index schema.

In [12]:
# Single Document Upload
try:
    # 'search_client.upload_documents' expects a list of documents
    result = search_client.upload_documents(documents=selected_file_content)

    # Print the result for each document
    for res in result:
        print("Upload of new document succeeded: {}".format(res.succeeded))
except Exception as ex:
    print("Error in single document upload: ", ex)

Upload of new document succeeded: True


In [13]:
# Multiple Documents Upload
try:
    # 'search_client.upload_documents' can ingest multiple documents at once
    # 'selected_files_content' is a list of documents
    result = search_client.upload_documents(documents=selected_files_content)

    # Print the result for each document
    for res in result:
        print("Upload of new document succeeded: {}".format(res.succeeded))
except Exception as ex:
    print("Error in multiple documents upload: ", ex)

Upload of new document succeeded: True
Upload of new document succeeded: True


In [15]:
# Use the Azure Cognitive Search SDK to perform a search on the index
# The search_text parameter is set to "LLMs are the best"
results = search_client.search(search_text="LLMs are the best")

# Iterate through the search results
for result in results:
    # Print the ID and name of each result
    print(f"ID: {result['id']}, Name: {result['name']}")

    # Print the first 100 characters of the content of each result
    print(f"Content: {result['content'][:100]}")

ID: 01W3WT6PG5HFCYLSOAMNGIGWEBISZCI5X4, Name: test3.docx
Content: A large language model (LLM) is a type of language model notable for its ability to achieve general-


**🎉 Hooray!** 

Success! 

In this example, we've fetched a document named `test3.docx` from SharePoint and ingested it into Azure Search. This process demonstrates the seamless integration of SharePoint and Azure Search for document indexing and retrieval. 
