## Prerequesites

Modify `target_directory` in the code to match the path of your desired directory before executing the notebook.

In [1]:
import os

# Define the target directory (change yours)
target_directory = r'C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search'

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\sharepoint-indexing-azure-cognitive-search


#### Configure Environment Variables 

In [2]:
# Local Variables Needed

# SharePoint Configuration
SITE_DOMAIN = 'mngenvmcap747548.sharepoint.com'  # Domain of your SharePoint site
SITE_NAME = 'Contoso'                           # Name of your SharePoint site

# Azure Open AI Deployment Configuration
DEPLOYMENT = "foundational-ada"                 # Deployment name for your Azure Open AI service needed for text embeddings
MODEL_NAME = "text-embedding-ada-002"           # Model name for text embeddings in Azure Open AI service

# Azure AI Search Index Configuration
INDEX_NAME = "langchain-vector-demo-custom"     # Name of the index in Azure Cognitive Search

In [3]:
from gbb_ai.sharepoint_data_extractor import SharePointDataExtractor

# Instantiate the SharePointDataExtractor client
# The client handles the complexities of interacting with SharePoint's REST API, providing an easy-to-use interface for data extraction.
client_scrapping = SharePointDataExtractor()

In [4]:
client_scrapping.load_environment_variables_from_env_file()
client_scrapping.msgraph_auth()

2023-12-15 01:30:36,938 - micro - MainProcess - INFO     Successfully loaded environment variables: TENANT_ID, CLIENT_ID, CLIENT_SECRET (sharepoint_data_extractor.py:load_environment_variables_from_env_file:90)
2023-12-15 01:30:37,678 - micro - MainProcess - INFO     New access token retrieved. (sharepoint_data_extractor.py:msgraph_auth:122)


'eyJ0eXAiOiJKV1QiLCJub25jZSI6IndEWTljcDFlc3BPZVVsOTdsdDNfQVN3cWZqR1R3OWNpRFFfUExBNHBmM0kiLCJhbGciOiJSUzI1NiIsIng1dCI6IlQxU3QtZExUdnlXUmd4Ql82NzZ1OGtyWFMtSSIsImtpZCI6IlQxU3QtZExUdnlXUmd4Ql82NzZ1OGtyWFMtSSJ9.eyJhdWQiOiJodHRwczovL2dyYXBoLm1pY3Jvc29mdC5jb20iLCJpc3MiOiJodHRwczovL3N0cy53aW5kb3dzLm5ldC85NDk1ZDhjOS00ZWJiLTQxMDctYjkwNS1jN2I0NWQxYjdiN2EvIiwiaWF0IjoxNzAyNjI1MTM3LCJuYmYiOjE3MDI2MjUxMzcsImV4cCI6MTcwMjYyOTAzNywiYWlvIjoiRTJWZ1lGaWplemJxaE1rZk9WMU95eWRmSW5jekFnQT0iLCJhcHBfZGlzcGxheW5hbWUiOiJkZXYtZ3JhcGgiLCJhcHBpZCI6IjExODU4M2VlLTk0ZWQtNDVkZC04NzBiLTczNzg0MDQ1ZWIzNyIsImFwcGlkYWNyIjoiMSIsImlkcCI6Imh0dHBzOi8vc3RzLndpbmRvd3MubmV0Lzk0OTVkOGM5LTRlYmItNDEwNy1iOTA1LWM3YjQ1ZDFiN2I3YS8iLCJpZHR5cCI6ImFwcCIsIm9pZCI6IjRmNjE0Mzc0LTY1ZmEtNDVmYy04MzY5LWNiNjE2YTZmZTA4ZiIsInJoIjoiMC5BYjBBeWRpVmxMdE9CMEc1QmNlMFhSdDdlZ01BQUFBQUFBQUF3QUFBQUFBQUFBRExBQUEuIiwicm9sZXMiOlsiVGVhbXNBY3Rpdml0eS5SZWFkLkFsbCIsIlNoYXJlUG9pbnRUZW5hbnRTZXR0aW5ncy5SZWFkLkFsbCIsIlBlb3BsZS5SZWFkLkFsbCIsIkdyb3VwLlJlYWQuQWxsIiwiU2l0ZXMuUm

In [5]:
site_id = client_scrapping.get_site_id(SITE_DOMAIN, SITE_NAME)
drive_id = client_scrapping.get_drive_id(site_id)

2023-12-15 01:30:37,714 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:191)
2023-12-15 01:30:38,303 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:195)
2023-12-15 01:30:39,168 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:212)


In [6]:
files = client_scrapping.retrieve_sharepoint_files_content(site_domain=SITE_DOMAIN, site_name=SITE_NAME, 
                                                           folder_path="/test/test2/test3/")

2023-12-15 01:30:39,195 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:191)


2023-12-15 01:30:39,904 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:195)
2023-12-15 01:30:40,653 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:212)
2023-12-15 01:30:40,655 - micro - MainProcess - INFO     Making request to Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:251)
2023-12-15 01:30:41,433 - micro - MainProcess - INFO     Received response from Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:254)
2023-12-15 01:30:44,379 - micro - MainProcess - INFO     Returning highest priority group: Group_critical (azure_search_security_trimming.py:get_highest_priority_group:51)
2023-12-15 01:30:55,564 - micro - MainProcess - INFO     Text extraction from PDF bytes was s

In [7]:
files

[{'page_content': 'A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.[1] LLMs are artificial neural networks (mainly transformers[2]) and are (pre-)trained using self-supervised learning and semi-supervised learning.\nAs autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word.[3] Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results.[4] They are thought to acquire knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the corp