# Load and Enrichment to Azure Cognitive Search

In this Jupyter Notebook, create and run enrichment steps to unlock searchable content in Azure blobs. It performs operations over mixed content in Azure Storage, such as images and application files, using a skillset that analyzes and extracts text information that becomes searchable in Azure Cognitive Search. 
The reference sample can be found at [Tutorial: Use Python and AI to generate searchable content from Azure blobs](https://docs.microsoft.com/azure/search/cognitive-search-tutorial-blob-python).

Although only  PDF files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).

This notebook creates the following objects on your search service:

+ search index
+ data source
+ skillset
+ indexer

In the last step, you'll run queries against the search index to explore the text output that was generated for each blob.

This notebook calls the [Search REST APIs](https://docs.microsoft.com/rest/api/searchservice/), but you can also use the Azure.Search.Documents client library in the Azure SDK for Python to perform the same steps. See this [Python quickstart](https://docs.microsoft.com/azure/search/search-get-started-python) for details.

To run this sample, you should have already uploade the sample data to a blob container in Azure Storage account. In this notebook, replace the placeholders for the search service endpoint, the admin API key, Azure Storage connection string, and blob container. Once you've provided all four values, you can run all cells, but the query won't return results until the indexer is finished and the search index is loaded. 

We recommend running each step and making sure it completes before moving on.

Reference:

https://learn.microsoft.com/en-us/azure/search/cognitive-search-tutorial-blob

In [1]:
import json
import requests
from pprint import pprint

In [2]:
# Define the names for the data source, skillset, index and indexer
datasource_name = "cogsrch-py-datasource"
skillset_name = "cogsrch-py-skillset"
index_name = "cogsrch-py-index"
indexer_name = "cogsrch-py-indexer"

In [3]:
# Setup the endpoint
#endpoint = 'https://<YOUR-SEARCH-SERVICE-NAME>.search.windows.net/'
#headers = {'Content-Type': 'application/json',
#           'api-key': '<YOUR-ADMIN-API-KEY>'}

endpoint = 'https://azure-cog-search-pabdyosydd7ta.search.windows.net'
headers = {'Content-Type': 'application/json',
           'api-key': 'bL9Ixq0z1Ax2XsDF77c9IcP6QIPAOijMA9zMmG2lgbAzSeCtIYG8'}
params = {'api-version': '2020-06-30'}

## Create Data Source (Blob container with the Arxiv CS pfs)

In [4]:
# Create a data source
# This data source points to your Azure Storage account.
# You should already have a blob container that contains the sample data

# datasourceConnectionString = "<YOUR-BLOB-RESOURCE-CONNECTION-STRING>"
datasourceConnectionString = "DefaultEndpointsProtocol=https;AccountName=arxivdatasetcs;AccountKey=M0f/46RGw1IvIpSEXuR7hyprzwvVBiCvIbKYNIlbtJzD2X96KBegKZp59pg4soiu2hSjtRXfhl/5+AStsWkLPA==;EndpointSuffix=core.windows.net"

datasource_payload = {
    "name": datasource_name,
    "description": "Demo files to demonstrate cognitive search capabilities.",
    "type": "azureblob",
    "credentials": {
        "connectionString": datasourceConnectionString
    },
    "container": {
        "name": "pdf"
    }
}
r = requests.put(endpoint + "/datasources/" + datasource_name,
                 data=json.dumps(datasource_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

201
True


## Create Skillset - OCR, Text Splitter, Language Detection, KeyPhrase extraction, Entity Recognition

In [9]:
cog_services_name = "cognitive-service-pabdyosydd7ta"
cog_services_key = "55aa8e6b473d41a78e3d60126a7b1ce8"

In [13]:
# Create a skillset
skillset_payload = {
    "name": skillset_name,
    "description": "Extract entities, detect language and extract key-phrases",
    "skills":
    [
        {
            "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
            "description": "Extract text (plain and structured) from image.",
            "context": "/document/normalized_images/*",
            "defaultLanguageCode": "en",
            "detectOrientation": True,
            "inputs": [
                {
                  "name": "image",
                  "source": "/document/normalized_images/*"
                }
            ],
                "outputs": [
                {
                  "name": "text"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
            "description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field. This is useful for PDF and other file formats that supported embedded images.",
            "context": "/document",
            "insertPreTag": " ",
            "insertPostTag": " ",
            "inputs": [
                {
                  "name":"text", "source": "/document/extracted_content"
                },
                {
                  "name": "itemsToInsert", "source": "/document/normalized_images/*/text"
                },
                {
                  "name":"offsets", "source": "/document/normalized_images/*/contentOffset"
                }
            ],
            "outputs": [
                {
                  "name": "mergedText", 
                  "targetName" : "merged_text"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
            "context": "/document",
            "textSplitMode": "pages",
            "maximumPageLength": 4000,
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/merged_text"
                }
            ],
            "outputs": [
                {
                    "name": "textItems",
                    "targetName": "pages"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
            "description": "If you have multilingual content, adding a language code is useful for filtering",
            "context": "/document",
            "inputs": [
                {
                  "name": "text",
                  "source": "/document/pages/*"
                }
            ],
            "outputs": [
                {
                  "name": "languageName",
                  "targetName": "language"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
            "context": "/document/pages/*",
            "inputs": [
                {
                    "name": "text", 
                    "source": "/document/pages/*"
                }
            ],
            "outputs": [
                {
                    "name": "keyPhrases",
                    "targetName": "keyPhrases"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.V3.EntityRecognitionSkill",
            "context": "/document",
            "categories": ["Person", "Location", "Organization", "DateTime", "URL", "Email"],
            "minimumPrecision": 0.5, 
            "inputs": [
                {
                    "name": "text", 
                    "source": "/document/pages/*"
                }
            ],
            "outputs": [
                {
                    "name": "persons", 
                    "targetName": "persons"
                },
                {
                    "name": "locations", 
                    "targetName": "locations"
                },
                {
                    "name": "organizations", 
                    "targetName": "organizations"
                },
                {
                    "name": "dateTimes", 
                    "targetName": "dateTimes"
                },
                {
                    "name": "urls", 
                    "targetName": "urls"
                },
                {
                    "name": "emails", 
                    "targetName": "emails"
                }
            ]
        }
    ],
    "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
        "description": cog_services_name,
        "key": cog_services_key
    }
}

r = requests.put(endpoint + "/skillsets/" + skillset_name,
                 data=json.dumps(skillset_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

204
True


## Create Index

The body of the request defines the schema of the search index. A fields collection requires one field to be designated as the key. For blob content, this field is often the "metadata_storage_path" that uniquely identifies each blob in the container.

In this schema, the "text" field receives OCR output, "content" receives merged output, "language" receives language detection output. Key phrases, entities, and several fields lifted from blob storage comprise the remaining entries.

In [14]:
# Create an index
# Queries operate over the searchable fields and filterable fields in the index
index_payload = {
    "name": index_name,
    "fields": [
        {
            "name": "text",
            "type": "Collection(Edm.String)",
            "searchable": "true",
            "sortable": "false",
            "filterable": "true",
            "facetable": "false"
        },
        {
            "name": "content",
            "type": "Edm.String",
            "searchable": "true",
            "sortable": "false",
            "filterable": "false",
            "facetable": "false"
        },
        {
            "name": "language",
            "type": "Edm.String",
            "searchable": "true",
            "sortable": "true",
            "filterable": "true",
            "facetable": "false"
        },
        {
            "name": "keyPhrases",
            "type": "Collection(Edm.String)",
            "searchable": "true",
            "sortable": "false",
            "filterable": "true",
            "facetable": "true"
        },
        {
            "name": "persons",
            "type": "Collection(Edm.String)",
            "searchable": "true",
            "sortable": "false",
            "filterable": "true",
            "facetable": "true"
        },
        {
            "name": "locations",
            "type": "Collection(Edm.String)",
            "searchable": "true",
            "sortable": "false",
            "filterable": "true",
            "facetable": "true"
        },
        {
            "name": "organizations",
            "type": "Collection(Edm.String)",
            "searchable": "true",
            "sortable": "false",
            "filterable": "true",
            "facetable": "true"
        },
        {
            "name": "dateTimes",
            "type": "Collection(Edm.String)",
            "searchable": "true",
            "sortable": "false",
            "filterable": "true",
            "facetable": "true"
        },
        {
            "name": "urls",
            "type": "Collection(Edm.String)",
            "searchable": "true",
            "sortable": "false",
            "filterable": "false",
            "facetable": "false"
        },
        {
            "name": "emails",
            "type": "Collection(Edm.String)",
            "searchable": "true",
            "sortable": "false",
            "filterable": "true",
            "facetable": "true"
        },
        {
            "name": "metadata_storage_path",
            "type": "Edm.String",
            "key": "true",
            "searchable": "true",
            "sortable": "false",
            "filterable": "false",
            "facetable": "false"
        },
        {
            "name": "metadata_storage_name",
            "type": "Edm.String",
            "searchable": "true",
            "sortable": "false",
            "filterable": "false",
            "facetable": "false"
            }
    ]
}

r = requests.put(endpoint + "/indexes/" + index_name,
                 data=json.dumps(index_payload), headers=headers, params=params)
print(r.status_code)

204


## Create and Run the Indexer - (runs the pipeline)
This process takes about 30 mins to load all the Arxiv CS pds

Call Create Indexer to drive the pipeline. The three components you have created thus far (data source, skillset, index) are inputs to an indexer. Creating the indexer on Azure Cognitive Search is the event that puts the entire pipeline into motion.

In [18]:
# Create an indexer
indexer_payload = {
    "name": indexer_name,
    "dataSourceName": datasource_name,
    "targetIndexName": index_name,
    "skillsetName": skillset_name,
    "fieldMappings": [
        {
          "sourceFieldName" : "metadata_storage_path",
          "targetFieldName" : "metadata_storage_path",
          "mappingFunction" : { "name" : "base64Encode" }
        },
        {
            "sourceFieldName": "metadata_storage_name",
            "targetFieldName": "metadata_storage_name"
        }
    ],
    "outputFieldMappings":
    [
        {
            "sourceFieldName": "/document/merged_text",
            "targetFieldName": "content"
        },
        {
            "sourceFieldName" : "/document/normalized_images/*/text",
            "targetFieldName" : "text"
        },
        {
            "sourceFieldName": "/document/language",
            "targetFieldName": "language"
        },
        {
            "sourceFieldName": "/document/pages/*/keyPhrases/*",
            "targetFieldName": "keyPhrases"
        },
        {
          "sourceFieldName" : "/document/persons", 
          "targetFieldName" : "persons"
        },
        {
          "sourceFieldName" : "/document/locations", 
          "targetFieldName" : "locations"
        },
        {
            "sourceFieldName": "/document/organizations",
            "targetFieldName": "organizations"
        },
        {
            "sourceFieldName": "/document/dateTimes",
            "targetFieldName": "dateTimes"
        },
        {
            "sourceFieldName": "/document/urls",
            "targetFieldName": "urls"
        },
        {
            "sourceFieldName": "/document/emails",
            "targetFieldName": "emails"
        }
    ],
    "parameters":
    {
        "maxFailedItems": -1,
        "maxFailedItemsPerBatch": -1,
        "configuration":
        {
            "dataToExtract": "contentAndMetadata",
            "imageAction": "generateNormalizedImages"
        }
    }
}

r = requests.put(endpoint + "/indexers/" + indexer_name,
                 data=json.dumps(indexer_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)
print(r.text)

201
True
{"@odata.context":"https://azure-cog-search-pabdyosydd7ta.search.windows.net/$metadata#indexers/$entity","@odata.etag":"\"0x8DB1D50C55629D3\"","name":"cogsrch-py-indexer","description":null,"dataSourceName":"cogsrch-py-datasource","skillsetName":"cogsrch-py-skillset","targetIndexName":"cogsrch-py-index","disabled":null,"schedule":null,"parameters":{"batchSize":null,"maxFailedItems":-1,"maxFailedItemsPerBatch":-1,"base64EncodeKeys":null,"configuration":{"dataToExtract":"contentAndMetadata","imageAction":"generateNormalizedImages"}},"fieldMappings":[{"sourceFieldName":"metadata_storage_path","targetFieldName":"metadata_storage_path","mappingFunction":{"name":"base64Encode","parameters":null}},{"sourceFieldName":"metadata_storage_name","targetFieldName":"metadata_storage_name","mappingFunction":null}],"outputFieldMappings":[{"sourceFieldName":"/document/merged_text","targetFieldName":"content","mappingFunction":null},{"sourceFieldName":"/document/normalized_images/*/text","target

In [29]:
# Optionally, get indexer status to confirm that it's running
r = requests.get(endpoint + "/indexers/" + indexer_name +
                 "/status", headers=headers, params=params)
# pprint(json.dumps(r.json(), indent=1))
print(r.status_code)
print("Status:",r.json().get('lastResult').get('status'))
print("Items Processed:",r.json().get('lastResult').get('itemsProcessed'))
print(r.ok)

200
Status: inProgress
Items Processed: 6390
True


In [23]:
# Query the service for the index definition
# Query responses can be verbose. If you get "Output exceeds the size limit. Open the full output data in a text editor", open the output in an editor.
# r = requests.get(endpoint + "/indexes/" + index_name,
#                  headers=headers, params=params)
# pprint(json.dumps(r.json(), indent=1))

In [22]:
# Query the index to return the contents of "organizations", created through Entity Recognition during enrichment
# For keyword search, replace the asterisk with comma-separated query terms: search=microsoft,azure
# r = requests.get(endpoint + "/indexes/" + index_name +
#                  "/docs?&search=*&$select=organizations", headers=headers, params=params)
# pprint(json.dumps(r.json(), indent=1))