# Load CSVs (one-to-many) to Azure Cognitive Search

In this Jupyter Notebook, we create and run steps to index a CSV file in which each row is an indivual and independent record/document. Each row then becomes searchable in Azure Cognitive Search. 
The reference documentation can be found at [Indexing blobs and files to produce multiple search documents](https://learn.microsoft.com/en-us/azure/search/search-howto-index-one-to-many-blobs).

By default, an indexer will treat the contents of a blob or file as a single search document. If you want a more granular representation in a search index, you can set parsingMode values to create multiple search documents from one blob or file.

We are going to be using a public Blob Storage container that has abstracts of ~52k Medical publications about COVID-19 published in 2020. You can check the website [HERE](https://www.ncbi.nlm.nih.gov/research/coronavirus/)

If you want to download the dataset, go [HERE](https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/topic_tagger/)

In [1]:
import os
import json
import requests

# Set the Data source connection string. This is the location of the CSV with the COVID articles on each line. 
# You can change it and use your own data
DATASOURCE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=storagedocs;AccountKey=FE1EheTvelSZ+DfkiGHYnqdoNQWOUlfNOPSmp7hNqJq5eHhTSCrPiRkSFUiqPSvVb+a9yh/XyxlR+AStrDuHXw==;EndpointSuffix=core.windows.net"
DATASOURCE_SAS_TOKEN = "?sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2024-05-24T08:46:47Z&st=2023-04-24T00:46:47Z&spr=https&sig=jttV8Xj2fBbzWklIZXCc%2BUroUoUygcXzS3XyFv%2F0XW0%3D"
BLOB_CONTAINER_NAME = "csvonetomany"

# Don't mess with this unless you really know what you are doing
AZURE_SEARCH_API_VERSION = '2021-04-30-Preview'

# Change these below with your own services credentials
AZURE_SEARCH_ENDPOINT = "https://cog-search-lrj44ck74ca4y.search.windows.net"
AZURE_SEARCH_KEY = "tfEzqIH0tgFA8fi04C99RKVgz4BwtFXpcr0NBKLEvxAzSeBhNwug" # Make sure is the MANAGEMENT KEY no the query key
COG_SERVICES_NAME = "cognitive-service-lrj44ck74ca4y"
COG_SERVICES_KEY = "aa0b2b98684b45e1ba9814c23b1ddf28"

In [3]:
# Define the names for the data source, index and indexer
datasource_name = "cogsrch-datasource-csv"
skillset_name = "cogsrch-skillset-csv"
index_name = "cogsrch-index-csv"
indexer_name = "cogsrch-indexer-csv"

In [4]:
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': AZURE_SEARCH_KEY}
params = {'api-version': AZURE_SEARCH_API_VERSION}

## Create Data Source (Blob container with the Litcovid CSV data file)

In [5]:
# Create a data source
# You should already have a blob container that contains the sample data.

datasource_payload = {
    "name": datasource_name,
    "description": "Demo files to demonstrate cognitive search capabilities of one-to-many.",
    "type": "azureblob",
    "credentials": {
        "connectionString": DATASOURCE_CONNECTION_STRING
    },
    "container": {
        "name": BLOB_CONTAINER_NAME
    }
}
r = requests.put(AZURE_SEARCH_ENDPOINT + "/datasources/" + datasource_name,
                 data=json.dumps(datasource_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

201
True


## Create Skillset - Text Splitter, Language Detection
We will use cognitive services enrichment for spliting the text of each content field into chunks (pages) and for language detection. We should always split the text since we don't know how big the content of each row might be.

In [6]:
# Create a skillset
skillset_payload = {
    "name": skillset_name,
    "description": "Splits Text and detect language",
    "skills":
    [
        {
            "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
            "description": "If you have multilingual content, adding a language code is useful for filtering",
            "context": "/document",
            "inputs": [
                {
                  "name": "text",
                  "source": "/document/abstract"
                }
            ],
            "outputs": [
                {
                  "name": "languageCode",
                  "targetName": "language"
                }
            ]
        },
        {
            "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
            "context": "/document",
            "textSplitMode": "pages",
            "maximumPageLength": 5000, # 5000 is default
            "defaultLanguageCode": "en",
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/abstract"
                },
                {
                    "name": "languageCode",
                    "source": "/document/language"
                }
            ],
            "outputs": [
                {
                    "name": "textItems",
                    "targetName": "pages"
                }
            ]
        }
    ],
    "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
        "description": COG_SERVICES_NAME,
        "key": COG_SERVICES_KEY
    }
}

r = requests.put(AZURE_SEARCH_ENDPOINT + "/skillsets/" + skillset_name,
                 data=json.dumps(skillset_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

201
True


## Inspect CSV file so we can understand the column types before creating the Index

In [11]:
#Download the csv files to disk and inspect using pandas
import pandas as pd
remote_file_path = "https://storagedocs.blob.core.windows.net/csvonetomany/train.csv"

In [12]:
df = pd.read_csv(remote_file_path+DATASOURCE_SAS_TOKEN)
print("No. of lines:",df.shape[0])
df.head()

No. of lines: 52419


Unnamed: 0,pmid,journal,title,abstract,keywords,label,pub_type,authors,date1,doi,date2,label_category
0,32410266,J Med Virol,Immunoregulation with mTOR inhibitors to preve...,Coronavirus disease 2019 (COVID-19) has become...,ade;antibody-dependent enhancement;coronavirus...,Treatment;Mechanism,Journal Article;Systematic Review,"Zheng, Yunfeng;Li, Renfeng;Liu, Shunai",,10.1002/jmv.26009,2020-05-16,title_abstract_abstract
1,33052950,PLoS One,Measuring the resilience of criminogenic ecosy...,This paper uses resilience as a lens through w...,,,"Journal Article;Research Support, Non-U.S. Gov't","Borrion, Herve;Kurland, Justin;Tilley, Nick;Ch...",,10.1371/journal.pone.0240077,2020-10-15,abstract_only
2,32589531,Br J Hosp Med (Lond),Pulmonary embolism in acute medicine: a case-b...,Pulmonary embolism remains an important cause ...,covid-19;catheter-directed thrombolysis;pulmon...,Prevention,Case Reports;Journal Article;Review,"Stevenson, Alexander;Davis, Sarah;Murch, Nick",,10.12968/hmed.2020.0300,2020-06-27,title_abstract_abstract
3,32835070,Groundw Sustain Dev,A positive perspective during COVID-19 related...,The months from March to June refer as water c...,covid-19;groundwater;positive perspective;rain...,,Journal Article,"Patni, Kiran;Jindal, Manoj Kumar",,10.1016/j.gsd.2020.100420,2020-08-25,abstract_only
4,32620125,J Transl Med,The timeline and risk factors of clinical prog...,BACKGROUND: The novel coronavirus disease 2019...,covid-19;clinical progression;pneumonia;retros...,Treatment;Diagnosis,"Journal Article;Research Support, Non-U.S. Gov't","Wang, Fang;Qu, Mengyuan;Zhou, Xuan;Zhao, Kai;L...",,10.1186/s12967-020-02423-8,2020-07-06,title_abstract_abstract


In [12]:
df.dtypes

pmid                int64
journal            object
title              object
abstract           object
keywords           object
label              object
pub_type           object
authors            object
date1             float64
doi                object
date2              object
label_category     object
dtype: object

## Create the Index
In Azure Cognitive Search, both blob indexers and file indexers support a delimitedText parsing mode for CSV files that treats each line in the CSV as a separate search document.

In [13]:
index_payload = {
    "name": index_name,  
    "fields": [
        {"name": "id", "type": "Edm.String", "key": "true", "searchable": "false", "retrievable": "true", "facetable": "false", "filterable": "false", "sortable": "false"},
        {"name": "title", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "false", "filterable": "true", "sortable": "false"},
        {"name": "content", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "false", "filterable": "true", "sortable": "false"},
        {"name": "language", "type": "Edm.String", "searchable": "false", "retrievable": "true", "sortable": "true", "filterable": "true", "facetable": "true"},
        {"name": "pages","type": "Collection(Edm.String)", "searchable": "false", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "journal", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "false", "filterable": "true", "sortable": "false"},
        {"name": "keywords", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "false", "filterable": "true", "sortable": "false"},
        {"name": "label", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "true", "filterable": "true", "sortable": "false"},
        {"name": "pub_type", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "false", "filterable": "true", "sortable": "false"},
        {"name": "authors", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "false", "filterable": "true", "sortable": "false"},
        {"name": "date1", "type": "Edm.Double", "searchable": "false", "retrievable": "true", "facetable": "true", "filterable": "true", "sortable": "true"},
        {"name": "doi", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "false", "filterable": "true", "sortable": "false"},
        {"name": "date2", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "false", "filterable": "true", "sortable": "false"},
        {"name": "label_category", "type": "Edm.String", "searchable": "true", "retrievable": "true", "facetable": "true", "filterable": "true", "sortable": "false"},
        {"name": "metadata_storage_name", "type": "Edm.String", "searchable": "true", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "metadata_storage_path", "type":"Edm.String", "searchable": "false", "retrievable": "true", "filterable": "false", "sortable": "false"},
        {"name": "metadata_storage_last_modified", "type":"Edm.DateTimeOffset", "searchable": "false", "retrievable": "false", "filterable": "false", "sortable": "false"}
    ],
    "semantic": {
        "configurations": [
            {
                "name": "my-semantic-config",
                "prioritizedFields": {
                    "titleField": 
                        {
                            "fieldName": "title"
                        },
                    "prioritizedContentFields": [
                        { 
                            "fieldName":"content" 
                        }
                    ],
                    "prioritizedKeywordsFields": [
                        {
                          "fieldName": "keywords"
                        }
                    ]
                }
            }
        ]
    }
}

r = requests.put(AZURE_SEARCH_ENDPOINT + "/indexes/" + index_name,
                 data=json.dumps(index_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

201
True


## Create and Run the Indexer - (runs the pipeline)
To create one-to-many indexers with CSV blobs, create or update an indexer definition with the delimitedText parsing mode

In [14]:
indexer_payload = {
    "name": indexer_name,
    "dataSourceName": datasource_name,
    "targetIndexName": index_name,
    "skillsetName": skillset_name,
    "schedule" : { "interval" : "PT2H"},
    "fieldMappings": [
        {
          "sourceFieldName" : "pmid",
          "targetFieldName" : "id"
        },
        {
          "sourceFieldName" : "abstract",
          "targetFieldName" : "content"
        }
    ],
    "outputFieldMappings":
    [
        {
            "sourceFieldName": "/document/language",
            "targetFieldName": "language"
        },
        {
            "sourceFieldName": "/document/pages/*",
            "targetFieldName": "pages"
        }
    ],
    "parameters" : { 
        "configuration" : { 
            "dataToExtract": "contentAndMetadata",
            "parsingMode" : "delimitedText", 
            "firstLineContainsHeaders" : True,
            "delimitedTextDelimiter": ","
        } 
    }
}
r = requests.put(AZURE_SEARCH_ENDPOINT + "/indexers/" + indexer_name,
                 data=json.dumps(indexer_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

201
True


In [15]:
# Optionally, get indexer status to confirm that it's running
r = requests.get(AZURE_SEARCH_ENDPOINT + "/indexers/" + indexer_name +
                 "/status", headers=headers, params=params)
# pprint(json.dumps(r.json(), indent=1))
print(r.status_code)
print("Status:",r.json().get('lastResult').get('status'))
print("Items Processed:",r.json().get('lastResult').get('itemsProcessed'))
print(r.ok)

200
Status: inProgress
Items Processed: 5000
True


**When the indexer finishes running we will have all 52419 rows indexed properly as separate documents in our Search Engine!.**

# Reference

- https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs
- https://learn.microsoft.com/en-us/azure/search/knowledge-store-create-rest



# NEXT
Now that we have two separete indexes loaded with two different types of information, In the next notebook 3, we will do a Multi-Index query, sort the results based on the reranker semantic score of Azure Search, and then use OpenAI to understand this results and give the best answer possible