This Azure Function binds to an Azure Storage Blob container and triggers when a PDF file is stored. The function extracts text from the PDF file using pdf.js and using the supplies rules extracts metadata from the text & stores the result (text + metadata) in a DocumentDB collection which can then be used as a datasource for Azure Search.
- You must have an active Azure Subscription, if you do not you can always start with a free trial
- General understanding of
- The Azure Storage Explorer is handy for creating containers and uploading/deleting files.
- Postman is fantastic for making REST calls and storing them for re-use. Later I will include a Postman collection for search.
- Optional (but recommended): Create a Resource Group to contain the resources you will create below.
- Create an Azure Storage Account (General-Purpose or Blob, either will work)
- Create a public or private (depending on your needs) blob container called
inbound
- I suggest using the Azure Storage Explorer tool for this.
- Create a public or private (depending on your needs) blob container called
- Create a Document DB (using the SQL API) service, follow the steps Create a database account here
- Be sure to create a database called
documents
and a collection calleddocs
- For testing when creating your collection start with the smallest/cheapest configuration which would be
- Fixed Capacity(10GB)
- 400 RU's
- No partition key
- Be sure to create a database called
- Create an Azure Function or use the Azure Function Command Line utility to run this function locally
- If you fork this repo you can use the continious deployment option.
- I suggest using the
Consumption Plan
for testing and small/medium workloads.
- Since you already have a storage account that you previously created, I would suggest using that instead of creating a seperate one.
- Create an Azure Search instance
- Tip: Use the Free pricing tier for testing!
See functions/settings.sample.json for all Azure Function app settings. Rename this to local.settings.json
if running under the Azure Function Command Line utility.
- In you Azure Function you will need to supply a few App Settings, specifically
BlobStore
: The Azure Storage Account connection string which you can find in theAccess keys
blade for your Storage Account. The function will trigger whenever a .pdf file is uploaded into theuploads
container of this storage account, again I recommend using the Azure Storage Explorer for this.- Note:
AzureWebJobsStorage
&BlobStore
can be the same or different storage accounts depending on your needs.
- Note:
DocumentDBConnectionString
: Connection string to your Azure DocumentDB database. This can be found in theKeys
blade of your DocumentDB underPRIMARY CONNECTION STRING
Once your funtion is running you can upload the sample/sample_doc.pdf
into the inbound
container, which will trigger the function. If it works you should see output in the Function logs like so:
Function started (Id=28dce41e-474c-42c8-9f9b-da82072ce4fb)
Updating document => dafe8948ef379e6aef78cc1b059122cebcae436d7dd878375f16094a99a9243b
Metadata found => {
"Title": "PDF to Text Function",
"Author": "Marc Gagne",
"Description": "An azure function that extracts text from PDFs, runs the regular expression captures found in rules.json \nagainst the text and stores the results in DocumentDB.",
"Technologies": [
"Azure Functions",
"pdf.js",
"JavaScript",
"Node.js"
]
}
Function completed (Success, Id=28dce41e-474c-42c8-9f9b-da82072ce4fb, Duration=286ms)
The rules.json file contains the regular expressions rules that are matched against the extracted text and stored as metadata.
The format for a rule is
{
"key": "<Metadata Name>",
"type": "<Match Type>",
"expression": "<Regular Expression>",
"default": "<Default Value if no matches>"
"startKeyword": "Optional: <Keyword for substring match start>",
"endKeyword": "Optional: <Keyword for substring match end>",
"options": {
"flags": "<Optional RegularExpression Flags>"
}
}
This function uses the TextMeta module which is a text extraction and rules engine. To learn more about the rules/how the text is extracted please refer to the TextMeta GitHub repo.
The result of processing the sample file in /sample/sample_doc.pdf using the sample rules.json is the following document being stored in DocumentDB:
{
"id": "dafe8948ef379e6aef78cc1b059122cebcae436d7dd878375f16094a99a9243b",
"name": "sample_doc.pdf",
"text": "Title: PDF to Text Function \nAuthor: Marc Gagne \n \nDescription: \nAn azure function that extracts text from PDFs, runs the regular expression captures found in rules.json \nagainst the text and stores the results in DocumentDB. \n \nTechnologies used: \n• Azure Functions \n• pdf.js \n• JavaScript \n• Node.js \n \nGitHub: https://github.com/m-gagne/PDF2AzSearch \n ",
"last_updated": "2017-05-23T20:10:31.653Z",
"meta": {
"Title": "PDF to Text Function",
"Author": "Marc Gagne",
"Description": "An azure function that extracts text from PDFs, runs the regular expression captures found in rules.json",
"Technologies": [
"Azure Functions",
"pdf.js",
"JavaScript",
"Node.js"
]
}
}
To configure search to index data from your Document I highly recommend getting familiar with the Azure Search REST APIs which I find more efficient (once you learn them) than using the portal/code.
The included search/PDF2Search.postman_collection.json Postman collection contains the basics required to create the data source (DocumentDB), the index (search schema) and the indexer (reads from data source and indexes data using the configured index) as well as a very simple search query.
- Open Postman, click
Import
and import the search/PDF2Search.postman_collection.json collection. - Configure your environment variables in Postman to include
DocDbConnectionString
: Which is the connection string for your DocumentDB database.- Note: When setting your DocumentDB connection string as the data source, you will need to include the Database name in the string like so
AccountEndpoint=https://[your account name].documents.azure.com;AccountKey=[your account key];Database=[your database id]
- Note: When setting your DocumentDB connection string as the data source, you will need to include the Database name in the string like so
SearchAdminKey
: Your Azure Search admin key (so you can create/delete data sources, indexes, indexers etc.)SearchAccountName
: The name of your Azure Search service- Note: This is just the name not the full url.
- In the
001 - Setup
folderSend
theCreate Data Source
,Create Index
&Create Indexer
requests.- You should look at the
Body
of each of these requests to better understand what they are doing
- You should look at the
- After a brief moment (give it a minute) you should now be able to run the
002 - Searches/Sample Query
request to search your document!