## Training a Form Recognizer model and extracting form data (Python)

### Before you start
 
Install Python. Do this either via the Microsoft Store (recommended) or via https://www.python.org/downloads/
***
Install a Python interpreter, preferrably Anaconda via https://www.anaconda.com/distribution/
***
Install the required modules. To do this:
1. Launch the Anaconda prompt
2. Run the command `pip install azure`
3. Run the command `pip install python-dotenv`
***
Create your own '.env' file. This is where we will keep all of our secrets required to run this tutorial. To do this:
1. Go to your local directory that you are working in.
2. Open a Notepad file. On the first line write `KEY=`, on the second `ENDPOINT=`, the third `SAS_URL=`, fourth `FILE_PATH=`, and on the fifth line `FILE_TYPE=`.
3. Save the Notepad file as '.env'
***
Have data to train your model. You must have either:
* A set of at least five forms of the same type. They can be of different file types but must be the same type of document; OR
* A single empty form with two filled-in forms. The empty form's file name needs to include the word "empty". 
***

### Set up your Form Recognizer resource group
Go to the Azure portal (https://portal.azure.com/) and create a new resource group to store your Form Recognizer resource and a storage container. Once you have created the resource group, create a new Form Recognizer resource. Then, create a Storage Account. 

Now that we have established the resources that we need, we can go ahead and find and save our secrets. 

#### Saving your key and endpoint values
From within you Form Recognizer resource, select the **Quick start** tab to view your subscription data. Save the values **Key** and **Endpoint** to your .env file for the `KEY` and `ENDPOINT` values respectively. 

#### Saving your SAS URL 
Navigate to your Storage Account resource. Open the **Storage Explore** tab and right-click on your container. Select **Get shared access signature**. Make sure that the **Read** and **List** permissions are checked, and click **Create**. Then copy the value in the **URL** section and paste this into the `SAS_URL` value within your .env file. Save your .env. 

If you do not have access to the **Storage Explorer** tab, then navigate further down to the **Shared access signature** tab (under **Settings**). Make sure that the **_all_** checkboxes are selected, leave all other settings as they are. Press **Generate SAS and connection string**. Copy the **Blob service SAS URL** and save it to your .env file. You will have to alter the SAS URL so that it includes the name of your container - see the note below.

**_Note_**: Make sure that your SAS URL is of the form: `https://<storage account>.blob.core.windows.net/<container name>?<SAS value>`. If you have gotten your URL via the **Shared access signature** tab, your URL will _not_ include the `<container name>` so you will need to add this in order for your URL to work correctly. You can check the name of your container by going to your Storage Account resource, and navigating to **Blob service** > **Containers** in the side bar. 

***

### Finalising your .env file
There are two variables still unpopulated in your .env file. These should be `FILE_PATH` and `FILE_TYPE`. These two values refer to the file that you would like to analyze using your trained model. 
* `FILE_PATH` is the file path of your form (for example, `C:\temp\file.pdf`). It can also be the URL of a remote file. 
* `FILE_TYPE` is the form's file type. Supported types: `application/pdf`, `image/jpeg`, `image/png`, `image/tiff`. 

### Running the code 
We should now be ready to run the below code. The first block imports all the required modules that we need to run the code. The second block confirms that we have access to your .env file. 

In [1]:
import azure, json, os, requests
import time
from requests import get, post
from dotenv import load_dotenv

In [2]:
load_dotenv(verbose=True)

True

***
#### Train a Form Recognizer model
The following block of code calls the the **Train Custom Model** API. This will train a Form Recognizer model with the documents that are in your Azure blob container.


In [3]:
########### Python Form Recognizer Labeled Async Train #############

# Endpoint URL
endpoint = os.getenv("ENDPOINT") 
post_url = endpoint + r"/formrecognizer/v2.0-preview/custom/models"
source = os.getenv("SAS_URL")
prefix = ""
includeSubFolders = False
useLabelFile = False

headers = {
    # Request headers
    'Content-Type': 'application/json',
    'Ocp-Apim-Subscription-Key': os.getenv("KEY"),
}

body = 	{
    "source": source,
    "sourceFilter": {
        "prefix": prefix,
        "includeSubFolders": includeSubFolders
    },
    "useLabelFile": useLabelFile
}

try:
    resp = post(url = post_url, json = body, headers = headers)
    if resp.status_code != 201:
        print("POST model failed (%s):\n%s" % (resp.status_code, json.dumps(resp.json())))
        quit()
    print("POST model succeeded:\n%s" % resp.headers)
    get_url = resp.headers["location"]
except Exception as e:
    print("POST model failed:\n%s" % str(e))
    quit()

POST model succeeded:
{'Content-Length': '0', 'Location': 'https://formrecognisertest-kawar.cognitiveservices.azure.com/formrecognizer/v2.0-preview/custom/models/277d882b-16e4-40d6-8faf-17dedf9ee956', 'x-envoy-upstream-service-time': '38', 'apim-request-id': '1c601a20-6cfa-4254-bc72-a9a8adab68d4', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'Date': 'Wed, 08 Apr 2020 01:59:33 GMT'}


***
#### Get the training results
Running the below code will return the results of the training.

In [4]:
n_tries = 15
n_try = 0
wait_sec = 5
max_wait_sec = 60
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = headers)
        resp_json = resp.json()
        if resp.status_code != 200:
            print("GET model failed (%s):\n%s" % (resp.status_code, json.dumps(resp_json)))
            quit()
        model_status = resp_json["modelInfo"]["status"]
        if model_status == "ready":
            print("Training succeeded:\n%s" % json.dumps(resp_json, indent=4, sort_keys=True))
            modelID = resp_json["modelInfo"]["modelId"]
            print("ModelID:\n%s" % modelID)
            break
        if model_status == "invalid":
            print("Training failed. Model is invalid:\n%s" % json.dumps(resp_json))
            quit()
        # Training still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
        wait_sec = min(2*wait_sec, max_wait_sec)     
    except Exception as e:
        msg = "GET model failed:\n%s" % str(e)
        print(msg)
        quit()

Training succeeded:
{
    "modelInfo": {
        "createdDateTime": "2020-04-08T01:59:33Z",
        "lastUpdatedDateTime": "2020-04-08T01:59:44Z",
        "modelId": "277d882b-16e4-40d6-8faf-17dedf9ee956",
        "status": "ready"
    },
    "trainResult": {
        "errors": [],
        "trainingDocuments": [
            {
                "documentName": "EmptyPage1.png",
                "errors": [],
                "pages": 1,
                "status": "succeeded"
            },
            {
                "documentName": "Form1.png",
                "errors": [],
                "pages": 1,
                "status": "succeeded"
            },
            {
                "documentName": "Form2.png",
                "errors": [],
                "pages": 1,
                "status": "succeeded"
            }
        ]
    }
}
ModelID:
277d882b-16e4-40d6-8faf-17dedf9ee956


***
#### Analyse your forms for key-value pairs and tables
Now we'll use the newly trained model to analyze a document and extract key-value pairs and tables from it. We do this by calling the **Analyze Form** API. 


In [5]:
########### Python Form Recognizer Async Analyze #############

# Endpoint URL
apim_key = os.getenv("KEY")
post_url = endpoint + "/formrecognizer/v2.0-preview/custom/models/%s/analyze" % modelID
source = os.getenv("FILE_PATH")
params = {
    "includeTextDetails": True
}

headers = {
    # Request headers
    'Content-Type': os.getenv("FILE_TYPE"),
    'Ocp-Apim-Subscription-Key': apim_key,
}
with open(source, "rb") as f:
    data_bytes = f.read()

try:
    resp = post(url = post_url, data = data_bytes, headers = headers, params = params)
    if resp.status_code != 202:
        print("POST analyze failed:\n%s" % json.dumps(resp.json()))
        quit()
    print("POST analyze succeeded:\n%s" % resp.headers)
    get_url = resp.headers["operation-location"]
except Exception as e:
    print("POST analyze failed:\n%s" % str(e))
    quit()

POST analyze succeeded:
{'Content-Length': '0', 'Operation-Location': 'https://formrecognisertest-kawar.cognitiveservices.azure.com/formrecognizer/v2.0-preview/custom/models/277d882b-16e4-40d6-8faf-17dedf9ee956/analyzeresults/1d3ac21a-1448-4fc5-8081-27f7fec8955f', 'x-envoy-upstream-service-time': '179', 'apim-request-id': '00e25250-1f42-4506-b660-45bd855ba42f', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'Date': 'Wed, 08 Apr 2020 01:59:54 GMT'}


***
#### Get the Analyze results
This code returns the analysis results as JSON content. 

In [6]:
n_tries = 15
n_try = 0
wait_sec = 5
max_wait_sec = 60
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
        resp_json = resp.json()
        if resp.status_code != 200:
            print("GET analyze results failed:\n%s" % json.dumps(resp_json))
            quit()
        status = resp_json["status"]
        if status == "succeeded":
            print("Analysis succeeded:\n%s" % json.dumps(resp_json,indent=4, sort_keys=True))
            break
        if status == "failed":
            print("Analysis failed:\n%s" % json.dumps(resp_json))
            quit()
        # Analysis still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
        wait_sec = min(2*wait_sec, max_wait_sec)     
    except Exception as e:
        msg = "GET analyze results failed:\n%s" % str(e)
        print(msg)
        quit()


Analysis succeeded:
{
    "analyzeResult": {
        "documentResults": [],
        "errors": [],
        "pageResults": [
            {
                "clusterId": 0,
                "keyValuePairs": [
                    {
                        "confidence": 0.7,
                        "key": {
                            "boundingBox": [
                                139,
                                180,
                                181,
                                180,
                                181,
                                194,
                                139,
                                194
                            ],
                            "elements": [
                                "#/readResults/0/lines/13/words/0"
                            ],
                            "text": "Latitude:"
                        },
                        "value": {
                            "boundingBox": [
                                