## Training a Form Recognizer model and extracting form data (Python)

### Before you start
 
Install Python. Do this either via the Microsoft Store (recommended) or via https://www.python.org/downloads/
***
Install a Python interpreter, preferrably Anaconda via https://www.anaconda.com/distribution/
***
Install the required modules. To do this:
1. Launch the Anaconda prompt
2. Run the command `pip install azure`
3. Run the command `pip install python-dotenv`
***
Have data to train your model. You must have either:
* A set of at least five forms of the same type. They can be of different file types but must be the same type of document; OR
* A single empty form with two filled-in forms. The empty form's file name needs to include the word "empty". 
***

### Set up your Form Recognizer resource group
Go to the Azure portal (https://portal.azure.com/) and create a new resource group to store your Form Recognizer resource and a storage container. Once you have created the resource group, create a new Form Recognizer resource. Then, create a Storage Account. 

Now that we have established the resources that we need, we can go ahead and find and save our secrets. 

#### Saving your key and endpoint values
From within you Form Recognizer resource, select the **Quick start** tab to view your subscription data. Save the values **Key** and **Endpoint** to a temporary location, such as a .txt file. Make sure that you label the key and endpoint clearly so you can tell them apart later. 

#### Saving your SAS URL 
Navigate to your Storage Account resource. Open the **Storage Explore** tab and right-click on your container. Select **Get shared access signature**. Make sure that the **Read** and **List** permissions are checked, and click **Create**. Then copy the value in the **URL** section and paste this into your .txt file.

If you do not have access to the **Storage Explorer** tab, then navigate further down to the **Shared access signature** tab (under **Settings**). Make sure that the **_all_** checkboxes are selected, leave all other settings as they are. Press **Generate SAS and connection string**. Copy the **Blob service SAS URL** and save it to your .txt file. You will have to alter the SAS URL so that it includes the name of your container - see the note below.

**_Note_**: Make sure that your SAS URL is of the form: `https://<storage account>.blob.core.windows.net/<container name>?<SAS value>`. If you have gotten your URL via the **Shared access signature** tab, your URL will _not_ include the `<container name>` so you will need to add this in order for your URL to work correctly. You can check the name of your container by going to your Storage Account resource, and navigating to **Blob service** > **Containers** in the side bar. 


***
### Running the code

**_Note_**: Before we run the notebook, we need to populate the variables with out **Key**, **Endpoint** and **SAS URL** values that we collected earlier. The values you need to change are listed at the beginning of each block of code. Please read these changes and alter the variables as necessary before running the notebook.


#### Train a Form Recognizer model
The following block of code calls the the **Train Custom Model** API. This will train a Form Recognizer model with the documents that are in your Azure blob container. 

Variables to populate:
* Replace `<endpoint>` with the endpoint URL from your Form Recognizer resource
* Replace `<SAS URL>` with the URL you generated before
* Replace `<subscription key>` with the subscription key you copied from your Form Recognizer resource

In [None]:
########### Python Form Recognizer Labeled Async Train #############

import json
import time
from requests import get, post

# Endpoint URL
endpoint = r"<endpoint>"
post_url = endpoint + r"/formrecognizer/v2.0-preview/custom/models"
source = r"<SAS URL>"
prefix = "" #Path to the folder in blob storage where your forms are located. If your forms are at the root of your container, leave this string empty.
includeSubFolders = False
useLabelFile = False

headers = {
    # Request headers
    'Content-Type': 'application/json',
    'Ocp-Apim-Subscription-Key': '<subscription key>',
}

body = 	{
    "source": source,
    "sourceFilter": {
        "prefix": prefix,
        "includeSubFolders": includeSubFolders
    },
    "useLabelFile": useLabelFile
}

try:
    resp = post(url = post_url, json = body, headers = headers)
    if resp.status_code != 201:
        print("POST model failed (%s):\n%s" % (resp.status_code, json.dumps(resp.json())))
        quit()
    print("POST model succeeded:\n%s" % resp.headers)
    get_url = resp.headers["location"]
except Exception as e:
    print("POST model failed:\n%s" % str(e))
    quit()

***
#### Get the training results
Running the below code will return the results of the training.

In [None]:
n_tries = 15
n_try = 0
wait_sec = 5
max_wait_sec = 60
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = headers)
        resp_json = resp.json()
        if resp.status_code != 200:
            print("GET model failed (%s):\n%s" % (resp.status_code, json.dumps(resp_json)))
            quit()
        model_status = resp_json["modelInfo"]["status"]
        if model_status == "ready":
            print("Training succeeded:\n%s" % json.dumps(resp_json, indent=4, sort_keys=True))
            modelID = resp_json["modelInfo"]["modelId"]
            print("ModelID:\n%s" % modelID)
            break
        if model_status == "invalid":
            print("Training failed. Model is invalid:\n%s" % json.dumps(resp_json))
            quit()
        # Training still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
        wait_sec = min(2*wait_sec, max_wait_sec)     
    except Exception as e:
        msg = "GET model failed:\n%s" % str(e)
        print(msg)
        quit()

***
#### Analyse your forms for key-value pairs and tables
Now we'll use the newly trained model to analyze a document and extract key-value pairs and tables from it. We do this by calling the **Analyze Form** API. 

Variables to populate:
* Replace `<endpoint>` with the endpoint URL from your Form Recognizer resource
* Replace `<subscription key>` with the key from your Form Recognizer resource
* Replace `<file path>` with the file path of the form you would like to analyze, for example C:\temp\file.pdf. This can also be the URL of a remote file.
* Replace `<file type>` with the file type. Supported types: `application/pdf`, `image/jpeg`, `image/png`, `image/tiff`. 


In [None]:
########### Python Form Recognizer Async Analyze #############

# Endpoint URL
endpoint = r"<endpoint>"
apim_key = "<subscription key>"
post_url = endpoint + "/formrecognizer/v2.0-preview/custom/models/%s/analyze" % modelID
source = r"<file path>"
params = {
    "includeTextDetails": True
}

headers = {
    # Request headers
    'Content-Type': '<file type>', #Note: make sure that this is a '/', rather than a '\'
    'Ocp-Apim-Subscription-Key': apim_key,
}
with open(source, "rb") as f:
    data_bytes = f.read()

try:
    resp = post(url = post_url, data = data_bytes, headers = headers, params = params)
    if resp.status_code != 202:
        print("POST analyze failed:\n%s" % json.dumps(resp.json()))
        quit()
    print("POST analyze succeeded:\n%s" % resp.headers)
    get_url = resp.headers["operation-location"]
except Exception as e:
    print("POST analyze failed:\n%s" % str(e))
    quit()

***
#### Get the Analyze results
This code returns the analysis results as JSON content. 

In [None]:
n_tries = 15
n_try = 0
wait_sec = 5
max_wait_sec = 60
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
        resp_json = resp.json()
        if resp.status_code != 200:
            print("GET analyze results failed:\n%s" % json.dumps(resp_json))
            quit()
        status = resp_json["status"]
        if status == "succeeded":
            print("Analysis succeeded:\n%s" % json.dumps(resp_json,indent=4, sort_keys=True))
            break
        if status == "failed":
            print("Analysis failed:\n%s" % json.dumps(resp_json))
            quit()
        # Analysis still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
        wait_sec = min(2*wait_sec, max_wait_sec)     
    except Exception as e:
        msg = "GET analyze results failed:\n%s" % str(e)
        print(msg)
        quit()
