# Companies House

<p>
Mal Minhas, v0.2<br>
28.10.24
</p>
<p>
<h4>Versions</h4>
<ul>
<li><b>v0.1</b>: 27.10.24. First version focussed only on REST API.</li>
<li><b>v0.2</b>: 28.10.24. Added support for document processing.</li>
</ul>
</p>

### 1. Introduction

This playbook provides a brief outline of how to access the Companies House (CH) REST API per the developer documentation available [here](https://developer.company-information.service.gov.uk/overview).  In order to follow along and use this playbook you will first need to [create an application](https://developer.company-information.service.gov.uk/how-to-create-an-application/) and associated API key that you will need to access the CH API.  Note that you can set up applications and keys to communicate with either a Sandbox endpoint which is read/write or the public API which is just read.  These endpoints are located at the following endpoints:
* **Sandbox API endpoint**: test-data-sandbox.company-information.service.gov.uk
* **Public API endpoint**: api.company-information.service.gov.uk

This notebook will focus only on the public read endpoint.

### 1. Authorisation

The CH REST API uses [HTTP Basic Auth](https://en.wikipedia.org/wiki/Basic_access_authentication) for GET requests in a CLI tool.  There is an OAuth2 flow as well which appears to be required for certain types of sensitive access. For instance obtaining [user profile](https://developer-specs.company-information.service.gov.uk/companies-house-identity-service/reference/user-details/user-profile) information from the endpoint at https://identity.company-information.service.gov.uk/user/profile requires going through OAuth2 flow to obtain a `client_id` and `client_secret` per the documentation [here](https://developer-specs.company-information.service.gov.uk/companies-house-identity-service/guides/ServerWeb).  This playbook is going to focus on the simple REST API access use case only rather than the web flow one which is more involved as it requires setting up a web server to handle the redirect.

CH will be [migrating to the Gov UK One Login scheme](https://www.gov.uk/government/news/companies-house-to-join-govuk-one-login) in Autumn 2024. 

### 2. Company Profile

Let's start with the most basic information which relates to company profile.  The following function `getCompanyInfo` will retrieve that information:

In [1]:
import os
import requests

API_KEY = os.environ.get('COMPANIES_HOUSE_API_KEY')

def procUrl(url, query_params={}, verbose=False):
    r = requests.get(url, auth=(API_KEY, ''), params=query_params)
    if verbose:
        print("-------------------------")
        print(f"URL: {r.request.url}")
        print(f"Body: {r.request.body}")
        hdrs = ""
        for k,v in r.request.headers.items():
            hdrs += f"  {k}:{v}\n"
        print(f"Headers:\n{hdrs[:-1]}")
        print("-------------------------")
    if r.status_code != 200:
        raise Exception(f"Error {r.status_code}: '{r.text}'")
    return r.json()  # if response is JSON format

def getCompanyProfile(company_id):
    directive = "company"
    url = f"https://api.company-information.service.gov.uk/{directive}/{company_id}"
    return procUrl(url)

We can use it to find out information about a company called [Oppian](https://find-and-update.company-information.service.gov.uk/company/06782942) which has a company number 06782942 which we obtained from using the CH web search interface.

In [2]:
def procAddress(data):
    ''' ChatGPT Python GPT assisted :-) '''
    # Extract the registered office address from the data
    address = data.get('registered_office_address', {})
    # Retrieve each part of the address, or an empty string if it doesn't exist
    address_line_1 = address.get('address_line_1', '')
    address_line_2 = address.get('address_line_2', '')
    locality = address.get('locality', '')
    region = address.get('region', '')
    postal_code = address.get('postal_code', '')
    country = address.get('country', '')
    # Concatenate all parts with commas, filtering out empty strings
    full_address = ', '.join(filter(None, [address_line_1, address_line_2, locality, region, postal_code, country]))
    return full_address

companyId = '06782942'
d = getCompanyProfile(companyId)
print(f"{d.get('company_name')} has company id={d.get('company_number')} is {d.get('company_status')} and has registered address:\n{procAddress(d)}")

OPPIAN SYSTEMS LIMITED has company id=06782942 is active and has registered address:
C/O High Royd Business Services Limited B B I C, Innovation Way, Barnsley, South Yorkshire, S75 1JL, United Kingdom


### 3. Company Details

Let's have a look at extracting information about getting a list of [persons with significant control](https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference/persons-with-significant-control/list?v=latest) which is obtained from https://api.company-information.service.gov.uk/company/{company_number}/persons-with-significant-control:

In [3]:
def getPersonsWithSignificantControl(company_id):
    directive = f"company/{company_id}/persons-with-significant-control"
    url = f"https://api.company-information.service.gov.uk/{directive}"
    params = {'start_index':0, 'items_per_page':10, 'register_view':'false'}
    return procUrl(url, params, verbose=False)

d = getPersonsWithSignificantControl(companyId)
print(f"Found {len(d.get('items'))} persons with significant control")

Found 2 persons with significant control


### 4. Search

We can also conduct search programmatically to mirror the search user interface [here](https://find-and-update.company-information.service.gov.uk/):

In [4]:
def searchCompany(category, search_string, nitems=10, start_index=0):
    directive = f"search/{category}"
    url = f"https://api.company-information.service.gov.uk/{directive}"
    params = {'q':search_string, 'start_index':0}
    return procUrl(url, params)

In [5]:
d = searchCompany('companies','Oppian')

A cursory look at this data reveals the same information:

In [6]:
company_title = d.get('items')[0].get('title')
company_id = d.get('items')[0].get('company_number')
company_description = d.get('items')[0].get('company_number')
print(f"Company '{company_title}' with company id={company_id} has description {company_description}")

Company 'OPPIAN SYSTEMS LIMITED' with company id=06782942 has description 06782942


### 5. Documents

Let's get the filing history for the company using https://api.company-information.service.gov.uk/company/{company_number}/filing-history per the documentation [here](https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference/filing-history/list?v=latest):

In [7]:
def getFilingHistory(company_id):
    directive = f"company/{company_id}/filing-history"
    url = f"https://api.company-information.service.gov.uk/{directive}"
    return procUrl(url)

d = getFilingHistory(companyId)

Let's pick up the doc id from the first entry:

In [8]:
def getDocId(url):
    # Strip any trailing slashes and split the URL by '/'
    parts = url.rstrip('/').split('/')
    # Return the last part
    return parts[-1]

accounts = []
for dd in d.get('items'):
    if dd.get('category') == 'accounts':
        doc_date = dd.get('date')
        doc_id = getDocId(dd.get('links').get('document_metadata'))
        accounts.append({'doc_date':doc_date,'doc_id':doc_id})
print(f"Found {len(accounts)} accounts records")

Found 9 accounts records


Let's get the document_id for the first set of accounts at `accounts[0]`:

In [9]:
def getDocumentFromId(document_id):
    directive = f"document/{document_id}"
    url = f"https://document-api.company-information.service.gov.uk/{directive}"
    return procUrl(url)

doc_id = accounts[0].get('doc_id')
d = getDocumentFromId(doc_id)
filetype = 'application/pdf'
filesize = d.get('resources').get('application/pdf').get('content_length')
filename = d.get('filename')
document_url = d.get('links').get('document')
print(f"Document for file '{filename}' of type '{filetype}' and size {filesize} is at:\n{document_url}")

Document for file '06782942_aa_2024-07-25' of type 'application/pdf' and size 25239 is at:
https://document-api.company-information.service.gov.uk/document/KCA3S09EKA-QQkrICYbVBAUtFVh5kKOPe7A0c434zlI/content


Now we can write a function to download the pdf document with a couple of checks on filetype and file size.  This code assumes the doc is `application/pdf`:

In [10]:
def get_pdf_file_size(file_path):
    # Get the file size in bytes
    return int(os.path.getsize(file_path))

def getDocument(document_url, filename, filetype, filesize):
    r = requests.get(document_url, auth=(API_KEY, ''))
    if r.status_code != 200:
        raise Exception(f"Error {r.status_code}: '{r.text}'")
    doc_location = r.url
    doc = requests.get(doc_location)
    fname = filename + '.pdf'
    if os.path.isfile(fname):
        os.remove(fname)
    with open(fname, 'wb') as f:
        f.write(doc.content)
    assert(filetype == 'application/pdf')
    fsize = get_pdf_file_size(fname)
    assert(filesize == get_pdf_file_size(fname))
    return fname,fsize

fname,fsize = getDocument(document_url, filename, filetype, filesize)
print(f"Downloaded {fname} locally of size {fsize} bytes")

Downloaded 06782942_aa_2024-07-25.pdf locally of size 25239 bytes


Once we have the pdf locally we can try to convert the pdf to text.  We need to use the open source `tesseract` OCR engine and `poppler` pdf renderer for this because `pymupdf` doesn't work.  In order to get this to work you will need to do the following if you are on a Mac:
```
$ brew install tesseract
$ tesseract --version
$ brew install poppler
$ pdftoppm -h
$ pip install pytesseract pdf2image pillow
```

Now we can run this function to convert the pdf to text.  Note that it takes 2 seconds to run:

In [11]:
%%time
import pytesseract
from pdf2image import convert_from_path

def pdf_to_text_with_ocr(file_path):
    text = ""
    # Convert PDF pages to images
    images = convert_from_path(file_path)
    for image in images:
        # Use OCR to extract text from each image
        text += pytesseract.image_to_string(image)
    return text

text = pdf_to_text_with_ocr(fname)

CPU times: user 47.9 ms, sys: 52.8 ms, total: 101 ms
Wall time: 2.22 s


Let's check the first 200 characters to ensure we are on the right track:

In [12]:
print(text[:200])

Registered number: 06782942

OPPIAN SYSTEMS LIMITED
UNAUDITED FINANCIAL STATEMENTS
FOR THE YEAR ENDED 31 DECEMBER 2023

High Royd Business Services Limited
BBIC
Innovation Way
Barnsley
South Yorkshire


Looks good!

We could now go further, download all the docs to do with a specific company and create a retrieval augmented generation (RAG) store to allow us to ask questions of the documents.  We would need to use a secure LLM for that.  This exercise is left for a later stage.