# Machine Readable Text of the Kenya 2010 Constitution

In this notebook, we're going to produce a machine-readable version of the Kenya 2010 constitution. We'll use a PDF version of the constitution from the [Kenya Law][0] website as the source.

This version was last revised in 2022 and published by the Government Printer, so we can have a high degree of confidence on its integrity.

[0]: https://kenyalaw.org/kl/fileadmin/pdfdownloads/TheConstitutionOfKenya.pdf

First step is to download it:

In [1]:
file_name = "TheConstitutionOfKenya.pdf"
!mkdir -p pdf
![ ! -f "pdf/{file_name}" ] && wget "https://kenyalaw.org/kl/fileadmin/pdfdownloads/TheConstitutionOfKenya.pdf" -O "pdf/{file_name}"

We'll use [pypdf][3] to extract the text. We install it and load the pdf:

[3]: https://pypi.org/project/pypdf/

In [2]:
!pip install --quiet pypdf

In [3]:
from pypdf import PdfReader

In [4]:
reader = PdfReader('pdf/TheConstitutionOfKenya.pdf')

In [5]:
len(reader.pages)

166

The pdf file has 166 pages. We can print a sample of extracted text from page 23:

In [6]:
sample = reader.pages[22]

print(sample.extract_text())

[Rev. 2022] Constitution of Kenya 23
Human dignity.
28.  Every person has inherent dignity and the right to have that dignity respected
and protected.
Freedom and security of the person.
29.  Every person has the right to freedom and security of the person, which
includes the right not to be—
(a) deprived of freedom arbitrarily or without just cause;
(b) detained without trial, except during a state of emergency, in which
case the detention is subject to Article 58;
(c) subjected to any form of violence from either public or private sources;
(d) subjected to torture in any manner, whether physical or psychological;
(e) subjected to corporal punishment; or
(f)treated or punished in a cruel, inhuman or degrading manner.
Slavery, servitude and forced labour.
30.   (1)  A person shall not be held in slavery or servitude.
(2)  A person shall not be required to perform forced labour.
Privacy.
31.  Every person has the right to privacy, which includes the right not to have—
(a) their person, 

We extract text from all the pages:

In [7]:
page_texts = [p.extract_text() for p in reader.pages]

The constitution is divided into chapters, under which we have the specific articles. The articles are 264 in total. These are in pages 14 to 123. So we narrow down pages to this range:

In [8]:
relevant_text = "\n".join(page_texts[13:123])

We the break down the text into lines as extracted from the pdf file:

In [9]:
import io

mem_file = io.StringIO(relevant_text)
all_lines = [l for l in mem_file]

In [10]:
len(all_lines)

4657

Some exploration indicates that the clauses each start from a new line with a pattern like `31.  Every person has the right to privacy...`; that is, a number followed by a period followed by some whitespace, then non-whitespace characters. We create a regular expression for this:

In [11]:
import re

article_num_pattern = re.compile(r'(\d+)\.\s+\S+')

We can extract the text like below:

In [12]:
article_num_pattern.match('31.  Every person has the right to privacy').group(1)

'31'

Running the pattern through all the lines will identify 264 of them:

In [13]:
article_nums = []

for line in all_lines:
    line = line.strip()
    match = article_num_pattern.match(line)
    if match:
        article_nums.append(int(match.group(1)))

In [14]:
len(article_nums)

264

Validating that numbers range from 1 to 264:

In [15]:
assert [i+1 for i in range(264)] == article_nums

Further observation shows that the title of each article seems to precede the line with the number. To validate this, we'll pair each article number with the text of the previous line:

In [16]:
articles = []

for i, line in enumerate(all_lines):
    line = line.strip()
    match = article_num_pattern.match(line)
    if match:
        articles.append((int(match.group(1)), all_lines[i-1].strip()))

In [17]:
articles

[(1, 'Sovereignty of the people.'),
 (2, 'Supremacy of this Constitution.'),
 (3, 'Defence of this Constitution.'),
 (4, 'Declaration of the Republic.'),
 (5, 'Territory of Kenya.'),
 (6, 'Devolution and access to services.'),
 (7, 'National, ofﬁcial and other languages.'),
 (8, 'State and religion.'),
 (9, 'National symbols and national days.'),
 (10, 'National values and principles of governance.'),
 (11, 'Culture.'),
 (12, 'Entitlements of citizens.'),
 (13, 'Retention and acquisition of citizenship.'),
 (14, 'Citizenship by birth.'),
 (15, 'Citizenship by registration.'),
 (16, 'Dual citizenship.'),
 (17, 'Revocation of citizenship.'),
 (18, 'Legislation on citizenship.'),
 (19, 'Rights and fundamental freedoms.'),
 (20, 'Application of Bill of Rights.'),
 (21, 'Implementation of rights and fundamental freedoms.'),
 (22, 'Enforcement of Bill of Rights.'),
 (23, 'Authority of courts to uphold and enforce the Bill of Rights.'),
 (24, 'Limitation of rights and fundamental freedoms.'),

Cross checking with the pdf version, it looks like the clauses and their titles were correctly identified.

The chapters all seem to start a new line in the form: `CHAPTER ONE`, `CHAPTER TWO` and so on, with the title in the succeeding line. We extract these and check:

In [18]:
chapter_pattern = re.compile(r'CHAPTER\s+(\w+)')

chapters = []

for i, line in enumerate(all_lines):
    line = line.strip()
    if chapter_pattern.match(line):
        chapters.append((line, all_lines[i+1].strip()))

In [19]:
chapters

[('CHAPTER ONE',
  'SOVEREIGNTY OF THE PEOPLE AND SUPREMACY OF THIS CONSTITUTION'),
 ('CHAPTER TWO', 'THE REPUBLIC'),
 ('CHAPTER THREE', 'CITIZENSHIP'),
 ('CHAPTER FOUR', 'THE BILL OF RIGHTS'),
 ('CHAPTER FIVE', 'LAND AND ENVIRONMENT'),
 ('CHAPTER SIX', 'LEADERSHIP AND INTEGRITY'),
 ('CHAPTER SEVEN', 'REPRESENTATION OF THE PEOPLE'),
 ('CHAPTER EIGHT', 'THE LEGISLATURE'),
 ('CHAPTER NINE', 'THE EXECUTIVE'),
 ('CHAPTER TEN', 'JUDICIARY'),
 ('CHAPTER ELEVEN', 'DEVOLVED GOVERNMENT'),
 ('CHAPTER TWELVE', 'PUBLIC FINANCE'),
 ('CHAPTER THIRTEEN', 'THE PUBLIC SERVICE'),
 ('CHAPTER FOURTEEN', 'NATIONAL SECURITY'),
 ('CHAPTER FIFTEEN', 'COMMISSIONS AND INDEPENDENT OFFICES'),
 ('CHAPTER SIXTEEN', 'AMENDMENT OF THIS CONSTITUTION'),
 ('CHAPTER SEVENTEEN', 'GENERAL PROVISIONS'),
 ('CHAPTER EIGHTEEN', 'TRANSITIONAL AND CONSEQUENTIAL PROVISIONS')]

That seems to have correctly identified each chapter with its title.

Some chapters are divided into parts. They are represented as lines of the form `PART 1 – ESTABLISHMENT AND ROLE OF PARLIAMENT`. We create a regex for this as well:

In [20]:
part_pattern = re.compile(r'PART\s+(\d+)\s+–\s+([\s\w]+)')

for line in all_lines:
    line = line.strip()
    if part_pattern.match(line):
        print(line)

PART 1 – GENERAL PROVISIONS TO THE BILL OF RIGHTS
PART 2 – RIGHTS AND FUNDAMENTAL FREEDOMS
PART 3 – SPECIFIC APPLICATION OF RIGHTS
PART 4 – STATE OF EMERGENCY
PART 5 – KENYA NATIONAL HUMAN RIGHTS AND EQUALITY COMMISSION
PART 1 – LAND
PART 2 – ENVIRONMENT AND NATURAL RESOURCES
PART 1 – ELECTORAL SYSTEM AND PROCESS
PART 2 – INDEPENDENT ELECTORAL AND BOUNDARIES
PART 3 – POLITICAL PARTIES
PART 1 – ESTABLISHMENT AND ROLE OF PARLIAMENT
PART 2 – COMPOSITION AND MEMBERSHIP OF PARLIAMENT
PART 3 – OFFICES OF PARLIAMENT
PART 4 – PROCEDURES FOR ENACTING LEGISLATION
PART 5 – PARLIAMENT'S GENERAL PROCEDURES AND RULES
PART 6 – MISCELLANEOUS
PART 1 – PRINCIPLES AND STRUCTURE OF THE NATIONAL EXECUTIVE
PART 2 – THE PRESIDENT AND DEPUTY PRESIDENT
PART 3 – THE CABINET
PART 4 – OTHER OFFICES
PART 1 – JUDICIAL AUTHORITY AND LEGAL SYSTEM
PART 2 – SUPERIOR COURTS
PART 3 – SUBORDINATE COURTS
PART 4 – JUDICIAL SERVICE COMMISSION
PART 1 – OBJECTS AND PRINCIPLES OF DEVOLVED GOVERNMENT
PART 2 – COUNTY GOVERNMENTS


At this point we've identified all the important sections. However, some lines represent the header of each page in the original doc. The headers include the pattern `[Rev. 2022]`, so we create a regexp for this:

In [21]:
header_pattern = re.compile(r'\[Rev\.\s+2022\]')

In [22]:
for line in all_lines:
    line = line.strip()
    if header_pattern.search(line):
        print(line, end="       ")

14 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 15       16 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 17       18 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 19       20 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 21       22 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 23       24 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 25       26 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 27       28 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 29       30 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 31       32 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 33       34 Constitution of Kenya [Rev. 2022]       [Rev. 2022] Constitution of Kenya 35       36 Constitution of Kenya [Rev. 2022]       [Rev. 2022]

We can now exclude these lines:

In [23]:
relevant_lines = [l for l in all_lines if not header_pattern.search(l)]

In [24]:
len(relevant_lines)

4547

In [25]:
relevant_lines[:30]

['CHAPTER ONE\n',
 'SOVEREIGNTY OF THE PEOPLE AND SUPREMACY OF THIS CONSTITUTION\n',
 'Sovereignty of the people.\n',
 '1. \xa0 (1)\xa0\xa0All sovereign power belongs to the people of Kenya and shall be exercised\n',
 'only in accordance with this Constitution.\n',
 '(2)\xa0\xa0The people may exercise their sovereign power either directly or through their\n',
 'democratically elected representatives.\n',
 '(3)\xa0\xa0Sovereign power under this Constitution is delegated to the following State\n',
 'organs, which shall perform their functions in accordance with this Constitution—\n',
 '(a) Parliament and the legislative assemblies in the county governments;\n',
 '(b) the national executive and the executive structures in the county\n',
 'governments; and\n',
 '(c) the Judiciary and independent tribunals.\n',
 '(4)\xa0\xa0The sovereign power of the people is exercised at—\n',
 '(a) the national level; and\n',
 '(b) the county level.\n',
 'Supremacy of this Constitution.\n',
 '2. \xa0 (1)\

Looking at the first few lines, we notice the non-breaking space character `\xa0` in some lines. These can be normalized to regular whitespace:

In [26]:
import unicodedata

l = relevant_lines[3]
unicodedata.normalize('NFKC', l), l

('1.   (1)  All sovereign power belongs to the people of Kenya and shall be exercised\n',
 '1. \xa0 (1)\xa0\xa0All sovereign power belongs to the people of Kenya and shall be exercised\n')

We do the same for all the lines:

In [27]:
relevant_lines = [unicodedata.normalize('NFKC', l) for l in relevant_lines]
relevant_lines[:30]

['CHAPTER ONE\n',
 'SOVEREIGNTY OF THE PEOPLE AND SUPREMACY OF THIS CONSTITUTION\n',
 'Sovereignty of the people.\n',
 '1.   (1)  All sovereign power belongs to the people of Kenya and shall be exercised\n',
 'only in accordance with this Constitution.\n',
 '(2)  The people may exercise their sovereign power either directly or through their\n',
 'democratically elected representatives.\n',
 '(3)  Sovereign power under this Constitution is delegated to the following State\n',
 'organs, which shall perform their functions in accordance with this Constitution—\n',
 '(a) Parliament and the legislative assemblies in the county governments;\n',
 '(b) the national executive and the executive structures in the county\n',
 'governments; and\n',
 '(c) the Judiciary and independent tribunals.\n',
 '(4)  The sovereign power of the people is exercised at—\n',
 '(a) the national level; and\n',
 '(b) the county level.\n',
 'Supremacy of this Constitution.\n',
 '2.   (1)  This Constitution is the supr

At this point we can put all the ingredients together. We'll extract each of the 264 clauses as a list of all the relevant lines. For each clause, we include the chapter number and title it belongs to. For chapters that are divided into parts, we include the part number and title the clause belongs to; otherwise part is set to null.

In [28]:
number_words = [
    "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten",
    "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen"
]

current_chapter, current_part, current_article = None, None, None
article_list = []
index = 0

def to_int(word):
    """Convert string word to integer number"""
    word = word.lower()
    return number_words.index(word) + 1

def is_article_title(index):
    next_index = index + 1
    if next_index < len(relevant_lines):
        next_line = relevant_lines[next_index].strip()
        return article_num_pattern.match(next_line) is not None
    return True  


while index < len(relevant_lines):
    line = relevant_lines[index].strip()
    if match := chapter_pattern.match(line):
        # reset part, article
        current_part = None

        if current_article:
            # append to list before reseting
            article_list.append(current_article)
        current_article = None

        # Current line is the chapter num, next line is the title
        num_word = match.group(1)
        current_chapter = (to_int(num_word), relevant_lines[index+1].strip())
        index += 1 # swallow the title line
    elif match := part_pattern.match(line):
        part_number, part_title = int(match.group(1)), match.group(2)
        current_part = (part_number, part_title)
    elif match := article_num_pattern.match(line):
        
        if current_article:
            # append previous article before starting new one
            article_list.append(current_article)

        current_article = dict(
            number=int(match.group(1)),
            title=relevant_lines[index-1].strip(),
            lines=[re.sub(r'\d+\.\s+', '', relevant_lines[index], count=1)],  # strip out article number
            part=current_part,
            chapter=current_chapter,
        )
    else:
        # not the beginning of any section
        
        # if we have a current_article, then this
        # is likely an additional line of the article.
         
        # However, since a new article's title precedes
        # the article number, check that possibility first
        if current_article and not is_article_title(index):
            current_article['lines'].append(relevant_lines[index])
    index += 1

# append last article
article_list.append(current_article)

In [29]:
len(article_list)

264

In [30]:
from pprint import pprint

for article in article_list[:5]:
    pprint(article)
    print('***')

{'chapter': (1, 'SOVEREIGNTY OF THE PEOPLE AND SUPREMACY OF THIS CONSTITUTION'),
 'lines': ['(1)  All sovereign power belongs to the people of Kenya and shall '
           'be exercised\n',
           'only in accordance with this Constitution.\n',
           '(2)  The people may exercise their sovereign power either directly '
           'or through their\n',
           'democratically elected representatives.\n',
           '(3)  Sovereign power under this Constitution is delegated to the '
           'following State\n',
           'organs, which shall perform their functions in accordance with '
           'this Constitution—\n',
           '(a) Parliament and the legislative assemblies in the county '
           'governments;\n',
           '(b) the national executive and the executive structures in the '
           'county\n',
           'governments; and\n',
           '(c) the Judiciary and independent tribunals.\n',
           '(4)  The sovereign power of the people is exercis

We can check that all articles were extracted in order:

In [31]:
for i in range(len(article_list)):
    article_num = article_list[i]['number']
    assert article_num == i+1

Now we have all articles of the constitution as a list of dictionaries. Each dictionary has the following attributes:
- `number`: the article number
- `title`: the article's title
- `lines`: the lines that consist the article, including all clauses and sub-clauses.
- `chapter`: a tuple of chapter number and title the clause belongs to
- `part`: a tuple of the part number and title the article belongs to if available, otherwise this will be `None`.

We can now save this in JSON:

In [32]:
!mkdir -p json

import json

with open('json/ConstitutionKenya2010.json', 'wt') as f:
    json.dump(article_list, f)

We now have a JSON version of the 2010 constitution's clauses that can be easily be ingested by programs without having to do complicated parsing.

Kudos to the folks at Govt Printer for producing a PDF version that had consistent formatting and was easy to parse without ugly hacks. At least in this case, Kenyans can be happy their tax money was used properly :)