# Analysis of State Constitutions

This project seeks to analyze the nature and length and evolution of state constitutions over time. Important questions:

1. How do constitutions change over time? 
2. Are there any obvious groups of similar constitutions? Have they become more similar or less similar over time? 
3. Can we find an

A first step is downloading all the constitutions from the NBER's State Constitutions Project website (the `url` is below)

In [1]:
import urllib, urllib.request
from bs4 import BeautifulSoup, SoupStrainer
import os
import requests
import wget
import io
import re

In [2]:
constitutions_url = 'http://www.stateconstitutions.umd.edu/texts/'

opener = urllib.request.URLopener({})
f = opener.open(constitutions_url)
content = f.read()

soup = BeautifulSoup(content, "lxml")

files = []
for link in BeautifulSoup(content, "lxml",
                         parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if len(link['href'])>1:
            print(link['href'])
            files.append(link['href'])

/texts/AK1836_final_parts_0.txt
/texts/AK1959_final_parts_0.txt
/texts/AL1819_final_parts_0.txt
/texts/AL1861_final_parts_0.txt
/texts/AL1865_final_parts_0.txt
/texts/AL1868_final_parts_0.txt
/texts/AL1875_final_parts_0.txt
/texts/AL1901_001200_final_parts_0.txt
/texts/AL1901_201400_final_parts_0.txt
/texts/AL1901_401658_final_parts_0.txt
/texts/AL1901_final_parts_0.txt
/texts/AR1864_final_parts_0.txt
/texts/AR1868_final_parts_0.txt
/texts/AR1874_final_parts_0.txt
/texts/AZ1912_final_parts_0.txt
/texts/CO1876_amd_final_parts.txt
/texts/CO1876_amd_final_parts_0.txt
/texts/CO1876_final_parts_0.txt
/texts/CT1662_final_parts_0.txt
/texts/CT1818_final_parts_0.txt
/texts/CT1955_final_parts_0.txt
/texts/CT1965_final_parts_0.txt
/texts/DE1776_final_parts_0.txt
/texts/DE1792_final_parts_0.txt
/texts/DE1831_final_parts_0.txt
/texts/DE1897_final_parts_0.txt
/texts/FL1838%20_final_parts_0.txt
/texts/FL1861_final_parts_0.txt
/texts/FL1865_final_parts_0.txt
/texts/FL1868_Final_parts_0.txt
/texts/FL1

We now have a list of files on the site. Let's see if we can download all these files to a local directory. We can then take a look at the texts and all that stuff.

If it is not already there, make a directory and use our download function to get all the texts in order.

In [3]:
!mkdir texts

A subdirectory or file texts already exists.


In [4]:
base_url='http://www.stateconstitutions.umd.edu'

rejects = []
rejectNames = []

for file in files:
    f = base_url+file
    fileName = file.replace('/texts/','')
    if os.path.isfile(fileName) == False:
        try:
            junk = wget.download(f)
        except: 
            rejects.append(f)
            rejectNames.append(fileName)

In [5]:
for reject in rejects:
    rejectname=reject.replace('http://www.stateconstitutions.umd.edu/texts/','')
    if os.path.isfile(rejectname) == False:
        r = requests.get(reject)
        name = reject.replace(base_url+'/texts/','')
        with open(name, 'w') as code:
            code.write(r.content.decode("utf-8",'ignore'))

We now have all the files in the directory. Since we change some names, lets get a list of all these files. 

In [6]:
fileList = os.listdir()

In [7]:
fileList[1]

'AK1836_final_parts_0.txt'

In [8]:
file =open(fileList[1], encoding='cp1252', errors='ignore').read()

In [9]:
m = re.search('ARKANSAS', file)
file = file[m.start():]

In [10]:
file = re.sub('!\w', ' ', file)

In [11]:
words = re.findall('\w+', file.lower())

We still have a lot of crap in these sequences. Let's see if we can get the crap out of there...

In [12]:
words = [re.sub('[ìôóðäïòø¼àµ]', ' ', word).strip().split() for word in words]

It would also help to have a function that gives us a simple list:

In [13]:
def flatten(x):
    result = []
    for el in x:
        if hasattr(el, "__iter__") and not isinstance(el, str):
            result.extend(flatten(el))
        else:
            result.append(el)
    return result

In [14]:
words = flatten(words)

Now, it is probably a good idea to look up words that belong in a dictionary. So, we have a dictionary file and we will make good use of it here. 

In [15]:
dictionaryFile = open('dictionary.txt').read()

In [16]:
dictwords = re.findall('\w+', dictionaryFile.lower())

In [17]:
len(dictwords)

45333

In [18]:
keepers = []
for i in range(0,len(words)):
    if words[i] in dictwords:
        keepers.append(words[i])
    else:
        if i < len(words) -2:    
            if (words[i]+words[i+1] in dictwords):
                keepers.append(words[i]+words[i+1])

In [19]:
import nltk
from nltk.corpus import stopwords
import collections

In [20]:
print(keepers[:100])

['arkansas', 'ordinance', 'and', 'acceptance', 'compact', 'the', 'general', 'assembly', 'the', 'state', 'arkansas', 'ordained', 'the', 'general', 'assembly', 'the', 'state', 'arkansas', 'virtue', 'the', 'authority', 'vested', 'said', 'general', 'assembly', 'the', 'pro', 'visions', 'the', 'ordinance', 'adopted', 'the', 'convention', 'delegates', 'assembled', 'bled', 'little', 'rock', 'for', 'the', 'purpose', 'forming', 'constitution', 'and', 'system', 'government', 'for', 'said', 'state', 'that', 'the', 'propositions', 'set', 'forth', 'act', 'supplementary', 'the', 'act', 'entitled', 'act', 'for', 'the', 'admission', 'the', 'state', 'arkansas', 'into', 'the', 'union', 'and', 'provide', 'for', 'the', 'due', 'execution', 'the', 'laws', 'the', 'united', 'states', 'within', 'the', 'same', 'and', 'for', 'other', 'purposes', 'and', 'the', 'same', 'are', 'hereby', 'freely', 'accepted', 'ratified', 'and', 'irrevocably', 'confirmed', 'articles', 'compact']


In [21]:
KeeperText = nltk.Text(keepers)

In [22]:
KeeperText.collocations()

general assembly; united states; bylaw law; elect one; one
representative; shall elect; compose one; mend art; free white; one
senator; supreme court; house representatives; prescribed bylaw; white
male; state arkansas; ratified november; shall compose; two
representatives; circuit courts; november mend


In [23]:
filtered_words = [word for word in keepers if word not in stopwords.words('english')]

In [24]:
len(filtered_words)

5379

In [25]:
fdist = nltk.FreqDist(filtered_words)

In [26]:
fdist.most_common(40)

[('shall', 428),
 ('send', 132),
 ('state', 120),
 ('may', 88),
 ('one', 84),
 ('general', 77),
 ('county', 72),
 ('assembly', 66),
 ('elect', 60),
 ('law', 53),
 ('two', 44),
 ('office', 44),
 ('governor', 44),
 ('time', 42),
 ('house', 39),
 ('representatives', 38),
 ('years', 37),
 ('bylaw', 35),
 ('elected', 34),
 ('arkansas', 33),
 ('court', 32),
 ('courts', 32),
 ('states', 30),
 ('united', 30),
 ('power', 29),
 ('counties', 28),
 ('circuit', 27),
 ('election', 26),
 ('every', 25),
 ('peace', 25),
 ('representative', 23),
 ('supreme', 22),
 ('senator', 21),
 ('judges', 20),
 ('constitution', 20),
 ('term', 20),
 ('number', 20),
 ('manner', 20),
 ('government', 19),
 ('free', 19)]

So, we have a fairly reliable way of reading in a constitution and cleaning it. What we should do now is create a function that automates the work of doing this somewhat. Actually, a more interesting thing to do might be to take a look at the next file in the list and see how similar it is. 

In [28]:
fileList[2]

'AK1959_final_parts_0.txt'

In [29]:
file =open(fileList[2], encoding='cp1252', errors='ignore').read()

In [30]:
file

'*** CSTART AK 1/3/1959 1/1/2003 ***\n\nCONSTITUTION OF ALASKA-1959 \n\n*** ASTART 9001.0 AK 1959 ***\n\nPreamble\n\nWe the people of Alaska, grateful to God and to those who founded our nation and pioneered this\ngreat land, in order to secure and transmit to succeeding generations our heritage of political,\ncivil, and religious liberty within the Union of States, do ordain and establish this constitution for\nthe State of Alaska.\n\n.0*** AEND ***\n*** ASTART 001.0 AK 1959 ***\n\nARTICLE I.  Declaration of Rights\n\n*** SSTART 001.0 001.0 0 AK 1959 ***\n\nSection 1. Inherent Rights. This constitution is dedicated to the principles that all persons have a\nnatural right to life, liberty, the pursuit of happiness, and the enjoyment of the rewards of their\nown industry; that all persons are equal and entitled to equal rights, opportunities, and protection\nunder the law; and that all persons have corresponding obligations to the people and to the State.\n \n*** SEND ***\n*** SSTART 00

In [31]:
file = re.sub('!\w', ' ', file)

In [33]:
words = re.findall('\w+', file.lower())

In [34]:
words

['cstart',
 'ak',
 '1',
 '3',
 '1959',
 '1',
 '1',
 '2003',
 'constitution',
 'of',
 'alaska',
 '1959',
 'astart',
 '9001',
 '0',
 'ak',
 '1959',
 'preamble',
 'we',
 'the',
 'people',
 'of',
 'alaska',
 'grateful',
 'to',
 'god',
 'and',
 'to',
 'those',
 'who',
 'founded',
 'our',
 'nation',
 'and',
 'pioneered',
 'this',
 'great',
 'land',
 'in',
 'order',
 'to',
 'secure',
 'and',
 'transmit',
 'to',
 'succeeding',
 'generations',
 'our',
 'heritage',
 'of',
 'political',
 'civil',
 'and',
 'religious',
 'liberty',
 'within',
 'the',
 'union',
 'of',
 'states',
 'do',
 'ordain',
 'and',
 'establish',
 'this',
 'constitution',
 'for',
 'the',
 'state',
 'of',
 'alaska',
 '0',
 'aend',
 'astart',
 '001',
 '0',
 'ak',
 '1959',
 'article',
 'i',
 'declaration',
 'of',
 'rights',
 'sstart',
 '001',
 '0',
 '001',
 '0',
 '0',
 'ak',
 '1959',
 'section',
 '1',
 'inherent',
 'rights',
 'this',
 'constitution',
 'is',
 'dedicated',
 'to',
 'the',
 'principles',
 'that',
 'all',
 'persons',
 

In [35]:
words = [re.sub('[ìôóðäïòø¼àµ]', ' ', word).strip().split() for word in words]

In [37]:
words = flatten(words)

In [38]:
keepers = []
for i in range(0,len(words)):
    if words[i] in dictwords:
        keepers.append(words[i])
    else:
        if i < len(words) -2:    
            if (words[i]+words[i+1] in dictwords):
                keepers.append(words[i]+words[i+1])

In [39]:
filtered_words = [word for word in keepers if word not in stopwords.words('english')]

In [40]:
fdist = nltk.FreqDist(filtered_words)

In [42]:
fdist.most_common(40)

[('shall', 468),
 ('section', 299),
 ('send', 206),
 ('state', 190),
 ('may', 172),
 ('law', 158),
 ('governor', 137),
 ('legislature', 119),
 ('bylaw', 111),
 ('election', 104),
 ('court', 74),
 ('office', 69),
 ('including', 64),
 ('members', 63),
 ('article', 62),
 ('mend', 62),
 ('area', 59),
 ('amended', 53),
 ('board', 52),
 ('session', 51),
 ('public', 50),
 ('prescribed', 50),
 ('one', 50),
 ('drained', 49),
 ('states', 48),
 ('united', 46),
 ('subject', 46),
 ('districts', 45),
 ('house', 45),
 ('general', 44),
 ('vote', 43),
 ('right', 43),
 ('first', 43),
 ('provided', 42),
 ('unless', 42),
 ('alaska', 42),
 ('river', 41),
 ('year', 41),
 ('days', 41),
 ('judicial', 40)]

In [44]:
file =open(fileList[3], encoding='cp1252', errors='ignore').read()
file = re.sub('!\w', ' ', file)
words = re.findall('\w+', file.lower())
words = [re.sub('[ìôóðäïòø¼àµ]', ' ', word).strip().split() for word in words]
words = flatten(words)
keepers = []
for i in range(0,len(words)):
    if words[i] in dictwords:
        keepers.append(words[i])
    else:
        if i < len(words) -2:    
            if (words[i]+words[i+1] in dictwords):
                keepers.append(words[i]+words[i+1])
filtered_words = [word for word in keepers if word not in stopwords.words('english')]

In [46]:
fdist = nltk.FreqDist(filtered_words)

In [50]:
fdist.most_common(80)

[('shall', 393),
 ('send', 148),
 ('state', 109),
 ('general', 98),
 ('may', 86),
 ('assembly', 80),
 ('law', 51),
 ('county', 50),
 ('one', 48),
 ('house', 42),
 ('representatives', 41),
 ('office', 40),
 ('person', 38),
 ('governor', 36),
 ('bylaw', 36),
 ('power', 31),
 ('two', 30),
 ('provided', 30),
 ('time', 29),
 ('election', 26),
 ('courts', 26),
 ('states', 25),
 ('constitution', 25),
 ('united', 23),
 ('years', 23),
 ('laws', 23),
 ('every', 21),
 ('cases', 21),
 ('town', 20),
 ('court', 19),
 ('manner', 19),
 ('alabama', 19),
 ('said', 18),
 ('government', 18),
 ('first', 18),
 ('number', 18),
 ('session', 18),
 ('bank', 18),
 ('three', 18),
 ('entitled', 17),
 ('members', 17),
 ('next', 17),
 ('public', 16),
 ('year', 16),
 ('houses', 16),
 ('hundred', 16),
 ('within', 16),
 ('city', 16),
 ('except', 15),
 ('vote', 15),
 ('several', 15),
 ('term', 15),
 ('twenty', 15),
 ('circuit', 15),
 ('officers', 14),
 ('white', 14),
 ('hold', 14),
 ('unless', 14),
 ('counties', 14),
 (

In [55]:
file =open(fileList[15], encoding='cp1252', errors='ignore').read()
file = re.sub('!\w', ' ', file)
words = re.findall('\w+', file.lower())
words = [re.sub('[ìôóðäïòø¼àµ]', ' ', word).strip().split() for word in words]
words = flatten(words)
keepers = []
for i in range(0,len(words)):
    if words[i] in dictwords:
        keepers.append(words[i])
    else:
        if i < len(words) -2:    
            if (words[i]+words[i+1] in dictwords):
                keepers.append(words[i]+words[i+1])
filtered_words = [word for word in keepers if word not in stopwords.words('english')]

In [57]:
fdist = nltk.FreqDist(filtered_words)

In [58]:
fdist.most_common(80)

[('shall', 2157),
 ('state', 775),
 ('section', 755),
 ('law', 432),
 ('court', 402),
 ('may', 384),
 ('county', 368),
 ('office', 368),
 ('provided', 320),
 ('bylaw', 319),
 ('one', 311),
 ('election', 309),
 ('legislature', 277),
 ('person', 268),
 ('property', 261),
 ('effective', 257),
 ('year', 255),
 ('send', 253),
 ('members', 235),
 ('two', 227),
 ('mend', 225),
 ('amendment', 218),
 ('public', 218),
 ('governor', 216),
 ('commission', 209),
 ('dollars', 198),
 ('five', 194),
 ('thousand', 173),
 ('term', 171),
 ('said', 169),
 ('years', 165),
 ('general', 164),
 ('elected', 163),
 ('states', 158),
 ('except', 157),
 ('limitation', 157),
 ('united', 150),
 ('superior', 144),
 ('arizona', 144),
 ('prescribed', 142),
 ('number', 137),
 ('fiscal', 137),
 ('hundred', 136),
 ('district', 134),
 ('upon', 133),
 ('political', 132),
 ('corporation', 130),
 ('expenditure', 129),
 ('board', 129),
 ('supreme', 128),
 ('judges', 127),
 ('thereof', 126),
 ('total', 124),
 ('exceed', 124),
 