<center>
<h1><b>Pocket Lawyer: Basic Techniques</h1>
</center>

<h2>Intro:</h2>

This notebook details some of the algorithms that power the Pocket Lawyer app, available online <a href="http://pocketlawyer.maxfarago.com">here</a>.  In addition to these, the app also uses a number of more advanced algorithms including geolocation, custom text pre-processing, and multiple layers of classification and vector comparison.  Only the basic techniques are discussed here in order to keep the notebook brief and clear.

## Goal:
To retrieve and display laws and legal encyclopedia articles that are relevant to, and potentially answer, any law-related question.

## Process:

The basic idea is to categorize a user's question into an area of the law, *e.g.*, **Family Law.**  Once that section is predicted, the question is compared to state statutes and articles within that particular area of the law. The most relevant statutes and articles are then returned for the user.

The project uses three main data sets, each scraped from publicly available online sources and stored in its own NoSQL database:
- posts from an online message board
- state and federal laws
- legal encyclopedia articles

The laws and articles are already categorized by the sources they are retrieved from, and that categorization can be preserved while scraping.  Each law and article document in the databases stores this `'section'` data given to it by the source.

To categorize the user's question, a multi-class model is trained using the forum posts.  Message boards are generally categorized by the type of posts they contain, and the model is essentialy predicting where a user would choose to place their legal question had they posted it on the message board.

The benefit of this approach is that the user's question is compared to the natural language of the forum rather than the formal language of the laws or articles.

## Steps:
<a href="#step1">1. Data Retrieval and Storage</a>
<br>
<a href="#step2">2. Text Preprocessing</a>
<br>
<a href="#step3">3. Vectorization</a>
<br>
<a href="#step4">4. Machine Learning</a>
<br>
<a href="#step5">5. Cosine Similarity</a>

<hr />
<a name='step1'></a><center><b><font size='5em'>Step 1: Data Retrieval and Storage</center>
<hr />

Throughout this project I use <a href="https://www.mongodb.com/">MongoDB</a>, a popular NoSQL database, and <a href="https://api.mongodb.com/python/current/">Pymongo</a>, a Python package used to manipulate Mongo databases within a Python script. This project uses locally stored Mongo databases, accessed using a Pymongo client directed at the MongoDB port on `localhost`.

In [1]:
from pymongo import MongoClient
client = MongoClient('localhost')

<hr />
<a name="#step1a"></a>__Step 1a: Forum Posts__<br>
I chose to scrape legal-related posts from the <a href="http://boards.answers.findlaw.com/">FindLaw</a> message boards, where users post their legal questions and others respond with their advice.  Only the original post is scraped, not the response posts, with the rationale that the first post's text contains the actual question beneath the thread.  

The forum divides its threads into sections; these sections will be the basis for the `collections` of the database.  The data is stored in a database named `POSTS`, which uses the forum sections as collections.  This gives each MongoDB document the following structure:

    {
        'title': postTitle,
        'text': postText,
    }

Again, there is no need for a separate `'section'` field: the collection name indicates the section the thread was placed within on the forum.

<br>

__Step 1b: Legal Encyclopedia__<br>
I chose <a href="http://www.nolo.com/legal-encyclopedia">Nolo</a> as my "legal encyclopedia" because (1) it categorizes its articles similarly to how FindLaw categorizes its threads; (2) it's comprehensive, containing quality legal articles on virtually every area of law that exist; (3) the articles are written in the format of answers to example legal questions.  

Nolo breaks its articles into approximately 20 different areas of law.  These will serve as the collections which the `ARTICLES` database is broken up into. Consider <a href="http://www.nolo.com/legal-encyclopedia/damages-how-much-personal-injury-32264.html">this article</a> on personal injury law as an example.  Scraping the title, text, and URL of the page will create a MongoDB document structured in this way:

    {
        'title': 'Damages: How Much is a Personal Injury Case Worth?',
        'url': 'http://www.nolo.com/legal-encyclopedia/damages-how-much-personal-injury-32264.html'
        'text': 'If you're considering filing a personal injury lawsuit over a car accident...'
    }
        
Again, the `'section'` data is stored in the collection name, not in a document field for each article.

<br>

__Step 1c: State Laws__<br>
To retrieve state and federal laws, I chose <a href="http://law.justia.com/codes/">Justia</a>.  States categorize their laws in different ways and using widely varying nomenclature, but Justia does a nice job of keeping the hierarchies relatively consistent, which makes a huge difference when scraping the data.  

Because the organization of statutes varies so much by state, and the app will only be retrieving laws from the state the user is in, the database of laws is divided into collections organized by state rather than by section.  

<a name="lawExample" href="http://law.justia.com/codes/new-york/2015/cpl/part-2/title-m/article-450/450.30/">Take a look at this statute as an example.</a>  Above the law's text, in large bold font, is the entire hierarchy path for reaching it, starting with the root node ("2015 New York Laws") and ending with the title of the law itself ("450.30 - Appeal from sentence.").  This entire path, as well as the page URL and the text of the statute, is stored in the database, such that each MongoDB document is structured in this way:

    {
        "text" : "450.30  Appeal from sentence.  An  appeal  by  the  defendant... ",
        "section" : [
            "2015 New York Laws",
            "CPL - Criminal Procedure",
            "Part 2 - THE PRINCIPAL PROCEEDINGS",
            "Title M - PROCEEDINGS AFTER JUDGMENT",
            "Article 450 - (450.10 - 450.90) APPEALS--IN WHAT CASES AUTHORIZED AND TO WHAT COURTS TAKEN",
            "450.30 - Appeal from sentence."
        ],
        "url" : "http://law.justia.com/codes/new-york/2015/cpl/part-2/title-m/article-450/450.30/index.html"
    }

Note that unlike the posts and articles, the laws have a `'section'` field in order to store the entire hierarchy path with a list of strings.  This makes the app more versatile, as it can search statutes by section either broadly or narrowly.  The `LAWS` database is organized into collections by state rather than by section.

<hr />
<a name="step2"></a><center><b><font size='5em'>Step 2: Text Preprocessing</center>
<hr />

After scraping, three separate databases of documents are created, containing very different bodies of text. To analyze them meaningfully, each document's text is broken down into its individual words, *i.e.*, tokenizing by word.  This allows each text to be vectorized by word counts.

__Step 2a: Create Corpus Cleaning Functions__

Two functions are created to process the text scraped into the databases.  The first (`cleanText`) strips text of non-letter symbols and lowercases all the characters.  The second (`getTokens`) takes that cleaned string of text, removes the insignificant words in it, and *stems* the remaining significant ones.

<a href="https://en.wikipedia.org/wiki/Stemming">Stemming</a> improves vectorization by grouping together words of the same root, and thus the same root meaning, rather than treating them as separate words.

In [16]:
def cleanText(text):
    symbols = ['\n', '\r', '\u', '\u201', 'Nolo.com', 'nolo.com']
    punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    numbers = [ str(x) for x in range(10) ]
    
    for s in symbols:
        text = text.replace(s, ' ')
    for p in punctuation:
        text = text.replace(p, '')
    for n in numbers:
        text = text.replace(n, '')

    return text.lower()

The <a href="http://www.nltk.org/">Natural Language Toolkit (NLTK)</a> package has a function that stems via the <a href="https://tartarus.org/martin/PorterStemmer/">Porter algorithm</a>, and one that tokenizes by word.  These are imported and used here.  NLTK also conveniently has a list of <a href="https://en.wikipedia.org/wiki/Stop_words">stop words</a> that can be easily imported and used for text processing.

In [17]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def getTokens(text):
    tokens = word_tokenize(cleanText(text))
    return ' '.join([ PorterStemmer().stem(w) for w in tokens if w not in stopwords.words('english') ])

__Step 2b: Apply functions to databases__

Down the road it might be convenient for each post, law, and article document to store its tokens as well as its text.  To that end, the databases are traversed and a `'tokens'` field is inserted for each document, storing the result of the document's `'text'` field run through the preprocessing functions.

In [119]:
for database in client.database_names():
    db = client[database]
    for collection in db.collection_names():
        for doc in db[collection].find():
            db[collection].update_one({ '_id':doc['_id'] }, { '$set': { 'tokens':getTokens(doc['text']) } })

The `Pymongo` command `update_one()` requires a way to locate the document to update.  Here, the `_id` value — a unique ID given to each document by MongoDB during insertion into the database — is used to locate the document.  The `_id` value is found by going through the database, one document at a time, accessing each document's `_id`, and then performing the update.

There's a lot of looping going on here: the first `for` loop traverses all the databases in the client; the second traverses the collections in each database; the third traverses the documents in each collection by using a non-specific `find()` query for each collection.

<hr />
<a name="#step3"></a><center><b><font size='5em'>Step 3: Vectorization</center>
<hr />

Once all the text has been tokenized by word, each document can be vectorized into a word count.  <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">TF-IDF weighting</a> proves especially effective here, because law-related texts frequently contain key words that have great importance for a particular area of law, but are rarely used in any other areas.

As an example, take a look at the statute <a href="#lawExample">used earlier</a>.  The first sentence of the statute tells us a lot about it:

    An  appeal  by  the  defendant from a sentence, as authorized by  subdivision two of section 450.10, may be based  upon  the  ground  that  such sentence either was (a) invalid as a matter of law, or (b) harsh or excessive.
    
Running this sentence through the `getTokens` function creates a list of these stemmed words:

    [ "appeal", "defend", "sentenc", "author", "subdivis", "two", "section", "may", "base", "upon",
    "ground", "sentenc", "either", "invalid", "matter", "law", "harsh", "excess" ]

The first word, `appeal`, tells us that a legal issue has been ruled on, and suggests the statute is looking *after* the ruling temporally.  The word `sentence` indicates that the statute has to do with *criminal* law — it's specifically used in the law to refer to punishment in criminal cases.  Those two words alone tell us that the statute concerns a post-judgment issue in a criminal case.  

These words rarely occur together in the context of family law, bankruptcy law, or any other area of law.  The consistent usage of the same words and phrases in the legal profession to indicate **only** specific times and places within the law is what makes TF-IDF weighting so powerful in this context.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

The <a href="http://scikit-learn.org/stable/">scikit-learn</a> package contains a `TfidfVectorizer` object that vectorizes text into word counts and simultaneously applies TF-IDF weighting.  The result is a matrix whose rows represent the vectors of the individual documents, with a column for each unique token in the total "vocabulary" of the data set.  Scikit-learn has great <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">documentation</a> on the vectorizer (and on the package in general), and it's worth going through.

<a name="step3a"></a>__Step 3a: Vectorizing Forum Posts__

A vectorizer `postTIV` is fitted with a list of the `'tokens'` value of every post in the database.  The `postTIV` vectorizer will be used to transform the user's question.  The `fit_transform` method simultaneously returns the vectorized posts into `X`, which will be used to train the classifier in <a href="#step4">Step 4</a>.

In [19]:
db = client["POSTS"]   # Now working in the POSTS database

postTokens = [ x['tokens'] for collection in db.collection_names() for x in db[collection].find() ]

postTIV = TfidfVectorizer(max_features=10000, lowercase=False)
X = postTIV.fit_transform(postTokens)
X.shape

`TfidfVectorizer` has many parameters that can be passed to it.  For this example, `max_features` is set to `10000`, which limits the matrix to 10,000 columns representing only the most common words used in the text of the posts.

In [193]:
type(X)

scipy.sparse.csr.csr_matrix

Since each post uses only a small percentage of the 10,000 words in the total corpus vocabulary, the vast majority of values calculated by the vectorizer are going to be zeros.  To take advantage of this, a SciPy <a href="https://en.wikipedia.org/wiki/Sparse_matrix">sparse matrix</a> is used to store the data, generating much smaller filesizes and shorter load times.

In [17]:
import cPickle as pickle
pickle.dump(postTIV, open('postTIV.pkl', 'wb'))

The vectorizer is pickled so that it can be loaded when a user submits a question and used to transform the input.

__Step 3b: Vectorizing State Law Sections__

Once a user's question is categorized, it needs to be compared against the state laws.  The `postTIV` vectorizer transforms the user question using the vocabulary of the posts.  To be compared to the laws, it needs to be transformed again with a vectorizer fitted with the vocabulary of the state code.

Instead of creating a vectorizer for the entire set of state statutes, separate vectorizers are created for the different sections of the laws that correspond with the forum subsections.  Since the forum sections are not exactly the same as the statute sections, the connection has to be mapped manually and is stored in a dictionary.  The dictionary is kept in a `.txt` file for each state; this is what the forum -> laws section mapping looks like for Maryland:

    {
        "After Sentencing": (["Title 7 - UNIFORM POSTCONVICTION PROCEDURE ACT",
                            "Title 10 - CRIMINAL RECORDS",
                            "Title 8 - OTHER POSTCONVICTION REVIEW"]),

        "Auto Accidents":	([ "Title 20 - MARYLAND AUTOMOBILE INSURANCE FUND",
                            "TRANSPORTATION",
                            "Subtitle 1 - LIMITATIONS",
                            "Title 20 - VEHICLE LAWS -- ACCIDENTS AND ACCIDENT REPORTS" ]),
        ...
    }
    
The `key:value` format is structured `forum-section:law-section-list` — this allows multiple sections of the state code to be mapped against each forum-section.  Now we can traverse the dictionary and create a vectorizer for each `key` value that represents a forum-section, fitting the vectorizer using only the laws that contain the one of the law-section values in their hierarchy path.  That was the path stored in the `'section'` field of the `LAWS` database documents during scraping.

In [18]:
db = client["LAWS"]   # Now working in the LAWS database

for state in db.collection_names():
    f = open('queryToLaws/' + state + '.txt', 'r')
    forumSections_to_lawSections = eval(f.read())

    for forumSection in forumSections_to_lawSections.keys():
        lawSectionTokens =  [ ' '.join(law['tokens']) for lawSection in forumSections_to_lawSections[forumSection] \
                            for law in db[state].find({ 'section':lawSection }) ]

        tempTIV = TfidfVectorizer(lowercase=False)
        tempX = tempTIV.fit_transform(lawSectionTokens)
        law_pkl = (tempTIV, tempX)
        
        # Pickle law-section vectorizer and sparse matrix using forum-section as filename
        for c in ' /&\\,.:':
            filename = forumSection.replace(c, '')
        pickle.dump(law_pkl, open('vectorizers/laws/' + state + '/' + filename + '.pkl', 'wb'))

ValueError: empty vocabulary; perhaps the documents only contain stop words

Like the post vectorizer, the individual law-section vectorizers are pickled to be loaded — if needed.  Unlike the post vectorizer, which transforms every user question, the law-section vectorizers only transform the user question if it has been placed classifier in the forum-section corresponding to the law-section.

The sparse matrix containing the vectorized laws is pickled along with the vectorizer.  After vectorizing the user question, <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> is used to compare the vectorized question to the vectors of the sparse matrix in <a href="#step5">Step 5</a>.

__Step 3c: Vectorizing Legal Articles__

In [None]:
db = client["ARTICLES"]   # Now working in the ARTICLES database

f = open('forums_to_articles.txt', 'r')
forumSections_to_articleSections = eval(f.read())

for forumSection in forumSections_to_articleSections.keys():

    articleTokens =  [ article['tokens'] for articleSection in forumSections_to_articleSections \
                     for article in db[articleSection].find() ]

    tempTIV = TfidfVectorizer(lowercase=False)
    tempX = tempTIV.fit_transform(articleTokens)
    temp_pkl = (tempTIV, tempX)

    # Pickle article-section vectorizer and sparse matrix using forum-section as filename
    for c in ' /&\\,.:':
        filename = forumSection.replace(c, '')
    pickle.dump(temp_pkl, open('vectorizers/articles/' + filename + '.pkl', 'wb'))

<hr />
<a name="step4"></a><center><b><font size='5em'>Step 4: Machine Learning</center>
<hr />

To train a classifier, labels are needed for the sparse matrix produced by vectorizing the forum posts in <a href="#step3a">Step 3a</a>.  For this example, the name of the `collection` for each post is used.  In <a href="#step1a">Step 1a</a> the `POSTS` database collections are created using the different sections of the forum, so this classifier will predict which forum-section a user question belongs in.

__Step 4a: Create Labels__

In [None]:
postSections = [ section for section in db.collection_names() for post in db[section].find() ]

__Step 4b: Train Classifier on Sparse Matrix and Labels__

A <a href="http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html">multinomial naive Bayes classifier</a> is trained on the vectorized posts stored in `X` from <a href="#step3a">Step 3a</a>, and the forum-section labels stored in `postSections`.

In [23]:
from sklearn.naive_bayes import MultinomialNB

In [262]:
postSectionMNB = MultinomialNB()
postSectionMNB.fit(X, postSections)

pickle.dump(postSectionMNB, open('postSectionMNB.pkl', 'wb'))

__Step 4c: Predict Section__

Once the classifier has been trained on the sections, it can be used to predict which section a user's question would be placed in.  An example user query is created and transformed by the `postTIV` vectorizer, then passed into the naive Bayes classifier.

In [195]:
userQuery = '''
            i work more than 40 hours a week but my boss wont pay me for overtime
            '''
userX = postTIV.transform([getTokens(userQuery)])

predictedSection = postSectionMNB.predict(userX)[0]
print "Section: " + predictedSection

<hr />
<a name="step5"></a><center><b><font size='5em'>Step 5: Cosine Similarity</center>
<hr />

After predicting a forum-section for a user question, the __forum —> laws__ and __forum —> articles__ section mapping can be accessed from the dictionaries created in <a href="#step3">Step 3</a> to find the appropriate law and article vectorizers.  Each vectorizer transforms the user question into so that it can be compared to law-section and article-section sparse matrices, respectively.

Cosine similarity works by measuring the angle between two vectors.  All zero-values are skipped when measuring cosine similarity, since when the angle is $0^{\circ}$ the vectors are directly on top of each other.  This makes cosine similarity efficient for sparse matrices, which consist primarily of zeros.

The scikit-learn <a href="http://scikitlearn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html">cosine-similarity function</a> conveniently computes the cosine distances between a vector and every row in a sparse matrix.

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

__Step 5a: Compute Cosine Distances Between User Question and Vectorized Articles__

In this example, the forum-section for the user question has been predicted by the classifier and is stored in `predictedSection`.  The article vectorizer corresponding to this forum-section is found and used to vectorize the user question and compare it to the rows of the sparse matrix.

In [None]:
for c in ' /&\\,.:':
    filename = predictedSection.replace(c, '')

articleTIV_pkl = pickle.load(open('vectorizers/articles/' + filename + '.pkl', 'rb'))
articleTIV = articleTIV_pkl[0]
articleX = articleTIV_pkl[1]

userX = articleTIV.transform([getTokens(userQuery)])
relevantArticleIndices = list(np.argsort([x[0] for x in cosine_similarity(articleX, userX)])[::-1])

The `cosine_similarity` function returns an array storing the cosine similarity between the user question and each article.  The most relevant articles will have the great values – *i.e.*, the values closest to `1` – and their indices within the array can be found using the `argsort` function from NumPy.  `argsort` takes the array and returns a list of its indices sorted by the magnitude of their values, least to greatest. 

The `[::-1]` is added to reverse the array and order it from greatest to least. The indices returned by `argsort` can be plugged into a list pulled from the `ARTICLES` database of all the articles used for the vectorizer.  Using these indices, the top 5 most relevant articles are located, and their titles and URLs displayed.

In [10]:
db = client["ARTICLES"]   # Now working in the ARTICLES database

tempList = [ article for articleSection in forumSections_to_articleSections[predictedSection] \
            for article in db[articleSection].find() ]

for i in relevantArticleIndices[:5]:
    print tempList[i]['title']
    print tempList[i]['url']
    print

NameError: name 'forumSections_to_articleSections' is not defined

__Step 5b: Compute Cosine Distances Between User Question and Vectorized Laws__

The same process is used to compare the user question to the laws that correspond to the forum-section predicted by the classifier.

In [58]:
db = client["LAWS"]   # Now working in the LAWS database

state = "NY"   # App will need to retrieve state using geolocation; "NY" chosen as example
f = open('forums_to_laws/' + state + ''.txt', 'r')
forumSections_to_ = eval(f.read())

nearestLawSects = queryToLaws[nearestSubsection][0]
nearestLawSects

'Title 2 - LAW ENFORCEMENT PROCEDURES; ARREST PROCESS'

In [59]:
lawTIV_pkl = pickle.load(open('vectorizers/laws/' + userState + '/' + subsectionFilename + '.pkl', 'rb'))
lawTIV = lawTIV_pkl[0]
lawX = lawTIV_pkl[1]
userX = lawTIV.transform([' '.join(getTokens(userQuery))])

In [60]:
distances = 1 - cosine_similarity(lawX, userX)
relevantLawIndices = list(np.argsort([x[0] for x in distances]))

In [66]:
#   Search within the predicted state statutes sections for most relevant laws

userState = "MD"
db = client["LAWS"]
tempLawList, tempSectionList = [], []

for lawSection in queryToLaws[nearestSubsection]:
    for law in db[userState].find({ 'section':lawSection }):
        tempLawList.append(law['text'])
        tempSectionList.append(law['section'])

        
nearestLaws = [ (tempLawList[x], tempSectionList[x][-1:]) for x in relevantLawIndices[:5] ]

for x in nearestLaws:
    print x[0]
    print

 MD Crim Pro Code § 2-102 (2015) What's This? (a) Scope of section. -- This section does not apply to an employee of the Department of State Police to whom the Secretary of State Police assigns the powers contained in § 2-412 of the Public Safety Article. (b) In general. --  (1) Subject to the limitations of paragraph (3) of this subsection, a police officer may make arrests, conduct investigations, and otherwise enforce the laws of the State throughout the State without limitations as to jurisdiction.  (2) This section does not authorize a police officer who acts under the authority granted by this section to enforce the Maryland Vehicle Law beyond the police officer's sworn jurisdiction, unless the officer is acting under a mutual aid agreement authorized under § 2-105 of this subtitle.  (3) A police officer may exercise the powers granted by this section when:  (i) 1. the police officer is participating in a joint investigation with officials from another state, federal, or local la

In [112]:
states = ({'AK': 'Alaska', 'AL': 'Alabama', 'AR': 'Arkansas', 'AS': 'American Samoa', 'AZ': 'Arizona',
    'CA': 'California', 'CO': 'Colorado', 'CT': 'Connecticut', 'DC': 'District of Columbia', 'DE': 'Delaware',
    'FL': 'Florida', 'GA': 'Georgia', 'GU': 'Guam', 'HI': 'Hawaii', 'IA': 'Iowa', 'ID': 'Idaho', 'IL': 'Illinois',
    'IN': 'Indiana', 'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'MA': 'Massachusetts', 'MD': 'Maryland', 'ME': 'Maine',
    'MI': 'Michigan', 'MN': 'Minnesota', 'MO': 'Missouri', 'MP': 'Northern Mariana Islands', 'MS': 'Mississippi',
    'MT': 'Montana', 'NA': 'National', 'NC': 'North Carolina', 'ND': 'North Dakota', 'NE': 'Nebraska',
    'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico', 'NV': 'Nevada', 'NY': 'New York', 'OH': 'Ohio',
    'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania', 'PR': 'Puerto Rico', 'RI': 'Rhode Island',
    'SC': 'South Carolina', 'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VA': 'Virginia',
    'VI': 'Virgin Islands', 'VT': 'Vermont', 'WA': 'Washington', 'WI': 'Wisconsin', 'WV': 'West Virginia', 'WY': 'Wyoming'})

In [118]:
#   Search within the predicted Encyclopaedia category for most relevant articles

userState = "Maryland"
nearestArticles = []

db = client["ARTICLES"]
tempList = [ x for x in db[nearestArtSect].find() ]


for article in [ x for x in db[nearestArtSect].find() ][:50]:
    if userState in article['title']:
        nearestArticles.append((article['title'], article['url']))
        continue

for i in relevantArticleIndices:
    if any(state in tempList[i]['title'] for state in states.values()):
        print tempList[i]['title']
        continue

    nearestArticles.append((tempList[i]['title'][:-11], tempList[i]['url']))
    
    if len(nearestArticles) == 5:
        break

nearestArticles

DUI Laws in Florida | Nolo.com
DUI Laws in Kentucky | Nolo.com
DUI Laws in Colorado | Nolo.com
Pennsylvania DUI Laws | Nolo.com
California DUI Law | Nolo.com
Illinois DUI Law | Nolo.com
DUI Laws in Maryland | Nolo.com
DUI Laws in Vermont | Nolo.com
DUI Laws in Connecticut | Nolo.com
DUI Laws in Washington | Nolo.com
DUI Laws in Delaware | Nolo.com
DUI Laws in Kansas | Nolo.com
DUI Laws in Virginia | Nolo.com
DUI Laws in Utah | Nolo.com
DUI Laws in Tennessee | Nolo.com
DUI Laws in Mississippi | Nolo.com
Georgia DUI Law | Nolo.com
DUI Laws in Hawaii | Nolo.com
DUI Laws in Alabama | Nolo.com
DUI Laws in Arizona | Nolo.com
DUI Laws in Nevada | Nolo.com
DUI Laws in Montana | Nolo.com
DUI Laws in Oklahoma | Nolo.com
DUI Laws in Ohio | Nolo.com
DUI Laws in New Hampshire | Nolo.com


[u'DUI Laws in Maryland | Nolo.com',
 (u'DUIs: When Do You Need a Lawyer?',
  u'http://www.nolo.com/legal-encyclopedia/free-books/beat-ticket-book/chapter8-5.html'),
 (u'DUI and DWI Defenses',
  u'http://www.nolo.com/legal-encyclopedia/dui-dwi-defenses-32254.html'),
 (u'Should You Plead Guilty to a DUI?',
  u'http://www.nolo.com/legal-encyclopedia/free-books/beat-ticket-book/chapter8-7.html'),
 (u"DUI or DWI Penalties: Jail Time, Driver's License Issues, Fines & More",
  u'http://www.nolo.com/legal-encyclopedia/dui-or-dwi-punishments-penalties-30321.html')]