## Processing XML

In the first set of notebooks, we went over the basic parts of Python that are commonly used for data processing tasks.  In this set, we'll go over some useful libraries and addons, using some concrete examples. 

https://aws.amazon.com/public-datasets/irs-990/

Downloading and reading XML files 990s for one year only, using:

https://s3.amazonaws.com/irs-form-990/index_2016.json

-- Open file
-- Load into dictionary
-- Find dictionary keys - we want ein, org name, mission statement, filer city and state
-- filer/businessname
-- Insert into elasticsearch index
-- Codes - BusinessCode
-- ActivityOrMissionDesc
-- Grant amounts 
-- Revenue amounts

2016 990s
Return->ReturnHeader->Filer->BusinessName->BusinessNameLine1Txt
Return->ReturnHeader->Filer->USAddress->CityNm
Return->ReturnHeader->Filer->USAddress->StateAbbreviationCd
Return->ReturnHeader->Filer->USAddress->ZIPCd
Return->ReturnData->IRS990->PrincipalOfficerNm
Return->ReturnData->IRS990->USAddress
Return->ReturnData->IRS990->USAddress->CityNm
Return->ReturnData->IRS990->USAddress->StateAbbreviationCd
Return->ReturnData->IRS990->CYContributionsGrantsAmt
Return->ReturnData->IRS990->CYProgramServiceRevenueAmt
Return->ReturnData->IRS990->CYInvestmentIncomeAmt
Return->ReturnData->IRS990->CYOtherRevenueAmt
Return->ReturnData->IRS990->CYTotalRevenueAmt
Return->ReturnData->IRS990->GrossReceiptsAmt
Return->ReturnData->IRS990->Desc
Return->ReturnData->IRS990->MissionDesc
Return->ReturnHeader->BusinessOfficerGrp->PersonNm




In [5]:
import requests
import json

response = requests.get("https://s3.amazonaws.com/irs-form-990/index_2016.json")
data = json.loads(response.text)
urls = []
for return_metadata in data["Filings2016"]:
    if return_metadata["TaxPeriod"].startswith("2015") and return_metadata["FormType"] == "990":
        urls.append(return_metadata["URL"] + "\n")

with open("urls.txt", "w") as f:
    f.writelines(urls)

print ("All done!")

All done!


Can be written more simply as:

In [6]:
import requests
import csv
import xmltodict

urls = []
with open("urls.txt") as f:
    urls = f.readlines()

output = []
for url in urls:
    response = requests.get(url.replace("\n", ""))
    return_data = xmltodict.parse(response.content)
    # Return->ReturnHeader->Filer->BusinessName->BusinessNameLine1Txt
    output_dict = {}
    output_dict["ein"] = return_data["Return"]["ReturnHeader"]["Filer"]["EIN"]
    output_dict["name"] = return_data["Return"]["ReturnHeader"]["Filer"]["BusinessName"]["BusinessNameLine1Txt"]
    output_dict["mission"] = return_data["Return"]["ReturnData"]["IRS990"]["ActivityOrMissionDesc"]
    output.append(output_dict)
    if len(output) > 10:
        break

with open("data.txt", "w") as f:
    dr = csv.DictWriter(f, delimiter="\t", fieldnames=["ein", "name", "mission"])
    dr.writerows(output)

print ("All done!")

All done!


You can also add conditions to the comprehension:

In [None]:
import requests
import csv
import xmltodict

urls = []
with open("urls.txt") as f:
    urls = f.readlines()

output = []
for url in urls:
    response = requests.get(url.replace("\n", ""))
    return_data = xmltodict.parse(response.content)
    # Return->ReturnHeader->Filer->BusinessName->BusinessNameLine1Txt
    output_dict = {}
    output_dict["ein"] = return_data["Return"]["ReturnHeader"]["Filer"]["EIN"]
    output_dict["name"] = return_data["Return"]["ReturnHeader"]["Filer"]["BusinessName"]["BusinessNameLine1Txt"]
    output_dict["mission"] = return_data["Return"]["ReturnData"]["IRS990"]["ActivityOrMissionDesc"]
    output.append(output_dict)
    if len(output) > 10:
        break

with open("data.txt", "w") as f:
    dw = csv.DictWriter(f, delimiter="\t", fieldnames=["ein", "name", "mission"])
    dw.writeheader()
    dw.writerows(output)

print ("All done!")

## Elasticsearch

So far we defined functions using this syntax:

In [9]:
import csv
from elasticsearch import Elasticsearch

es = Elasticsearch("http://fcsearchdev04:9200", http_auth=("elastic", "ocelot243kiwi"))
es.indices.delete(index='gmg', ignore=[400, 404])
with open("data.txt") as f:
    dr = csv.DictReader(f, delimiter="\t")
    for d in dr:
        es.create(index="gmg", body=d, doc_type="document", id=d["ein"])

print ("All done!")


All done!


pip install elasticsearch

In [10]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://fcsearchdev04:9200", http_auth=("elastic", "ocelot243kiwi"))

q = {
    "query": {
        "match" : {
            "mission" : "mental"
        }
    }
}

r = es.search(index="gmg", body=q)

for d in r["hits"]["hits"]:
    print(d["_source"]["name"])

print ("All done!")


FLORIDA BETA PI BETA PHI FRATERNITY
All done!


We can also use this syntax:

In [11]:

from elasticsearch import Elasticsearch

es = Elasticsearch("http://fcsearchdev04:9200", http_auth=("elastic", "ocelot243kiwi"))
q = {
    "query": {
        "fuzzy": {"name": "haita"}
    }
}
r = es.search(index="gmg", body=q)
for d in r["hits"]["hits"]:
    print(d["_source"]["name"])
print ("All done!")


HAITI ENDOWMENT FUND
All done!


A search service!

In [None]:
from elasticsearch import Elasticsearch
import cherrypy
import json

es = Elasticsearch("http://fcsearchdev04:9200", http_auth=("elastic", "ocelot243kiwi"))

class Search():
    def search(self, s):
        q = {
            "query": {
                "fuzzy": {"name": s}
            }
        }
        r = es.search(index="gmg", body=q)
        result = []
        for d in r["hits"]["hits"]:
            result.append(d)

        return json.dumps(result)

    search.exposed = True

cherrypy.quickstart(Search())



## NLTK

Python also supports object-oriented programming.  Classes work like modules, but have a little extra functionality to support more object-oriented design patterns.  Here is a simple python class.  Note the keywords **class**, **object** (optional), **self**, and **__init__**.

* **class** indicates that the following code is the class definition.
* **object** is the parent class for the current object.
* **self** is a reference to the current object, to which member variables can be added at run time.
* **init** (preceded and followed by double underscores) is the constructor function called when the object is created.



In [None]:
import nltk
import csv

sample_text = ""
with open("data.txt") as f:
    dr = csv.DictReader(f, delimiter="\t")
    for d in dr:
        sample_text = d["mission"]
        break

for sent in nltk.sent_tokenize(sample_text):
    print(sent)

for sent in nltk.sent_tokenize(sample_text):
    print (list(nltk.wordpunct_tokenize(sent)))

for sent in nltk.sent_tokenize(sample_text):
    print(list(nltk.pos_tag(nltk.word_tokenize(sent))))

text = list(nltk.word_tokenize(sample_text))
print(text)



Stemming

In [None]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer

import nltk
import csv

sample_text = ""
with open("data.txt") as f:
    dr = csv.DictReader(f, delimiter="\t")
    for d in dr:
        sample_text = d["mission"]
        break

text = list(nltk.word_tokenize(sample_text))

snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
porter = PorterStemmer()

for stemmer in (snowball, lancaster, porter):
    stemmed_text = [stemmer.stem(t) for t in text]
    print(" ".join(stemmed_text))


Lemmatizing

In [None]:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
import csv

sample_text = ""
with open("data.txt") as f:
    dr = csv.DictReader(f, delimiter="\t")
    for d in dr:
        sample_text = d["mission"]
        break
text = list(nltk.word_tokenize(sample_text))
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in text]
print(" ".join(lemmas))

In [None]:
import string
import nltk
from nltk import WordNetLemmatizer
import csv

sample_text = ""
with open("data.txt") as f:
    dr = csv.DictReader(f, delimiter="\t")
    for d in dr:
        sample_text = d["mission"]
        break

## Module constants
lemmatizer  = WordNetLemmatizer()
stopwords   = set(nltk.corpus.stopwords.words('english'))
punctuation = string.punctuation

def normalize(text):
    for token in nltk.word_tokenize(text):
        token = token.lower()
        token = lemmatizer.lemmatize(token)
        if token not in stopwords and token not in punctuation:
            yield token

print (list(normalize(sample_text)))


In [None]:
import nltk

print (nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize("John Smith is from the United States of America and works at Microsoft Research Labs"))))


This defined a specific type of exception to check for.  You can also just use the type "Exception" to catch all types.  By referencing the Exception object, you can print out diagnostic information.

In [None]:
x = 100
y = 0
try:
    z = x/y
except Exception as e:
    print ("Something went wrong!", e)

The _traceback_ module can also help by letting you output more detailed information about the error, like a stack trace (list of calling functions that led to the error), and the line number of the error.

In [None]:
import traceback

x = 100
y = 0
try:
    z = x/y
except Exception as e:
    print ("Something went wrong!", e)
    traceback.print_exc()

Finally, the _finally_ clause lets you run some code at the end of your try block, whether or not there was an exception, which is occasionally useful if you want to, say, always return a valid value from your function whether or not there was an exception:

In [None]:
x = 100
y = 0
try:
    z = x/y
    result = str(z)
except Exception as e:
    print ("Something went wrong!", e)
    result = "unknown"
finally:
    print ("The result is: ", result)

### Exercise: can you rewrite one of the examples from earlier that accesses a database or web service, and use Exceptions to handle cases where the database or service can't be reached? 