<h1>Sources of data</h1>
<li><b>local files</b>csv, xls, txt, pdf</li>
<li><b>database servers</b>
<ul>
<li>Relational databases
<li>NoSQL databases
</ul>
<li><b>On the web</b>
<ul>
<li>html
<li>JSON
<li>XML
</ul>

<h2>Getting data from the Internet</h2>
<li>Accessing the data is through APIs or web scraping
<li>Data usually arrives in either JSON, XML or HTML 
<li>Python has libraries to help getting the data as well as extracting "useful" data

<h2>The <i>requests</i> library</h2>
<li>The primary mechanism for sending an API request or accessing a web server

<h3>Step 1: Import the requests library</h3>

In [1]:
import requests

<h3>Step 2: Send an HTTP request, get the response, and save in a variable</h3>

In [2]:
response = requests.get("http://www.epicurious.com/search/Tofu+Chili")

In [3]:
response.content    # Will return bytes, but Python decodes it for you and puts it in a string. 
                    # Hence, the b in front of the string.



In [4]:
#response.elapsed

In [5]:
response.status_code

200

<h3>Step 3: Check the response status code to see if everything went as planned</h3>
<li>status code 200: the request response cycle was successful
<li>any other status code: it didn't work as expected (e.g., 404 = page not found)

In [6]:
print(response.status_code)

200


In [7]:
type(response.content)

bytes

<h3>WWW data is usually encoded</h3>
<li>The b' in front of the response content indicates an encoded byte string
<li>The <meta charset="utf-8"> indicates that the page is using "utf-8" encoding
<li>utf-8 == <i>Unicode</i> variable length character encoding system
<li>Data received from the world wide web is usually in utf-8
<li>Python strings are "plain" (unencoded) character sequences
<li><b>Corollary!</b> We need to convert the utf-8 string into a python str

<a href="http://www.diveintopython3.net/strings.html">Click here!</a> if you want to know all about strings and character encoding

In [8]:
print(type(response.content))
print(type(response.content.decode('utf-8')))

<class 'bytes'>
<class 'str'>


In [9]:
response.encoding    # UTF-8 is an 8-bit byte encoding

'utf-8'

<h3>Step 4: Get the content of the response</h3>
<li>Convert to utf-8 if necessary

In [10]:
type(response.content.decode('utf-8'))

str

In [11]:
response.content.decode('utf-8')   # This is the form that we want our data in before we 
                                   # start doing things with it. (Notice there is no b in the output.)



<h3>In-class problem</h3>
<li>Get the contents of Wikipedia's main page and look for the string "Did you know" in it
<li>At what location is it on the page?

In [13]:
url = "https://en.wikipedia.org/wiki/main_page"
#The rest of your code should go below this line
#import


<h1>JSON: JavaScript Object Notation</h1>

<li>Standard for "serializing" data objects for storage or transmission 
<li>Human-readable, useful for data interchange
<li>Also useful for representing and storing semistructured data
<li>Stored as plain (byte strings or utf-8 strings) text
<li>Contains data type information

<h2>json</h2>
<li>The python library - json - deals with converting text to and from JSON


<h2>Python and JSON data types</h2>
<table align="left">
<tr><td>JSON</td><td>Python</td></tr>
<tr><td>number</td>	<td>int,float</td></tr>
<tr><td>string</td>	<td>str</td></tr>
<tr><td>Null</td>	<td>None</td></tr>
<tr><td>true/false</td>	<td>True/False</td></tr>
<tr><td>Object</td>	<td>dict</td></tr>
<tr><td>Array</td>	<td>list</td></tr>
</table>

<b>json.loads converts a json object into a python data object</b>

In [12]:
import json
data_string = '[{"b": [2, 4], "3.0": "c", "a": "A"},34]'     # All keys of a dictionary for JSON must be strings
python_data = json.loads(data_string)   # loads is pronounced 'load-es' for load string
type(python_data)
# data_string
# print(type(python_data))

list

In [14]:
python_data

[{'b': [2, 4], '3.0': 'c', 'a': 'A'}, 34]

<h3>json.loads recursively decodes a string in JSON format into equivalent python objects</h3>
<li>data_string's outermost element is converted into a python list (in the example!)
<li>the first element of that list is converted into a dictionary
<li>the key of that dictionary is converted into a string
<li>the value of that dictionary is converted into a list of two integer elements

In [15]:
print(type(data_string),type(python_data))
print(type(python_data[0]),python_data[0])
print(type(python_data[0]['b']),python_data[0]['b'])

<class 'str'> <class 'list'>
<class 'dict'> {'b': [2, 4], '3.0': 'c', 'a': 'A'}
<class 'list'> [2, 4]


In [16]:
python_data[0]['b']

[2, 4]

<h3>json.loads will throw an exception if the format is incorrect</h3>

In [17]:
# Wrong. WHY?
json.loads('Hello')
# Because it's not in valid JSON format: The object in quotations is not a python data type.
# The loads function takes a string in valid JSON format and creates a valid Python object.
# In this case, the object is a string.

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [18]:
#Correct since 'Hello' is a Python data type
json.loads('"Hello"')


'Hello'

<h3>json.dumps creates a json string from a python data object</h3>

In [19]:
import json
data_string = json.dumps(python_data)  #converts a python object into a json string
print(type(data_string))
print(data_string)
data_string

<class 'str'>
[{"b": [2, 4], "3.0": "c", "a": "A"}, 34]


'[{"b": [2, 4], "3.0": "c", "a": "A"}, 34]'

<h1>API: Application Programming Interface</h1>
<li>A protocal containing a set of commands or functions that allow one piece of software  to talk to another 
<li>Data from the web is often gotten through an API
<li>Web APIs usually consist of two parts:
<ul>
<li><b>request</b> a well-formed HTTP request to a server
<li><b>response</b> a response from the server, usually either an html page or a JSON object
</ul>

<h2>The HTTP request</h2>
<li>Contains a url
<li>Contains a set of parameters required by the server to figure out what data to send back
<li>Often, a parameter is a unique <b>access key</b> that the server uses to keep track of who is requesting the data

<h1>API example: Google Geocoding API</h1>
<li><a href="https://developers.google.com/maps/documentation/geocoding/start">Documentation</a>
<li>Google has a large number of map and location related APIs
<li>You need an account and an API key to use these APIs
<li>To set up an account and get a key:
<ul>
<li>go to <a href="https://cloud.google.com/">google cloud</a>
<li>click "go to console" or "try gcp for free"
<li>if creating a new account, enter all details 
<li>go to API and services
<li>click "Enable APIs" and search for geocoding api
<li>click on credentials and create an API key


<h2>requests library and API requests</h2>

In [20]:
#My api_key
with open("/home/uday/Documents/Columbia_University/API_Keys/geocoding_key.txt",'r') as f:
    api_key = f.read().strip()
# strip() gets rid of the blanks that aren't apart of the API key

FileNotFoundError: [Errno 2] No such file or directory: '/home/uday/Documents/Columbia_University/API_Keys/geocoding_key.txt'

In [None]:
address="Columbia University, New York, NY"
address=address.replace(' ','_')   #HTML can't have spaces. All spaces are encoded with an underscore in html.
#api_key=""
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s&key=%s" % (address,api_key)
response = requests.get(url)  # Always check the status after requests.get(url)
response.content.decode('utf8')
print(type(response))
# the /json? denotes that you want it to come back in JSON format

In [None]:
response.encoding

<h4>requests can automatically decode and convert a json response into a python object</h4>

In [25]:
response.json()   ## A method in the response class that takes the response objects which 
                  ## carries the payload, the status, etc. and returns the python object.

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

<h3>Exception checking!</h3>
<li>Ideally, we should always check if the data grab has been successful
<li>Especially if we are incorporating our results into a "live" analysis

In [None]:
response_data = ''
address="Columbia University, New York, NY"
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s&key=%s" % (address,api_key)
try:
    response = requests.get(url)
    if not response.status_code == 200:
        print("HTTP error",response.status_code)
    else:
        try:
            response_data = response.json()  # although the request is successful, it might not return a valid JSON object
        except:
            print("Response not in valid JSON format")
except:
    print("Something went wrong with requests.get")
print(type(response_data))
print(response_data)

<b>Try this</b>: Write a function that takes an address as an argument and returns a (latitude, longitude) tuple</h2>

In [None]:
# def get_lat_lng(address_string,api_key):

#     return (lat,lng)
    
# get_lat_lng("Columbia University",api_key)

In [None]:
# get_lat_lng("London Business School",api_key)

In [None]:
# get_lat_lng("Monash University",api_key)

<h1>XML</h1>
<li>eXtensible Markup Language
<li>data is stored in a tree
<li>data items are "tagged" with named values
<li>html is (loosely) similar to XML (both are based on SGML)
<li>The python library - lxml - deals with converting an xml string to python objects and vice versa</li>

In [None]:
data_string = """
<Bookstore>
   <Book ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5">
      <Title>New York Deco</Title>
      <Authors>
         <Author Residence="New York City">
            <First_Name>Richard</First_Name>
            <Last_Name>Berenholtz</Last_Name>
         </Author>
      </Authors>
   </Book>
   <Book ISBN="ISBN-13:978-1579128562" Price="15.80">
      <Remark>
      Five Hundred Buildings of New York and over one million other books are available for Amazon Kindle.
      </Remark>
      <Title>Five Hundred Buildings of New York</Title>
      <Authors>
         <Author Residence="Beijing">
            <First_Name>Bill</First_Name>
            <Last_Name>Harris</Last_Name>
         </Author>
         <Author Residence="New York City">
            <First_Name>Jorg</First_Name>
            <Last_Name>Brockmann</Last_Name>
         </Author>
      </Authors>
   </Book>
</Bookstore>
"""

In [None]:
from lxml import etree
root = etree.XML(data_string)       ## Returns a reference to the root object (the outermost object in the XML structure/tree)
# print(root.tag,type(root.tag))
root

<h4>XML trees are stored as utf-8 byte strings</h4>

In [None]:
print(etree.tostring(root, pretty_print=True))

In [None]:
print(etree.tostring(root, pretty_print=True).decode("utf-8"))   # Puts byte formated xml tree into the tree format.
                                                                 # When would you use this? When you want to write it to a file or share it with someone.

<h3>Iterating over an XML tree</h3>
<li>Use an iterator. 
<li>The iterator will generate every tree element for a given subtree

In [None]:
for element in root.iter():
    print(element)

<h4>Or just use the child in subtree construction

In [None]:
for child in root:
    for thing in child:
        print(thing)

<h4>Accessing the tag</h4>


In [None]:
for child in root:
    print(child.tag)

<h4>Using the iterator to get specific tags<h4>
<li>In the below example, only the author tags are accessed
<li>For each author tag, the .find function accesses the First_Name and Last_Name tags
<li>The .find function only looks at the children, not other descendants, so be careful!
<li>The .text attribute prints the text in a leaf node

In [None]:
for element in root.iter("Author"):
    #print(element)
    print(element.find('First_Name').text,element.find('Last_Name').text)   # The .find method will only search among the children of a tag, not the children of the children of a tag.

In [None]:
for element in root.findall('Book/Authors/Author/First_Name'):
    print(element.text)

<h4>Problem: Find the last names of all authors in the tree “root” using xpath</h4>

In [None]:
for element in root.findall('Book/Authors/Author/Last_Name'):
    print(element.text)

<h4>Using values of attributes as filters</h4>
<li>Example: Find the first name of the author of a book that weighs 1.5 oz

In [None]:
root.find('Book[@Weight="1.5"]/Authors/Author/First_Name').text

<b>Try This</b>: Print first and last names of all authors who live in New York City