# Connecting Multiple Data Sources to SQL DB
#### 11/29/17 | Natalie | Sprint 2
#### Description
This project connects multiple data sources with DevLeague student information for DevLeague and pulls data into one SQL database.

## Skill Backlog User Story
As a the Director of Technology I need to identify errors and inconsistencies in the data so that I can develop solutions to address them, and possibly their source.

## Project Proposal
Retrieve all existing student data from multiple data sources and place them in one central database using a Python script and PostgreSQL. Data is currently stored in Hubspot, Asana, and a google drive.


## Key Questions
- How to access api data sources using python and postgreSQL.

## Key Findings
- How to write a python scripts and execute from the command line, since I'm only taking data in at this step, I don't need to create a full application. A script should work fine.

## Gameplan
Here is my overall approach 
1. Determine what python libraries I can use to pull in data from an api.
2. Determine how to execute command line scripts in python.
3. Create a python script to grab data from hubspots api.
3. Determine how to create a postgres db in python.
4. Create script to connect a postgres db in python.
5. Understand how to grab data in python data structures and store in sql db.

## Day 1 Work

Initially, I was going to create a python application, however I realized that since I'm only taking data in, I don't need a huge framework or application to do so. I can just create a script that will pull data from any data source and store it to a database. My first step will just be to work with one data source (Hubspot's API) and just try to bring in the data and parse it. 

Below is the first script I wrote using my own generated Hubspot api key.

In [29]:
import urllib.request, json

# Store api key value into variable
APIKEY_VALUE = "my api key"

# concat query string with api key
APIKEY = "?hapikey=" + APIKEY_VALUE

#user id
usrid = "&userID=0000000"

# hs api end point stored to a variable
HS_API_URL = "http://api.hubapi.com"

def get_total_number_of_contacts():
    # First, we build the correct url
    xurl = "/contacts/v1/lists/all/contacts/all"
    url = HS_API_URL + xurl + APIKEY + usrid
    # Now we use urllib to open the url and read it
    response = urllib.request.urlopen(url).read()
    # print(response) to see what it looks like
    statistics = json.loads(response)
    # Finally, return all contacts in dict
    return statistics["contacts"]

print (get_total_number_of_contacts())

b'{"contacts":[],"has-more":false,"vid-offset":0}'
[]


## Day 2 Work
My api script to Hubspot works, however it gives me back an empty dataset with zero records in it. Tried using user id, however I still get 0 results. I may just need to use the demo api key to prototype a solution. 

Once I used the demo key, I was able to pull in the data from hubspot. Below is the script I created that just pulls in all the data.

In [5]:
from pprint import pprint
import urllib.request, json

# Store api key value into variable
APIKEY_VALUE = "demo"

# concat query string with api key
APIKEY = "?hapikey=" + APIKEY_VALUE

# hs api end point stored to a variable
HS_API_URL = "http://api.hubapi.com"

def get_contacts():
    # builds the correct url
    xurl = "/contacts/v1/lists/all/contacts/all"
    url = HS_API_URL + xurl + APIKEY 
    # Now we use urllib to open the url and read it
    response = urllib.request.urlopen(url).read()
    #loads to json obj to all_contacts variable
    all_contacts = json.loads(response)
    #return the contact data
    return all_contacts

# calls the function stores to variable
contacts = get_contacts()

# pretty print just the first name of a contact
pprint(contacts)


{'contacts': [{'addedAt': 1456333855820,
               'canonical-vid': 860626,
               'form-submissions': [],
               'identity-profiles': [{'deleted-changed-timestamp': 0,
                                      'identities': [{'timestamp': 1456333855819,
                                                      'type': 'LEAD_GUID',
                                                      'value': 'c4324a6f-ef03-4250-a2a8-252536f4e443'},
                                                     {'is-primary': True,
                                                      'timestamp': 1502322131721,
                                                      'type': 'EMAIL',
                                                      'value': 'new-email1@hubspot.com'}],
                                      'saved-at-timestamp': 1511397633819,
                                      'vid': 860626}],
               'is-contact': True,
               'merge-audits': [],
               'merged-vids': [

## Day 3

Python doesn't provide much error checking out of the box. I spent a lot of time trying to figure out how access values in python data structures, in this case - nested lists and dictionaries.

Below is the working script that I created that just gave me back a specific value (first name) in the json object pulled from Hubspot.

In [6]:
from pprint import pprint
import urllib.request, json

# Store api key value into variable
APIKEY_VALUE = "demo"

# concat query string with api key
APIKEY = "?hapikey=" + APIKEY_VALUE

# hs api end point stored to a variable
HS_API_URL = "http://api.hubapi.com"

def get_contacts():
    # builds the correct url
    xurl = "/contacts/v1/lists/all/contacts/all"
    url = HS_API_URL + xurl + APIKEY 
    # Now we use urllib to open the url and read it
    response = urllib.request.urlopen(url).read()
    #loads to json obj to all_contacts variable
    all_contacts = json.loads(response)
    #return the contact data
    return all_contacts

# calls the function stores to variable
contacts = get_contacts()

##pretty print just the first name in a contact record
pprint(contacts['contacts'][0]['properties']['firstname']['value'])

'kolokithas'


## Day 4

Time to create my database! I installed postgreSQL through brew in my terminal and created a database through the command line with myself as a user and protected password. Once I had that set up I had to create my data model. I went through and selected data points I wanted from all the data I was collecting.

I then realized for speed, I would use a GUI to create my data model and set up my tables. I installed Postico and created a **contacts** table for the data I wanted based on what my company needs for reporting.

## Day 5

Now that I have an idea of how to access values, I need to iterate through all of the contacts and pull in only those values and create a script that puts that data into my postgreSQL db.

After pulling in specific values, I found a lot of missing values in certain fields. For the purposes of this project, I wanted complete data. I had to create some logic that filled in blank fields with values so that I could create a complete list of data to insert into the db.

Below is my script for structuring all of my data into a new "thin" or clean list.

Libraries Used:
- **pprint** for pretty printing json objects so I can easily see where the values were in the data structure.
- **urllib.request** to open api url and read it.
- **json** to load data in json format.

In [9]:
from pprint import pprint
import urllib.request, json

# Store api key value into variable
APIKEY_VALUE = "demo"

# concat api query string with api key
APIKEY = "?hapikey=" + APIKEY_VALUE

# hs api end point stored to a variable
HS_API_URL = "http://api.hubapi.com"

thin_contact_list = []

def get_contacts():
    # builds the correct url
    xurl = "/contacts/v1/lists/all/contacts/all"
    url = HS_API_URL + xurl + APIKEY 
    # Now we use urllib to open the url and read it
    response = urllib.request.urlopen(url).read()
    #loads to json obj to all_contacts variable
    all_contacts = json.loads(response)
    #return the contact data
    return all_contacts

def process_contacts(contact_list):
    new_contact_list = []
    
    #create a loop through contacts dict and store values to new list
    for i in range(len(contacts['contacts'])):

        #store values needed to variables
        first_name= contacts['contacts'][i]['properties']['firstname']['value']
        last_name= contacts['contacts'][i]['properties']['lastname']['value']
        
        email = ''
        for identity in contacts['contacts'][i]['identity-profiles'][0]['identities']:
            if identity['type'] == 'EMAIL':
                email = identity['value']
        
        created_on= contacts['contacts'][i]['addedAt']
        last_login= contacts['contacts'][i]['identity-profiles'][0]['saved-at-timestamp']

        #added mock values to blanks in fields
        if(first_name == ''):
         first_name = 'Amanda'

        if(last_name == ''):
         last_name = 'Miranda'

        if(email == ''):
         email = 'unicorn@aweseomeco.com'

        #created contact dict to go into db
        contact = {"firstname": first_name,
                   "lastname": last_name,
                   "email": email,
                   "createdon": created_on,
                   "lastlogin": last_login
                  }

        new_contact_list.append(contact)
    
    return new_contact_list
        
# Start processing logic
if __name__== "__main__":
    #invoke function to get data from api
    contacts = get_contacts()
    #process list of contacts
    thin_contact_list = process_contacts(contacts)
    
    print(thin_contact_list)

[{'firstname': 'kolokithas', 'lastname': 'Record11', 'email': 'new-email1@hubspot.com', 'createdon': 1456333855820, 'lastlogin': 1511397633819}, {'firstname': 'John', 'lastname': 'cruz', 'email': 'juanignaciosl-ded-05-578@test-org.com', 'createdon': 1456333839974, 'lastlogin': 1511413619107}, {'firstname': 'Updated', 'lastname': 'Record', 'email': 'new-email99@hubspot.com', 'createdon': 1456333849586, 'lastlogin': 1512081116948}, {'firstname': 'Amanda', 'lastname': 'Miranda', 'email': 'juanignaciosl-ded-05-587@test-org.com', 'createdon': 1456333869192, 'lastlogin': 1511401626850}, {'firstname': 'Amanda', 'lastname': 'Miranda', 'email': 'juanignaciosl-ded-05-588@test-org.com', 'createdon': 1456333873752, 'lastlogin': 1511414279605}, {'firstname': 'Amanda', 'lastname': 'Miranda', 'email': 'juanignaciosl-ded-05-594@test-org.com', 'createdon': 1456333895045, 'lastlogin': 1511417369152}, {'firstname': 'Amanda', 'lastname': 'Miranda', 'email': 'juanignaciosl-ded-05-597@test-org.com', 'create

## Day 5 Continued...

Now that I have a thin list of data, I needed to create my script to insert data into a database. This was the surprisingly easiest part of the whole project. 

Below is the script I created to insert data into my sql db. This script will only work for localhost connection to db.

Libraries used:
- **psycopg2** used to allow my python script to talk to postgres in one connection session. Pretty easy to follow.
- **sys**


In [8]:
import psycopg2
import sys

# print to see if you have access to the clean new list of contact data
# print(thin_contact_list)

# start with no connection
con = None

# try to connect
try:
    # adapter to connect to postgres db 
    con = psycopg2.connect(database='hsbd', user='nat') 
    # allows python code to execute sql commands
    cur = con.cursor()
    # execute method that process sql commands in db
    cur.execute('SELECT version()')          
    # error check connection to db
    ver = cur.fetchone()
    print (ver, "i can conncet")    
    
    # loop through clean list and insert to db
    for contact in thin_contact_list:
        cur.execute("INSERT INTO contacts(first_name, last_name, email) VALUES ('"+ contact['firstname'] + "','" + contact['lastname'] + "',' " + contact['email'] + "')")
        # error check print to see if each record was inserted
        print('inserted')
    
    # commit everything in this session to db
    con.commit()

# exception error handling for failed connection    
except psycopg2.DatabaseError as e:
    print ('Error %s' % e)    
    sys.exit(1)
    

# closes session to db after everything runs    
finally:
    
    if con:
        con.close()

('PostgreSQL 9.6.6 on x86_64-apple-darwin13.4.0, compiled by clang version 4.0.1 (tags/RELEASE_401/final), 64-bit',) i can conncet
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted
inserted


## Peer Feedback on Day 5

After talking it over with a peer, I received the following feedback and decided to make these changes

## Here are some overall notes on the skills I learned
And perhaps some stream of consciousness notes about what I did, and other questions I might have