# Data Operations - Sprint Journal for Natalie Ramirez

---
## DAY 1: Tuesday (Week 1)
### What I Expect to Learn

I want to learn how to automate the process of exporting aggregating data into a readable file to a designated folder in google drive. I hope to determine which tools best fit the needs of my data analysis.

Key Questions:
- What are the differences between tables and dataframes?
- What are the differences between lists and arrays?
- If my data is already modeled and in my database how am I going to companre an external data source for duplicates? Will I have to aggregate data all in one table or will I have another data dump table that just has everything and use that to compare?

### Project References
- [RAM Cloud Project](https://ramcloud.atlassian.net/wiki/spaces/RAM/pages/6848531/Data+Operations)
- [Comparing two or more data frames and find rows that appear in more than one data frame, or rows that appear only in one data frame.](http://www.cookbook-r.com/Manipulating_data/Comparing_data_frames/)
- [Comparing matching data structures in Tableau](http://kb.tableau.com/articles/howto/comparing-matching-and-non-matching-records-across-blended-data-sources)

### Project Pitch
- In this sprint I will compare data from different data sources for duplicate records in R and then upload unique records to a PostgreSQL database.
- This project will help my aggregate, normalize, and structure data using data operations.

--- 
## DAY 2: Wednesday (Week 1)
### Pitch Feedback 
- Use R to compare records.
- Make sure to normalize data to create data model.

### Prototype Notes
*Results from prototype experiments, snippets of code, things I tried * 

### Pair Show & Tell Comments
*Comments from prototype discussion*
- Can use the cookbook-r as resource
- Can complete datacamp R intro course
- Can also try to remove duplicates straight in SQL.

### Proposed Plan: Key Milestones by Day

##### Day 2 (Wed Week 1):
- Develop Project Proposal
- Push Docs / Repo / Roadmap Update

##### Day 3 (Thu Week 1):
- Normalize Data
- Generate Data Model
- Create Dups Script
- Push docs / repo

##### Day 4 (Tue Week 2): 
- Fix Python Script to collect new data
- Update Python Script to insert data into SQL Database
- Push docs / repo

##### Day 5 (Wed Week 2):
- Project highlights
- Identify question or cohort knowledge gap for sprint review
- Develop Topic Project + Presentation
- Push Repo / docs / Presentation

### Project Definition and README.MD Discussion 
*This is a discussion of how this project will fit into my overall roadmap. I will update my roadmap with the following project definition*

*I will focus my project Repo's README.MD on the same topic, but with this additional detail.*


--- 
## DAY 3: Thursday (Week 1)

#### Setup for Repo and Documentation Push
*Setup and testing I did to make sure my repo and documentation were ready to push at the end of the day*

#### Repo File Strategy Discussion
*How I will present my repo files for clarity and demonstration*

#### Work towards milestone 1
Collected all possible data points needed to collect from data sources.

#### Work towards milestone 2
Created list of tables needed to create in PostgreSQL db.

#### Work towards milestone 3
Perform data operations to combine data from both sources and insert unique values into new table.




In [None]:
from pprint import pprint
import urllib.request, json

# Store api key value into variable
APIKEY_VALUE = "c6ba6fa8-84c5-40d3-a30e-8d6d414a21c7"

# concat api query string with api key
APIKEY = "?hapikey=" + APIKEY_VALUE

# hs api end point stored to a variable
HS_API_URL = "http://api.hubapi.com"

thin_contact_list = []

def get_contacts():
    # builds the correct url
    xurl = "/contacts/v1/lists/all/contacts/all"
    url = HS_API_URL + xurl + APIKEY 
    # Now we use urllib to open the url and read it
    response = urllib.request.urlopen(url).read()
    # print("response", response)
    # loads to json obj to all_contacts variable
    all_contacts = json.loads(response)
    # return the contact data
    return all_contacts

def process_contacts(contact_list):
    new_contact_list = []
    
    #create a loop through contacts dict and store values to new list
    for i in range(len(contacts['contacts'])):
        
        first_name = ''
        last_name = ''
        for property in contacts['contacts'][i]['properties']:
                
            if property == 'firstname':
                first_name = contacts['contacts'][i]['properties']['firstname']['value']
            
            if property == 'lastname':
                last_name = contacts['contacts'][i]['properties']['lastname']['value']
        
        email = ''
        for identity in contacts['contacts'][i]['identity-profiles'][0]['identities']:
            if identity['type'] == 'EMAIL':
                email = identity['value']
        
        created_on= contacts['contacts'][i]['addedAt']
        last_updated = contacts['contacts'][i]['identity-profiles'][0]['saved-at-timestamp']

        #created contact dict to go into db
        contact = {"firstname": first_name,
                   "lastname": last_name,
                   "email": email,
                   "createdon": created_on,
                   "lastupdated": last_updated
                  }

        new_contact_list.append(contact)
    
    return new_contact_list
        
contacts = get_contacts()
#process list of contacts
thin_contact_list = process_contacts(contacts)
    
pprint(thin_contact_list)

---
## DAY 4: Tuesday (Week 2)

#### Work in Progress Feedback 
*Feedback and ideas from my work in progress presentation *

#### Work towards milestone 1
Update script to collect only data I want to insert into PostgreSQL database.

In [2]:
from pprint import pprint
import urllib.request, json

# Store api key value into variable
APIKEY_VALUE = "c6ba6fa8-84c5-40d3-a30e-8d6d414a21c7"

# concat api query string with api key
APIKEY = "?hapikey=" + APIKEY_VALUE

# hs api end point stored to a variable
HS_API_URL = "http://api.hubapi.com"

thin_contact_list = []

def get_vids():
    vid_list = []
    
    # builds the correct url
    xurl = "/contacts/v1/lists/all/contacts/all"
    url = HS_API_URL + xurl + APIKEY 
    # Now we use urllib to open the url and read it
    response = urllib.request.urlopen(url).read()
    # print("response", response)
    # loads to json obj to all_contacts variable
    all_contacts = json.loads(response)
    for i in range(len(all_contacts['contacts'])):
        vid_list.append(all_contacts['contacts'][i]['vid'])
        
    return vid_list

def get_urls(list):
    url_list = []
    for i in list:
        xurl = "/contacts/v1/contact/vid/:" + str(i) + "/profile"
        url = HS_API_URL + xurl + APIKEY 
        # print(url)
        url_list.append(url)
        #response = urllib.request.urlopen(url).read()
        #all_properties = json.loads(response)
    
    return url_list
        
def get_properties(urls):
    contact_list = []
    for i in urls:
        response = urllib.request.urlopen(i).read()
        all_properties = json.loads(response)
        contact_list.append(all_properties)
    return contact_list
        
new_vid_list = get_vids()
all_urls = get_urls(new_vid_list)
#pprint(new_vid_list)
#pprint(get_properties(all_urls))

#### Work towards milestone 2
Update script to insert data into PostgreSQl database.

- Created script for googlesheet data to be imported to PostgreSQL

In [3]:
import psycopg2, csv
 
con = None

# try to connect
try:
    # adapter to connect to postgres db 
    con = psycopg2.connect(database='hsbd', user='nat') 
    # allows python code to execute sql commands
    cur = con.cursor()
    # execute method that process sql commands in db
    cur.execute('''DROP TABLE IF EXISTS all_contacts_g''')

    cur.execute('''CREATE TABLE all_contacts_g
        (   date text,
            status text,
            stage text,
            referal text,
            projected_start_date text,
            first text,
            last text,
            gender text,
            term text,
            location text,
            skype text,
            start_date text,
            cohort text,
            experience text,
            profession text,
            motivation text,
            media text,
            media2 text,
            elevate_scholarship text,
            notes_log text
        );''')
    print ("Table data created successfully")
 
    reader = csv.reader(open('devleague_applicants_data.csv', 'r', encoding = 'ISO-8859-1'))
 
    for i, row in enumerate(reader):
        if i < 1 : continue
        print(i, row)
        cur.execute('''
            INSERT INTO "all_applicants_g"(
            "date",
            "status",
            "stage",
            "referal",
            "projected_start_date",
            "first",
            "last",
            "gender",
            "term",
            "location",
            "skype",
            "start_date",
            "cohort",
            "experience",
            "profession",
            "motivation",
            "media",
            "media2",
            "elevate_scholarship",
            "notes_log"
            ) values %s ''', [tuple(row)]
        )
    con.commit()

finally:
    
    if con:
        con.close()

Table data created successfully


FileNotFoundError: [Errno 2] No such file or directory: 'devleague_applicants_data.csv'

#### Work towards milestone 3
Push up both project documentation and portfolio repo documentation.
- Executed data operation queries to insert data. 
- Pushed up repo docs.

``` sql
drop table if exists person;

create table person (
	id serial PRIMARY KEY,
	first_name text not null,
	last_name text not null,
	create_date timestamp default now(),
	last_updated timestamp default now()
);

insert into person (first_name, last_name) select distinct first_name, last_name from all_contacts_hb;

insert into person (first_name, last_name) select distinct first, last from all_applicants_g;
```

--- 

## DAY 5: Wednesday (Week 2)

### Project Highlights: The things I am most excited about in my project
Me
- Excited to merge dump data into one table and see how to filter through duplicates with just a query.
- Using Postico vs the command line.

Peer Identified

- How to use Postico
- How to get unique values with unique ids to auto generate

### Peer Repo Feedback 

*Here are the changes I am making to my repo structure for additional clarity* 
- Didn't use R
- Only used Postico
- Found out that I need to cleanse more of the data used



## Day 6: Thursday (Week 2)
--- 

#### Things I didn't get to
*Here are some ideas that I didn't get to implement, but wanted to. I will be adding these to my roadmap table entry for this sprint as well.
- Getting all data from HubSpot API, still need to work on that.
- Didn't get to use R would like to do more with that.
- Need to finalize data model.
- Need complete cleansed data to work with.