# Jurisdiction Entry QA/QC Analysis

- lambda function ARN: 

    aws:lambda:us-west-2:311263071456:function:jurisdiction-entry-analysis

## Goals: 
1. Given entries from a jurisdiction, track and analyze the changes from any previous table version.
    a. Determine Metrics that can be used as a reflection of valid entries
    b. Generalize to compare any updated entry with any master table enty which shares same unique key
    
2. Find a method to track all rows potentially affected in redshift database via primary key
    
3. Implement analysis as an AWS Lambda Function for external use.
    a. Gain a better understanding of "Layers" to bootstrap libraries in a space efficient manner
    b. Implement call capabilities both from a Python Environment and through a REST API
        
4. Store warning and update logs into appropriate Redshift tables for future reference.


## Assumptions:
1. The tables ingested by this pipeline have been pulled from Redshift land use 2020 table (and thus that values for all features are consistent in datatype)

2. Jurisdiction updates are presented to the function as a json file.
    
3. Aforementioned presentation shares same schema as old table
    a. Full table with updated columns
    b. Updated columns themselves

4. If a row was added in the update, it was assigned a unique primary key

5. A row's recid cannot be changed (since it's used as the primary key between old and new dataframe)

6. Columns of type "float" are continuous (not ordinal/nominal categories)

7. The fifth percentile to the 95th percentile of master socrata dataframe numerical columns is sufficient metric for "normal" updates

8. Novel string updates when compared to old column counterparts are suspicious enough to raise warnings
    

## Ouput:

1. Update table log with schema:
    - recid, city_name, cols_updated, updated_from_nan, updated_to_nan, changed_vals, out_of_range, warnings, editor, edit_timestamp, edit_type__c_u_d__, main_version, old_vals, new_vals

2. Warning log with schema:
    - city, editor, recid, warn

# QA//QC Use Case:

- Most recent update of this class can be found in the 'jurisdiction-entry-analysis' lambda function

In [30]:
import sys
sys.path.insert(0, '../../Project_1_Policy_Parsing/')

from table_update_insights_debugging import *
from utils_io import *

In [33]:
socrata_data_id = 'qdrp-c5ra'
df_old = pull_df_from_socrata(socrata_data_id)

pulling data in 1 chunks of 4936 rows each
pulling chunk 0
took 3.6922 seconds


In [34]:
dfuje = pd.DataFrame(load_json('Use_Case_Data/updates_only_ex.json'))
event = {'socrata_data_id': 'qdrp-c5ra',
         'jurisdiction_entries' : dfuje}

In [35]:
update_info = CheckUpdates(df_old, dfuje, primary_key='recid', parcels=False, compare_masters=False)

In [37]:
pretty_print(update_info.warning_table)

Unnamed: 0,city,editor,edit_timestamp,recid,warning
0,Fremont,Joshua Croff,2019-10-24 17:00:00,e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,regional_lu_class value is out of range (val = 123.0)
1,Fremont,Joshua Croff,2019-10-24 17:00:00,e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,max_dua value is out of range (val = 5000.0)
2,Oakland,Joshua Croff,2019-10-24 17:00:00,62db00d9-c9ff-4764-8d84-59413fe5f0b6,max_far value is out of range (val = 16.0)
3,Oakland,Joshua Croff,2019-10-24 17:00:00,c1a59692-ee49-4677-95d0-f64e580e5754,units_per_lot value is out of range (val = 12.0)
4,San Leandro,Avalon Schultz,2019-10-24 17:00:00,338fdfa3-3d33-42ec-afbb-b1ca11f7a73a,regional_lu_class value is out of range (val = 123.0)
5,San Leandro,Avalon Schultz,2019-10-24 17:00:00,338fdfa3-3d33-42ec-afbb-b1ca11f7a73a,max_dua value is out of range (val = 5000.0)
6,San Leandro,Avalon Schultz,2019-10-24 17:00:00,338fdfa3-3d33-42ec-afbb-b1ca11f7a73a,"""Testing In Place"" not in other row records for zn_description"


# Pulling Existing Data From Redshift

In [17]:
# Local debugging
# from json import *
# dfuje = load_json('Use_Case_Data/updates_only_ex.json')
# event = {'socrata_data_id': 'qdrp-c5ra',
#          'jurisdiction_entries' : dfuje,
#          'entry_tag' : 'basis'}

In [39]:
with open('Use_Case_Data/updates_only_ex.json') as f:
    dfuje = json.load(f)

In [80]:
master_df_query = 'select * from policy.zoning_lookup_20'
zl_2020 = pull_df_from_redshift_sql(master_df_query)
zl_2020.dtypes

took 0.9691 seconds


recid                     object
county_name               object
city_name                 object
zn_code                   object
zn_description            object
zn_area_overlay           object
regional_lu_class        float64
max_far                  float64
max_dua                  float64
max_height               float64
units_per_lot            float64
minimum_lot_sqft         float64
lot_coverage             float64
max_footprint             object
zn_code_color             object
source                    object
hs                       float64
ht                       float64
hm                       float64
of                       float64
ho                       float64
sc                       float64
il                       float64
iw                       float64
ih                       float64
rs                       float64
rb                       float64
mr                       float64
mt                       float64
me                       float64
building_t

In [40]:
update_uc = pd.read_json('Use_Case_Data/updates_only_ex.json')
update_uc.rename(columns={'building_height':'max_height'}, inplace=True)

In [41]:
pull_df_from_redshift_sql('select * from basis_edit_logs.edit_log_2021').columns

took 0.452 seconds


Index(['recid', 'city_name', 'cols_updated', 'updated_from_nan',
       'edit_timestamp', 'edit_type__c_u_d__', 'main_version', 'old_vals',
       'new_vals'],
      dtype='object')

# Creating, Deleting Tables

In [77]:
query = ['DROP table if exists basis_edit_logs.edit_log_2021']

execute_redshift_cmds(query)

add_table_query = [
    """
    CREATE TABLE basis_edit_logs.edit_log_2021 (
        recid VARCHAR,
        city_name VARCHAR,
        cols_updated VARCHAR(1000),
        updated_from_nan INTEGER,
        updated_to_nan INTEGER,
        changed_vals INTEGER,
        out_of_range INTEGER,
        warnings VARCHAR(5000),
        editor VARCHAR,
        edit_timestamp VARCHAR,
        edit_type__c_u_d__ VARCHAR,
        main_version VARCHAR,
        old_vals VARCHAR(5000),
        new_vals VARCHAR(5000)
        );
    """
]

execute_redshift_cmds(add_table_query)
pull_df_from_redshift_sql('select * from basis_edit_logs.edit_log_2021')

DROP table if exists basis_edit_logs.edit_log_2021



    CREATE TABLE basis_edit_logs.edit_log_2021 (
        recid VARCHAR,
        city_name VARCHAR,
        cols_updated VARCHAR(1000),
        updated_from_nan INTEGER,
        updated_to_nan INTEGER,
        changed_vals INTEGER,
        out_of_range INTEGER,
        editor VARCHAR,
        edit_timestamp VARCHAR,
        edit_type__c_u_d__ VARCHAR,
        main_version VARCHAR,
        old_vals VARCHAR(5000),
        new_vals VARCHAR(5000)
        );
    


took 0.4087 seconds


Unnamed: 0,recid,city_name,cols_updated,updated_from_nan,updated_to_nan,changed_vals,out_of_range,warnings,editor,edit_timestamp,edit_type__c_u_d__,main_version,old_vals,new_vals


In [78]:
query = ['DROP table if exists basis_edit_logs.warning_log_2021']

execute_redshift_cmds(query)

add_table_query = [
    """
    CREATE TABLE basis_edit_logs.warning_log_2021 (
        city VARCHAR,
        editor VARCHAR,
        edit_timestamp VARCHAR,
        recid VARCHAR,
        warning VARCHAR
        );
    """
]

execute_redshift_cmds(add_table_query)
pull_df_from_redshift_sql('select * from basis_edit_logs.warning_log_2021')




        city VARCHAR,
        editor VARCHAR,
        edit_timestamp VARCHAR,
        recid VARCHAR,
        );
    


took 0.4405 seconds


Unnamed: 0,city,editor,edit_timestamp,recid,warning


# Lambda Function Invocation:

In [79]:
from aws_creds import alternate_creds
import boto3
import json

# dfuje = "dataframe updated jurisdiction entries"
with open('Use_Case_Data/updates_uc_2020.json') as f:
    dfuje = json.load(f)

event = {'main_version': 3,
         'jurisdiction_entries' : dfuje,
         'include_parcels' : False}

payload = json.dumps(event)

session = boto3.Session(aws_access_key_id=alternate_creds['Access_Key_ID'],
                        aws_secret_access_key = alternate_creds['Secret_Key'])

lambda_client = session.client('lambda',  region_name = 'us-west-2')
lambda_payload = event
lambda_client.invoke(FunctionName='arn:aws:lambda:us-west-2:311263071456:function:jurisdiction-entry-analysis', 
                     InvocationType='RequestResponse',
                     Payload=payload)

{'ResponseMetadata': {'RequestId': 'db55b0be-1443-4e0c-8811-7669c4fe534c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 19 Jul 2021 22:48:10 GMT',
   'content-type': 'application/json',
   'content-length': '4',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'db55b0be-1443-4e0c-8811-7669c4fe534c',
   'x-amzn-remapped-content-length': '0',
   'x-amz-executed-version': '$LATEST',
   'x-amzn-trace-id': 'root=1-60f60124-46e2a2cc6ae5627e6d52df49;sampled=0'},
  'RetryAttempts': 0},
 'StatusCode': 200,
 'ExecutedVersion': '$LATEST',
 'Payload': <botocore.response.StreamingBody at 0x10d3c18b0>}

- Empty values inputted as "-1" (datatype errors arise for NaN)

In [80]:
pretty_print(pull_df_from_redshift_sql('select * from basis_edit_logs.edit_log_2021'))

took 0.3986 seconds


Unnamed: 0,recid,city_name,cols_updated,updated_from_nan,updated_to_nan,changed_vals,out_of_range,warnings,editor,edit_timestamp,edit_type__c_u_d__,main_version,old_vals,new_vals
0,thiswasadded1-f595-4fa8-8da2-975dfae46dc4,Cloverdale,row created,-1,-1,-1,2,"[max_far value is out of range (val = 0.0),max_dua value is out of range (val = 0.0),0 Joshua Croff\nName: editor, dtype: object not in other row records for editor]",Joshua Croff,2019-10-21 00:00:00,c,3,-1,"[thiswasadded1-f595-4fa8-8da2-975dfae46dc4,Cloverdale,M-1,General Industrial,None,5,0.0,0,nan,nan,Joshua Croff,2019-10-21 00:00:00,Sonoma,#75187C,nan,nan,0.0,None,1.0,0.0,None,0.0,None,0.0,0.0,0.0,0.0,0.0,None,0.0,1.0,0.0,1.0,0.0]"
1,dc4e140a-ed57-42c5-b4d0-89444919e88f,Dublin,"[recid,city_name,zn_code,zn_description,zn_area_overlay,regional_lu_class,max_far,max_dua,max_height,units_per_lot,editor,edit_date,county_name,zn_code_color,minimum_lot_sqft,source,iw,building_types_source,hm,ht,max_footprint,rb,ordinance_url,ho,mt,il,hs,me,lot_coverage,ih,mr,sc,rs,of]",0,2,34,0,[],Kearey Smith,2019-10-29 00:00:00,u,3,"[dc4e140a-ed57-42c5-b4d0-89444919e88f,Alameda,Dublin,C-N,Neighborhood Commercial,None,3.0,nan,nan,35.0,nan,nan,nan,None,#FF6B6B,zoning,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,2010 Aksel Geo-matching,https://www.codepublishing.com/CA/Dublin/Dublin08/Dublin0812.html#8.12,Kearey Smith,2019-10-29 00:00:00]","[dc4e140a-ed57-42c5-b4d0-89444919e88f,Dublin,C-N,Neighborhood Commercial,None,123,nan,12,35.0,nan,Michael Cass,2019-10-25 00:00:00,Alameda,#FF6B6B,5000.0,nan,0.0,None,1.0,0.0,None,0.0,None,0.0,0.0,0.0,0.0,0.0,None,0.0,1.0,0.0,1.0,0.0]"
2,e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,Fremont,"[recid,city_name,zn_code,zn_description,zn_area_overlay,regional_lu_class,max_far,max_dua,max_height,units_per_lot,editor,edit_date,county_name,zn_code_color,minimum_lot_sqft,source,iw,building_types_source,hm,ht,max_footprint,rb,ordinance_url,ho,mt,il,hs,me,lot_coverage,ih,mr,sc,rs,of]",14,0,20,3,"[regional_lu_class value is out of range (val = 123.0),max_far value is out of range (val = 0.0),max_dua value is out of range (val = 5000.0),Joshua Croff not in other row records for editor]",Joshua Croff,2019-10-25 00:00:00,u,3,"[e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,Alameda,Fremont,P-87-2,Planned District,None,11.0,0.01,1.0,30.0,nan,nan,nan,None,#556B2F,zoning,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,None,https://www.codepublishing.com/CA/Fremont/,Kearey Smith,2019-05-24 00:00:00]","[e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,Fremont,P-87-2,Planned District,None,123,0.0,5000,50.0,nan,Joshua Croff,2019-10-25 00:00:00,Alameda,#556B2F,nan,nan,0.0,None,1.0,0.0,None,0.0,None,0.0,0.0,0.0,0.0,0.0,None,0.0,1.0,0.0,1.0,0.0]"
3,f063d0c3-d09f-47e7-9381-f71b954ca0fe,Fremont,"[recid,city_name,zn_code,zn_description,zn_area_overlay,regional_lu_class,max_far,max_dua,max_height,units_per_lot,editor,edit_date,county_name,zn_code_color,minimum_lot_sqft,source,iw,building_types_source,hm,ht,max_footprint,rb,ordinance_url,ho,mt,il,hs,me,lot_coverage,ih,mr,sc,rs,of]",0,17,34,0,[],Kearey Smith,2019-10-31 00:00:00,u,3,"[f063d0c3-d09f-47e7-9381-f71b954ca0fe,Alameda,Fremont,OS(Q),Open Space,Quarry Overlay,7.0,nan,1.0,30.0,nan,nan,nan,None,#006A00,zoning,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,None,https://www.codepublishing.com/CA/Fremont/,Kearey Smith,2019-10-31 00:00:00]","[f063d0c3-d09f-47e7-9381-f71b954ca0fe,Fremont,OS(Q),Open Space,Quarry Overlay,12,26.0,16,30.0,1.0,Marc Cleveland,2019-10-25 00:00:00,Alameda,#006A00,43560.0,nan,0.0,None,1.0,0.0,None,0.0,None,0.0,0.0,0.0,0.0,0.0,None,0.0,1.0,0.0,1.0,0.0]"
4,62db00d9-c9ff-4764-8d84-59413fe5f0b6,Oakland,row created,-1,-1,-1,2,"[max_far value is out of range (val = 16.0),max_dua value is out of range (val = 0.0),3756 CN-2 - D-KP-3\nName: zn_code, dtype: object not in other row records for zn_code,3756 Kaiser Permanente Oakland Medical Center Zone\nName: zn_area_overlay, dtype: object not in other row records for zn_area_overlay,3756 25.0\nName: source, dtype: object not in other row records for source,3756 Joshua Croff\nName: editor, dtype: object not in other row records for editor]",Joshua Croff,2019-10-25 00:00:00,c,3,-1,"[62db00d9-c9ff-4764-8d84-59413fe5f0b6,Oakland,CN-2 - D-KP-3,Commercial - Neighborhood Center,Kaiser Permanente Oakland Medical Center Zone,3,16.0,0,nan,nan,Joshua Croff,2019-10-25 00:00:00,Alameda,#FF0000,nan,25.0,0.0,None,1.0,0.0,None,0.0,None,0.0,0.0,0.0,0.0,0.0,None,0.0,1.0,0.0,1.0,0.0]"
5,c1a59692-ee49-4677-95d0-f64e580e5754,Oakland,row created,-1,-1,-1,3,"[max_far value is out of range (val = 0.0),max_dua value is out of range (val = 0.0),units_per_lot value is out of range (val = 12.0),4060 123.0\nName: source, dtype: object not in other row records for source,4060 Joshua Croff\nName: editor, dtype: object not in other row records for editor]",Joshua Croff,2019-10-25 00:00:00,c,3,-1,"[c1a59692-ee49-4677-95d0-f64e580e5754,Oakland,D-KP-1,Special and Combined - Kaiser Permanente Oakland Medical,None,4,0.0,0,nan,12.0,Joshua Croff,2019-10-25 00:00:00,Alameda,#990000,10000.0,123.0,0.0,None,1.0,0.0,None,0.0,None,0.0,0.0,0.0,0.0,0.0,None,0.0,1.0,0.0,1.0,0.0]"
6,338fdfa3-3d33-42ec-afbb-b1ca11f7a73a,San Leandro,"[recid,city_name,zn_code,zn_description,zn_area_overlay,regional_lu_class,max_far,max_dua,max_height,units_per_lot,editor,edit_date,county_name,zn_code_color,minimum_lot_sqft,source,iw,building_types_source,hm,ht,max_footprint,rb,ordinance_url,ho,mt,il,hs,me,lot_coverage,ih,mr,sc,rs,of]",12,0,22,2,"[regional_lu_class value is out of range (val = 123.0),max_dua value is out of range (val = 5000.0),Avalon Schultz not in other row records for editor,Testing In Place not in other row records for zn_description]",Avalon Schultz,2019-10-25 00:00:00,u,3,"[338fdfa3-3d33-42ec-afbb-b1ca11f7a73a,Alameda,San Leandro,IT,Industrial Transition,None,5.0,1.0,40.0,35.0,nan,nan,nan,None,#B566FF,zoning,0.0,0.0,nan,nan,nan,nan,1.0,nan,nan,nan,nan,nan,nan,nan,Industrial zn_description inference,http://sanleandro.org/depts/cd/plan/zonecodemap.asp,Kearey Smith,2019-10-24 00:00:00]","[338fdfa3-3d33-42ec-afbb-b1ca11f7a73a,San Leandro,IT,Testing In Place,None,123,1.0,5000,50.0,nan,Avalon Schultz,2019-10-25 00:00:00,Alameda,#B566FF,5000.0,nan,0.0,None,1.0,0.0,None,0.0,None,0.0,0.0,0.0,0.0,0.0,None,0.0,1.0,0.0,1.0,0.0]"


In [81]:
pretty_print(pull_df_from_redshift_sql('select * from basis_edit_logs.warning_log_2021'))

took 0.4008 seconds


Unnamed: 0,city,editor,edit_timestamp,recid,warning
0,Cloverdale,Joshua Croff,2019-10-21 00:00:00,thiswasadded1-f595-4fa8-8da2-975dfae46dc4,max_far value is out of range (val = 0.0)
1,Cloverdale,Joshua Croff,2019-10-21 00:00:00,thiswasadded1-f595-4fa8-8da2-975dfae46dc4,max_dua value is out of range (val = 0.0)
2,Cloverdale,Joshua Croff,2019-10-21 00:00:00,thiswasadded1-f595-4fa8-8da2-975dfae46dc4,"0 Joshua Croff\nName: editor, dtype: object not in other row records for editor"
3,Fremont,Joshua Croff,2019-10-25 00:00:00,e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,regional_lu_class value is out of range (val = 123.0)
4,Fremont,Joshua Croff,2019-10-25 00:00:00,e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,max_far value is out of range (val = 0.0)
5,Fremont,Joshua Croff,2019-10-25 00:00:00,e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,max_dua value is out of range (val = 5000.0)
6,Fremont,Joshua Croff,2019-10-25 00:00:00,e63d2e0d-2baf-4d4d-a6f8-5eae9bdbc33c,Joshua Croff not in other row records for editor
7,Oakland,Joshua Croff,2019-10-25 00:00:00,62db00d9-c9ff-4764-8d84-59413fe5f0b6,max_far value is out of range (val = 16.0)
8,Oakland,Joshua Croff,2019-10-25 00:00:00,62db00d9-c9ff-4764-8d84-59413fe5f0b6,max_dua value is out of range (val = 0.0)
9,Oakland,Joshua Croff,2019-10-25 00:00:00,62db00d9-c9ff-4764-8d84-59413fe5f0b6,"3756 CN-2 - D-KP-3\nName: zn_code, dtype: object not in other row records for zn_code"


# Deprecated!!! (though may have some useful content)

# Simulating Jurisdiction Entries

- Using a few simulated cases, jurisdiction entries for three dates will be saved to "Example_Logs"

## Update 1:

In [None]:
df_old = load_data()
update1 = use_cases(df_old)
update1_info = CheckUpdates(df_old, update1, primary_key='recid', parcels=False)

In [None]:
update1_info.row_comparison
update1_info.update_analysis()

In [None]:
update1_info.update_log

In [None]:
pretty_print(update1_info.warning_table)

In [None]:
date = pd.Timestamp('2019-10-25')
current_date = date.strftime("%y-%m-%d_%H-%M")

warnings_file_name = 'warnings_' + current_date
update_log_file_name = 'update-log_' + current_date

update1_info.update_log.to_csv('Example_Logs/'+update_log_file_name+'.csv')
update1_info.warning_table.to_csv('Example_Logs/'+warnings_file_name+'.csv')

In [None]:
df_old.to_csv('Example_Logs/Master_Table.csv')

In [None]:
pd.Timestamp('2019-10-25')

## Update 2:

In [None]:
update2 = use_cases_entry_2(df_old)
update2_info = CheckUpdates(df_old, update2, primary_key='recid', parcels=False)

In [None]:
#update_info.row_comparison
update2_info.update_analysis()

date = pd.Timestamp('2019-11-14')
current_date = date.strftime("%y-%m-%d_%H-%M")

warnings_file_name = 'warnings_' + current_date
update_log_file_name = 'update-log_' + current_date

update2_info.update_log.to_csv('Example_Logs/'+update_log_file_name+'.csv')
update2_info.warning_table.to_csv('Example_Logs/'+warnings_file_name+'.csv')

## Update 3:

In [None]:
update3 = use_cases_entry_3(df_old)
update3_info = CheckUpdates(df_old, update3, primary_key='recid', parcels=False)

In [None]:
#update_info.row_comparison
update3_info.update_analysis()

date = pd.Timestamp('2020-01-01')
current_date = date.strftime("%y-%m-%d_%H-%M")

warnings_file_name = 'warnings_' + current_date
update_log_file_name = 'update-log_' + current_date

update3_info.update_log.to_csv('Example_Logs/'+update_log_file_name+'.csv')
update3_info.warning_table.to_csv('Example_Logs/'+warnings_file_name+'.csv')

# Pulling the Warnings/Logs From Master to a Certain Date

- Product all the updates from the "master" table to the cutoff date in tabular format

In [None]:
import os
#Pull the file list from "Example_Logs"

def retrieve_logs(path: str, cutoff_date: str) -> pd.DataFrame:
    """
    Inputs:
    path -> path to the logs
    cutoff_date -> Up to what day? format: yy-mm-dd_hh-mm
    """
    files = os.listdir(path)

    files_txt = [i for i in files if i.endswith('.csv')]

    cutoff_date = datetime.strptime(cutoff_date, '%y-%m-%d_%H-%M')

    warning_log_df = pd.DataFrame()
    update_log_df = pd.DataFrame()

    for file in files_txt[1:]:
        try:
            date = file.partition("_")[2].split(".")[0]
            date = datetime.strptime(date, '%y-%m-%d_%H-%M')
            if cutoff_date > date:
                if 'update' in file:
#                     print("update_log:")
#                     print(file)
                    update_log_df = update_log_df.append(pd.read_csv(path+"/"+file, index_col='Unnamed: 0'))
                elif 'warning' in file:
#                     print("warning_log:")
#                     print(file)
                    warning_log_df = warning_log_df.append(pd.read_csv(path+"/"+file, index_col='Unnamed: 0'))
        except:
            continue

    update_log_df = update_log_df.sort_values(by='city_name')
    warning_log_df = warning_log_df.sort_values(by='editor')
    return update_log_df, warning_log_df

path = '/Users/okeefe/Box/USF Data Science Practicum/2020-21/Okeefe/Project_3_BASIS_Pipeline/Example_Logs'
cutoff_date = '19-12-30_10-15'

update_log, warning_log = retrieve_logs(path, cutoff_date)

In [None]:
pretty_print(update_log)

In [None]:
pretty_print(warning_log)

# Roll Back to a Specified Date


# Save log into S3

- Once analysis is run, the following cell will save the file by date and time to the s3 path of choice.

In [None]:
import datetime

now = datetime.datetime.now()
current_date = now.strftime("%y-%m-%d_%H:%M")

warnings_file_name = 'warnings_' + current_date
update_log_file_name = 'update_log_' + current_date



# post_df_to_s3(update_info.warning_table, 'upload-logs-data-lake-mtc', warnings_file_name)
# post_df_to_s3(update_info.update_table, 'upload-logs-data-lake-mtc', update_log_file_name)

def all_changed_rows(date):
Function: Logs before (date/time)

1. Open s3 bucket with all logs
2. Parsing all log document titles and extracting dates
3. Concatenate all log documents that are before/up to the inputted date/time

returns: two pandas dataframes

1. concatenated update log 
2. concatenated warning log (by city)



update_logs, warning_logs = all_changed_rows(date)

def rollback_table(update_logs, master_socrata_table, save_df=False):
  
1. Identify changed rows via update_logs
2. Make a copy of the master_socrata_table
3. Put it into dataframe
4. change the rows in the master_socrata_dataframe

(optional)
5. If save_df, save the updated master_socrata_dataframe as csv (upload to socrata?)

return changed_master_socrata_dataframe