# The Objective
This exercise revolves around translating a given SQL query into a Python script. The query in question is the ams.rules_input.sql query. 

**The objective is to generate an exact copy, in a pandas DataFrame, of the ams.sf_cases_input table in Python.**

---
# 0) Setting Key Variables
To start, we're going to perform a toy example that will primarily walk you through accessing the data and provide context and business background.

## Context

### What is AMS?
**AMS is our Asset Management System.** It is the data pipeline run daily that creates the "master location table" most of our business uses for device location and status.

### What is the Business Value of this Toy Query?
The ams.rules_input.sql query creates variables and values for later use by other queries. This was a work-around, analyst-implemented creation in a world that could only use SQL and Redshift.

### Important Notes
- **The source code is production code - for this exercise, subsets of the tables have been created within the data_engineer schema in the interview database.** 
    - They are named appropriately: ams.rules_input --> data_engineer.ams_rules_input.

### The Query Code
The code you will be "translating" from SQL to Python is below:

```sql
DROP TABLE IF EXISTS ams.rules_input;
SELECT 'Rollforward date' AS rule, (SELECT value FROM ams.temp)::varchar AS value INTO ams.rules_input UNION ALL --Change to produce AMS for another date. Represents the start of day date (i.e. equates to end of day for date prior). Therefore for end of month use '01-##-20##' not '31-##-20##'.
SELECT 'Length of mac address', '12' UNION ALL --The length of a true mac address once stripped for ':'. Shouldn't ever change but is in here incase it does
SELECT 'Days from ''In Transit To Member'' Until ''Unknown''', '35' UNION ALL
SELECT 'Days from ''In Transit To Warehouse'' Until ''Unknown''', '55' UNION ALL
SELECT 'Days from ''X'' Until ''Unknown'' Given No Ping', '45' UNION ALL -- A year from no ping to changing status to Unknown
SELECT 'Days from ''Unknown'' Until Netsuite Disposed', '180' UNION ALL -- Six months from being in Unknown to being retired in Netsuite
SELECT 'Team name of account owners for accounts undergoing installation', 'Onboarding' UNION ALL
SELECT 'Currently Running', (SELECT value FROM ams.temp)::varchar;
```

### Input Tables
- **None:** However, in later examples, and in the exercise itself, this will have the necessary input tables and short descriptions of them.

### Output Tables
- **ams.rules_input:** A table that contains various business parameters business users can change. In the absence of named variables (and because AMS was built from an analyst's point of view, where SQL and Redshift were the only infrastructure possibilities), this is how parameters are created for AMS.

## Pulling the Data
### Accessing the Input Data
Before any analysis or processing, we need to access the data. Normally, we would use Outcome Health's custom data-pipeline-utils package, which allows us to access data from any of our data stores with the same interface. However, for legal reasons, we'll be using a cleaned subset of data in our interviewing database for pull what we need instead.

**In this case, there is no input data to pull, so we'll skip this step for now.**

In [9]:
import pandas as pd
import pandas.io.sql as sqlio
import psycopg2

pd.options.display.max_colwidth = 200 

if False:
    conn = psycopg2.connect(host="",
                            port="",
                            database="", 
                            user="", 
                            password="")
    sql = "select count(*) from my_table;"
    data = sqlio.read_sql_query(sql, conn)
    print(data)

### Accessing the Output Data
We will, however, want to be able to compare our generated dataframes with the dataframe from the output table. To do that, we'll need to pull the output table.

In [6]:
conn = psycopg2.connect(host="data-interview.outcomehealth.io",
                        port="5432",
                        database="product_analytics", 
                        user="pa_candidate", 
                        password="OsOntUnDleYeTivi")

rules_sql = "select id, rule, value from data_engineer.ams_rules_input;"
rules_data = sqlio.read_sql_query(rules_sql, conn)
rules_data.sort_values('rule',inplace=True)
rules_data.reset_index(drop=True,inplace=True)
rules_data

Unnamed: 0,id,rule,value
0,4,Currently Running,
1,1,Days from 'In Transit To Member' Until 'Unknown',35
2,6,Days from 'In Transit To Warehouse' Until 'Unknown',55
3,3,Days from 'Unknown' Until Netsuite Disposed,180
4,2,Days from 'X' Until 'Unknown' Given No Ping,45
5,5,Length of mac address,12
6,0,Rollforward date,2019-03-31
7,7,Team name of account owners for accounts undergoing installation,Onboarding


## Processing the Data
This is where we perform data transformations and merges necessary to reach our output table. 

In this case, we will be setting all "rules" as variables, rather than rebuilding the table. Since the purpose of the table was to set variables, and we can actually do that in Python, that's what we're going to do.

**Moving forward, the objective will be to produce an exact replica of the output table - as will be shown below.**

In [3]:
from datetime import datetime, timedelta
today = (datetime.today() + timedelta(days=1)).strftime('%Y-%m-%d') 

currently_running = ''                     # A flag for ensuring duplicate AMS processes are not running
days_from_in_transit_member_til_unk = 35   # If we do not see a device at a clinic after 35 days after sending it there, it becomes "unknown"
days_in_transit_ware_til_unk = 55          # If we do not see a device after 55 days after sending it to the warehouse, it becomes "unknown"
days_from_unk_til_netsuite_disposed = 180  # After a device is "unknown" for 180 days, our accounting team writes it off
days_from_x_til_unk = 45                   # After a device has been installed, if it does not ping for 45 days, it becomes "unknown"
mac_address_len = 12                       # The appropriate length of mac addresses
rollforward_date = today                   # The date AMS is generating data for
team_name = 'Onboarding'                   # The team name of account owners

## Output Validation
In this case, I'm simply validating that every value is identical to the output table.

**Moving forward, I'll be using the pandas.DataFrame.equals method to check that every value is identical.**

In [4]:
variables = [currently_running, days_from_in_transit_member_til_unk, days_in_transit_ware_til_unk, 
             days_from_unk_til_netsuite_disposed, days_from_x_til_unk, mac_address_len, rollforward_date,
             team_name]
ok = True
for n,i in enumerate(rules_data['value']):
    if str(variables[n]) != i:
        ok = False
        print('{} does not match.'.format(i))
if ok:
    print('Everything is ok!')

None does not match.
2019-03-31 does not match.


---
# 1) A Complete Example
Now, we'll walk through what a proper example will look like. The exercise you'll be completing will look exactly like this one, with more complex logic.

## Context

### What is the Business Value of this Toy Query?
The ams.sf_cases_input.sql query creates a processes our raw "cases" data in Salesforce to create a processed table. "Cases" are usually problem devices that we need to address - our Network Operation Solutions team handles calling in to clinics or sending out technicians to get these devices fixed.

### The Query Code
The code you will be "translating" from SQL to Python is below:

```sql
DROP TABLE IF EXISTS ams.sf_cases_input CASCADE;
SELECT
    TRIM(UPPER(ISNULL(tablets_impacted,CASE WHEN subject_parse ~ '^[0-9]+$' THEN subject_parse END))) AS asset_tag
  ,	product
  ,	MIN(created_date)::date AS first_seen_in_data_date

INTO ams.sf_cases_input
FROM (
    SELECT
        tablets_impacted
      ,	LEFT(REPLACE(REPLACE(subject,'(Scheduled) ',''),'Player asset ',''),5) AS subject_parse
      ,	created_date
      ,	CASE WHEN case_type ILIKE '%Player%' THEN 'Waiting Room Screen'
           WHEN case_type ILIKE '%Tablet%' THEN 'Tablet'
           WHEN case_type ILIKE '%Wallboard%' THEN 'Wallboard'
           WHEN case_type ILIKE 'Irt %' THEN 'Tablet'
           WHEN case_type ILIKE '%WIFI%' THEN 'Waiting Room Wifi'
        END AS product

    FROM salesforce.cases

    WHERE case_type LIKE '%MIA%'
   ) a

GROUP BY
    1,2;
```

### Input Tables
- **salesforce.cases:** The most recently ingested snapshot of the "cases" object in Salesforce. Contains metadata about every case that has ever been opened.

### Output Tables
- **ams.sf_cases_input:** A cleaned up version of a slice of salesforce.cases data - meant for use in our AMS process.

## Pulling the Data
### Accessing the Input Data
As before, we're going to access the core dataset. I'm only pulling the fields used, for obvious speed reasons.

In [5]:
cases_sql = '''
select 
   tablets_impacted,
   subject,
   created_date,
   case_type
from data_engineer.salesforce_cases
where case_type like '%MIA%'
 AND tablets_impacted not in ('Not Deployed');
'''
m = sqlio.read_sql_query(cases_sql, conn)
cases_data.head()

Unnamed: 0,tablets_impacted,subject,created_date,case_type
0,32302,Player-bs MIA 32302,2018-01-02 13:27:21,Player MIA in BS
1,t103695,Tablet MIA t103695,2018-01-02 13:20:32,Tablet MIA
2,t01e103160,Tablet MIA t01e103160,2018-01-02 13:21:01,Tablet MIA
3,17604,Player-bs MIA 17604,2018-01-02 13:27:21,Player MIA in BS
4,ahtlg0000783858019,Tablet MIA ahtlg0000783858019,2018-01-02 13:14:41,Tablet MIA


### Accessing the Output Data
Again, we want to access the output data in order to compare our result with the real thing.

In [6]:
output_sql = '''
select 
    id,
    asset_tag,
    product,
    first_seen_in_data_date
from data_engineer.ams_sf_cases_input;
'''
output_data = sqlio.read_sql_query(output_sql, conn)
output_data.head()

Unnamed: 0,id,asset_tag,product,first_seen_in_data_date
0,0,T107301,Tablet,2018-01-02
1,1,T01E122281,Tablet,2018-01-02
2,2,T102121,Tablet,2018-01-02
3,3,AHTLG0000820005056,Tablet,2018-01-02
4,4,41562,Waiting Room Screen,2018-01-02


## Processing the Data
This is where we perform data transformations and merges necessary to reach our output table. 

**Again, our objective is to produce an exact replica of the output table.**

In [7]:
import re

def sort_cases(x):
    if 'Player' in x:
        return 'Waiting Room Screen'
    elif 'Tablet' in x:
        return 'Tablet'
    elif 'Wallboard' in x:
        return 'Wallboard'
    elif 'Irt ' == x[:4]:
        return 'Tablet'
    elif 'WIFI' in x:
        return 'Waiting Room Wifi'
    else:
        return None
    
def parse_subject(x):
    if x:
        # Clean Up the string and only select numerical ids
        new_str = x.replace('(Scheduled) ','').replace('Player asset ','')[:5]
        if re.search('^[0-9]+$',new_str):
            return new_str
        else:
            return None
    else:
        return None

def clean_asset_tags(x):
    if x:
        return x.upper().strip()
    else:
        return None

# Create the "Product" field
cases_data['product'] = cases_data['case_type'].apply(sort_cases)
cases_data['subject_parse'] = cases_data['subject'].apply(parse_subject)
cases_data['asset_tag'] = cases_data['tablets_impacted'].combine_first(cases_data['subject_parse']).apply(clean_asset_tags)
cases_data = cases_data[cases_data['product'].notnull()]

# We have to fill nulls because otherwise a group by will filter them out
cases_data.fillna('None',inplace=True)

# Group by and create the "first_seen_in_date" field
grouped_data = cases_data.groupby(['asset_tag','product'])['created_date'].min()
grouped_data = grouped_data.reset_index().rename(columns={'created_date': 'first_seen_in_data_date'})
grouped_data['first_seen_in_data_date'] = grouped_data['first_seen_in_data_date'].apply(lambda x: datetime.date(x))
grouped_data.replace(to_replace='None',value=None,inplace=True)

asset_tag                  None
product                    None
first_seen_in_data_date    None
dtype: object

In [8]:
# Sort the values
output_data.sort_values(by=list(output_data.columns),inplace=True)
grouped_data.sort_values(by=list(grouped_data.columns),inplace=True)

In [9]:
# Check that the two dfs are the same length
print(len(output_data) == len(grouped_data))

# Check that the two dfs have no duplicated columns
print(len(output_data) == len(output_data.drop_duplicates(keep=False)))
print(len(grouped_data) == len(grouped_data.drop_duplicates(keep=False)))

True
True
True


## Output Validation
In this case, I'm simply validating that every value is identical to the output table.

In [10]:
grouped_data['name'] = 'grouped'
output_data['name'] = 'output'
concat_data = pd.concat([grouped_data,output_data])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  This is separate from the ipykernel package so we can avoid doing imports until


In [11]:
concat_data.drop_duplicates(subset=['asset_tag','product','first_seen_in_data_date'],keep=False).sort_values(by='asset_tag')

Unnamed: 0,asset_tag,first_seen_in_data_date,id,name,product


### Every row had a duplicate, meaning our two dataframes were exactly the same!! We've achieved our objective!

## Unit Test Implementation
We will ask you to implement at least one unit test to catch errors with your script.

In this case, I've implemented a unit test to check that there are no duplicates in our grouped data.

In [12]:
def dupes_unit_test(grouped_data):
    assert len(grouped_data) == len(grouped_data.drop_duplicates(keep=False)), 'We have dupes in our grouped data.'
dupes_unit_test(grouped_data)