[View in Colaboratory](https://colab.research.google.com/github/janilles/dfdapp/blob/master/dfd_risk_levels.ipynb)

# Drink Free Days - Risk levels
From AWS RDS using PyMySQL 

## Install PyDrive for loading files from Google Drive
https://pythonhosted.org/PyDrive/index.html

In [0]:
# added -q for suppressing output
!pip install -U -q PyDrive

# see PyDrive documentation for libraries code snippets
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

### Create and upload a text file with PyDrive (if needed)

```python
uploaded = drive.CreateFile({'title': 'sample_file.txt'})
uploaded.SetContentString('Sample file content.')
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))
```

### Get the Google Drive IDs of your file (if needed)

You will need your file's google Drive ID to load the content of the text file into a variable with PyDrive.  

To get the ID, you can: 
#### Right-click on the file in your Google Drive and selcet 'Get shareable link' which gives you a URL with 'id=' in it.
  
Alternatively, you can obtain the ID directly from the notebook by running this code:  

- Step 1:   
Get your file's ID.  To get a list of all the file and folder IDs in the root folder:
```python
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print("File %s\n\n",(file1))
```
You can replace ```'root'``` with the folder ID your file is in. You can either get the folder ID from the web interface: ```drive.google.com/drive/u/0/folders/<folder ID>```  
or have PyDrive list the IDs for you:
```python
# Paginate file lists and specify number of max results if necessary
for file_list in drive.ListFile({'q': 'trashed=False', 'maxResults': 10}):
  print('Received %s files from Files.list()' % len(file_list)) # <= 10
  for file1 in file_list:
      print('title: %s, id: %s' % (file1['title'], file1['id']))
    ```
- Step 2:   
Load the file.
```python
downloaded = drive.CreateFile({'id': '<file ID>'})
# you can print the content of the text file to check it
print('Downloaded content "{}"'.format(downloaded.GetContentString()))
```


### Load the file's content (i.e. the password) into a variable

In [0]:
# insert ID of the file with the password
# comment out the other user when running this cell

# Jan's file
passwd_file = drive.CreateFile({'id': '1YnGugBHvqjJk0nbTqN-683Agb0vaZKHo'}) 

# Tacey's file
# passwd_file = drive.CreateFile({'id': ' '}) 

# load the password as a string into a variable
# you will use this variable in pymysql connection
# instead of the actual password string
user_passwd = passwd_file.GetContentString()

## Connect to AWS database via PyMySQL
### Retrun SQL queries as Pandas dataframes
Pandas ```read_sql``` [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html)

In [0]:
# added -q for suppressing output
!pip install -q -U pymysql

import pymysql
import pandas as pd

In [0]:
def connect():  
    return pymysql.connect(
        
        host = "df-phereplica3.crqbvr0pveqx.eu-west-1.rds.amazonaws.com",
        
        # change user name and password as necessary
           
        user = "jan",
        # user = "tacey",
        passwd = user_passwd, # string loaded into a variable with PyDrive above
   
        db = "daysoff",
        autocommit=True
        
        )

connection = connect()

def sql_to_df(sql):
    """
    Returns SQL queries as pandas dataframes
    """
    return pd.read_sql(sql, con = connection)

# Database tables
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax.

In [0]:
# formatting column width of Pandas dataframes
# increase column width so that longer comments don't get truncated

pd.set_option('max_colwidth',100)

### Drinks table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appdrinks'
        """)

### Pledges table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_apppledges'
        """)

### Days off table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appdaysoff'
        """)

### App users table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appusers'
        """)

# Report
SQL queries as strings inside ```qud()``` function defined as pymysql connection above.  
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax reference.

### 1. Calculate units for a typical week for each user

In [0]:
# assumption: there's only ever one typical week in the database for any given user
# i.e. if I sum units per id (once I calculate them for each row) that'll be the typical week units
# confirm with app developpers

drinks = sql_to_df("""
        select 
            id,
            percent,
            ml
        from
            g_appdrinks
        """)

drinks.head()

In [0]:
drinks['units'] = (drinks['percent'] * drinks['ml']) / 1000

drinks.head()

In [0]:
usersUnits = drinks.groupby('id', as_index=False)['units'].sum()

usersUnits.head()

### 2. Add gender and age to calculate risk levels "before"

In [0]:
demogr = sql_to_df("""
        select 
            id,
            gender,
            age
        from
            g_appusers
        """)

demogr.head()

In [0]:
riskBefore = pd.merge(usersUnits, demogr, how='inner', on='id')

riskBefore.head()

In [0]:
def riskCalculator(row):
    if row['units'] <= 14.9:
        global value
        value = 'lower'
    elif (row['gender'] == 'Male') & (row['units'] >= 15) & (row['units'] <= 49.9):
        value = 'increasing'
    elif (row['gender'] == 'Female') & (row['units'] >= 15) & (row['units'] <= 34.9):
        value = 'increasing'
    elif (row['gender'] == 'Male') & (row['units'] >= 50):
        value = 'higher'
    elif (row['gender'] == 'Female') & (row['units'] >= 35):
        value = 'higher'
    # needs else?
    return value

In [0]:
riskBefore['risk before'] = riskBefore.apply(riskCalculator, axis=1)

riskBefore.head()

### 3. Add units after to calculate risk "after"

- I will need a units per weekday table (group by weekday)
- go to days off table and take the day off - which week of the day
- subtract the day off weekday from the typical week
- recalculate weekly units as units after
- calculate risk after

In [0]:
dfd = sql_to_df("""
        select 
            id, date, dayname(date) as 'dfd'
        from 
            g_appdaysoff
        where
            date between '2018-09-10' and '2018-09-16'
        """)

dfd.head()

In [0]:
drinksWD = sql_to_df("""
        select 
            id,
            percent,
            ml,
            day 
        from
            g_appdrinks
        """)

drinksWD.head()

In [0]:
drinksWD['units'] = (drinksWD['percent'] * drinksWD['ml']) / 1000

drinksWD.head()

In [0]:
usersUnitsWD = drinksWD.groupby(['id', 'day'], as_index=False)['units'].sum()

usersUnitsWD.head()

In [0]:
unitsAfter = pd.merge(dfd, usersUnitsWD, 
                      how='outer', 
                      left_on=['id', 'dfd'], 
                      right_on=['id', 'day'])

unitsAfter.head()

'day' above is NaN - that means even if it was dfd the user didn't indicate that they are typically drinking on taht day

In [0]:
def unitsAfterCal(row):
    if row['dfd'] == row['day']:
        global value
        value = 0
    else:
        value = row['units']
    return value
    

In [0]:
unitsAfter['units after'] = unitsAfter.apply(unitsAfterCal, axis=1)

unitsAfter.head()

In [0]:
unitsAfter.fillna(0, inplace=True)

unitsAfter.head()

In [0]:
unitsAfter = unitsAfter.groupby('id', as_index=False)['units after'].sum()

In [0]:
unitsAfter.head()

In [0]:
riskAfter = pd.merge(riskBefore, unitsAfter, how='outer', on='id')

riskAfter.head()

In [0]:
def riskCalculatorAfter(row):
    if row['units after'] <= 14.9:
        global value
        value = 'lower'
    elif (row['gender'] == 'Male') & (row['units after'] >= 15) & (row['units after'] <= 49.9):
        value = 'increasing'
    elif (row['gender'] == 'Female') & (row['units after'] >= 15) & (row['units after'] <= 34.9):
        value = 'increasing'
    elif (row['gender'] == 'Male') & (row['units after'] >= 50):
        value = 'higher'
    elif (row['gender'] == 'Female') & (row['units after'] >= 35):
        value = 'higher'
    # needs else?
    return value

In [0]:
riskAfter['risk after'] = riskAfter.apply(riskCalculatorAfter, axis=1)

riskAfter.head()

### Outcomes

In [0]:
df = riskAfter

#### Overall shifts

In [0]:
df.groupby('risk before').count()

In [0]:
df.groupby('risk after').count()

# Ignore below - just some scribbles

In [0]:
dfd.loc[dfd['id'] == '00026447a772b9db']

In [0]:
usersUnitsWD.loc[usersUnitsWD['id'] == '00026447a772b9db']

In [0]:
usersUnits.loc[usersUnits['id'] == '00026447a772b9db']

### NOTES

- more dfd recorded than indicated as drinking on during the week

### Temporary notes

- I have more than one unique id number in my dfd and usersUnitsWD dataframes hence why I can't join them like above
