[View in Colaboratory](https://colab.research.google.com/github/janilles/dfdapp/blob/master/dfd_risk_levels.ipynb)

# Drink Free Days - Risk levels
From AWS RDS using PyMySQL 

## Install PyDrive for loading files from Google Drive
https://pythonhosted.org/PyDrive/index.html

In [0]:
# added -q for suppressing output
!pip install -U -q PyDrive

# see PyDrive documentation for libraries code snippets
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

### Load the file's content (i.e. the password) into a variable

In [0]:
# insert ID of the file with the password
# comment out the other user when running this cell

# Jan's file
passwd_file = drive.CreateFile({'id': '1YnGugBHvqjJk0nbTqN-683Agb0vaZKHo'}) 

# Tacey's file
# passwd_file = drive.CreateFile({'id': ' '}) 

# load the password as a string into a variable
# you will use this variable in pymysql connection
# instead of the actual password string
user_passwd = passwd_file.GetContentString()

## Connect to AWS database via PyMySQL
### Retrun SQL queries as Pandas dataframes
Pandas ```read_sql``` [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html)

In [0]:
# added -q for suppressing output
!pip install -q -U pymysql

import pymysql
import pandas as pd

In [0]:
def connect():  
    return pymysql.connect(
        
        host = "df-phereplica3.crqbvr0pveqx.eu-west-1.rds.amazonaws.com",
        
        # change user name and password as necessary
           
        user = "jan",
        # user = "tacey",
        passwd = user_passwd, # string loaded into a variable with PyDrive above
   
        db = "daysoff",
        autocommit=True
        
        )

connection = connect()

def sql_to_df(sql):
    """
    Returns SQL queries as pandas dataframes
    """
    return pd.read_sql(sql, con = connection)

# Database tables
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax.

In [0]:
# formatting column width of Pandas dataframes
# increase column width so that longer comments don't get truncated

pd.set_option('max_colwidth',100)

### Drinks table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appdrinks'
        """)

### Pledges table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_apppledges'
        """)

### Days off table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appdaysoff'
        """)

### App users table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appusers'
        """)

# Report
SQL queries as strings inside ```qud()``` function defined as pymysql connection above.  
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax reference.

### Beahviour before and after app usage

#### Selecting from drinks table

In [0]:
# I'm adding 'day' to the slection so that I can reuse this dataframe later
# 'day' is needed for calculating units and risks after dfd recorded

drinks = sql_to_df("""
        select 
            id,
            percent,
            ml,
            day,
            ((percent * ml)/ 1000) as 'units before'
        from
            g_appdrinks
        """)

drinks.head()

#### Summing units for each user on a typical week

In [0]:
# assumption (confirmed by Jimmy):
# there's only ever one typical week in the database for any given user
# so this groupby will return sum of units per week for each user

weeklyUnits = drinks.groupby('id', as_index=False)['units before'].sum()

weeklyUnits.head()

#### Selecting demographic info 

In [0]:
# adding age as well, not necessary buy may be interesting

demographic = sql_to_df("""
        select 
            id,
            gender,
            age
        from
            g_appusers
        """)

demographic.head()

#### Joining weekly units and demographic info

In [0]:
# I can merge them on id as both dataframes have only unique ids in the id column

usersBefore = pd.merge(weeklyUnits, demographic, how='left', on='id')

usersBefore.head()

#### Calculating risk before

In [0]:
def riskBeforeCalc(row):
    """
    Evaluates 'units before' in every row against risk level 
    conditions as defined in the app calculations
    and returns corresponding risk level value as a string.
    """
    if row['units before'] <= 14.99: 
        global value # has to be defined as a global variable
        value = 'lower'
        
    elif ((row['gender'] == 'Male') & 
          (row['units before'] >= 15) & 
          (row['units before'] <= 49.99)):
        value = 'increasing'
        
    elif ((row['gender'] == 'Female') & 
          (row['units before'] >= 15) & 
          (row['units before'] <= 34.99)):
        value = 'increasing'
        
    elif ((row['gender'] == 'Male') & 
          (row['units before'] >= 50)):
        value = 'higher'
        
    elif ((row['gender'] == 'Female') & 
          (row['units before'] >= 35)):
        value = 'higher'
        
    # does it need else to catch exceptions/anomalies?
    
    return value

In [0]:
usersBefore['risk before'] = usersBefore.apply(riskBeforeCalc, axis=1)

usersBefore.head()

#### Selecting form days off table

In [0]:
# only doing this for the first week of the campaign for now
# by looking at user's drink free days recorded in that week

dfd = sql_to_df("""
        select 
            id, date, dayname(date) as 'dfd'
        from 
            g_appdaysoff
        where
            date between '2018-09-10' and '2018-09-16'
        """)

dfd.head()

#### Grouping drinks units by weekday for each user

In [0]:
# I'm reusing the same dataframe I used for calculating weekly units
# adding 'day' to the groupby statement

dailyUnits = drinks.groupby(['id', 'day'], as_index=False)['units before'].sum()

dailyUnits.head()

#### Joining dfd with typical weekday units

In [0]:
# merge must be specified for more than id column 
# because id columns in both contain nonunique ids
# i.e. same user on different days
# and dfd achieved on a day with no drinks in a typical week

usersAfter = pd.merge(dfd, dailyUnits, 
                      how='right', 
                      left_on=['id', 'dfd'], 
                      right_on=['id', 'day'])

usersAfter.head()

#### Calculating 'units after'

In [0]:
def dfdDayMatch(row):
    """
    Evaluates if drink free day recorded has a corresponding
    day of drinks in a typical week.
    If it does units are replaced with zero,
    if it doesn't 'units before' are the returned value.
    """
    if row['dfd'] == row['day']:
        global value # has to be defined as global
        value = 0
        
    else:
        value = row['units before']
        
    return value

In [0]:
# the idea is to remove units from a day which was dfd

usersAfter['units after'] = usersAfter.apply(dfdDayMatch, axis=1)

usersAfter.head()

#### Summing 'units after' by week for each user

In [0]:
# grouping it again by id will return the sum of units for a week 

unitsAfter = usersAfter.groupby('id', as_index=False)['units after'].sum()

unitsAfter.head()

#### Calculating 'risk after'

In [0]:
result = pd.merge(usersBefore, unitsAfter, how='inner', on='id')

result.head()

In [0]:
def riskAfterCalc(row):
    """
    Evaluates 'units after' in every row against risk level 
    conditions as defined in the app calculations
    and returns corresponding risk level value as a string.
    """
    if row['units after'] <= 14.9:
        global value # has to be defined as global 
        value = 'lower'
        
    elif ((row['gender'] == 'Male') & 
          (row['units after'] >= 15) & 
          (row['units after'] <= 49.9)):
        value = 'increasing'
        
    elif ((row['gender'] == 'Female') & 
          (row['units after'] >= 15) & 
          (row['units after'] <= 34.9)):
        value = 'increasing'
        
    elif ((row['gender'] == 'Male') & 
          (row['units after'] >= 50)):
        value = 'higher'
        
    elif ((row['gender'] == 'Female') & 
          (row['units after'] >= 35)):
        value = 'higher'
        
    # does it need else to catch exceptions/anomalies?
    
    return value

In [0]:
result['risk after'] = result.apply(riskAfterCalc, axis=1)

result.head()

### Demographic group 39-60

In [0]:
demo3960 = result.loc[(result['age'] > 38) & (result['age'] < 61)]

demo3960.info()

In [0]:
demo3960_improved = demo3960.loc[demo3960['units before'] > demo3960['units after']]

demo3960_improved.info()

In [0]:
3705/221.6

# Outcomes

#### Overall shifts

In [0]:
result.groupby('risk before')['id'].count()

In [0]:
result.groupby('risk after')['id'].count()

#### Lower risk: units reduction

In [0]:
result[['units before', 'units after']].loc[result['risk before'] == 'lower'].mean()

#### Increasing risk: units reduction

In [0]:
result[['units before', 'units after']].loc[result['risk before'] == 'increasing'].mean()

#### Higher risk: units reduction

In [0]:
result[['units before', 'units after']].loc[result['risk before'] == 'higher'].mean()

#### From higher to increasing or lower

In [0]:
result['id'].loc[((result['risk before'] == 'higher') & 
                  ((result['risk after'] == 'increasing') | 
                   (result['risk after'] == 'lower')))].count()

In [0]:
338/41.77

#### From increasing to lower

In [0]:
result['id'].loc[((result['risk before'] == 'increasing') & 
                  (result['risk after'] == 'lower'))].count()

In [0]:
1364/131.31

# dfd analysis

### People logging dfd on days they do not drink anyway in their typical week

In [0]:
# check if people are logging dfd when they don't drink on a typical week:
# yes, they are - NaN values in 'day' indicate that they are
# so counting how many dfd have been logged is not a useful metric 
# for before and after behaviour/risk change indicator

usersAfterLEFT_JOIN = pd.merge(dfd, dailyUnits, 
                      how='left', 
                      left_on=['id', 'dfd'], 
                      right_on=['id', 'day'])

usersAfterLEFT_JOIN.head()

### Sum of units drunk - and not drunk  on dfd

In [0]:
# sum of units NOT drunk as a result of dfd

result.groupby('dfd', as_index=False)['units before'].sum()

In [0]:
result.groupby('dfd')['units before'].sum().plot(kind='bar', title='sum of units NOT drunk as a result of dfd');

In [0]:
# sum of units (put in by all users in their typical week) 
# split by day to see which days are most boozy 

result.groupby('day')['units before'].sum()

In [0]:
result.groupby('day')['units before'].sum().plot(kind='bar', 
                                                 title='sum of units (put in by all users in their typical week)');

In [0]:
# days (from a typical week) which haven't been "eliminated" by a dfd
# i.e. this many units were consumed on these days? 
# fewer units were consumed than on a typical week

result.groupby('day', as_index=False)['units after'].sum()

### When are most dfd completed?
In the time period of our one week we're looking at

In [0]:
dfd.groupby('dfd', as_index=False)['id'].count()

In [0]:
dfd.groupby('dfd')['id'].count().plot(kind='bar', 
                                      title='When are most dfd completed?');

### dfd in a typical week (i.e. dfd pre app usage)