[View in Colaboratory](https://colab.research.google.com/github/janilles/dfdapp/blob/master/dfd_risk_levels.ipynb)

# Drink Free Days app 
# RISK LEVELS ANALYSIS
Looking at the reduction in alcohol units consuption and subsquent risk levels reduction among Drink Free Days app users.

## Data set
Relational database behind the app - RDS instance on AWS.

## Time period
Comparing users' typical weeks (which they fill in as part of the app's onboarding journey) with the first week of the campaign 10-16 September 2018.

## Metodology
Lorem ipsum

## Assumptions
Lorem ipsum

# Credentials to run the notebook

## Google Drive authentication
Install and authenticate [PyDrive](https://pythonhosted.org/PyDrive/index.html) for loading files from Google Drive so that database passwords aren't hardcoded into the notebook.

If login credentials are hardcoded into the database connection (code cell below) this step in not necesary.

In [0]:
# added -q for suppressing output
!pip install -U -q PyDrive

# see PyDrive documentation for libraries code snippets
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

## Database connection
- Connecting to AWS RDS database with [PyMySQL](https://pymysql.readthedocs.io/en/latest/user/examples.html).
- Retruning MySQL queries as Pandas dataframes with [```read_sql()``` ](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) function.

In [0]:
# added -q for suppressing output
!pip install -q -U pymysql

import pymysql
import pandas as pd

In [0]:
# comment out the other user/options when running this cell as necessary

# Jan's file
# 'id' is Google Drive file ID
passwd_file = drive.CreateFile({'id': '1YnGugBHvqjJk0nbTqN-683Agb0vaZKHo'}) 

# Tacey's file
# passwd_file = drive.CreateFile({'id': ' '}) 

# this variable is used in the connect function below
user_passwd = passwd_file.GetContentString()

# If you're not using Google Drive file but hardcoding the password
# user_passwd = password as a string

In [0]:
def connect():

    return pymysql.connect(
        
        host = "df-phereplica3.crqbvr0pveqx.eu-west-1.rds.amazonaws.com",
        
        # change user name and password as necessary
           
        user = "jan",
        # user = "tacey",
        passwd = user_passwd, # assigned in the cell above
   
        db = "daysoff",
        
        autocommit=True

        )

connection = connect()

def sql_to_df(sql):
    """
    Returns MySQL queries as Pandas dataframes.
    """
    return pd.read_sql(sql, con = connection)

# Database tables
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for MySQL syntax.

In [0]:
# formatting column width of Pandas dataframes
# increase column width so that longer comments don't get truncated

pd.set_option('max_colwidth',100)

### Drinks table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appdrinks'
        """)

### Pledges table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_apppledges'
        """)

### Days off table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appdaysoff'
        """)

### App users table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appusers'
        """)

# Report generation
- Write MySQL queries as long strings inside ```sql_to_df()``` function.  See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for MySQL syntax reference.
- ```sql_to_df()``` returns Pandas dataframes.

### Beahviour before and after app usage
- 'before' means during the typical week as specified by the user in the app onboarding.
- 'after' means in the first week after the campaign launch: 10-16 Sep 2018

#### Selecting from drinks table

In [0]:
# I'm adding 'day' to the slection so that I can reuse this dataframe later
# 'day' is needed for calculating units and risks after dfd recorded
# Alcohol units calculation: (percent * ml)/ 1000

drinks = sql_to_df("""
        select 
            id,
            percent,
            ml,
            day,
            ((percent * ml)/ 1000) as 'units before'
        from
            g_appdrinks
        """)

drinks.head()

#### Summing units for each user on a typical week

In [0]:
# assumption (confirmed by Jimmy):
# there's only ever one typical week in the database for any given user
# so this groupby will return sum of units per week for each user

weeklyUnits = drinks.groupby('id', as_index=False)['units before'].sum()

weeklyUnits.head()

#### Selecting demographic info 

In [0]:
# adding age as well, not necessary but may be interesting in further analysis
# gender is necessary for risk level calculation hence exclusion of empty values

demographic = sql_to_df("""
        select 
            id,
            gender,
            age
        from
            g_appusers
        where
            gender like '%ale' -- to exclude empty values
        """)

demographic.head()

#### Joining weekly units and demographic info

In [0]:
# inner join to make sure gender value is included in the dataframe
# as gender is necessary for risk level calculation

usersBefore = pd.merge(weeklyUnits, demographic, how='inner', on='id')

usersBefore.head()

#### Calculating risk before

In [0]:
def riskCalculator(row, unitsColName):
    """
    Calculates risk level from alcohol units and gender 
    based on conditions for the three risk categories.
    
    Note:
        Conditions for risk levels are specified in the app documentation.
    
    Args:
        row: Used in lambda function when riskCalculator is applied to dataframe rows.
        unitsColName (str): Name of column with alcohol units values used in calculation.
        
    Returns:
        Risk level value as a string. 
        
    Raises:
        ValueError: If some row doesn't satisfy any of the if-elif conditions.
    """
    
    if row[unitsColName] < 15: 
        
        return 'lower'
        
    elif ((row['gender'] == 'Male') & 
          (15 <= row[unitsColName] < 50)):
        
        return 'increasing'
        
    elif ((row['gender'] == 'Female') & 
          (15 <= row[unitsColName] < 35)):
        
        return 'increasing'
        
    elif ((row['gender'] == 'Male') & 
          (row[unitsColName] >= 50)):
        
        return 'higher'
        
    elif ((row['gender'] == 'Female') & 
          (row[unitsColName] >= 35)):
        
        return 'higher'
        
    else:
        raise ValueError(
            'Some row doesn\'t meet any of the if-elif conditions in riskCalculator.'
        )

In [0]:
# applying riskCalculator to every row

usersBefore['risk before'] = usersBefore.apply(lambda row: 
                                               riskCalculator(row, 'units before'), 
                                               axis=1)

usersBefore.head()

#### Selecting form days off table

In [0]:
# only doing this for the first week of the campaign for now
# by looking at user's drink free days recorded in that week

dfd = sql_to_df("""
        select 
            id, date, dayname(date) as 'dfd'
        from 
            g_appdaysoff
        where
            date between '2018-09-10' and '2018-09-16'
        """)

dfd.head()

#### Grouping drinks units by weekday for each user

In [0]:
# I'm reusing the same dataframe I used for calculating weekly units
# adding 'day' to the groupby statement

dailyUnits = drinks.groupby(['id', 'day'], as_index=False)['units before'].sum()

dailyUnits.head()

#### Joining dfd with typical weekday units

In [0]:
# merge must be specified for more than id column 
# because id columns in both contain nonunique ids
# i.e. same user on different days
# and dfd achieved on a day with no drinks in a typical week

usersAfter = pd.merge(dfd, dailyUnits,
                      how='right',
                      left_on=['id', 'dfd'], 
                      right_on=['id', 'day'])

usersAfter.head()

#### Calculating 'units after'

In [0]:
def dfdDayMatch(row):
    """
    Evaluates if drink free day recorded has a corresponding
    day of drinks in a typical week.
    If it does units are replaced with zero,
    if it doesn't 'units before' are the returned value.
    """
    if row['dfd'] == row['day']:
        return 0    
    else:
        return row['units before']

In [0]:
# the idea is to remove units from a day which was a dfd

usersAfter['units after'] = usersAfter.apply(dfdDayMatch, axis=1)

usersAfter.head()

#### Summing 'units after' by week for each user

In [0]:
# grouping it again by id will return the sum of units for a week 

unitsAfter = usersAfter.groupby('id', as_index=False)['units after'].sum()

unitsAfter.head()

#### Calculating 'risk after'

In [0]:
result = pd.merge(usersBefore, unitsAfter, how='inner', on='id')

result.head()

In [0]:
# applying riskCalculator function (defined above) to every row

result['risk after'] = result.apply(lambda row:
                                    riskCalculator(row, 'units after'), 
                                    axis=1)

result.head()

### Demographic group 39-60
TO DO: only include users who had at least 1 dfd and see how these figures compare to overall app users or the rest of the demograpic


In [0]:
demo3960 = result.loc[(result['age'] > 38) & (result['age'] < 61)]

demo3960.info()

In [0]:
demo3960_improved = demo3960.loc[demo3960['units before'] > demo3960['units after']]

demo3960_improved.info()

In [0]:
3731/236.28

# Outcomes

#### Overall shifts

In [0]:
result.groupby('risk before')['id'].count()

In [0]:
result.groupby('risk after')['id'].count()

#### Lower risk: units reduction

In [0]:
result[['units before', 'units after']].loc[result['risk before'] == 'lower'].mean()

#### Increasing risk: units reduction

In [0]:
result[['units before', 'units after']].loc[result['risk before'] == 'increasing'].mean()

#### Higher risk: units reduction

In [0]:
result[['units before', 'units after']].loc[result['risk before'] == 'higher'].mean()

#### From higher to increasing or lower

In [0]:
result['id'].loc[((result['risk before'] == 'higher') & 
                  ((result['risk after'] == 'increasing') | 
                   (result['risk after'] == 'lower')))].count()

In [0]:
328/41.99

#### From increasing to lower

In [0]:
result['id'].loc[((result['risk before'] == 'increasing') & 
                  (result['risk after'] == 'lower'))].count()

In [0]:
1364/131.31

# dfd analysis

### People logging dfd on days they do not drink anyway in their typical week

In [0]:
# check if people are logging dfd when they don't drink on a typical week:
# yes, they are - NaN values in 'day' indicate that they are
# so counting how many dfd have been logged is not a useful metric 
# for before and after behaviour/risk change indicator

usersAfterLEFT_JOIN = pd.merge(dfd, dailyUnits, 
                      how='left', 
                      left_on=['id', 'dfd'], 
                      right_on=['id', 'day'])

usersAfterLEFT_JOIN.head()

### Sum of units drunk - and not drunk  on dfd

In [0]:
# sum of units NOT drunk as a result of dfd

usersAfter.groupby('dfd', as_index=False)['units before'].sum()

In [0]:
usersAfter.groupby('dfd')['units before'].sum().plot(kind='bar', title='sum of units NOT drunk as a result of dfd');

In [0]:
# sum of units (put in by all users in their typical week) 
# split by day to see which days are most boozy 

usersAfter.groupby('day', as_index=False)['units before'].sum()

In [0]:
usersAfter.groupby('day')['units before'].sum().plot(kind='bar', 
                                                 title='sum of units (put in by all users in their typical week)');

In [0]:
# days (from a typical week) which haven't been "eliminated" by a dfd
# i.e. this many units were consumed on these days? 
# fewer units were consumed than on a typical week

usersAfter.groupby('day', as_index=False)['units after'].sum()

### When are most dfd completed?
In the time period of our one week we're looking at

In [0]:
dfd.groupby('dfd', as_index=False)['id'].count()

In [0]:
# make a nicer chart for Jimmy/planning

dfd.groupby('dfd')['id'].count().plot(kind='bar', 
                                      title='When are most dfd completed?');

### dfd in a typical week (i.e. dfd pre app usage)

In [0]:
drinkDays = sql_to_df("""
        select 
            id,
            day, 
            amount
        from
            g_appdrinks
        """)

drinkDays.head()

In [0]:
pivoted = pd.pivot_table(drinkDays, values='amount', columns='day', index='id').reset_index()

pivoted.fillna(0, inplace=True)

pivoted.head()

In [0]:
pivoted['typical dfd'] = (pivoted[['Monday',
                                   'Tuesday',
                                   'Wednesday', 
                                   'Thursday', 
                                   'Friday', 
                                   'Saturday', 
                                   'Sunday']] == 0).sum(axis=1)

pivoted.head()

In [0]:
actualDfd = dfd.groupby('id', as_index=False)['dfd'].count()

In [0]:
compareDfd = pd.merge(actualDfd[['id', 'dfd']], pivoted[['id', 'typical dfd']], how='left', on='id')

compareDfd.head()

In [0]:
print(compareDfd['id'].count())
print(compareDfd['dfd'].mean())
print(compareDfd['typical dfd'].mean())

In [0]:
print(compareDfd['id'].loc[compareDfd['dfd'] > compareDfd['typical dfd']].count())
print(compareDfd['dfd'].loc[compareDfd['dfd'] > compareDfd['typical dfd']].mean())
print(compareDfd['typical dfd'].loc[compareDfd['dfd'] > compareDfd['typical dfd']].mean())

In [0]:
3941/67.15

# To do

- Group people who have 1, 2, 3, etc. typical week dfd and see how they performed after



# Export to csv

In [0]:
from google.colab import files

# df.to_csv('df.csv')
# files.download('df.csv')