[View in Colaboratory](https://colab.research.google.com/github/janilles/dfdapp/blob/master/dfd_risk_levels.ipynb)

# Drink Free Days app 
# RISK LEVELS ANALYSIS
Looking at the reduction in alcohol units consuption and subsquent risk levels reduction among Drink Free Days app users.

## Questions answered in this notebook
- How many users have reduced their risk level (and alcohol unit consumption) and by how much?
  - Comparing typical week with drink free days achieved while using the app (in four weeks after campaign lauch).
  - Shifts in Higher/Increasing/Lower risk levels before (typical week) and after using the app.

## Data set
Relational database behind the app - RDS instance on AWS.

# Credentials to run the notebook

## Google Drive authentication (optional)
NOTE: If login credentials are hardcoded into the database connection (code cell below) this step in not necesary. Otherwise: 

Install and authenticate [PyDrive](https://pythonhosted.org/PyDrive/index.html) for loading files from Google Drive so that database passwords aren't hardcoded into the notebook.

In [0]:
# added -q for suppressing output
!pip install -U -q PyDrive

# see PyDrive documentation for libraries code snippets
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

## Database connection
- Connecting to AWS RDS database with [PyMySQL](https://pymysql.readthedocs.io/en/latest/user/examples.html).
- Retruning MySQL queries as Pandas dataframes with [```read_sql()``` ](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) function.

In [0]:
# added -q for suppressing output
!pip install -q -U pymysql

import pymysql
import pandas as pd

In [0]:
# comment out the other user/options when running this cell as necessary

# Jan's file - 'id' is Google Drive file ID
passwd_file = drive.CreateFile({'id': '1YnGugBHvqjJk0nbTqN-683Agb0vaZKHo'}) 

# this variable is used in the connect function below
user_passwd = passwd_file.GetContentString()

# If you're not using Google Drive file but are hardcoding the password
# user_passwd = password as a string

In [0]:
def connect():

    return pymysql.connect(
        
        host = "df-phereplica3.crqbvr0pveqx.eu-west-1.rds.amazonaws.com",
        
        # change user name and password as necessary
           
        user = "jan",

        passwd = user_passwd, # assigned in the cell above
   
        db = "daysoff",
        
        autocommit=True

        )

connection = connect()

def sql_to_df(sql):
    """
    Returns MySQL queries as Pandas dataframes.
    """
    return pd.read_sql(sql, con = connection)

# Database tables (optional)
Overview of avaliable data and tables used in the MySQL queries below. 
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for MySQL syntax.

In [0]:
# formatting column width of Pandas dataframes
# increase column width so that longer comments don't get truncated

pd.set_option('max_colwidth',100)

### Drinks table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        SELECT 
            table_name, column_name, data_type, column_comment
        FROM
            information_schema.columns
        WHERE
            table_name = 'g_appdrinks'
        """)

### Pledges table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        SELECT
            table_name, column_name, data_type, column_comment
        FROM
            information_schema.columns
        WHERE
            table_name = 'g_apppledges'
        """)

### Days off table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        SELECT
            table_name, column_name, data_type, column_comment
        FROM
            information_schema.columns
        WHERE
            table_name = 'g_appdaysoff'
        """)

### App users table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        SELECT
            table_name, column_name, data_type, column_comment
        FROM
            information_schema.columns
        WHERE
            table_name = 'g_appusers'
        """)

# Report generation
- Write MySQL queries as long strings inside ```sql_to_df()``` function.  See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for MySQL syntax reference.
- ```sql_to_df()``` returns Pandas dataframes.

### Beahviour before and after app usage
- 'before' means during the typical week as specified by the user in the app onboarding.
- 'after' means in (any or all of) the first four weeks after the campaign launch on 10 Sep 2018

#### Selecting from drinks table

In [0]:
# 'day' is needed for calculating units and risks after dfd recorded
# Alcohol units calculation = (percent * ml)/ 1000

drinksByDay = sql_to_df("""
        SELECT
            id, 
            day AS 'typical day', 
            ((percent * ml)/ 1000) AS 'week0' -- typical week
        FROM
            g_appdrinks
        """)

drinksByDay.head()

#### Summing units for each user on a typical week

In [0]:
# there's only ever one typical week in the database for any given user
# so this groupby will return sum of units per week for each user

drinksByWeek = drinksByDay.groupby('id', as_index=False)['week0'].sum()

drinksByWeek.head()

#### Selecting from users table (demographics)
Filtering of users (i.e. WHERE condition) explained:
- ```gender LIKE '%ale'``` is data cleaning to filter out empty values since gender is necesary for calculating risk levels
- age range: 18 (the minimum the app accepts) seems to have too many fake values and the upper cuts off outliers
- ```lastseen >= '2018-09-13'``` is to only include app users who have recoded some activity on (or after) three days after the campaign launch (10 Sep 2018) so we can assume that they are using the app

**NB:** MySQL week starts on Sunday by default so: 
- ```week(lastseen)```  counts weeks from Sundays
- ```week(lastseen, 1)``` counts weeks from Monday

In [0]:
# 'age' is not necessary but may be interesting in further analysis
# 'lastseen' is going to be used to filter users who use the app longer

demographic = sql_to_df("""
        SELECT 
            id,
            week(lastseen, 1) AS 'last week',
            gender,
            age
        FROM
            g_appusers
        WHERE
            gender LIKE '%ale' -- to exclude empty values
            AND
            age BETWEEN 19 AND 79 -- 18 has too many fake values
            AND
            lastseen >= '2018-09-13' -- at least a week from campaign launch
        """)

demographic.head()

In [0]:
demographic.info()

#### Joining typical week units and demographic info dataframes

In [0]:
# inner join to make sure gender value is included in the dataframe
# as gender is necessary for risk level calculation

usersBefore = pd.merge(drinksByWeek, demographic, how='inner', on='id')

usersBefore.head()

In [0]:
usersBefore.info()

#### Calculating risk before
- Defining a ```riskCalculator``` function (see docstring in code cell below for details)
- Applying ```riskCalculator``` to every row of dataframe

In [0]:
def riskCalculator(row, unitsColName):
    """
    Calculates risk level from alcohol units and gender 
    based on conditions for the three risk categories.
    
    Note:
        Conditions for risk levels are specified in the app documentation.
    
    Args:
        row: Used in lambda function when riskCalculator is applied to dataframe rows.
        unitsColName (str): Name of column with alcohol units values used in calculation.
        
    Returns:
        Risk level value as a string. 
        
    Raises:
        ValueError: If some row doesn't satisfy any of the if-elif conditions.
    """
    
    if row[unitsColName] < 15: 
        
        return 'lower'
        
    elif ((row['gender'] == 'Male') & 
          (15 <= row[unitsColName] < 50)):
        
        return 'increasing'
        
    elif ((row['gender'] == 'Female') & 
          (15 <= row[unitsColName] < 35)):
        
        return 'increasing'
        
    elif ((row['gender'] == 'Male') & 
          (row[unitsColName] >= 50)):
        
        return 'higher'
        
    elif ((row['gender'] == 'Female') & 
          (row[unitsColName] >= 35)):
        
        return 'higher'
        
    else:
        raise ValueError(
            'Some row doesn\'t meet any of the if-elif conditions in riskCalculator.'
        ) # error message also returns row index number 

In [0]:
# applying riskCalculator to every row

usersBefore['risk0'] = usersBefore.apply(lambda row: 
                                               riskCalculator(row, 'week0'), 
                                               axis=1)

usersBefore.head()

#### Selecting form 'days off' table (i.e. drink free days recorded by users)

**NB:** MySQL week starts on Sunday by default so: 
- ```week(date)```  counts weeks from Sundays

- ```week(date, 1)``` counts weeks from Monday

In [0]:
# week numbers are used in subsequent calculations so
# wee need to make sure weeks start on Monday i.e. week(date, 1) 
# not Sunday which is MySQL default i.e. week(date)

dfdAchieved = sql_to_df("""
        SELECT 
            id,  
            dayname(date) AS 'dfd day name', -- drink free day 
            week(date, 1) AS 'dfd week number' -- week of drink free day
        FROM 
            g_appdaysoff
        WHERE
            date >= '2018-09-10' -- campaign start date
        GROUP BY
            week(date), id
        ORDER BY
            id, week(date, 1), dayname(date)
        """)

dfdAchieved.head()

#### Creating units by day (from typical weeks)

In [0]:
# I'm reusing the same dataframe I used for calculating weekly units
# adding 'day' to the groupby statement

unitsByDay = drinksByDay.groupby(['id', 'typical day'], as_index=False)['week0'].sum()

unitsByDay.sort_values('id').head()

#### Joining drink free days with their corresponding alcohol units

In [0]:
# grouping drink free days achieved with their typical days (daily units)
# to calculate how many units have to be deducted for each drink free day achieved

dfdUnitsMatch = pd.merge(dfdAchieved, unitsByDay, 
                # left join returns drink free days recorded 
                # also on days which are drink free in a typical week anyway
                how='inner', 
                left_on=['id', 'dfd day name'],
                right_on=['id', 'typical day'])

dfdUnitsMatch.sort_values('id').head(10)

In [0]:
dfdUnitsMatch.info()

#### Units reductions by week for each users

In [0]:
unitsReductionByWeek = dfdUnitsMatch.groupby(['id',
                                              'dfd week number'], as_index=False)['week0'].sum()

# rename column
unitsReductionByWeek.rename(columns={"week0": "units to deduct"}, inplace=True)

unitsReductionByWeek.sort_values('id').head(10)

In [0]:
# just as an aside
# how many units NOT drunk in each week

unitsReductionByWeek.groupby('dfd week number', as_index=False)['units to deduct'].sum()

In [0]:
# just as an aside
# how many drink free days in each week

unitsReductionByWeek.groupby('dfd week number', as_index=False)['id'].count()

In [0]:
# assign 'id' and 'units to deduct' for each week into new dataframe 
# to get units reduction column for each week for each user

cols = ['id', 'units to deduct']

week1loc = unitsReductionByWeek['dfd week number'] == 37
week2loc = unitsReductionByWeek['dfd week number'] == 38
week3loc = unitsReductionByWeek['dfd week number'] == 39
week4loc = unitsReductionByWeek['dfd week number'] == 40

week1 = unitsReductionByWeek[cols].loc[week1loc].rename(columns={"units to deduct": "reduction1"})
week2 = unitsReductionByWeek[cols].loc[week2loc].rename(columns={"units to deduct": "reduction2"})
week3 = unitsReductionByWeek[cols].loc[week3loc].rename(columns={"units to deduct": "reduction3"})
week4 = unitsReductionByWeek[cols].loc[week4loc].rename(columns={"units to deduct": "reduction4"})

In [0]:
week1.head()

#### Join usersBefore with weeks 1-4 of app usage

In [0]:
result = usersBefore.merge(week1, 
                           on='id', 
                           # outer returns week1 IDs not in usersBefore
                           # users may have achieved drink free days but
                           # don't meet demographics selection criteria
                           # including 'lastseen' condition
                           how='left').merge(week2, 
                                              on='id',
                                              how='left').merge(week3, 
                                                                 on='id', 
                                                                 how='left').merge(week4, 
                                                                                    on='id', 
                                                                                    how='left')
# fill in missing values with zeros
result.fillna(0, inplace=True)

result.head()

#### Calculating alcohol units for each week
**NB: ** Users who were last seen in e.g. week1 will get week0 (i.e. typical week) values for subsequent weeks - i.e. as if they were still using the app. We thereby assume they went back to typical weeks?

In [0]:
# adding columns for alcohol units consumptiopn in each week

result['week1'] = result['week0'] - result['reduction1']
result['week2'] = result['week0'] - result['reduction2']
result['week3'] = result['week0'] - result['reduction3']
result['week4'] = result['week0'] - result['reduction4']

In [0]:
result.head()

#### Calculating risk for each week

In [0]:
# applying riskCalculator function (defined above) to every row
# creating a new column for calculated risk level for each week

R = ['risk1', 'risk2', 'risk3', 'risk4']
W = ['week1', 'week2', 'week3', 'week4']

dictionary = dict(zip(R, W))

for risk, week in dictionary.items():
    result[risk] = result.apply(lambda row: 
                                riskCalculator(row, week), 
                                axis=1)

result.head()

#### Export result to CSV

In [0]:
from google.colab import files

# result.to_csv('dfd_risk_levels.csv')
# files.download('dfd_risk_levels.csv')

# Slicing the results

### Users who recorded a dfd in all four weeks
To select combinations as desired id sets are created and then records in 'result' dataframe are isolated with ```.isin()``` method.

For operations with ```set``` see [sets documentation](https://docs.python.org/2/library/sets.html).

In [0]:
# one set per week + week0 which is typical week

set0 = set(result['id'])
set1 = set(week1['id'])
set2 = set(week2['id'])
set3 = set(week3['id'])
set4 = set(week4['id'])

In [0]:
# selecting user ids which appear in each week
# i.e. users who recorded at least one dfd in each week

selection = set0.intersection(set1, set2, set3, set4)

# see how many users are in the selection
len(selection)

In [0]:
# isolate users who are part of the selection

result4 = result[result['id'].isin(selection)]

result4.head()

#### Data exploration

In [0]:
print(result4['reduction1'].mean())
print(result4['reduction2'].mean())
print(result4['reduction3'].mean())
print(result4['reduction4'].mean())

In [0]:
print(result4['week0'].mean())
print(result4['week1'].mean())
print(result4['week2'].mean())
print(result4['week3'].mean())
print(result4['week4'].mean())

In [0]:
result4.loc[(result4['risk0'] == 'higher') & (result4['risk4'] == 'increasing')]

In [0]:
result4.loc[(result4['risk0'] == 'increasing') & (result4['risk4'] == 'lower')]