[View in Colaboratory](https://colab.research.google.com/github/janilles/dfdapp/blob/master/dfd_pledges.ipynb)

# PLEDGES

# Credentials to run the notebook

## Google Drive authentication (optional)
NOTE: If login credentials are hardcoded into the database connection (code cell below) this step in not necesary. Otherwise: 

Install and authenticate [PyDrive](https://pythonhosted.org/PyDrive/index.html) for loading files from Google Drive so that database passwords aren't hardcoded into the notebook.

In [0]:
# added -q for suppressing output
!pip install -U -q PyDrive

# see PyDrive documentation for libraries code snippets
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

## Database connection
- Connecting to AWS RDS database with [PyMySQL](https://pymysql.readthedocs.io/en/latest/user/examples.html).
- Retruning MySQL queries as Pandas dataframes with [```read_sql()``` ](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) function.

In [0]:
# added -q for suppressing output
!pip install -q -U pymysql

import pymysql
import pandas as pd

In [0]:
# 'id' is Google Drive file ID
usernm_file = drive.CreateFile({'id': '1l0NedyVzKKhPJ1-_cOqF1VRt_oQyr8OL'})
passwd_file = drive.CreateFile({'id': '1YnGugBHvqjJk0nbTqN-683Agb0vaZKHo'}) 

# this variable is used in the connect function below
user_name = usernm_file.GetContentString()
user_passwd = passwd_file.GetContentString()

In [0]:
def connect():

    return pymysql.connect(
        
        host = "df-phereplica3.crqbvr0pveqx.eu-west-1.rds.amazonaws.com",
           
        user = user_name, # assigned in the cell above

        passwd = user_passwd, # assigned in the cell above
   
        db = "daysoff",
        
        autocommit=True

        )

connection = connect()

def sql_to_df(sql):
    """
    Returns MySQL queries as Pandas dataframes.
    """
    return pd.read_sql(sql, con = connection)

# Database tables used (optional step)
Overview of avaliable data and tables used in the MySQL queries below. 
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for MySQL syntax.

In [0]:
# formatting column width of Pandas dataframes
# increase column width so that longer comments don't get truncated

pd.set_option('max_colwidth',200)

### App pledges table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        SELECT
            table_name, column_name, data_type, column_comment
        FROM
            information_schema.columns
        WHERE
            table_name = 'g_apppledges'
        """)

### App users table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        SELECT
            table_name, column_name, data_type, column_comment
        FROM
            information_schema.columns
        WHERE
            table_name = 'g_appusers'
        """)

### Days off table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        SELECT
            table_name, column_name, data_type, column_comment
        FROM
            information_schema.columns
        WHERE
            table_name = 'g_appdaysoff'
        """)

# Report generation
- Write MySQL queries as long strings inside ```sql_to_df()``` function.  See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for MySQL syntax reference.
- ```sql_to_df()``` returns Pandas dataframes.

**NB:** MySQL week starts on Sunday by default so:

- week(lastseen) counts weeks from Sundays
- week(lastseen, 1) counts weeks from Monday

In [0]:
selection = sql_to_df("""
        SELECT 
            P.id, 
            WEEK(week, 1) AS pledge_week, -- week to start on Monday
            daycount AS num_of_pledged_days,            
            WEEK(lastseen, 1) AS last_week -- week to start on Monday
        FROM 
            g_apppledges P
            JOIN
            g_appusers U ON P.id=U.id
            JOIN
            g_appdaysoff D ON D.id=P.id
        WHERE
            YEAR(week) = 2018
        GROUP BY 
            id, 
            pledge_week
        ORDER BY
            pledge_week DESC, id
        """)

selection.head(10)

In [0]:
df = sql_to_df("""
        SELECT 
            id,
            COUNT(id) AS num_of_dfd,
            WEEK(date, 1) AS dfd_week 
        FROM 
            g_appdaysoff
        WHERE
            YEAR(date) = 2018
        GROUP BY 
            id, 
            dfd_week
        ORDER BY
            dfd_week DESC, id
        """)

df.head(10)

In [0]:
result = pd.merge(selection, df, left_on=['id', 'pledge_week'], right_on=['id', 'dfd_week'])

result.head(10)

In [0]:
pd.DataFrame.corr(result.iloc[:, [2,4]])

# TBC
Below cell copied form 'risk levels' notebook.

### Users by day
Filtering of users (i.e. WHERE condition) explained:
- ```gender LIKE '%ale'``` is data cleaning to filter out empty values since gender is necesary for calculating risk levels
- age range: 18 (the minimum the app accepts) seems to have too many fake values and the upper cuts off outliers
- ```joined >= '2018-09-10'``` to only include app users who joined on or after the campaign start date
- ```lastseen >= '2018-09-16'``` to only include app users who have recoded some activity on (or after) seven days from the campaign launch (10 Sep 2018) so we can assume that they are using the app

**NB:** MySQL week starts on Sunday by default so: 
- ```week(lastseen)```  counts weeks from Sundays
- ```week(lastseen, 1)``` counts weeks from Monday

In [0]:
# 'typical day' is needed for matching units to deduct 
# from drink free days recorded by a user
# Alcohol units calculation = (percent * ml)/ 1000

usersByDay = sql_to_df("""
        SELECT 
            U.id,
            gender,
            age,
            WEEK(lastseen, 1) AS last_week, -- MySQL week begins with Sunday by default
            day AS typical_day, -- days with no drinks in them are not recorded in database
            SUM((percent * ml)/ 1000) AS typical_units
        FROM
            g_appusers U
        LEFT JOIN g_appdrinks D
            ON U.id=D.id            
        WHERE
            gender LIKE '%ale' -- to exclude empty values
            AND
            age BETWEEN 19 AND 79 -- 18 has too many fake values, 80+ are outliers
            AND
            joined >= '2018-09-10' -- campaign start date
            AND
            lastseen >= '2018-09-16' -- at least a week from campaign launch (Monday-Sunday)
            AND
            day IS NOT NULL -- data cleaning, there were some users 'None' typical days
        GROUP BY 
            U.id, typical_day -- to sum up multiple drinks and get total units for each day
        """)

usersByDay.head()