[View in Colaboratory](https://colab.research.google.com/github/janilles/dfdapp/blob/master/dfd_pledges.ipynb)

# Drink Free Days - Pledges
From AWS RDS using PyMySQL 

## Install PyDrive for loading files from Google Drive
https://pythonhosted.org/PyDrive/index.html

In [0]:
# added -q for suppressing output
!pip install -U -q PyDrive

# see PyDrive documentation for libraries code snippets
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

### Create and upload a text file with PyDrive (if needed)

```python
uploaded = drive.CreateFile({'title': 'sample_file.txt'})
uploaded.SetContentString('Sample file content.')
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))
```

### Get the Google Drive IDs of your file (if needed)

You will need your file's google Drive ID to load the content of the text file into a variable with PyDrive.  

To get the ID, you can: 
#### Right-click on the file in your Google Drive and selcet 'Get shareable link' which gives you a URL with 'id=' in it.
  
Alternatively, you can obtain the ID directly from the notebook by running this code:  

- Step 1:   
Get your file's ID.  To get a list of all the file and folder IDs in the root folder:
```python
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print("File %s\n\n",(file1))
```
You can replace ```'root'``` with the folder ID your file is in. You can either get the folder ID from the web interface: ```drive.google.com/drive/u/0/folders/<folder ID>```  
or have PyDrive list the IDs for you:
```python
# Paginate file lists and specify number of max results if necessary
for file_list in drive.ListFile({'q': 'trashed=False', 'maxResults': 10}):
  print('Received %s files from Files.list()' % len(file_list)) # <= 10
  for file1 in file_list:
      print('title: %s, id: %s' % (file1['title'], file1['id']))
    ```
- Step 2:   
Load the file.
```python
downloaded = drive.CreateFile({'id': '<file ID>'})
# you can print the content of the text file to check it
print('Downloaded content "{}"'.format(downloaded.GetContentString()))
```


### Load the file's content (i.e. the password) into a variable

In [0]:
# insert ID of the file with the password
# comment out the other user when running this cell

# Jan's file
passwd_file = drive.CreateFile({'id': '1YnGugBHvqjJk0nbTqN-683Agb0vaZKHo'}) 

# Tacey's file
# passwd_file = drive.CreateFile({'id': ' '}) 

# load the password as a string into a variable
# you will use this variable in pymysql connection
# instead of the actual password string
user_passwd = passwd_file.GetContentString()

## Connect to AWS database via PyMySQL
### Retrun SQL queries as Pandas dataframes
Pandas ```read_sql``` [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html)

In [0]:
# added -q for suppressing output
!pip install -q -U pymysql

import pymysql
import pandas as pd

In [0]:
def connect():  
    return pymysql.connect(
        
        host = "df-phereplica3.crqbvr0pveqx.eu-west-1.rds.amazonaws.com",
        
        # change user name and password as necessary
           
        user = "jan",
        # user = "tacey",
        passwd = user_passwd, # string loaded into a variable with PyDrive above
   
        db = "daysoff",
        autocommit=True
        
        )

connection = connect()

def sql_to_df(sql):
    """
    Returns SQL queries as pandas dataframes
    """
    return pd.read_sql(sql, con = connection)

# Database tables
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax.

In [0]:
# formatting column width of Pandas dataframes
# increase column width so that longer comments don't get truncated

pd.set_option('max_colwidth',100)

### Pledges table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_apppledges'
        """)

### App users table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appusers'
        """)

# Reports
SQL queries as strings inside ```qud()``` function defined as pymysql connection above.  
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax reference.

## Pleges overview in numbers
* number of all users (a.k.a. app downloads)
* number of users who have pledged 
* user conversion (from downloading the app to pledging)
* number of pledges made
* average pledges per pledging user
* average pledges per all user (a.k.a. per app downloads)

### Lifetime of the app

In [0]:
#@title
pledgesOverview = sql_to_df(
    """
    select
        count(distinct g_appusers.id) as 'All users',
        count(distinct g_apppledges.id) as 'Pledging users',
        round(count(distinct g_apppledges.id) / (count(distinct g_appusers.id) /100), 1) as 'User conversion %',
        count(g_apppledges.id) as 'Total pledges',
        round(count(g_apppledges.id) / count(distinct g_apppledges.id), 1) as 'Pledges/pledging user',
        round(count(g_apppledges.id) / count(distinct g_appusers.id), 1) as 'Pledges/all users'
    from
        daysoff.g_apppledges right join daysoff.g_appusers
        on g_apppledges.id=g_appusers.id
    """)

pledgesOverview

### Campaign period
Fill in the dates in the 'where' clause of the SQL query below as necessary.

In [0]:
#@title
pledgesOverviewCampaign = sql_to_df(
    """
    select
        count(distinct g_appusers.id) as 'All users',
        count(distinct g_apppledges.id) as 'Pledging users',
        round(count(distinct g_apppledges.id) / (count(distinct g_appusers.id) /100), 1) as 'User conversion %',
        count(g_apppledges.id) as 'Total pledges',
        round(count(g_apppledges.id) / count(distinct g_apppledges.id), 1) as 'Pledges/pledging user',
        round(count(g_apppledges.id) / count(distinct g_appusers.id), 1) as 'Pledges/all users'
    from
        daysoff.g_apppledges right join daysoff.g_appusers
        on g_apppledges.id=g_appusers.id
    where
        joined between '2018-09-03' and '2018-09-13'
    """)

pledgesOverviewCampaign

## Pledges on timelines

In [0]:
pledges = sql_to_df(
    """
    select
        id as 'user id', 
        week as 'pledged week', 
        week(week) as 'week number',
        month(week) as 'month number',
        daycount as 'days pledged'
    from
        daysoff.g_apppledges
    """)

pledges.head()

In [0]:
pledges.info()

In [0]:
# converting strings to datetime 

pledges['pledged week'] = pd.to_datetime(pledges['pledged week'])

### Pledges by week number (2017 and 2018 on the same axis)

In [0]:
pledges_byWeekNumber = pledges.groupby('week number')['user id'].count()

pledges_byWeekNumber.plot(title='Number of pledges by week number (2017 and 2018 on the same axis)');

### Pledges by month (2017 and 2018 on the same axis)

In [0]:
pledges_byMonthNumber = pledges.groupby('month number')['user id'].count()

pledges_byMonthNumber.plot(title='Number of pledges by month (2017 and 2018 on the same axis)');

### Pledges by calendar week

In [0]:
pledges_byCalendarWeek = pledges.groupby('pledged week')['user id'].count()

pledges_byCalendarWeek.plot(title='Number of pledges - app history timeline');

## How many days do people pledge

In [0]:
pledges_daysPledged = pledges.groupby('days pledged')['user id'].count()

pledges_daysPledged.plot(title='How many users pledge how many days');

In [0]:
#@title
pledgesCampaign = sql_to_df(
    """
    select
        g_apppledges.id as 'user id', 
        week as 'pledged week', 
        week(week) as 'week number',
        month(week) as 'month number',
        daycount as 'days pledged'
    from
        daysoff.g_apppledges right join daysoff.g_appusers
        on g_apppledges.id=g_appusers.id
    where
        joined between '2018-09-03' and '2018-09-13'
    """)


In [0]:
#@title
pledges_daysPledged = pledgesCampaign.groupby('days pledged')['user id'].count()

pledges_daysPledged.plot(title='How many users pledge how many days (campaign period)');

## Pledge variation across months
Do people pledge more or less days depending on what time of year it is?

In [0]:
pledges_variationMonth = pledges.groupby(['month number', 'days pledged'], as_index=False)['user id'].count()

### Altair Viz charts

For customisation of Altair charts [see documentation](https://altair-viz.github.io/user_guide/customization.html).

For different colour schemes replace 'scheme_name' with a string that matches any of the available [Vega color schemes]( https://vega.github.io/vega/docs/schemes/#reference).

```alt.Color('column name', scale=alt.Scale(scheme='scheme name'))```  

In [0]:
import altair as alt

In [0]:
alt.Chart(pledges_variationMonth, title='Days pledged by calendar month').mark_line().encode(
    x='days pledged', 
    y=alt.X('user id', axis=alt.Axis(title='number of users')), 
    # y='user id',
    color=alt.Color('month number:N', scale=alt.Scale(scheme='tableau20'))
)

In [0]:
#@title
# get totals for each month
pledges_variationMonthsGrouped = pledges_variationMonth.groupby('month number', as_index=False)['user id'].sum()

# get both values for percentage calculation into one table
pledges_variationMonthNorm = pd.merge(pledges_variationMonth, 
                                      pledges_variationMonthsGrouped, 
                                      how='inner', 
                                      on='month number', 
                                      suffixes=('_mth', '_sum'))

# calculate percentages
pledges_variationMonthNorm['percent of total users each month'] = pledges_variationMonthNorm['user id_mth'] / (pledges_variationMonthNorm['user id_sum'] / 100)

# plot
alt.Chart(pledges_variationMonthNorm, title='Days pledged by calendar month - normalised').mark_line().encode(
    x='days pledged',
    y='percent of total users each month',
    color=alt.Color('month number:N', scale=alt.Scale(scheme='tableau20')),
    # opacity=alt.value(0.5),
)


## Pledge days

### Most popular days
1 = Monday ... 0 = Sunday

In [0]:
pledgeDays = sql_to_df(
    """
    select
        count(distinct id) as 'number of users',
        daycount as 'number of days pledged',
        days as 'day combinations'
    from
        g_apppledges 
    group by 
        days
    """)

pledgeDays.head()

### Most popular day combinations
1 = Monday ... 0 = Sunday

In [0]:
pledgeDayCombos = sql_to_df(
    """
    select
        count(distinct id) as 'number of users',
        days as 'day combinations'
    from
        g_apppledges 
    group by 
        days
    order by
        count(distinct id) desc
    """)

pledgeDayCombos.head()

In [0]:
alt.Chart(pledgeDayCombos).mark_bar().encode(
    x='day combinations',
    y='number of users'
)