[View in Colaboratory](https://colab.research.google.com/github/janilles/dfdapp/blob/master/dfd_pledges.ipynb)

# Drink Free Days - Pledges
From AWS RDS using PyMySQL 

## Install PyDrive for loading files from Google Drive
https://pythonhosted.org/PyDrive/index.html

In [0]:
# added -q for suppressing output
!pip install -U -q PyDrive

# see PyDrive documentation for libraries code snippets
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

### Create and upload a text file with PyDrive (if needed)

```python
uploaded = drive.CreateFile({'title': 'sample_file.txt'})
uploaded.SetContentString('Sample file content.')
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))
```

### Get the Google Drive IDs of your file (if needed)

You will need your file's google Drive ID to load the content of the text file into a variable with PyDrive.  

To get the ID, you can: 
#### Right-click on the file in your Google Drive and selcet 'Get shareable link' which gives you a URL with 'id=' in it.
  
Alternatively, you can obtain the ID directly from the notebook by running this code:  

- Step 1:   
Get your file's ID.  To get a list of all the file and folder IDs in the root folder:
```python
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print("File %s\n\n",(file1))
```
You can replace ```'root'``` with the folder ID your file is in. You can either get the folder ID from the web interface: ```drive.google.com/drive/u/0/folders/<folder ID>```  
or have PyDrive list the IDs for you:
```python
# Paginate file lists and specify number of max results if necessary
for file_list in drive.ListFile({'q': 'trashed=False', 'maxResults': 10}):
  print('Received %s files from Files.list()' % len(file_list)) # <= 10
  for file1 in file_list:
      print('title: %s, id: %s' % (file1['title'], file1['id']))
    ```
- Step 2:   
Load the file.
```python
downloaded = drive.CreateFile({'id': '<file ID>'})
# you can print the content of the text file to check it
print('Downloaded content "{}"'.format(downloaded.GetContentString()))
```


### Load the file's content (i.e. the password) into a variable

In [0]:
# insert ID of the file with the password
# comment out the other user when running this cell

# Jan's file
passwd_file = drive.CreateFile({'id': '1YnGugBHvqjJk0nbTqN-683Agb0vaZKHo'}) 

# Tacey's file
# passwd_file = drive.CreateFile({'id': ' '}) 

# load the password as a string into a variable
# you will use this variable in pymysql connection
# instead of the actual password string
user_passwd = passwd_file.GetContentString()

## Connect to AWS database via PyMySQL
### Retrun SQL queries as Pandas dataframes
Pandas ```read_sql``` [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html)

In [0]:
# added -q for suppressing output
!pip install -q -U pymysql

import pymysql
import pandas as pd

In [0]:
def connect():  
    return pymysql.connect(
        
        host = "df-phereplica3.crqbvr0pveqx.eu-west-1.rds.amazonaws.com",
        
        # change user name and password as necessary
           
        user = "jan",
        # user = "tacey",
        passwd = user_passwd, # loaded into variable with PyDrive above
   
        db = "daysoff",
        autocommit=True
        
        )

connection = connect()

def sql_to_df(sql):
    """
    Returns SQL queries as pandas dataframes
    """
    return pd.read_sql(sql, con = connection)

# Database tables
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax.

In [0]:
# formatting column width of Pandas dataframes
# increase column width so that longer comments don't get truncated

pd.set_option('max_colwidth',100)

### Pledges table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_apppledges'
        """)

### App users table

In [0]:
# run pd.set_option('max_colwidth',100) if comments column gets truncated

sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appusers'
        """)

# Reports
SQL queries as strings inside ```qud()``` function defined as pymysql connection above.  
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax reference.

## Pleges overview in numbers
* number of all users (a.k.a. app downloads)
* number of users who have pledged 
* user conversion (from downloading the app to pledging)
* number of pledges made
* average pledges per pledging user
* average pledges per all user (a.k.a. per app downloads)

In [0]:
pledgesOverview = sql_to_df(
    """
    select
        count(distinct g_appusers.id) as 'All users',
        count(distinct g_apppledges.id) as 'Pledging users',
        round(count(distinct g_apppledges.id) / (count(distinct g_appusers.id) /100), 1) as 'User conversion %',
        count(g_apppledges.id) as 'Total pledges',
        round(count(g_apppledges.id) / count(distinct g_apppledges.id), 1) as 'Pledges/pledging user',
        round(count(g_apppledges.id) / count(distinct g_appusers.id), 1) as 'Pledges/all users'
    from
        daysoff.g_apppledges right join daysoff.g_appusers
        on g_apppledges.id=g_appusers.id
    """)

pledgesOverview

## Pledges by week

### SQL query

In [0]:
pledges = sql_to_df(
    """
    select
        id as 'user id', 
        week as 'pledged week', 
        week(week) as 'week number',  
        daycount as 'days pledged'
    from
        daysoff.g_apppledges
    """)

pledges.head()

In [0]:
pledges.info()

In [0]:
# converting strings to datetime 

pledges['pledged week'] = pd.to_datetime(pledges['pledged week'])

### Pledges by week number (2017 and 2018 on the same axis)

In [0]:
pledges_byWeekNumber = pledges.groupby('week number')['user id'].count()

pledges_byWeekNumber.plot(title='Number of pledges by week number (2017 and 2018 on the same axis)');

### Pledges by calendar week

In [0]:
pledges_byCalendarWeek = pledges.groupby('pledged week')['user id'].count()

pledges_byCalendarWeek.plot(title='Number of pledges - app history timeline');

## How many days do people pledge

In [0]:
pledges_daysPledged = pledges.groupby('days pledged')['user id'].count()

pledges_daysPledged.plot(title='How many users pledge how many days');

# UNFINISHED FROM HERE

### Plot pledge count by month

#### Function to extract month

In [0]:
def extractMonthNumber(row):
    """
    Takes the datetime object in column 'week'
    and returns the month number from it.
    """
    return row['week'].month

pledges['monthNumber'] = pledges.apply(extractMonthNumber, axis=1)

pledges.head()

In [0]:
pledges_plot_m = pledges.groupby(['monthNumber', 'daycount'], as_index=False).count()

In [0]:
pledges_plot_m.head()

#### Pledge by month

Colour scheme: You can also use ```alt.Color('column name', scale=alt.Scale(scheme='scheme name'))``` where scheme_name is a string that matches any of the available Vega color schemes: https://vega.github.io/vega/docs/schemes/#reference

In [0]:
alt.Chart(pledges_plot_m).mark_line().encode(
    x='daycount',
    y='id',
    color=alt.Color('monthNumber:N', scale=alt.Scale(scheme='tableau20'))
)

#### Pledge by month - mormalised - data prep

In [0]:
pledges_plot_m.head(7)

In [0]:
pledges_plot_mg = pledges_plot_m.groupby(['monthNumber'], as_index=False)['id'].sum()

In [0]:
pledges_plot_mg.head()

In [0]:
pledges_plot_mg.columns= ['monthNumber', 'monthSum']
pledges_plot_mg.head()

In [0]:
monthNorm = pd.merge(pledges_plot_m, pledges_plot_mg, how='inner', on='monthNumber')

monthNorm.head(14)

In [0]:
monthNorm['percent'] = monthNorm['id'] / (monthNorm['monthSum'] / 100)

monthNorm.head(14)

#### Pledge by month - normalised

In [0]:
alt.Chart(monthNorm).mark_line().encode(
    x='daycount',
    y='percent',
    color=alt.Color('monthNumber:N', scale=alt.Scale(scheme='tableau20'))
)

### Plot most popular day off (when only one day is pledged)

In [0]:
pledges_plot_d = pledges.groupby(['daycount', 'days'], as_index=False).count()

pledges_plot_d.head(7)

In [0]:
# daycount = 1
# when it's just one day pledged
# most popular is 1 (which is Tuesday or Monday?)
# index 0 == Monday?
pledges_plot_d.iloc[0:7,1:3].plot();

In [0]:
daysoff.info()

### Function to get week numbers

In [0]:
# app developers should change the data type of 'date'
# from 'object' to 'datetime' so this isn't needed

daysoff['date'] = pd.to_datetime(daysoff['date'])

In [0]:
def getWeekNumber(row):
    """
    Takes the datetime object in column 'date'
    and returns the week number from it.
    """
    return row['date'].week

daysoff['weekNumber'] = daysoff.apply(getWeekNumber, axis=1)

daysoff.head()

In [0]:
daysoff.info()

In [0]:
daysoff = daysoff.groupby(['id', 'weekNumber'], as_index=False)['date'].count()

daysoff.head()

### Join days off with pledges

In [0]:
pledges.head()

In [0]:
pledges.drop(columns=['week'], inplace=True)

pledges.head()

In [0]:
pledges = pledges.groupby(['id', 'weekNumber', 'daycount'], as_index=False).count()

pledges.sort_values(by='daycount', ascending=False, inplace=True)

pledges.head()

In [0]:
merge = pd.merge(daysoff, pledges, how='inner', on=['id', 'weekNumber'])

merge.head()

In [0]:
merge.drop(columns=['days'], inplace=True)

merge.columns = ['userID', 'weekNumber', 'daysOff', 'daysPledged']

merge.head()

#### Export df to csv

In [0]:
merge.to_csv('df.csv', index=False)

In [0]:
from google.colab import files

In [0]:
files.download('df.csv')

#### On the fly quick analysis

In [0]:
merge.loc[merge['daysPledged'] > merge['daysOff']].count()

In [0]:
merge.loc[merge['daysPledged'] > merge['daysOff']].nunique()

In [0]:
merge.loc[merge['daysPledged'] > merge['daysOff']].mean()

In [0]:
merge.loc[merge['daysPledged'] <= merge['daysOff']].mean()

In [0]:
merge.loc[merge['daysPledged'] < merge['daysOff']].mean()

In [0]:
genAge = sql_to_df(
    """
    SELECT 
        daysoff.g_appusers.id AS userID,
        gender,
        age
    FROM 
        daysoff.g_appusers
    """)

genAge.head()

In [0]:
merge2 = pd.merge(merge, genAge, how='inner', on='userID')

merge2.head()

In [0]:
merge2.nunique()

In [0]:
merge2.loc[merge2['daysPledged'] > merge2['daysOff']].mean()

In [0]:
merge2.loc[merge2['daysPledged'] <= merge2['daysOff']].mean()

In [0]:
merge2.loc[merge2['daysPledged'] > merge2['daysOff']].groupby('gender').count()

In [0]:
merge2.loc[merge2['daysPledged'] <= merge2['daysOff']].groupby('gender').count()

In [0]:
merge2.loc[(merge2['daysPledged'] <= merge2['daysOff']) & (merge2['gender'] == 'Male')].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] <= merge2['daysOff']) & (merge2['gender'] == 'Female')].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] > merge2['daysOff']) & (merge2['gender'] == 'Male')].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] > merge2['daysOff']) & (merge2['gender'] == 'Female')].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] > merge2['daysOff']) & 
           (merge2['gender'] == 'Female') & 
           (merge2['age'] < 35)].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] > merge2['daysOff']) & 
           (merge2['gender'] == 'Male') & 
           (merge2['age'] < 35)].mean()

In [0]:
genAge.mean()

In [0]:
drinks.info()

In [0]:
drinks['gender'] = drinks['gender'].astype('category')
drinks['category'] = drinks['category'].astype('category')
drinks['subcategory'] = drinks['subcategory'].astype('category')
drinks['volume'] = drinks['volume'].astype('category')

In [0]:
drinks_gb = drinks.groupby(['gender', 'age', 'category', 'subcategory', 'volume'], as_index=False)['userID'].count()

drinks_gb.head()

In [0]:
drinks_gb.info()

In [0]:
drinks_gb.dropna(inplace=True)

In [0]:
drinks_gb.head()

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    y='category',
    x='count(userID)',
    color=alt.Color('subcategory:N', scale=alt.Scale(scheme='tableau20'))
)

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    y='gender',
    x='count(userID)',
    color=alt.Color('category:N', scale=alt.Scale(scheme='tableau20'))
)

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    alt.Y('gender:O', scale=alt.Scale(rangeStep=17)),
    alt.X('count(userID):Q',
        axis=alt.Axis(title='User count'),
        #stack='normalize'
    ),
    alt.Color('subcategory:N', scale=alt.Scale(scheme='tableau20'))
)

In [0]:
# Define aggregate fields
lower_box = 'q1(age):Q'
lower_whisker = 'min(age):Q'
upper_box = 'q3(age):Q'
upper_whisker = 'max(age):Q'

# Compose each layer individually
lower_plot = alt.Chart(drinks_gb).mark_rule().encode(
    x=alt.X(lower_whisker, axis=alt.Axis(title="age")),
    x2=lower_box,
    y='subcategory:O'
)

middle_plot = alt.Chart(drinks_gb).mark_bar(size=5.0).encode(
    x=lower_box,
    x2=upper_box,
    y='subcategory:O'
)

upper_plot = alt.Chart(drinks_gb).mark_rule().encode(
    x=upper_whisker,
    x2=upper_box,
    y='subcategory:O'
)

middle_tick = alt.Chart(drinks_gb).mark_tick(
    color='white',
    size=5.0
).encode(
    x='median(age):Q',
    y='subcategory:O',
)

lower_plot + middle_plot + upper_plot + middle_tick

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    y='category',
    x='userID'
)

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    y='subcategory',
    x='userID'
)