[View in Colaboratory](https://colab.research.google.com/github/janilles/dfdapp/blob/janilles-patch-1/DrinkFreeDays.ipynb)

# Drink Free Days app data
From AWS RDS using PyMySQL 

## Install PyDrive for loading files from Google Drive
https://pythonhosted.org/PyDrive/index.html

In [0]:
# added -q for suppressing output
!pip install -U -q PyDrive

# see PyDrive documentation for libraries code snippets
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

### Create and upload a text file with PyDrive (if needed)

```python
uploaded = drive.CreateFile({'title': 'sample_file.txt'})
uploaded.SetContentString('Sample file content.')
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))
```

### Get the Google Drive IDs of your file (if needed)

You will need your file's google Drive ID to load the content of the text file into a variable with PyDrive.  

To get the ID, you can either right-click on the file in your Google Drive and selcet 'Get shareable link' which gives you a URL with 'id=' in it.
  
Alternatively, you can obtain the ID directly from the notebook by running this code:  

- Step 1:   
Get your file's ID.  To get a list of all the file and folder IDs in the root folder:
```python
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print("File %s\n\n",(file1))
```
You can replace ```'root'``` with the folder ID your file is in. You can either get the folder ID from the web interface: ```drive.google.com/drive/u/0/folders/<folder ID>```  
or have PyDrive list the IDs for you:
```python
# Paginate file lists and specify number of max results if necessary
for file_list in drive.ListFile({'q': 'trashed=False', 'maxResults': 10}):
  print('Received %s files from Files.list()' % len(file_list)) # <= 10
  for file1 in file_list:
      print('title: %s, id: %s' % (file1['title'], file1['id']))
    ```
- Step 2:   
Load the file.
```python
downloaded = drive.CreateFile({'id': '<file ID>'})
# you can print the content of the text file to check it
print('Downloaded content "{}"'.format(downloaded.GetContentString()))
```


### Load the file's content into a variable

In [0]:
# insert ID of the file with the password
# comment out the other user when running this cell

# Jan's file
passwd_file = drive.CreateFile({'id': '1YnGugBHvqjJk0nbTqN-683Agb0vaZKHo'}) 

# Tacey's file
# passwd_file = drive.CreateFile({'id': ' '}) 

# load the password as a string into a variable
# you will use this variable in pymysql connection
# instead of the actual password string
user_passwd = passwd_file.GetContentString()

## Connect to AWS database via PyMySQL
### Retrun SQL queries as Pandas dataframes
Pandas ```read_sql``` [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html)

In [0]:
# added -q for suppressing output
!pip install -q -U pymysql

import pymysql
import pandas as pd

In [0]:
def connect():  
    return pymysql.connect(
        
        host = "df-phereplica3.crqbvr0pveqx.eu-west-1.rds.amazonaws.com",
        
        #change user name and password as necessary
           
        user = "jan",
        passwd = user_passwd, # text file loaded to variable with PyDrive
   
        # user = "tacey", 
        # passwd = user_passwd,
        
        db = "daysoff",
        autocommit=True
        
        )

connection = connect()

def sql_to_df(sql):
    """
    Returns SQL queries as pandas dataframes
    """
    return pd.read_sql(sql, con = connection)

# Database schmema
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax.

## Show all tables
However, only the following tables have app users' ID: 

- g_appdrinks  
- g_appmotivations  
- g_apppledges  
- g_appusers  


In [0]:
sql_to_df("SHOW TABLES")

## Describe tables
These are the five tables with app user info.

In [0]:
# increase column width so that longer comments don't get truncated

pd.set_option('max_colwidth',100)

### App users table

In [0]:
sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appusers'
        """)

### Drinks table

In [0]:
sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appdrinks'
        """)

### Motivations table

In [0]:
sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appmotivations'
        """)

### Pledges table

In [0]:
sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_apppledges'
        """)

### Days off table

In [0]:
sql_to_df("""
        select 
            table_name, column_name, data_type, column_comment
        from 
            information_schema.columns
        where
            table_name = 'g_appdaysoff'
        """)

# Reports
SQL queries as strings inside ```qud()``` function defined as pymysql connection above.  
See [MySQL documentation](https://dev.mysql.com/doc/refman/8.0/en/introduction.html) for SQL syntax reference.

## Questions  

DONE:
- How long are people using the app for? 
- Does the order of selected motivations correspond to the order of most selected motivations? (ANSWER: No.)
- When do people join, when do people pledge? Is there a time of the year difference? (ANSWER: No difference.)
- How many users are pledging 1, 2, 3 or more days? Or are they pledging too much? Too little? (ANSWER: They're pledging too much. UX recommendation: "Pledge less and build it up - you'll be more likely to meet your pledges.")
- Motivations by season - any variance? (ANSWER: No variance.)
- Percentages of people using the app for 0, 1, 7-30, 30+days. (NOTE: User jurney is not designed for long term usage/drinks tracking.)

TO DO:
- Claculate risk levels (based on unites per week). Do they improve over time with days off achieved? (NOTE: We need some more info to know how to calculate it from the data we have.)


## How long are people using the app for?

### SQL to df

In [0]:
duration = sql_to_df(
    """
    select
        id, joined, lastseen
    from
        daysoff.g_appusers
    order by
        joined    
    """)

duration.head()

In [0]:
duration.info()

### Alternative SQL query  - question

In [0]:
# why is the diff here only 30 when it's 578 in pandas? 
test = sql_to_df(
    """
    select
        id, joined, lastseen, DAY(lastseen) - DAY(joined) as diff
    from
        daysoff.g_appusers
    order by
        diff desc
    """)

test.head()

### Calculating difference 

In [0]:
duration['diff'] = duration['lastseen'] - duration['joined']

duration.sort_values(by=['diff'], inplace=True, ascending=False)

duration.head()

In [0]:
def extractDays(row):
    """
    Takes the timedelta object in column 'diff'
    and returns only the days from it.
    """
    return row['diff'].days

duration['days'] = duration.apply(extractDays, axis=1)

duration.head()

In [0]:
duration_plot = duration.groupby('days').count()

In [0]:
# all values
duration_plot.iloc[:,0:1].plot();

In [0]:
# 1+ days
duration_plot.iloc[1:,0:1].plot();

In [0]:
# between 7 and 31 days
duration_plot.iloc[6:31,0:1].plot();

In [0]:
# between 0 and 31 days
duration_plot.iloc[:32,0:1].plot();

In [0]:
# between 1 and 14 days
duration_plot.iloc[1:15,0:1].plot();

### Calculating percentages

In [0]:
duration_pc = duration_plot.drop(columns=['joined', 'lastseen', 'diff'])

duration_pc['perCent'] = duration_pc['id'] / (duration_pc['id'].sum() / 100)

duration_pc.head(31)

In [0]:
duration_pcb = duration.groupby('days', as_index=False).count()

duration_pcb['perCent'] = duration_pcb['id'] / (duration_pcb['id'].sum() / 100)

duration_pcb.head()

In [0]:
duration_pcb.head(8)

In [0]:
duration_pcb[(duration_pcb['days'] > 0) & (duration_pcb['days'] < 8)]['perCent'].sum()

In [0]:
duration_pcb[(duration_pcb['days'] > 7) & (duration_pcb['days'] < 32)]['perCent'].sum()

In [0]:
duration_pcb[duration_pcb['days'] > 31]['perCent'].sum()

In [0]:
longer = duration_pcb[duration_pcb['days'] > 31]['perCent'].sum()

month = duration_pcb[(duration_pcb['days'] > 7) & (duration_pcb['days'] < 32)]['perCent'].sum() 

week = duration_pcb[(duration_pcb['days'] > 0) & (duration_pcb['days'] < 8)]['perCent'].sum() 

zero = duration_pcb[duration_pcb['days'] == 0]['perCent'].sum()

L1 = [zero, week, month, longer]

L2 = [zero, (week+month+longer), (month+longer), longer]

duration_dropoff = pd.DataFrame({'buckets':L1, 
                      'dropoff':L2, 
                      'def':['zero', 'at least a day', 'at least a week', 'at least a month']})

duration_dropoff

In [0]:
duration_dropoff.plot();

In [0]:
import altair as alt

## Motivations

In [0]:
bars = alt.Chart(duration_dropoff).mark_bar().encode(
    y='def',
    x='dropoff'
)

bars

![](https://raw.githubusercontent.com/janilles/dfdapp/master/daysoffmotivations.PNG)


### Motivations by gender

In [0]:
gender_motivations = sql_to_df(
    """
    SELECT 
        COUNT(DISTINCT daysoff.g_appusers.id) AS userCount,
        gender,
        motivation
    FROM 
        daysoff.g_appmotivations
    JOIN 
        daysoff.g_appusers
    USING 
        (id)
    GROUP BY 
        gender, motivation
    ORDER BY
        motivation
    """)

gender_motivations.head(14)

In [0]:
# visualisations library
# https://altair-viz.github.io/
import altair as alt

# for interactive charts
alt.renderers.enable('colab');

In [0]:
alt.Chart(gender_motivations).mark_bar().encode(
    alt.Color('gender:N'),
    alt.Y('motivation:N'),
    alt.X('userCount', stack='normalize')
)

In [0]:
alt.Chart(gender_motivations).mark_bar(stroke='transparent').encode(
    alt.Y('motivation:N', scale=alt.Scale(rangeStep=12), axis=alt.Axis(title='')),
    alt.X('userCount:Q', axis=alt.Axis(title='user count', grid=False)),
    color=alt.Color('gender:N', scale=alt.Scale(range=["#EA98D2", "#659CCA"])),
    column='gender:O'
).configure_view(
    stroke='transparent'
).configure_axis(
    domainWidth=0.8
)

In [0]:
alt.Chart(gender_motivations).mark_bar(stroke='transparent').encode(
    alt.Y('gender:N', axis=alt.Axis(title='')),
    alt.X('userCount:Q', axis=alt.Axis(title='user count', grid=False)),
    color=alt.Color('gender:N', scale=alt.Scale(range=["#EA98D2", "#659CCA"])),
    column='motivation:O'
).configure_view(
    stroke='transparent'
).configure_axis(
    domainWidth=0.8
)

In [0]:
pivoted = pd.pivot_table(gender_motivations, 
                 values='userCount', 
                 index='motivation', 
                 columns='gender').reset_index()

pivoted

In [0]:
# reodering to match the order on app screen
# needs more work - doesn't show in the chart
pivoted.reindex([3,2,4,1,5,6,0])

In [0]:
base = alt.Chart(pivoted).encode(
    alt.X('motivation:O',
        axis=alt.Axis(format='%b'),
        scale=alt.Scale(zero=False)
    )
)

bar = base.mark_bar().encode(
    y='Female'
)


line =  base.mark_line(color='red').encode(
    y='Male',
)

alt.layer(
    bar,
    line
).resolve_scale(
    y='independent'
)


### Motivations by age

In [0]:
age_motivations = sql_to_df(
    """
    SELECT 
        COUNT(DISTINCT daysoff.g_appusers.id) AS userCount,
        age,
        motivation
    FROM 
        daysoff.g_appmotivations
    JOIN 
        daysoff.g_appusers
    USING 
        (id)
    GROUP BY
        age, motivation
    ORDER BY
        userCount, age, motivation
    """)

age_motivations.head()

In [0]:
alt.Chart(age_motivations).mark_line().encode(
    x='age',
    y='userCount',
    color='motivation'
)

In [0]:
alt.Chart(age_motivations).mark_bar().encode(
    alt.X('age:O', scale=alt.Scale(rangeStep=17)),
    alt.Y('userCount:Q',
        axis=alt.Axis(title='User count'),
        stack='normalize'
    ),
    alt.Color('motivation:N')
)

In [0]:
alt.Chart(age_motivations).mark_bar().encode(
    alt.X('userCount:Q', scale=alt.Scale(rangeStep=17)),
    alt.Y('age:O',
        axis=alt.Axis(title='Age'),
    ),
    alt.Color('motivation:N')
)

### Motivations by seasons

In [0]:
season_motivations = sql_to_df(
    """
    SELECT 
        daysoff.g_appusers.id as userID,
        joined,
        motivation
    FROM 
        daysoff.g_appmotivations
    JOIN 
        daysoff.g_appusers
    USING 
        (id)
    """)

season_motivations.head()

In [0]:
def extractMonthSeason(row):
    """
    Takes the datetime object in column 'joined'
    and returns the month number from it.
    """
    return row['joined'].month

season_motivations['monthNumber'] = season_motivations.apply(extractMonthSeason, axis=1)

season_motivations.head()

In [0]:
plot_season = season_motivations.groupby(['motivation', 'monthNumber'], as_index=False)['userID'].count()

plot_season.head()

In [0]:
alt.Chart(plot_season).mark_line().encode(
    x='monthNumber',
    y='userID',
    color=alt.Color('motivation:N', scale=alt.Scale(scheme='tableau20'))
)

## When are people joining?

### SQL to df

In [0]:
df = sql_to_df(
    """
    select
        id, joined
    from
        daysoff.g_appusers
    order by
        joined    
    """)

df.head()

### Function to extract months

In [0]:
def extractMonths(row):
    """
    Takes the datetime object in column 'joined'
    and returns only the months from it.
    """
    return row['joined'].month

df['month'] = df.apply(extractMonths, axis=1)

df.head()

### Plot

In [0]:
df_plot = df.groupby('month').count()

df_plot

In [0]:
# all values
df_plot.iloc[:,0:1].plot();

### Calculating percentages

In [0]:
df_pc = df_plot.drop(columns=['joined'])

df_pc['perCent'] = df_pc['id'] / (df_pc['id'].sum() / 100)

df_pc.head(12)

## How many active app users are there each day?
Total number each day defined as all users who have joined from start date of specified period less those users who stopped using the app to report self-moderation the previous day. 

### SQL to df

In [0]:
appUsers = sql_to_df(
    """
    select
        id, joined, lastseen
    from
        daysoff.g_appusers
    where
        joined between '2017-01-17 00:00:00' and '2017-03-31 23:59:59'
    order by
        joined    
    """)

appUsers.head()


### Drop times from datetime columns 

In [0]:
# without pd.to_datetime it returns the dates as dtype object

appUsers['joined'] = pd.to_datetime(appUsers['joined'].dt.date)
appUsers['lastseen'] = pd.to_datetime(appUsers['lastseen'].dt.date)

appUsers.head()

In [0]:
appUsers.info()

### Loop counting total active app users by day

In [0]:
# we need this for arithmetic operations on dates
# when adding +1 day to our date

from datetime import timedelta

In [0]:
#Handy way to count number of days for loop

pd.to_datetime('2017-03-31') - pd.to_datetime('2017-01-17')

In [0]:
#loop cycles through number of days from start date up to day limit (both hardcoded)
#set lastDate to same day as joinDate because some users will join and leave on same date
results = []

for i in range(73): 
    joinDate = pd.to_datetime('2017-01-17') + timedelta(days=i)
    lastDate = pd.to_datetime('2017-01-17') + timedelta(days=i)
    x = appUsers['id'].loc[(appUsers['joined'] <= joinDate) & 
                           (appUsers['lastseen'] >= lastDate)].count()
    results.append(x)

In [0]:
usersEachDay = pd.DataFrame(results)

usersEachDay.plot(kind='bar', 
                  title='Total app users by day', 
                  legend=False);

### Counting users joining each day - to check above

In [0]:
#Note: must change number of days and start date to match spec in loop

joining = []

for i in range(73):
    joinDate = pd.to_datetime('2017-01-17') + timedelta(days=i)
    x = appUsers['id'].loc[(appUsers['joined'] == joinDate)].count()
    joining.append(x)

In [0]:
joiningEachDay = pd.DataFrame(joining)

joiningEachDay.plot(kind='bar', 
                  title='Total app users joining by day', 
                  legend=False);

### Plotting total active users by calendar date


In [0]:
#Change format from YYYY-MM-DD HH:MM:SS to YYYY-MM-DD 

resultsYear = []
calendarDay = []

for i in range(73):
    joinDate = pd.to_datetime('2017-01-17') + timedelta(days=i)
    lastDate = pd.to_datetime('2017-01-17') + timedelta(days=i)
    x = appUsers['id'].loc[(appUsers['joined'] <= joinDate) & 
                               (appUsers['lastseen'] >= lastDate)].count()
    resultsYear.append(x)
    calendarDay.append(joinDate)

In [0]:
appUsers = pd.DataFrame({'day':calendarDay,'users':resultsYear})

appUsers.set_index('day', inplace=True)

In [0]:
appUsers.plot(kind='line', 
                  title='Total active app users by calendar day', 
                  legend=False);

### Export df to csv

In [0]:
appUsers.to_csv('appUsers.csv', index=False)

#usersEachDay.to_csv('appUsers.csv', index=False)

from google.colab import files

files.download('appUsers.csv')

## Pledges

### Overview 
* number of users (=number of downloads)
* number of users who pledge 
* number of pledges made

In [0]:
pledgesCount = sql_to_df(
    """
    select
        count(distinct g_appusers.id) as 'Number of users (downloads)',
        count(distinct g_apppledges.id) as 'Number of users who have pledged',
        count(g_apppledges.id) as 'Number of pledges made'       
    from
        daysoff.g_apppledges right join daysoff.g_appusers
        on g_apppledges.id=g_appusers.id
    """)

pledgesCount

### SQL to df

In [0]:
pledges = sql_to_df(
    """
    select
        id, week, daycount, days
    from
        daysoff.g_apppledges
    order by
        daycount desc
    """)

pledges.head()

In [0]:
pledges.info()

In [0]:
# converting strings to datetime 
pledges['week'] = pd.to_datetime(pledges['week'])

### Function to extract weeks

In [0]:
def extractWeekNumber(row):
    """
    Takes the datetime object in column 'week'
    and returns the week number from it.
    """
    return row['week'].week

pledges['weekNumber'] = pledges.apply(extractWeekNumber, axis=1)

pledges.head()

### Plot when people pledge

In [0]:
pledges_plot = pledges.groupby('weekNumber').count()

In [0]:
# all values
pledges_plot.iloc[:,0:1].plot();

### Plot how much people pledge

In [0]:
pledges_daycount = pledges.groupby('daycount').count()

In [0]:
# all values
pledges_daycount.iloc[:,0:1].plot();

### Plot pledge count by month

#### Function to extract month

In [0]:
def extractMonthNumber(row):
    """
    Takes the datetime object in column 'week'
    and returns the month number from it.
    """
    return row['week'].month

pledges['monthNumber'] = pledges.apply(extractMonthNumber, axis=1)

pledges.head()

In [0]:
pledges_plot_m = pledges.groupby(['monthNumber', 'daycount'], as_index=False).count()

In [0]:
pledges_plot_m.head()

#### Pledge by month

Colour scheme: You can also use ```alt.Color('column name', scale=alt.Scale(scheme='scheme name'))``` where scheme_name is a string that matches any of the available Vega color schemes: https://vega.github.io/vega/docs/schemes/#reference

In [0]:
alt.Chart(pledges_plot_m).mark_line().encode(
    x='daycount',
    y='id',
    color=alt.Color('monthNumber:N', scale=alt.Scale(scheme='tableau20'))
)

#### Pledge by month - mormalised - data prep

In [0]:
pledges_plot_m.head(7)

In [0]:
pledges_plot_mg = pledges_plot_m.groupby(['monthNumber'], as_index=False)['id'].sum()

In [0]:
pledges_plot_mg.head()

In [0]:
pledges_plot_mg.columns= ['monthNumber', 'monthSum']
pledges_plot_mg.head()

In [0]:
monthNorm = pd.merge(pledges_plot_m, pledges_plot_mg, how='inner', on='monthNumber')

monthNorm.head(14)

In [0]:
monthNorm['percent'] = monthNorm['id'] / (monthNorm['monthSum'] / 100)

monthNorm.head(14)

#### Pledge by month - normalised

In [0]:
alt.Chart(monthNorm).mark_line().encode(
    x='daycount',
    y='percent',
    color=alt.Color('monthNumber:N', scale=alt.Scale(scheme='tableau20'))
)

### Plot most popular day off (when only one day is pledged)

In [0]:
pledges_plot_d = pledges.groupby(['daycount', 'days'], as_index=False).count()

pledges_plot_d.head(7)

In [0]:
# daycount = 1
# when it's just one day pledged
# most popular is 1 (which is Tuesday or Monday?)
# index 0 == Monday?
pledges_plot_d.iloc[0:7,1:3].plot();

## Days off

### SQL to df

In [0]:
daysoff = sql_to_df(
    """
    SELECT 
        *
    FROM 
        daysoff.g_appdaysoff
    """)

daysoff.head()

In [0]:
daysoff.info()

### Function to get week numbers

In [0]:
# app developers should change the data type of 'date'
# from 'object' to 'datetime' so this isn't needed

daysoff['date'] = pd.to_datetime(daysoff['date'])

In [0]:
def getWeekNumber(row):
    """
    Takes the datetime object in column 'date'
    and returns the week number from it.
    """
    return row['date'].week

daysoff['weekNumber'] = daysoff.apply(getWeekNumber, axis=1)

daysoff.head()

In [0]:
daysoff.info()

In [0]:
daysoff = daysoff.groupby(['id', 'weekNumber'], as_index=False)['date'].count()

daysoff.head()

### Join days off with pledges

In [0]:
pledges.head()

In [0]:
pledges.drop(columns=['week'], inplace=True)

pledges.head()

In [0]:
pledges = pledges.groupby(['id', 'weekNumber', 'daycount'], as_index=False).count()

pledges.sort_values(by='daycount', ascending=False, inplace=True)

pledges.head()

In [0]:
merge = pd.merge(daysoff, pledges, how='inner', on=['id', 'weekNumber'])

merge.head()

In [0]:
merge.drop(columns=['days'], inplace=True)

merge.columns = ['userID', 'weekNumber', 'daysOff', 'daysPledged']

merge.head()

#### Export df to csv

In [0]:
merge.to_csv('df.csv', index=False)

In [0]:
from google.colab import files

In [0]:
files.download('df.csv')

#### On the fly quick analysis

In [0]:
merge.loc[merge['daysPledged'] > merge['daysOff']].count()

In [0]:
merge.loc[merge['daysPledged'] > merge['daysOff']].nunique()

In [0]:
merge.loc[merge['daysPledged'] > merge['daysOff']].mean()

In [0]:
merge.loc[merge['daysPledged'] <= merge['daysOff']].mean()

In [0]:
merge.loc[merge['daysPledged'] < merge['daysOff']].mean()

In [0]:
genAge = sql_to_df(
    """
    SELECT 
        daysoff.g_appusers.id AS userID,
        gender,
        age
    FROM 
        daysoff.g_appusers
    """)

genAge.head()

In [0]:
merge2 = pd.merge(merge, genAge, how='inner', on='userID')

merge2.head()

In [0]:
merge2.nunique()

In [0]:
merge2.loc[merge2['daysPledged'] > merge2['daysOff']].mean()

In [0]:
merge2.loc[merge2['daysPledged'] <= merge2['daysOff']].mean()

In [0]:
merge2.loc[merge2['daysPledged'] > merge2['daysOff']].groupby('gender').count()

In [0]:
merge2.loc[merge2['daysPledged'] <= merge2['daysOff']].groupby('gender').count()

In [0]:
merge2.loc[(merge2['daysPledged'] <= merge2['daysOff']) & (merge2['gender'] == 'Male')].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] <= merge2['daysOff']) & (merge2['gender'] == 'Female')].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] > merge2['daysOff']) & (merge2['gender'] == 'Male')].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] > merge2['daysOff']) & (merge2['gender'] == 'Female')].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] > merge2['daysOff']) & 
           (merge2['gender'] == 'Female') & 
           (merge2['age'] < 35)].mean()

In [0]:
merge2.loc[(merge2['daysPledged'] > merge2['daysOff']) & 
           (merge2['gender'] == 'Male') & 
           (merge2['age'] < 35)].mean()

In [0]:
genAge.mean()

## Drinks
TO DO:  
- do men drink more volume/percent than women on average
- what days do people drink on most v pledge days?

### SQL to df

In [0]:
drinks = sql_to_df(
    """
    SELECT 
        daysoff.g_appusers.id AS userID,
        gender,
        age,
        category,
        subcategory,
        volume
    FROM 
        daysoff.g_appusers
    JOIN 
        daysoff.g_appdrinks
    USING 
        (id)
    """)

drinks.head()

In [0]:
drinks.info()

In [0]:
drinks['gender'] = drinks['gender'].astype('category')
drinks['category'] = drinks['category'].astype('category')
drinks['subcategory'] = drinks['subcategory'].astype('category')
drinks['volume'] = drinks['volume'].astype('category')

In [0]:
drinks_gb = drinks.groupby(['gender', 'age', 'category', 'subcategory', 'volume'], as_index=False)['userID'].count()

drinks_gb.head()

In [0]:
drinks_gb.info()

In [0]:
drinks_gb.dropna(inplace=True)

In [0]:
drinks_gb.head()

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    y='category',
    x='count(userID)',
    color=alt.Color('subcategory:N', scale=alt.Scale(scheme='tableau20'))
)

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    y='gender',
    x='count(userID)',
    color=alt.Color('category:N', scale=alt.Scale(scheme='tableau20'))
)

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    alt.Y('gender:O', scale=alt.Scale(rangeStep=17)),
    alt.X('count(userID):Q',
        axis=alt.Axis(title='User count'),
        #stack='normalize'
    ),
    alt.Color('subcategory:N', scale=alt.Scale(scheme='tableau20'))
)

In [0]:
# Define aggregate fields
lower_box = 'q1(age):Q'
lower_whisker = 'min(age):Q'
upper_box = 'q3(age):Q'
upper_whisker = 'max(age):Q'

# Compose each layer individually
lower_plot = alt.Chart(drinks_gb).mark_rule().encode(
    x=alt.X(lower_whisker, axis=alt.Axis(title="age")),
    x2=lower_box,
    y='subcategory:O'
)

middle_plot = alt.Chart(drinks_gb).mark_bar(size=5.0).encode(
    x=lower_box,
    x2=upper_box,
    y='subcategory:O'
)

upper_plot = alt.Chart(drinks_gb).mark_rule().encode(
    x=upper_whisker,
    x2=upper_box,
    y='subcategory:O'
)

middle_tick = alt.Chart(drinks_gb).mark_tick(
    color='white',
    size=5.0
).encode(
    x='median(age):Q',
    y='subcategory:O',
)

lower_plot + middle_plot + upper_plot + middle_tick

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    y='category',
    x='userID'
)

In [0]:
alt.Chart(drinks_gb).mark_bar().encode(
    y='subcategory',
    x='userID'
)

### Top drinkers - TBC

In [0]:
drinks = sql_to_df(
    """
    SELECT 
        *
    FROM 
        daysoff.g_appdrinks
    """)

drinks.head()

In [0]:
drinks.info()

In [0]:
drinks['day'] = drinks['day'].astype('category')
drinks['codeDay'] = drinks['day'].cat.codes

drinks.head()