Data set idea: weight loss

Variables:
- day - daily time series

- weight 
    - estimated by calories in / calories out based on 3500cal as 450g lost (https://www.mayoclinic.org/healthy-lifestyle/weight-loss/in-depth/calories/art-20048065) so every 1 calorie = 0.128205128g
    - Weight loss will be calculated each day. (calories out - calories in) * 0.0128205128 +/- random noise as weight loss is not exact


- calories in based on logging food with LoseIt 
    - split by carbs/protein/fat? put into separate variables by percent?
    - Not going to split by c/p/f in dataset but will just state calorie count assuming she maintains good ratios 
    
    
- calories out - based on BMR/TDEE and exercise
- target calorie amount - same number throughout
- over/under calorie target
- exercise 
    - boolean?  True/False whether I exercised or not
    - categorical? Listing different exercises (walk, run, yoga class, weight training)
    - estimated calories burned?
 - calories out - tdee + exercise 
     - function created to calculate tdee as it fluctuates each day
     - maybe +/- random amount to exercise so it's not so samey
    
Run weekly - 52 weeks per year over 2 years = 104 rows
or 
Daily - January-April 2019 inclusive = 119 rows
Can't figure out which would be better

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Time" data-toc-modified-id="Time-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Time</a></span></li><li><span><a href="#Calories" data-toc-modified-id="Calories-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Calories</a></span></li><li><span><a href="#Weight-Loss" data-toc-modified-id="Weight-Loss-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Weight Loss</a></span></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Exercise</a></span></li><li><span><a href="#Calories" data-toc-modified-id="Calories-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Calories</a></span></li></ul></div>

This project simulates a dataset created by a woman who decided to track her weight loss efforts over the course of a year. She set a daily calorie goal and committed to exercising for one hour approximately five days per week. She did a lot of research before beginning her journey to set herself up for success and was very fastidious in logging her calorie intake and estimated calorie output. 



In [4]:
import numpy as np
import pandas as pd
import scipy.stats as stats

## Time

There is a lot of conflicting information online about how often one should weigh themselves when trying to lose weight. Some believe that weighing in too frequently can cause anxiety (https://health.clevelandclinic.org/why-you-shouldnt-weigh-yourself-every-single-day/) or discouragement (https://www.medicinenet.com/to_weigh__or_not_to_weighthat_is_the_question/views.htm) as weight fluctuations in the short-term can be quite unpredictable due to factors such as hydration or what was last eaten. However, some studies have shown that higher weighing frequency is associated with greater weight loss, (https://link.springer.com/article/10.1207/s15324796abm3003_5) less weight regain, (https://link.springer.com/article/10.1186/1479-5868-5-54) and is not associated with adverse psychological outcomes like anxiety (https://onlinelibrary.wiley.com/doi/full/10.1002/oby.20946). It really comes down to personal preference and what an individual feels works well for them. (https://blog.myfitnesspal.com/how-often-should-you-weigh-yourself/)

Our subject is interested in collecting as much data as possible to track her weight loss efforts and so decides to weigh herself first thing in the morning every day as she finds the consistent feedback helps her to make small changes to her daily behaviour, such as eating less or exercising more. (https://www.consumerreports.org/scales/the-best-time-to-weigh-yourself/). To account for daily fluctuations, weight loss or gain will not only depend on calories in/out but will also include a random variant. 


In [5]:
date = pd.date_range('2019-01-01', periods=365, freq='D')

# Change date format https://stackoverflow.com/a/38812486 but this returns an array of stings, not datetimes.
date.strftime('%d/%m/%Y')

Index(['01/01/2019', '02/01/2019', '03/01/2019', '04/01/2019', '05/01/2019',
       '06/01/2019', '07/01/2019', '08/01/2019', '09/01/2019', '10/01/2019',
       ...
       '22/12/2019', '23/12/2019', '24/12/2019', '25/12/2019', '26/12/2019',
       '27/12/2019', '28/12/2019', '29/12/2019', '30/12/2019', '31/12/2019'],
      dtype='object', length=365)

## Calories

Weight loss is very complex and depends on many factors such as: (https://www.niddk.nih.gov/health-information/weight-management/adult-overweight-obesity/factors-affecting-weight-health)

- Genetics
- Race
- Sex
- Age
- Diet
- Physical activity
- Environment
- Medical issues

The person in this case is a 30 year old white Irish woman who lives in the suburbs and works at a sedentary office job. She has a moderately balanced diet but exercises very little and does not have any known medical issues that would hinder weight loss. She believes that she can begin to lose weight by making some slight lifestyle adjustments rather than any drastic changes in particular lessening her calorie intake and adding in a bit of exercise. 

The first thing she did was calculate how many calories she should consume per day in order to steadily lose weight. Two measurements were important here, her Basal Metabolic Rate (BMR) and Total Daily Energy Expenditure (TDEE). The BMR is the energy expenditure over a certain period of time by a person at rest (https://en.wikipedia.org/wiki/Basal_metabolic_rate). In other words, it is the number of calories burned by the body just by functioning normally without moving, such as breathing and circulating blood. It can be estimated based on a person's gender age, weight and height. The TDEE then is the number of calories a person should consume to maintain their current weight. There are many online calculators that help a person figure out their BMR - I have used a few different ones here see if different measurements are achieved:

Measurements used: Female, 30 years old, starting weight 80k,  height 175cm:

BMR
- 1578: https://tdeecalculator.net/  
- 1591: https://www.active.com/fitness/calculators/bmr 
- 1598: https://www.bodybuilding.com/fun/bmr_calculator.htm 
- 1578: https://www.calculator.net/bmr-calculator.html# 
- 1578: https://www.thecalculatorsite.com/health/bmr-calculator.php  (Mifflin St Jeor)
- 1600: https://www.thecalculatorsite.com/health/bmr-calculator.php (Harris Benedict)

TDEE: 
- https://tdeecalculator.net/ TDEE: 1894

https://www.thecalculatorsite.com/health/bmr-calculator.php - has good explanation of equations


Of course, not all calories are created equal. She could eat 1500 calories worth of junk food and still lose weight but this, of course would not be healthy. She aims each day to split her calorie allowance as follows: (https://www.healthline.com/nutrition/best-macronutrient-ratio#calorie-vs-calorie)

    - 45-65% carbohydrates
    - 20-35% fats
    - 10-35% proteins 
    
In this dataset the focus is on calories in and out but our subject is generally quite good at sticking to the above ratios. 

For exercise, look into how different exercises affect weight loss - a combination of cardio, strength training and flexibility training: https://www.verywellfit.com/types-of-exercise-for-weight-loss-3495992

With calories I might split by protein, ft and carbs - 

## Weight Loss

Could I do some kind of probability distribution that makes total weight go up/down?

Or have one column with weight lost daily/weekly and then add/subtrct that to total weight in another column?

Weight lost every week: Say I lose an average of 0.5kg per week with a standard deviation of 0.25 and over a normal distribution.

Or with daily say it's 0.1kg per day average with a sd of 0.05g

No! I should base the weight loss on the other variables - exercise and calories - with a random amount added or subtracted as weight loss is not exact (is this noise?)

In [6]:
# Weight lost weekly. 104 rows 

np.random.normal(0.5, 0.25, 104)

array([-0.10128896,  0.59859516,  0.83453655,  0.52417011,  0.90521216,
        0.17909929,  0.64065219,  0.67986204,  0.12919643,  0.54468193,
        0.04929854,  0.26288368,  0.70977865,  0.65934148,  0.77326663,
        0.98388015,  0.67861682,  0.65557882,  0.36871091,  0.29111023,
        0.47680638,  0.42943281,  0.28922813,  0.76566839,  0.80250755,
        0.56005012,  0.57627409,  0.30161603,  0.60181938,  0.73819154,
        0.39303286,  0.28391161,  0.56937901,  0.39984985,  0.36841257,
        0.73038686,  0.17601089,  0.65979876,  0.50534172,  0.42181386,
        0.54017989,  0.47073429, -0.18720155,  0.47870924,  0.74946424,
        1.26925099,  0.09131415,  0.4930655 ,  0.63422509,  0.03884794,
        0.65817781,  0.81162236,  0.51286975,  0.47027433,  0.63641963,
        0.5093713 ,  0.44008204,  0.74652684, -0.02109036,  1.01067999,
        0.20676375,  0.76891219,  0.26973246,  0.35742557,  0.19319107,
        0.20013155,  0.21820184,  0.64125994,  0.89327894,  0.56

In [7]:
# Weight lost daily. 119 rows
np.random.normal(0.1, 0.05, 119)

array([ 0.17009429,  0.12419533,  0.11687747,  0.0332451 ,  0.05975039,
        0.19256842,  0.05095925,  0.05249426,  0.20078888,  0.0195992 ,
        0.11574546,  0.02812335,  0.09149349,  0.09946356,  0.07882445,
        0.12333616, -0.01468244,  0.09246127,  0.14888417,  0.05030381,
        0.08613459,  0.11043554,  0.0559676 ,  0.06190047,  0.06732921,
        0.20052231,  0.11710208,  0.11764535,  0.04182197, -0.01547591,
        0.11157168,  0.0962505 ,  0.08341353, -0.02110922,  0.13113321,
        0.10600675,  0.02118836,  0.05574964,  0.17745835,  0.08031079,
        0.09741127,  0.14663292,  0.09670708,  0.07757539,  0.14597488,
        0.1046649 ,  0.16051692, -0.02089606,  0.09356032,  0.06256797,
        0.04162591,  0.09938776,  0.0899836 ,  0.11554362,  0.15400297,
        0.16937583,  0.06974057, -0.01189572,  0.12651115,  0.10104465,
        0.13537012,  0.11858174,  0.11070824,  0.21913867,  0.01669556,
        0.06748702,  0.04096386,  0.10039162,  0.12973078,  0.09

## Exercise

This makes a case for doing a daily dataset - can't figure out how to do this weekly

Either use randint or random choice

With randint - assign exercises to different integers

None = 1
Walk = 2
Jog = 3
Yoga = 4

With random choice the options are (None, Walk, Jog, Yoga)

Use choice as you can set probability for each option.

Jan-Apr = 119 days = 17 weeks

Say I do some form of exercise about 5 days per week: None = 2 * 17 = 34 days of no exercise = ~28% (make it 29 for 100% probability altogether)

I walk 3 days per week: Walk = 3 * 104 = 51 walks = ~43%

I jog 1 day per week = 17 jogs = ~14%

I have a yoga class about 1 day per week = 17 yoga classes = ~14%


OR!

I could still do the weekly thing if I change the size so it gives an array!


Exercise not the most important thing for weight loss: https://www.vox.com/2016/4/28/11518804/weight-loss-exercise-myth-burn-calories

Exercise and calories burned: https://www.sciencealert.com/how-to-calculate-calories-burned-met-value-exercise?perpetual=yes&limitstart=1

In [8]:
excercises = ["none", "walk", "jog", "yoga"]

np.random.choice(excercises, size = (119, 7), p=[0.29, 0.43, 0.14, 0.14])

array([['none', 'walk', 'none', 'walk', 'none', 'walk', 'yoga'],
       ['none', 'yoga', 'yoga', 'none', 'jog', 'none', 'none'],
       ['none', 'walk', 'yoga', 'walk', 'walk', 'none', 'yoga'],
       ['walk', 'none', 'jog', 'none', 'none', 'walk', 'walk'],
       ['walk', 'jog', 'none', 'yoga', 'jog', 'none', 'walk'],
       ['jog', 'none', 'walk', 'walk', 'none', 'none', 'yoga'],
       ['walk', 'none', 'none', 'walk', 'walk', 'yoga', 'none'],
       ['none', 'walk', 'yoga', 'none', 'none', 'walk', 'walk'],
       ['none', 'none', 'yoga', 'walk', 'none', 'jog', 'walk'],
       ['walk', 'walk', 'none', 'yoga', 'walk', 'walk', 'none'],
       ['jog', 'walk', 'none', 'walk', 'jog', 'none', 'none'],
       ['jog', 'walk', 'none', 'walk', 'walk', 'walk', 'none'],
       ['walk', 'none', 'yoga', 'none', 'none', 'none', 'walk'],
       ['none', 'yoga', 'yoga', 'walk', 'walk', 'walk', 'walk'],
       ['walk', 'yoga', 'walk', 'yoga', 'jog', 'walk', 'yoga'],
       ['jog', 'jog', 'walk', 'none

## Calories

Want to aim for 1500 calories per day 

Normal distribution again?

Below shows normal and randint

Maybe a truncated normal distribution as I should never go below 1200 calories but have some days where I'd eat over 2000


Study about outdated 3500cal = 1lb idea: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4035446/

In [9]:
np.random.normal(1500, 250, 119)

array([1782.66756138, 1632.17159753,  881.33846093, 1574.3880839 ,
       1565.41243158, 1513.40507798, 1341.61827943, 1131.97062001,
       1153.7571124 , 1768.99906335, 2120.8069394 , 1334.10840618,
       1615.58284116, 1636.35478733, 1581.43735883, 1676.40038511,
       1539.22302513, 1638.03947817, 1499.31092248, 1786.32455995,
       1659.90320088, 1498.42087634, 1853.24219088, 1530.70346831,
       1270.61337351, 1599.62513411, 1590.67545671, 1885.41131519,
       1802.43263553, 1380.40876673, 1304.69791256, 1754.46040307,
       1366.85686417, 1496.12612365, 1332.66844324, 1805.00456226,
       1438.01583603, 1498.04413668, 1278.47911591,  972.94523459,
       1452.81194437, 1468.05585793, 1901.25186276, 1977.03504552,
       1825.50610549, 1736.97300188, 1285.36572576, 1945.17479012,
       1523.81899998, 1244.00730512, 1444.50819042, 1387.90388394,
       1591.84461045, 1501.99187905, 2106.24299507, 1536.10705895,
       1581.49943254, 1298.23958084, 1459.96108529, 1862.14351

In [10]:
# or radint for neater values
np.random.randint(1200, 1800, 119)

array([1314, 1726, 1660, 1440, 1655, 1282, 1501, 1329, 1657, 1645, 1293,
       1394, 1446, 1652, 1621, 1314, 1507, 1217, 1262, 1544, 1506, 1293,
       1593, 1702, 1739, 1747, 1510, 1565, 1530, 1441, 1744, 1311, 1203,
       1578, 1312, 1213, 1721, 1300, 1585, 1777, 1359, 1424, 1561, 1612,
       1328, 1787, 1364, 1304, 1676, 1649, 1508, 1621, 1481, 1337, 1203,
       1522, 1243, 1660, 1618, 1604, 1641, 1779, 1688, 1707, 1365, 1538,
       1758, 1671, 1789, 1411, 1687, 1752, 1250, 1551, 1691, 1444, 1709,
       1762, 1465, 1208, 1364, 1285, 1638, 1638, 1476, 1568, 1485, 1446,
       1324, 1535, 1207, 1581, 1273, 1456, 1474, 1657, 1566, 1635, 1764,
       1289, 1412, 1534, 1432, 1244, 1246, 1305, 1668, 1555, 1485, 1695,
       1655, 1643, 1752, 1440, 1232, 1530, 1323, 1779, 1250])

In [11]:
# https://stackoverflow.com/a/18444710
#low = 1200
#high = 2500
#mu = 1500
#sigma = 250

#x = stats.truncnorm((low - mu)/sigma, (high - mu)/sigma, loc=mu, scale=sigma, size=119)
#x


# https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.truncnorm.html
# Figure it out!
stats.truncnorm.rvs(1200, 2500, size = 119)

array([inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf,
       inf, inf])

Since BMR and TDEE are both going to change as weight fluctuates, I have created the function below to recalculate them each day as part of the calories burned. I have used the Mifflin-St Jeor calculation as it is widely used and deemed to be quite accurate

In [12]:
# TDEE function
# only weight will change in this calculation

def tdee(weight):
    bmr = 10 * weight + 6.25 * 175 - 5 * 30 -161
    result = bmr * 1.2
    return result

tdee(80)

1899.3

Below is a function that tracks the estimated calories burned by exercising. Our subject exercises most days per week and averages at about 45 minutes per session, sometimes more, sometimes less. She uses her fitbit to calculate the amount of calories burned for each exercise session and logs that into her spreadsheet. 

In [13]:
# https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns-apply-a-function-o

def exercise(activity):
    if activity == 'yoga':
        return np.random.normal(150, 50)
    elif activity == 'walking':
        return np.random.normal(250, 50)
    elif activity == 'jogging':
        return np.random.normal(300, 50)
    else:
        return 0
    
exercise('jogging')

297.6445484663469

In [14]:
def cal_to_gram(calories):
    gram = calories * 0.128205128
    return gram
cal_to_gram(3500)

448.717948