Data set idea: weight loss

Variables:
- day - daily time series

- weight 
    - estimated by calories in / calories out based on 3500cal as 450g lost (https://www.mayoclinic.org/healthy-lifestyle/weight-loss/in-depth/calories/art-20048065) so every 1 calorie = 0.128205128g
    - Weight loss will be calculated each day. (calories out - calories in) * 0.0128205128 +/- random noise as weight loss is not exact


- calories in based on logging food with LoseIt 
    - split by carbs/protein/fat? put into separate variables by percent?
    - Not going to split by c/p/f in dataset but will just state calorie count assuming she maintains good ratios 
    
    
- calories out - based on BMR/TDEE and exercise
- target calorie amount - same number throughout
- over/under calorie target
- exercise 
    - boolean?  True/False whether I exercised or not
    - categorical? Listing different exercises (walk, run, yoga class, weight training)
    - estimated calories burned?
 - calories out - tdee + exercise 
     - function created to calculate tdee as it fluctuates each day
     - maybe +/- random amount to exercise so it's not so samey
    
Run weekly - 52 weeks per year over 2 years = 104 rows
or 
Daily - January-April 2019 inclusive = 119 rows
Can't figure out which would be better

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Time" data-toc-modified-id="Time-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Time</a></span><ul class="toc-item"><li><span><a href="#Code" data-toc-modified-id="Code-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Code</a></span></li></ul></li><li><span><a href="#Calories" data-toc-modified-id="Calories-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Calories</a></span></li><li><span><a href="#Calories-In" data-toc-modified-id="Calories-In-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Calories In</a></span></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Exercise</a></span></li><li><span><a href="#Weight-Loss" data-toc-modified-id="Weight-Loss-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Weight Loss</a></span></li></ul></div>

This project simulates a dataset created by a woman - let's call her Zoe - who decided to track her weight loss efforts over the course of a calender year from 0/01/18 - 31/12/18. She set a daily average calorie allowance and committed to performing about 45 minutes of exercise approximately five days per week. She did a lot of research before beginning her journey to set herself up for success and was very fastidious in logging her calorie intake and estimated calorie output. 

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt

## Time

The first thing Zoe had to decide was how often to track her progress. She found that there is a lot of conflicting information online about how often one should weigh themselves when trying to lose weight. Some believe that weighing in too frequently can cause anxiety (https://health.clevelandclinic.org/why-you-shouldnt-weigh-yourself-every-single-day/) or discouragement (https://www.medicinenet.com/to_weigh__or_not_to_weighthat_is_the_question/views.htm) as weight fluctuations in the short-term can be quite unpredictable due to factors such as hydration or what was last eaten. However, some studies have shown that higher weighing frequency is associated with greater weight loss, (https://link.springer.com/article/10.1207/s15324796abm3003_5) less weight regain, (https://link.springer.com/article/10.1186/1479-5868-5-54) and is not associated with adverse psychological outcomes like anxiety (https://onlinelibrary.wiley.com/doi/full/10.1002/oby.20946). It really comes down to personal preference and what an individual feels works well for them. (https://blog.myfitnesspal.com/how-often-should-you-weigh-yourself/)

Zoe is interested in collecting as much data as possible to track her weight loss efforts and so decides to weigh herself first thing in the morning every day (https://www.consumerreports.org/scales/the-best-time-to-weigh-yourself/) as she finds the consistent feedback helps her to stay on track and keep herself accountable. 

### Code




In [2]:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.weekday.html

start = '2018-01-01'
end = '2018-12-31'

date = pd.date_range(start, end, freq='D')

df = pd.DataFrame({'date': date})

# Change date format https://stackoverflow.com/a/38812486 but this returns an array of stings, not datetimes.
#df['Date'] = pd.to_datetime(df['Date'].dt.strftime('%d-%m-%Y'))

#day = date.dt.weekday 

#if (day < 5).bool() == True:
    
#print('x')

#weekdays = pd.bdate_range(start, end)

#https://stackoverflow.com/a/19960116
#weekends = ~date.isin(weekdays)

#df = df.set_index('date')

df


Unnamed: 0,date
0,2018-01-01
1,2018-01-02
2,2018-01-03
3,2018-01-04
4,2018-01-05
...,...
360,2018-12-27
361,2018-12-28
362,2018-12-29
363,2018-12-30


A day column is added so Zoe can see what days of the week she is best able to stick to her plan, such as what day is she more likely to go too far above her calorie goal or what exercises she's more likely to perform on a particular day.

In [3]:
# adding days of the week
# https://stackoverflow.com/a/30222759

df['day'] = df['date'].dt.day_name()

# https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
df['day'] = df['day'].astype('category')
df

Unnamed: 0,date,day
0,2018-01-01,Monday
1,2018-01-02,Tuesday
2,2018-01-03,Wednesday
3,2018-01-04,Thursday
4,2018-01-05,Friday
...,...,...
360,2018-12-27,Thursday
361,2018-12-28,Friday
362,2018-12-29,Saturday
363,2018-12-30,Sunday


## Calories 

Weight loss is very complex and depends on many factors such as: (https://www.niddk.nih.gov/health-information/weight-management/adult-overweight-obesity/factors-affecting-weight-health)

- Genetics
- Race
- Sex
- Age
- Diet
- Physical activity
- Environment
- Medical issues

Zoe is a 30 year old white Irish woman who lives in the suburbs and works at a sedentary office job. She has a moderately balanced diet (if perhaps a bit over indulgent!) but exercises very little and does not have any known medical issues that would hinder weight loss. There are no glaring genetic or environmental reasons that would hinder her weight loss so she believes that she can start to lose weight by making some slight lifestyle adjustments rather than any drastic changes. Radical exercise regimens or fad diets that offer quick weight-loss results are difficult to stick to long-term and can be dangerous 
In particular she plans to lower her calorie intake and add in a bit of exercise. 

The first thing she did was calculate how many calories she should consume per day in order to steadily lose weight. Two measurements were important here, her Basal Metabolic Rate (BMR) and Total Daily Energy Expenditure (TDEE). The BMR is the energy expenditure over a certain period of time by a person at rest (https://en.wikipedia.org/wiki/Basal_metabolic_rate). In other words, it is the number of calories burned by the body just by functioning normally without moving, such as breathing and circulating blood. It can be estimated based on a person's gender age, weight and height. The TDEE then is the number of calories a person should consume to maintain their current weight. There are many online calculators that help a person figure out their BMR - I have used a few different ones here see if different measurements are achieved:

Measurements used: Female, 30 years old, starting weight 80k,  height 175cm:

BMR
- 1578: https://tdeecalculator.net/  
- 1591: https://www.active.com/fitness/calculators/bmr 
- 1598: https://www.bodybuilding.com/fun/bmr_calculator.htm 
- 1578: https://www.calculator.net/bmr-calculator.html# 
- 1578: https://www.thecalculatorsite.com/health/bmr-calculator.php  (Mifflin St Jeor)
- 1600: https://www.thecalculatorsite.com/health/bmr-calculator.php (Harris Benedict)

TDEE: 
- https://tdeecalculator.net/ TDEE: 1894

https://www.thecalculatorsite.com/health/bmr-calculator.php - has good explanation of equations


Of course, not all calories are created equal. She could eat 1500 calories worth of junk food and still lose weight but this, of course would not be healthy. She aims each day to split her calorie allowance as follows: (https://www.healthline.com/nutrition/best-macronutrient-ratio#calorie-vs-calorie)

    - 45-65% carbohydrates
    - 20-35% fats
    - 10-35% proteins 
    
In this dataset the focus is on calories in and out but our subject is generally quite good at sticking to the above ratios. 

For exercise, look into how different exercises affect weight loss - a combination of cardio, strength training and flexibility training: https://www.verywellfit.com/types-of-exercise-for-weight-loss-3495992

With calories I might split by protein, ft and carbs - 

## Calories In

To track her calorie intake, Zoe used the LoseIt app (https://www.loseit.com/) which allows her to log everything she eats and provides her with a calorie total at the end of every day that she then logs into her dataset. 

It is not healthy to go below 1200 calories daily as it would be difficult to get the nutrition the body needs (https://www.everydayhealth.com/weight/can-more-calories-equal-more-weight-loss.aspx

Zoe aims to lose the weight slowly and in a sustainable manner and so follows the guidance of 0.5kg per week (https://www.mayoclinic.org/healthy-lifestyle/weight-loss/in-depth/weight-loss/art-20047752) This means 

The randint function allows me to set minimum and maximum calorie values and was my first thought when trying to come up with a suitable function. As can be seen below, it returns 365 integer values that could potentially be used as calorie measurements.

In [None]:
# Using randint for calorie in values
first = np.random.randint(1200, 2500, 365)
first

However, the distribution for this function is uniform so Zoe is as likely to consume 2000 calories as she is to consume 1000 calories. Her daily calorie goal is the 1500 mark and so most values should ideally be centred around this figure. A normal distribution may work better. 

In [None]:
sns.distplot(first)

Again there is an issue as just using a regular normal distribution can return some values that are unrealistically below the minimum threshold of 1200 calories. Setting the mean to 1500 and 

In [None]:
second = np.random.normal(1500, 200, 365)
second

The solution below used a truncated normal distribution which allows a range to be set and the data is normally distributed. I have also converted the results to integers as Zoe would realistically be tracking decimals of a calorie.

In [14]:
# https://stackoverflow.com/a/18444710 
# https://stackoverflow.com/a/53948014

def calories_in():
    low = 1200
    high = 2500
    mu = 1500
    sigma = 200
    x = stats.truncnorm((low - mu) / sigma, (high - mu) / sigma, loc=mu, scale=sigma)
    cal_in = x.rvs(365).astype(int)
    return cal_in

    
    # https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.truncnorm.html
    # https://stackoverflow.com/a/37411711
    


#calories_in()
#
#df['calories_in'] = calories_in()

#
#df

#def calorie_input(day):
 #   if day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']:
  #      return calories_in(2000, 1)
   # if day in ['Saturday', 'Sunday']:
    #    return calories_in(1300, 1)
    
#df['calories_in'] = calories_in()

#x = df['calories_in'] + np.random.randint(10000, 200000)

#np.where((df['day'] == 'Saturday') & (df['day'] == 'Sunday'), x, df['calories_in'])

#pd.options.display.max_rows = 365

#sat = df.loc[df['day'] == 'Saturday']
#sun = df.loc[df['day'] == 'Sunday']

#sat.transform(lambda x: df.calories_in + 1000)

#weekend = pd.Series(sat, sun)

#if df[df.day == df.day('Saturday')]:
new = np.random.randint(300, 500)



#def assign_days(x):
  #  if x == 'Saturday':
   #     return df['calories_in'] = df['calories'] +np.random.randint(10000, 200000)
    
#df['calories_in'] = assign_days(df['day'])

# https://note.nkmk.me/en/python-numpy-where/

df['calories_in'] = np.where((((df['day'])=='Saturday') | ((df['day'])=='Sunday')), calories_in() + new, calories_in())

df

Unnamed: 0,date,day,calories_in
0,2018-01-01,Monday,1526
1,2018-01-02,Tuesday,1607
2,2018-01-03,Wednesday,1453
3,2018-01-04,Thursday,1734
4,2018-01-05,Friday,1540
...,...,...,...
360,2018-12-27,Thursday,1357
361,2018-12-28,Friday,1435
362,2018-12-29,Saturday,1894
363,2018-12-30,Sunday,1761


## Exercise

This makes a case for doing a daily dataset - can't figure out how to do this weekly

Either use randint or random choice

With randint - assign exercises to different integers

None = 1
Walk = 2
Jog = 3
Yoga = 4

With random choice the options are (None, Walk, Jog, Yoga)

Use choice as you can set probability for each option.

Say I do some form of exercise about 5 days per week: None = 2 * 17 = 34 days of no exercise = ~28% (make it 29 for 100% probability altogether)

I walk 3 days per week: Walk = 3 * 104 = 51 walks = ~43%

I jog 1 day per week = 17 jogs = ~14%

I have a yoga class about 1 day per week = 17 yoga classes = ~14%


OR!

I could still do the weekly thing if I change the size so it gives an array!


Exercise not the most important thing for weight loss: https://www.vox.com/2016/4/28/11518804/weight-loss-exercise-myth-burn-calories

Exercise and calories burned: https://www.sciencealert.com/how-to-calculate-calories-burned-met-value-exercise?perpetual=yes&limitstart=1

In [None]:
activities = ["none", "walk", "jog", "yoga"]

exercise = np.random.choice(activities, size = (365), p=[0.29, 0.43, 0.14, 0.14])

df['exercise'] = exercise

df

Below is a function that tracks the estimated calories burned by exercising. Zoe exercises most days per week and averages at about 45 minutes per session. She uses her fitbit (https://www.fitbit.com/ie/home) to approximate the number of calories burned during each exercise session and logs that into her spreadsheet.  

In [None]:
# https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns-apply-a-function

def exercise_cals(row):
    if row['exercise'] == 'yoga':
        return int(np.random.normal(150, 50))
    if row['exercise'] == 'walk':
        return int(np.random.normal(250, 50))
    if row['exercise'] == 'jog':
        return int(np.random.normal(300))
    if row['exercise'] == 'none':
        return 0
    
df['exercise_cals'] = df.apply(lambda row: exercise_cals(row), axis = 1)
df

Since BMR and TDEE are both going to change as weight fluctuates, I have created the function below to recalculate them each day as part of the calories burned. I have used the Mifflin-St Jeor calculation as it is widely used and deemed to be quite accurate

In [None]:
# TDEE function
# only weight will change in this calculation

def tdee(weight):
    bmr = 10 * weight + 6.25 * 175 - 5 * 30 -161
    result = bmr * 1.2
    return result

tdee(80)

Study about outdated 3500cal = 1lb idea: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4035446/

In [None]:
def cal_to_gram(calories):
    gram = calories * 0.12858
    return gram
cal_to_gram(3500)

## Weight Loss

Could I do some kind of probability distribution that makes total weight go up/down?

Or have one column with weight lost daily/weekly and then add/subtrct that to total weight in another column?

Weight lost every week: Say I lose an average of 0.5kg per week with a standard deviation of 0.25 and over a normal distribution.

Or with daily say it's 0.1kg per day average with a sd of 0.05g

No! I should base the weight loss on the other variables - exercise and calories - with a random amount added or subtracted as weight loss is not exact (is this noise?)


To account for daily fluctuations, weight loss or gain will not only depend on calories in/out but will also include a random variant. 
