# Health Data - Elizabeth Ashley Flack
Original data is available on [Google Sheets](https://docs.google.com/spreadsheets/d/1vaXOXPluOg_lg0uqKC-qLBt6odL0GPFPcYzOP2vbaiM/edit#gid=1309334105).

In [1]:
# Juptyer Notebook Settings
options(repr.plot.width=5, repr.plot.height=5)  # plot size, inches

## I. Load Semi-Raw Data
This data has already been somewhat wrangled in that secondary variables (namespaced as `Indicator.SECONDARY_VARIABLE_NAME`) have been created.

In [12]:
dat = read.csv("data/Health Data - Elizabeth Ashley Flack.csv", header = TRUE)
tail(dat)  # See last 6 obs
# head(dat)  # See first 6 obs

Unnamed: 0,Date,Weekday,Notes.Unstructured,Med.Class1.1,Med.Class1.2,Med.Class3.1,Med.Class4.1,Med.Class5.1,Med.Class6.1,Symptom.1,⋯,Indicator.Health,Indicator.Experience..,Indicator.Experience...1,Indicator.Experience,Indicator.SelfEfficacy,Indicator.Productivity,Indicator.Wellbeing,Indicator.Utility..,Indicator.Utility...1,Indicator.Utility
95,1/3/18,4,"Food diary (FD), G (Gym).",0,40,0,0,75,0.0,0.0,⋯,0,0,0,0,0.23,0,0.1,0,0,0.08
96,1/4/18,5,,0,40,0,0,75,,,⋯,0,0,0,0,0.0,0,0.0,0,0,0.0
97,1/5/18,6,,0,40,0,0,75,,,⋯,0,0,0,0,0.0,0,0.0,0,0,0.0
98,1/6/18,7,,0,40,0,0,75,,,⋯,0,0,0,0,0.0,0,0.0,0,0,0.0
99,1/7/18,1,,0,40,0,0,75,,,⋯,0,0,0,0,0.0,0,0.0,0,0,0.0
100,1/8/18,2,,0,40,0,0,75,,,⋯,0,0,0,0,0.0,0,0.0,0,0,0.0


## II. Explore Semi-Raw Data

In [3]:
summary(dat)

      Date       Weekday    
 1/1/18 : 1   Min.   :1.00  
 1/2/18 : 1   1st Qu.:2.00  
 1/3/18 : 1   Median :4.00  
 1/4/18 : 1   Mean   :3.95  
 1/5/18 : 1   3rd Qu.:6.00  
 1/6/18 : 1   Max.   :7.00  
 (Other):94                 
                                      Notes.Unstructured  Med.Class1.1  
                                               :20       Min.   :  0.0  
 Migraine pill.                                : 4       1st Qu.:  0.0  
 Bedridden.                                    : 3       Median :150.0  
 Cleaning house.                               : 3       Mean   :100.5  
 Food diary.                                   : 3       3rd Qu.:150.0  
 1hr convo w/ volunteer coordinator at library.: 1       Max.   :150.0  
 (Other)                                       :66                      
  Med.Class1.2   Med.Class3.1  Med.Class4.1    Med.Class5.1   Med.Class6.1   
 Min.   :30.0   Min.   :0     Min.   :  0.0   Min.   : 0.0   Min.   :0.0000  
 1st Qu.:30.0   1st Qu.:0   

In [4]:
sapply(dat, typeof)  # Check types of all fields, using "simple apply" to apply typeof() to all fields in "dat".

## III. Clean Data

### III.1 Convert read.csv() List to Data Frame

In [17]:
df <- data.frame(dat)
df$Notes.Unstructured[1]

### III.2 Drop unneeded columns

df$Notes.Unstructured[3]
drops <- c("Notes.Unstructured")  # Create vector of fields to drop
df <- df[ , !(names(df) %in% drops)]  # Creates a new data frame from the first one, including any fields that aren't in list of fields to filter out.
df$Notes.Unstructured[1]

### III.3 Remove any rows with missing values

In [7]:
df$Date[1]
nrow(df)
str(colSums(is.na(df)))
str(colSums(!is.na(df)))
df <- df[complete.cases(df), ]
nrow(df)

 Named num [1:51] 0 0 0 0 0 0 0 5 5 5 ...
 - attr(*, "names")= chr [1:51] "Date" "Weekday" "Med.Class1.1" "Med.Class1.2" ...
 Named num [1:51] 100 100 100 100 100 100 100 95 95 95 ...
 - attr(*, "names")= chr [1:51] "Date" "Weekday" "Med.Class1.1" "Med.Class1.2" ...


### IV.4 Streamline Data Types

In [8]:
df$Date[1]
typeof(df$Date)
class(df$Date)
df$Date <- as.Date(df$Date, format='%m/%d/%Y')
class(df$Date)
typeof(df$Date)
df$Date[1]

“unknown timezone 'zone/tz/2018c.1.0/zoneinfo/America/New_York'”

## IV. Explore Clean Data

In [9]:
summary(df)

      Date               Weekday       Med.Class1.1    Med.Class1.2  
 Min.   :0017-10-01   Min.   :1.000   Min.   :  0.0   Min.   :30.00  
 1st Qu.:0017-10-24   1st Qu.:2.000   1st Qu.:  0.0   1st Qu.:30.00  
 Median :0017-11-17   Median :4.000   Median :150.0   Median :30.00  
 Mean   :0017-11-17   Mean   :3.937   Mean   :105.8   Mean   :32.95  
 3rd Qu.:0017-12-10   3rd Qu.:6.000   3rd Qu.:150.0   3rd Qu.:40.00  
 Max.   :0018-01-03   Max.   :7.000   Max.   :150.0   Max.   :40.00  
  Med.Class3.1  Med.Class4.1     Med.Class5.1    Med.Class6.1   
 Min.   :0     Min.   :  0.00   Min.   : 0.00   Min.   :0.0000  
 1st Qu.:0     1st Qu.: 50.00   1st Qu.:75.00   1st Qu.:0.0000  
 Median :0     Median : 75.00   Median :75.00   Median :0.0000  
 Mean   :0     Mean   : 64.74   Mean   :73.42   Mean   :0.1789  
 3rd Qu.:0     3rd Qu.: 75.00   3rd Qu.:75.00   3rd Qu.:0.0000  
 Max.   :0     Max.   :100.00   Max.   :75.00   Max.   :1.0000  
   Symptom.1        Symptom.2         Symptom.3        

## V. Analyses

### V.1 Simple Plots
#### Data Visualization Reference Info
##### Ggplot2
Tutorial: http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2)/

Options for `geom_line()`
- Line type, e.g. `linetype='dashed'`
- Line color, e.g. `color='red'`

#### Presence of any headache on weekday for timespan of entire dataset

In [10]:
lm1 <- lm(Symptom.Headache ~ Weekday, data=df)
plot(df$Weekday, df$Symptom.Headache)

ERROR: Error in eval(predvars, data, env): object 'Symptom.Headache' not found


#### Health over time

In [None]:
library(ggplot2)
var <- df$Indicator.Health
time <- df$Date

# a_lm <- lm(-10*Indicator.Health ~ Date, data=df)
# plot(df$Date, df$Indicator.Health)
# lines(df$Date, a_lm$fitted)

# Basic line plot with points
options(repr.plot.width=7, repr.plot.height=4.5)  # (1) Jupyter Notebook plot size, inches
# ggplot(data=df, aes(x=time, y=-10*(var), colour=(-10*var), group=1)) + geom_line()+ geom_point() + 
ggplot(data=df, aes(x=time, y=-10*(var), colour=(-10*var), color=qsec, group=1)) + geom_line()+ geom_point() + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_color_gradient(low="#AA4444", high="#ff0000") + 
  scale_x_date(date_breaks='1 month', date_labels="%b '%y") + 
  stat_smooth(method='lm', col='#2e64ba') + 
  labs(x='Time', y='Adverse Health Measure (AHM)', title='Adverse Health Measure (AHM), Daily', colour='AHM')
options(repr.plot.width=7, repr.plot.height=7)  # (1) default

In [None]:
a_lm <- lm(Indicator.Health ~ Date, data=df)
plot(df$Date, df$Indicator.Health)
lines(df$Date, a_lm$fitted)

### V.2 Linear Models

In [None]:
a_lm <- lm(Indicator.Utility ~ Date, data=df)
plot(df$Date, df$Indicator.Utility)
lines(df$Date, a_lm$fitted)

In [None]:
a_lm <- lm(Indicator.Health ~ Date, data=df)
plot(df$Date, df$Indicator.Health)
lines(df$Date, a_lm$fitted)

In [None]:
a_lm <- lm(Indicator.Wellbeing ~ Date, data=df)
plot(df$Date, df$Indicator.Wellbeing)
lines(df$Date, a_lm$fitted)