# 7. Visualizing data in Stata

```stata
global repo https://github.com/jhustata/basic/raw/main/
```

Our course catalog has several examples to help you visualize data:

- [Histogram](https://jhustata.github.io/basic/chapter3.html#histogram)
- [Twoway](https://jhustata.github.io/basic/chapter3.html#graph)
- [Special](https://jhustata.github.io/basic/chapter3.html#special-numeric-variables-dates-times) numeric variables
   - Dates
   - Times
- Longitudinal data (pending)
   - Nested data
      - When data are nested over time
   - Survival analysis

But let's kickoff with some basics:

$Y_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_NX_N + \epsilon_{i.i.d.\sim\mathcal{N}(\mu=0,\sigma=1)}$ 

May questions in science take the above form. We often seek to "accurately" predict $Y_i$ (absolute estimate) or test a hypothesis $\beta_i=0$ (relative estimate)

### 7.1 Univariable
But, first, we tend to "explore" the variables one at a time 
$Y_i = \beta_0 + \epsilon_{i.i.d.\sim\mathcal{N}(\mu=0,\sigma=1)}$

```stata
use "${repo}transplants.dta", clear
```
##### Continuous
```stata
hist age
sum age
regress age 
``` 

Let's get fancy:
```stata
gen x=1 //dummy or meaningless variable
twoway scatter age x, jitter(15)
```

Even fancier!!
```stata
use "${repo}transplants", clear
g x=1
sum age
g mean = r(mean)
g lb = r(mean) - 1.96*r(sd)
g ub = r(mean) + 1.96*r(sd)
twoway scatter age x, jitter(15) || scatter mean x || rcap ub lb x 
```

##### Binary 
```stata
hist gender
```
##### Multicategory
```stata
hist dx
```
- Is this worth the trouble? or is a simple `tab gender` sufficient?
- Notice there's no "error" or $\epsilon_{i.i.d.\sim\mathcal{N}(\mu=0,\sigma=1)} with binary variables$
- And so we also don't have an "error" with logistic regression
### 7.2 Bivariable
We may then investigate correlations between two variables:

$Y_i = \beta_0 + \beta_1X_1 + \epsilon_{i.i.d.\sim\mathcal{N}(\mu=0,\sigma=1)}$ 

```stata
twoway scatter age bmi || lowess age bmi
``` 

You may even stratify your analysis:

$Y_i = \beta_0 + \beta_1X_1 + \epsilon_{i.i.d.\sim\mathcal{N}(\mu=0,\sigma=1)}$ 

### 7.3 Multivariable
How would we visualize the real-world scenario below?
$Y_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_NX_N + \epsilon_{i.i.d.\sim\mathcal{N}(\mu=0,\sigma=1)}$ 

##### Linear Regression
##### Generalized Linear Regression

```stata
use ${repo}transplants, clear
sum peak_pra,d
g highpra=peak_pra>r(p90)
sum wait_yrs,d
g longwait=wait_yrs>r(p50)
postutil clear 
postfile pp xaxis str80 coef double(result lb ub pvalue) using betas.dta, replace 
logistic highpra gender prev_ki bmi age wait_yrs don_ecd
local xaxis=1
qui foreach v of varlist gender prev_ki bmi age wait_yrs don_ecd {  
	lincom `v'
	return list 
	local est: di %3.2f r(estimate)
	local lb: di %3.2f r(lb)
	local ub: di %3.2f r(ub)
	local pval: di %3.2f r(p)
    post pp (`xaxis') ("`v'") (`est') (`lb') (`ub') (`pval')
	local xaxis=`xaxis' + 1
}

postclose pp

ls -l
use betas,clear
list 
#delimit ;
twoway (scatter result xaxis)
       (rcap lb ub xaxis, 
	       scale(log)
	       yline(1, 
	            lc(lime) 
			    lp(dash)
	            )
	       legend(off)
	       xlab(
	           1 "Gender" 
	           2 "Previous Tx"
	           3 "BMI"
		       4 "Age"
		       5 "Wait y"
	           6 "ECD"
	           )
	       ti("Association between High PRA and select risk factors", pos(11))
	       yti("OR", 
	           orientation(horizontal)
		      )
	       xti("")
	      )
        ;
#delimit cr
graph export logistic.png,replace 
```

### 7.4 Hierarchical
Last week we briefly talked about [nested data](https://jhustata.github.io/basic/chapter6.html). In these data the $\epsilon_{i.i.d.\sim\mathcal{N}(\mu=0,\sigma=1)}$ in the expressions is "violated"

We assess blood pressure in each one of you during each session of the Stata I (Basic) class:

```stata
clear

// Set the number of students and sessions
local nstudents 100
local nsessions 8

// Create an empty dataset

set obs 8
gen student_id = .
gen session = .
gen sbp = .
gen session_date = .

// Loop over each student
forvalues i = 1/`nstudents' {
    // Generate data for each session for the current student
    forvalues j = 1/`nsessions' {
        // Generate student ID
        replace  student_id = `i'
        
        // Generate session
        replace session = `j'
        
        // Generate simulated systolic blood pressure measurements
        set seed `i'`j' // Set seed based on student and session
        replace sbp = rnormal(120, 10)
        
        // Append data for the current session to the dataset
        //append
    }
	save student`i', replace //edit in class after seeing the mess it causes!!
}

forvalues i=1/99 {
	
	append using student`i'.dta
}

// Sort the dataset
sort student_id session

// Display the first few observations
list student_id session sbp in 1/10

// Not what we wanted
bys student_id: replace session=_n

// Let's include the dates
local session_date=d(28mar2024)
forvalues i=1/8 {
	replace session_date=`session_date' if session==`i'
	local session_date=`session_date' + 7
}
format session_date %td
codebook 
replace sbp=round(sbp)
```
### 7.5 Time
How may we visualize the change (if any) in blood pressure over the course of Stata I?

```stata
// Line plot of SBP over the 8-week period
twoway (scatter sbp session_date, sort jitter(9)) ///
    , xtitle("Session Date") ytitle("Systolic Blood Pressure") ///
    title("Systolic Blood Pressure Over 8 Weeks") legend(off)

```

This only tells us about the class average over time. But says nothing about individual students. Each might have their unique tragectory
### 7.6 Nested 
```stata
// Line plot of SBP for each student over the 8-week period
twoway (connected sbp session_date if student_id==1, msymbol(circle) mcolor(blue)) ///
    (connected sbp session_date if student_id==2, msymbol(circle) mcolor(red)) ///
    (connected sbp session_date if student_id==100, msymbol(circle) mcolor(magenta)) ///
    , xtitle("Session Date") ytitle("Systolic Blood Pressure") ///
    title("Individual Trajectories of Systolic Blood Pressure Over 8 Weeks") ///
    legend(order(1 "Student 1" 2 "Student 2" ... 100 "Student 100")) ///
    yline(120, lcolor(black) lpattern(dash))

```
### 7.7 Effects Mixed "Effects" of Time on SBP

##### Fixed
```stata
regress sbp session_date
```

Equivalent to adding a `lowess` on the figure above

##### Mixed
```stata
// Create a new dataset with individual student IDs
bysort student_id: gen id = _n
tempfile student_data
save "`student_data'"

// Loop over each student to estimate the regression coefficient
quietly forvalues i = 1/100 {
    use "`student_data'", clear
    keep if student_id == `i'
    regress sbp session_date
    scalar beta`i' = _b[session_date]
}

// Display the coefficients
di "Regression Coefficients for Each Student:"
forvalues i = 1/100 {
    di "Student `i': " scalar(beta`i')
}
```
### 7.8 Lab
Practice examples to prepare you for your final homework can be found [here](lab7.md)
### 7.9 Homework
Your final homework is [here](hw7.md)

