# 4. Advanced file management

Is there anything wrong with this code block?

```stata
program define myprogram
quietly {
forvalues i=1/10 {
if `i' < 5 {
noisily disp "`i' is less than five" }
else {
noisily disp "`i' is at least five" }
}
} 
end
```

![](chatgpt_indent.png)

General programming practices

+ indent your code 
   + or copy & paste into chatGPT, which will most certainly indent the code
   + code is much easier to read if indented
   + much easier to see and prevent errors
   + easier to debug
   + you'll find it easier to modify

+ test, test, test
   + iterate, iterate, iterate

+ collaborate, collaborate, collaborate
   + share, share, share

## 4.1 by 

Have you seen this syntax before?

```stata
gen record_no = _n //row number 
disp _N //total # rows
```

What if we want a variable of the total number of records in each ABO blood type category?

```stata
sort abo
by abo: gen cat_n = _N
```

Which creates a variable cat_n storing the number of records in each abo category

```stata
sort abo age
by abo: gen cat_id = _n
```

Which creates a variable `cat_id` which is 1 for the youngest patient in each category, 2 for the
next youngest, etc.

You can combine the `sort` and the `by` into `bysort` (or just `bys`)

## 4.2 egen  

We can create a new variable equal to the mean of all records...

```stata
use transplants, clear
sum age
gen mean_age = r(mean)
```

...but how can we do that with just one command?

```stata
use transplants, clear 
egen mean_age = mean(age)
```

More valid egen commands:

```stata
egen median_age = median(age) egen max_age = max(age)
egen min_age = min(age)
egen age_q1 = pctile(age), p(25) egen age_sd = sd(age)
egen total_prev = sum(prev)
```

So what's the big deal? Suppose we want the mean age, stratified by diagnosis? First, we sort on diagnosis

```stata
use transplants, clear 
sort dx
```

Now we do `by dx: egen`

```stata
by dx: egen mean_age = mean(age)
by dx: egen min_bmi = min(bmi) 
```

Or we can use `bysort`

More valid egen commands:

```stata
bys abo: egen m_bmi=mean(bmi)
bys abo gender: egen max_bmi = max(bmi) 
bys abo gender: egen min_bmi = min(bmi) 
gen spread = max_bmi – min_bmi
```

## 4.3 tag 

Supposed you've generated `spread` as per the previous slide. You want to look at it:

```stata
list abo gender spread
```

`egen = tag` will create a new variable that "tags" one record per abo/gender pair. This
variable is 1 for each tagged record, and 0 for all others:

```stata
egen grouptag = tag(abo gender) 
list abo gend spread if grouptag
```

```stata
use donors_recipients, clear 
bys fake_don: gen n_tx = _N 
bys fake_don: gen tx_id = _n 
egen don_tag = tag(fake_don) 
tab n_tx if don_tag
```

1. Import dataset
2. Total number of recipients for each donor
3. ID for each transplant for a given donor (1 or 2)
4. Tag one record per donor
5. Display a table of transplants per donor

tagging <u>protip</u>: `egen = tag` is very handy for graphing (which we haven't discussed yet). Many graphing commands are slow on large datasets. You can use `egen = tag` to run your graphic commands on only one record per group

```stata
use transplants, clear
egen ethtag = tag(ethcat)
bys ethcat: egen mean_bmi=mean(bmi) 
bys ethcat: egen mean_age=mean(age)
```

## 4.4 function()

## 4.4.1 round() 

Round down to the nearest integer:

```stata
disp floor(0.3)
disp floor(8.9)
```

Round up to the nearest integer

```stata
disp ceil(0.3)
disp ceil(8.9)
```

Round to the nearest integer

```stata
disp round(0.3)
disp round (8.9)
```

## 4.4.2 min()

```stata
disp min(8,6,7,5,3,0,9)
disp max(8,6,7,5,3,0,9)
```

## 4.4.3 exp()

Useful for transforming a variable (exponent, log, square root):

```stata
disp exp(1)
disp ln(20)
disp sqrt(729)
```

Other math functions:

```stata
di abs(-6) //absolute value
disp mod(529, 10) //modulus (remainder)
disp c(pi)
disp sin(c(pi)/2) //sine function
```


## 4.4.4 strpos()

As discussed throughout this course, a string variable is data that takes the form of characters instead of numbers.

```stata
use transplants, clear
ds, has(type string)
```

```stata
list extended_dgn in 1/5, clean
```

### 4.4.4.1 word()

The word function isolates the first (or second, etc.) word of a string.

```stata
disp word("Hello, is there anybody in there?", 4)
```

```stata
list extended_dgn if word(ext, 5) != "", clean noobs
```

### 4.4.4.2 strlen()

The strlen function counts the number of characters in a string

```stata
disp strlen("Same as it ever was")
list extended_dgn if strlen(ext)< 6, clean
```


### 4.4.4.3 regexm()

Test whether a string appears inside of another string

```stata
assert regexm("Earth", "art")
assert !regexm("team", "I")

tab ext if regexm(ext, "HTN")
```


regexm actually searches for regular expressions. We won't go into the syntax, but you can do more powerful searches.

```stata
list ext if regexm(ext, "^A") 
//starts with A

list ext if regexm(ext, "X$") 
//ends with X

tab ext if regexm(ext, "HIV.*Y") 
//contains "HIV", then some other stuff, then Y

```

### 4.4.5 dates()

How does Stata store dates? Remember this?

```stata
disp %3.2f exp(1)
disp %4.1f 3.14159
```


Stata stores dates as a special format of integer – the number of days since January 1, 1960

```stata
disp %td 19400
disp %td 366
disp %td -5

```

"td" = "time, date" (probably)

Since Stata stores dates as numbers, you can do arithmetic on them.

```stata
use transplants, clear
gen oneweek = transplant_date+7
format %td oneweek
list transplant_date oneweek in 1/3
```




Converting a date to Stata %td: td()

The function td() can give the number corresponding to a date written in Stata's funny default format. 

```stata
disp td(04jul1976)
disp td(26oct1985) 
```

The function date() can convert a string to a numerical date. It takes two strings. The first is the date to convert, and the second is a formatting string telling the order of month, date and year.


```stata
disp date("August 15, 1969", "MDY")
disp date("2061 28 July", "YDM")
```


Generate a date variable from a string

```stata
use donors.dta, clear
list fake_don_id fake_don_dob

gen donor_dob = date(fake_don_dob, "DMY")


```

## 4.5 Survival analysis

This portion of the lecture is designed for people who:
+ have not done survival analyses in Stata, but
+ know what "survival analyses" are


### 4.5.1 gen

### 4.5.2 stset

### 4.5.3 sts 

#### 4.5.3.1 sts graph

#### 4.5.3.1 sts list

#### 4.5.3.1 stcox