# 4. Basic file management



## 4.1 Merging files

| transplants.dta | donors_recipients.dta | donors.dta  |
|-----------------|-----------------------|-------------|
| fake_id         | fake_id               | fake_don_id |
| age             | fake_don_id           | age_don     |

Suppose your analytic dataset is [transplants.dta](https://github.com/jhustata/basic/raw/main/transplants.dta) and you wish to assess whether donor age is correlated with recipient age in deceased-donor transplants.

Let's first define our working directory using a macro

```stata
global url "https://github.com/jhustata/basic/raw/main/"
```

Now let's import our analytic dataset

```stata
use "${url}transplants", clear
lookfor age
```

We only have the recipients age. So let's confirm that the variable is in `donors.dta`:

```stata
use "${url}donors", clear
lookfor age
```

Can we merge `transplants.dta` with `donors.dta` to solve this issue?

#### Simple merge command

```stata
use transplants, clear 
merge 1:1 fake_id ///
    using donors_recipients
```

- We expect each `fake_id` to appear only once in each dataset ("one-to-one merge")
- `fake_id` is the variable that appears in both datasets, letting us link them
- `donors_recipients` is the dataset that we're merging with the dataset in memory

```stata
tab _merge 
```

#### Fancier merge commands

Let's explore fancier syntax

```stata

use transplants, clear 
merge 1:1 fake_id ///
    using donors_recipients, ///
    keep(match)
```

- Only records that appear in both datasets will remain in memory
- Only records that appear in the master dataset only, or in both datasets, will remain in memory

```stata

use transplants, clear 
merge 1:1 fake_id ///
    using donors_recipients, ///
    gen(mergevar)
```

- Instead of creating a "system-defined" variable called `_merge`, lets have a "user-defined" one called `mergevar`

```stata

use transplants, clear 
merge 1:1 fake_id ///
    using donors_recipients, ///
    nogen 
```

- Don't create any new variables
- NOTE: if the `_merge` variable already exists, the `_merge` command will give an error unless you use `gen()` or `nogen`

#### Two merges in a row

```stata
use transplants, clear 
merge 1:1 fake_id ///
    using donors_recipients, ///
    keep(match) nogen 
merge m:1 fake_don_id ///
    using donors, keep(match) nogen ///
    keepusing(age_don)
```
- Don't load all variables from the new (using) dataset. Just load age_don

Now lets get back to our original problem: are donor and recipient age correlated?

```stata
corr age*
```

#### Merging protip

Using `merge, keep(match)` might drop more records than you expect. If you think all
records will match, it's a good idea to check this assumption

```stata
use transplants, clear merge 1:1 fake_id ///
    using donors_recipients, ///
    keep(master match)
assert _merge==3
```

Maybe you don't expect a perfect match, but you
want to make sure nearly all of your records match use transplants, clear

```stata
merge 1:1 fake_id ///
    using donors_recipients, /// 
    keep(master match)
quietly sum _merge
assert r(mean) > 2.98
//99% of records have _merge==3
```

That's enough in way of an introduction to the `merge` command. It will come in handy only in specific projects that have relevant variables in different datasets.