-
Notifications
You must be signed in to change notification settings - Fork 17
/
2-git.Rmd
518 lines (335 loc) · 20.5 KB
/
2-git.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
---
title: "Version Control"
subtitle: "Getting used to Git(hub) with R"
output:
html_document:
toc: TRUE
df_print: paged
number_sections: FALSE
highlight: tango
theme: lumen
toc_depth: 3
toc_float: true
css: custom.css
self_contained: false
includes:
after_body: footer.html
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
***
## Welcome back! `r emo::ji("flexed biceps")`
This week we will be looking at a key component of the data science workflow: version control. More specifically we will create your first repo on Github and see how we can use it to keep a record of all the changes that were ever made to your code. This is an essential component of collaborative coding efforts, but can also be immensely beneficial to your own solo projects.
Today's session will follow the exact recipe suggested in Simon's lecture:
- Create a new repo on Github and initialize it
- Clone this repo to your local machine
- Make some changes to a file (we'll review some code from last week)
- Stage these local changes
- Commit them to our Git history with a helpful message
- Pull from the GitHub repo just in case anyone else made changes too (not expected here, but good practice).
- Pushed our changes to the GitHub repo.
Before we get into the nitty-gritty of this week's session, I would like to suggest you all [download Github Desktop](https://desktop.github.com/). It is a GUI that lets you interact with GitHub and might be an additional option if your do not want to rely on RStudio or the Command Line alone. Instructions to set it up and get it running can be found [here](https://docs.github.com/en/desktop/installing-and-configuring-github-desktop/overview/getting-started-with-github-desktop).
There are a number of different GUIs that you can try out and play around with. Some are better than others. An additional free GUI like GitHub Desktop that I like is [GitKraken](https://www.gitkraken.com/).
***
# Project Management `r emo::ji("mage")`
Before getting to know Git(Hub) it is useful to have a look at R-projects and how you should structure them. Version Control is only useful if your projects can be understood **by a wide range of collaborators**.
First of all, you should always use RStudio projects. They allow you to keep all the files associated with a project together — input data, R scripts, analytical results, figures etc. When you open a Project you will find a .Rproj file. This file contains all the meta information relevant to your project: settings, working directory, open scripts, encoding etc. If you are curious what exactly lies within it, I suggest you open the .Rproj file with a text editor.
However, projects are only as useful as the organization within them. As we saw in the lecture a structure like the following one would be a good starting point.
```
.
+-- src
+-- output
| +-- figures
| +-- tables
+--data
| +-- raw
| +-- processed
|
+-- README.md
+-- run_analyses.R
+-- .gitignore
```
The overview can be adapted to requirements inherent to any given project. It might be interesting to spend a little more time looking into the different folders and what they should contains.
### Data Folder
The data folder is, as the name suggests, where your data goes. The "raw" folder should contain the original data that will need to be wrangled and manipulated. It might be that your data is stored in an online relational database, or needs to be accessed through an API. In such cases the raw folder can be dispensed with.
Once you have shaped your data into a format that suits your analysis, the data should be stored in the "processed" folder. If you need to perform certain intermediate steps it might also be an idea to create a dedicated "temp" folder.
### src folder
This is the folder that should contain your code. You should aim to emulate the workflow of standard data science project within this folder. It is best to separate different aspects of your project into separate scripts. The names of these scripts should be named so as to make it clear what step of the project they represent (i.e. "2-preprocess-data", "3-descriptives" etc.). These scripts should sequentially be executed through a main runner script. This can be done calling the `source()` function.
### relative paths
Within a given script you should always use relative paths. To access your processed data folder you can use the following file path `"./data/processed.RDa"`. The "." represents your current working directory, which is set automatically when you open a new R-project.
For more in depth guides and explanations you can look [here](https://www.r-bloggers.com/2018/08/structuring-r-projects/). If you feel like you want to practice this you can find more material [here](https://github.com/intro-to-data-science-21/labs/tree/main/practice-scripts/01-file-management.R). And there is even a package dedicated to file referencing [here](https://cran.r-project.org/web/packages/here/vignettes/here.html).
***
# Setting Up Git `r emo::ji("hammer")`
Most of you will already have completed this step (we hope). For future reference however, we included a small reminder on the necessary steps:
1. Register for a [GitHub](https://github.com/) account.
2. Install Git (or update version) using the Command Line.
```{bash}
which git
git --version
```
```{bash, eval=FALSE}
brew install git
brew install gh
```
3. Enter Credentials for Git
```{bash, eval=FALSE}
git config --global user.name 'its-me'
git config --global user.email 'my-email@adress.eu'
```
# Suggested Workflow `r emo::ji("surfer")`
***
The following section highlights the recommended workflow that you should employ when you work with Git. You can see it as a sort of recipe that you should follow under most circumstances. At each stage you can find the instructions for working through both the Command Line and through RStudio. **Be aware** however that these are **separate** processes that should not be mixed. Either you use the shell for version control or you use RStudio.
### 1. Create a new repo on Github
* Go to [your github page](https://github.com) and make sure you are logged in.
* Click green “New repository” button. Or, if you are on your own profile page, click on “Repositories”, then click the green “New” button.
* How to fill this in:
+ Repository name: my_first_repo (or whatever you want).
+ Description: “figuring out how this works” (or whatever, but some text is good for the README).
+ Select Public.
+ YES Initialize this repository with a README.
+ For everything else, just accept the default.
Great, now that you created a new repo on Github, it is important to note that you **should always** create a repo **prior** to starting your work in RStudio.
:::: {.questionbox data-latex=""}
::: {.center data-latex=""}
**Question:**
:::
::: {.center data-latex=""}
There is a distinct advantage to employing the GitHub first, RStudio/Shell second approach, do you know which one?
:::
::::
***
### 2. Clone it to your local machine {.tabset .tabset-fade .tabset-pills}
Whatever way you plan on cloning this repo, you first need to copy the URL identifying it. Luckily there is another green button "Code" that allows you to do just that. Copy the HTTPS link for now. It will look something like this https://github.com/tom-arend/my_first_repo.git.
#### Using the Command Line
Open the Terminal on your laptop.
Be sure to check what directory you’re in. `$ pwd` displays the working directory. `$ cd` is the command to change directory.
Clone a repo into your chosen directory.
```{bash, eval = F}
cd ~/phd_hertie/teaching/IDS_fall_21
git clone https://github.com/tom-arend/my_first_repo.git
```
Check whether it worked:
```{bash, eval=F}
cd ~/phd_hertie/teaching/IDS_fall_21/my_first_repo
git log
git status
```
#### Using Rstudio
In RStudio, go to:
File > New Project > Version Control > Git.
In the “repository URL”-box paste the URL of your new GitHub repository.
Do not just create some random directory for the local copy. Instead think about how you organize your files and folders and make it coherent.
I always suggest that with any new R-project you “Open in new session”.
Finally, click the “Create Project” to create a new directory. What you get are three things in one:
* a directory or “folder” on your computer
* a Git repository, linked to a remote GitHub repository
* an RStudio Project
In the absence of other constraints, I suggest that all of your R projects have exactly this set-up.
#### Using Github Desktop
Here is a short gif on how to clone a repo with GitHub Desktop:
```{r, echo=FALSE, out.width="85%", fig.cap="", fig.align="center"}
knitr::include_graphics("./pics/github_desktop_cloning.gif")
```
### {.unlisted .unnumbered}
***
### 3. Make changes to a file {#code}
To showcase how useful Git can be, we first need to add some files to our repo. For now there should only be the .gitignore, the .Rproj and the README file. While we do this we might as well review some of the stuff we encountered last week.
* So let's create a new R script and save it in the directory that we just cloned.
* First load/install necessary packages (the tidyverse suffices here)
```{r, eval=F}
# Set-up your script ------------------------------------------------------
# install.packages(c("tidyverse", "gapminder", "pacman")) # uncomment if already installed
pacman::p_load(tidyverse, gapminder)
```
* Then load the data you want to work with into R.
```{r, eval=F}
# Load your Data into R ---------------------------------------------------
data(gapminder)
head(gapminder)
```
* Finally, start cleaning your data.
```{r, eval=F}
# Clean your Data ---------------------------------------------------------
gapminder_clean <- gapminder %>%
rename(life_exp = lifeExp, gdp_per_cap = gdpPercap) %>%
mutate(gdp = pop * gdp_per_cap)
```
* For good measure let's also update our Readme.
### {.unlisted .unnumbered}
***
### 4. Stage your Changes and Commit {.tabset .tabset-fade .tabset-pills}
Before we get on to the next step, it is a goo idea to save this newly created script and give it a name. Do the same thing for the README file. Now that your changes are saved locally we need to let Git know.
#### Using the Command Line
Stage ("add") a file or group of files. This allows you to stage specific individual files such as the README file for example.
```{bash, eval=F}
git add NAME-OF-FILE-OR-FOLDER
```
Alternatively you could stage all files (whether updated or not):
```{bash, eval=F}
git add -A
```
Or you could stage updated files only (modified or deleted, but not new):
```{bash, eval=F}
git add -u
```
Finally you can also only stage new files (not updated ones).
```{bash, eval=F}
git add .
```
Having done so, you are now ready to commit these changes!
```{bash, eval=F}
git commit -m "Helpful message"
```
As you can imagine, the command shell with its different options (and there are more beyond the staging phase), can be quicker and more flexible than your GUI interface, especially for experienced users.
#### Using RStudio
In the Environment/History panel a new tab called "git" should have appeared.
* Click on it and it should display all the changed and new files in your directory.
* Now you can select which files you want to stage. Simply tick the box next to your chosen files.
* Hit the Commit Button and a new window should open up.
* In this window you can quickly add a helpful message to mark exactly what you did. (Do not neglect this, as it helps both you and your collaborators to understand your changes.)
This method has the clear advantage of being extremely intuitive. However you are limited to selecting and stageing individual files.
#### Using GitHub Desktop
Changed or added files are automatically staged in Github Desktop. The GUI also displays (wehere possible) the changes that were done.
You only really need to commit the changes with a nice little summary or message.
```{r, echo=FALSE, out.width="85%", fig.cap="", fig.align="center"}
knitr::include_graphics("./pics/github_desktop_stage_commit.gif")
```
### {.unlisted .unnumbered}
***
### 5. Pulling and Pushing your commits {.tabset .tabset-fade .tabset-pills}
This part of the version control workflow should be relatively easy. The most important thing to remember is that you should always **pull** before you **push**. The reason for this is that in collaborative projects someone else might have pushed changes to the same file you were working on. This can lead to conflicts that should be avoided.
#### Using the Command Line
There are only two commands you need to remember here and they are pretty intuitive:
To pull from the main repo:
```{bash, eval=F}
git pull
```
And to push your commits:
```{bash, eval=F}
git push
```
#### Using RStudio
In RStudio you should see a change in the **git** tab. It should now read: "Your branch is ahead of 'origin/main' by 1 commit"
As long as this information is displayed, you know that you need to pull and push your commit.
To do this simply click on the blue arrow pointing down to **pull** from your main repo, before clicking on the green arrow pointing upwards to **push** your commits.
#### Using Github Desktop
It is not really straightforward to pull before pushing with Github Desktop. Given that there is no prominent pull button, you need to actively remember to do so prior to pressing the blue push button.
```{r, echo=FALSE, out.width="85%", fig.cap="", fig.align="center"}
knitr::include_graphics("./pics/github_desktop_pull_and_push.gif")
```
### {.unlisted .unnumbered}
At some point during either of these processes you will be prompted to enter your Github credentials and password. It might be that this works or it might be that you receiver the following error message:
> remote: Support for password authentication was removed on August 13, 2021. Please use a personal access token instead.
:::: {.noticebox data-latex=""}
::: {.center data-latex=""}
**Attention!**
:::
Github no longer supports simply using user-name and password as a way of authenticating yourself. You now have to identify yourself using *Personal Access Tokens*. You can cache these with [HTTPS](#https) or you can use [SSH keys]("https://happygitwithr.com/ssh-keys.html").
::::
***
### Exercises (Part I)
The best way to get into the habit of using Git in your workflow is to practice! So let us get back to our working example with my_first_repo and the initial_script.
1. Go to Github and create a new repo called: exercises_1
2. Make sure to include a description for the README
2. Clone that repo to your local machine using your preferred method
3. Change the README file to read: "Hello Github!"
4. Save the README.
5. Stage and Commit your changes (Do not forget the message along your commit!)
6. First Pull from origin (Not necessary here, but it is good to develop the right habits)
7. Then push your commit.
# Branches `r emo::ji("tree")`
***
A really cool feature of version control with Git are *Branches*.
Branches allow you to take a snapshot of your existing repo and try out a whole new idea without affecting your main (i.e. "master") branch.
Only once you (and your collaborators) are 100% satisfied, would you merge it back into the master branch.
Okey, time to play around with branches a little bit!
### {.unlisted .unnumbered .tabset .tabset-fade .tabset-pills}
#### Using the Command Line
Create a new branch on your local machine and switch to it:
```{bash, eval=F}
git checkout -b NAME-OF-YOUR-NEW-BRANCH
```
Push the new branch to GitHub:
```{bash, eval=F}
git push origin NAME-OF-YOUR-NEW-BRANCH
```
List all branches on your local machine:
```{bash, eval=F}
git branch
```
Switch back to (e.g.) the master branch:
```{bash, eval=F}
$ git checkout master
```
Delete a branch
```{bash, eval=F}
$ git branch -d NAME-OF-YOUR-FAILED-BRANCH
$ git push origin :NAME-OF-YOUR-FAILED-BRANCH
```
#### Using RStudio
Creating a new branch within RStudio is very easy.
1. Under the git tab in the environment pane you click on the pink L-shaped symbol on the right.
2. This will launch a new branch that you can name.
3. Switching branches can be accomplished through the drop-down menu next to it.
#### Using Github Desktop
Given the point-&-click nature of Github Desktop, it is very easy to open a new branch. Here is how it is done:
```{r, echo=FALSE, out.width="85%", fig.cap="", fig.align="center"}
knitr::include_graphics("./pics/github_desktop_open_new_branch.gif")
```
### {.unlisted .unnumbered .tabset .tabset-fade .tabset-pills}
# Pull Requests `r emo::ji("ask")`
***
Once you are satisfied that the feature, analysis or whatever it was that you did in your branch, has been completed, you need to notify and ask your collaborators (or yourself) to approve the changes. This is done through Pull Requests.
Pull Requests are best managed directly on Github. As a reminder on how you open pull requests, here is the GIF from the lecture again:
```{r, echo=FALSE, out.width="85%", fig.cap="", fig.align="center"}
knitr::include_graphics("./pics/github-rstudio-pr.gif")
```
# Dealing with Merge Conflicts `r emo::ji("exploding head")`
***
Merge conflicts occur when Git cannot resolve differences in the code between two commits. It only affects commits that apply changes **to the same line of code**. These merge conflicts are part of Git and alert you to the fact that there is a problem that needs to be addressed before continuing. Helpfully, Git marks where the merge conflict occurred within your code, contrasting the conflicting contributions. Unfortunately, it might be more difficult to resolve in larger projects where the problem is not immediately obvious to the human coder. A better approach would be to make extensive use of branches.
# Exercises (Part II) `r emo::ji("person lifting weights")`
***
For these exercises we will use the same repo created in the first part of the exercises, so be sure to finish them before moving on to this section!
1. From the exercises_1 repo open up a new branch called "data-manipulation-idea".
2. Switch to this branch and create a new script.
3. Copy the three code chunks from [this section](#code) into the newly created script.
4. Next, and using the tidyverse, subset our cleaned df to include only countries in the Americas.
5. Finally, also add a new categorical variable using `mutate()` that qualifies whether a country is rich or poor! (Hint use the variable gdp_per_cap and `ìf_else()`)
6. Save the script!
7. Stage your changes and commit them. (Make sure you are commiting them within our "data-manipulation-idea" branch.)
8. Go to your repo online and open a pull request.
9. Review and merge this pull request.
10. Delet the now obsolete branch.
***
# Appendix
***
### Caching credentials with HTTPS {#https .tabset .tabset-fade .tabset-pills}
Github recommends that you generate a *Personal Access Token* and cache it using HTTPS. Luckily for us this can be achieved in a few simple steps.
#### Using the Command Line
You need to go to your Github Account. Then select your logo in the top right > Settings > Developer Settings > Personal Access tokens
There you can create a new *PAT*. It is suggested that you select the following categories under scope: `repo`, `workflow`, `gist`, `admin::org` and `user`.
Finally click the green button "Generate Token". Make sure to **save a copy of this Token** somewhere secure. Once you leave the site you will lose the token!
Finally enter the following command:
```{bash, eval=F}
gh auth login
```
In order select: GitHub.com > HTTPS > Yes > Paste an authentication token
Then you simply need to paste your *PAT* and you should no longer be challenged for your github credentials.
#### Using RStudio
Luckily for R users there are two packages that essentially automate this process for us:
The `usethis` package has a function that takes you immediately to the relevant GitHub page and preselects the recommended scopes (what github calls settings) for you.
```{r}
usethis::create_github_token()
```
It should look like this:
```{r, echo=FALSE, out.width="75%", fig.cap="", fig.align="center"}
knitr::include_graphics("./pics/PAT_github_web.png")
```
Next you can use `gitcreds` to cache your credentials.
```{r, eval=F}
gitcreds::gitcreds_set()
```
It will ask you for your PAT (or if you already cached credentials, whether you want to change them) and store it for you.
Now you should be able to **pull** and **push** to your origin/main repo.
If you ever misplace your token, but you need to use it to authenticate your command line operation for example, you can also call ´gitcreds::gitcreds_set()´ and select the 3. option to display your PAT.