##**git for Colab:**
###Google Drive -- Colab VM -- GitHub Integration
---
---
###**Caution: Do not blindly run all cells in this notebook.  </br> Each different section should be run only if it corresponds to your desired actions.**
---
---


When **setting up a new local repo**, proceed in this order:
*  Section 1: OAuth Token
*  Section 2: Personal Parameters
*  Section 3: Mount Google Drive; create authorized GitHub URL with OAuth token
*  Section 6: Clone Repo
*  Then, use sections 4 and 5 (pull and push) as you wish
</br></br>
---
---
When operating with an **existing local repo**, proceed in this order:
*  Section 2: Personal Parameters
*  Section 3: Mount Google Drive; create authorized GitHub URL with OAuth token
*  Then, use sections 4 and 5 (pull and push) as you wish
</br></br>
---
---


####**Public Repositories on GitHub:**  

The code in this notebook is presently set up for connection with GitHub **public** repositories.  StackExchange can give you more information on how to connect securely with a **private** GitHub repository using **ssh**, for example.

####**The following code blocks assist with:**

1. (Required on setup; only done once) Obtaining your **GitHub 'OAuth Token'** to enable ***push*** commands from your Google Drive to the public repo on GitHub

2. (Required) Loading your **personal parameters** regarding file locations, GitHub credentials, etc.  

3. (Required) **Mounting Google Drive** so Colab has programmatic access to your Google Drive files

4. (Optional) **git pull** code to update your local Google Drive repo

5. (Optional) **git add/commit/push** code to send updated files to remote GitHub repo

6. (Required on setup; only done once) **Clone** a GitHub remote repository to a desired Google Drive location.  *Do this before doing any pushes or pulls, but only do the cloning once for any given repo.*

7. (Optional) collection of some useful git commands to check status and/or debug

8. Random collection of other stuff found on the web that I didn't want to forget.  Probably not useful to anyone else, but enjoy if you'd like.

#Hidden section: this code doesn't change often

##1. **Obtain GitHub OAuth Token**
Do this **once only**.  The first time you set up your local repo on Google Drive, you must follow the instructions in this section.

You do *not* need to repeat this procedure again, as long as your token remains valid.


---
---



The GitHub OAuth Token provides you access to ***push*** any commits to the GitHub remote repo.  (You must have "contributor" authority on the GitHub repo to **push** commits, irrespective of Colab / Google Drive)

(You do *not* need a GitHub OAuth token or "contributor" authority to ***pull*** or ***clone*** from the public repo...  only to ***push***.)

---
---
</br>

If you do not already have one, go to [github.com/settings/tokens](https://github.com/settings/tokens/new) and create a new *personal access token* for "Google Drive Repo".  You may ignore the advanced options for the scope of the OAuth token and simply enable the first checkbox for "repo" full control of private repositories.
</br></br>
**Save your OAuth Token as a single line in a text file named "*GitHub_Token.txt*" that is stored in your default Colab directory (i.e., NOT inside the local repo!).**  GitHub will deauthorize your token and require you to create a new one, if GitHub detects your token in any file you upload to the remote GitHub repo.
</br></br>
Once you have received and saved your token as directed, you should not have to do it again for this repo, and can enjoy many pushes and pulls for the remainder of time.

---
---

**Tips and Notes:**

The token provided by GitHub will resemble a 40-character combination of numbers and letters, and will be unique to you (i.e., not unique to the repo).  **DO NOT share this token with others**, even if you are in a "committed relationship."  These "relationships" have a way of failing quite often, and then your "partner" can trash years of your valuable labor.  ;)

##2. **Personal Parameters**
You must run this code cell each time you start a new runtime in Colab for this IPynb.
</br></br>

---
---

**Enter *your* personal info by replacing my info in the relevant variables and run the setup code in this section**

---
---

Tip: Your **default Colab directory** is created when you first set up Colab.  If you use Google's standard settings, this directory will be titled "Colab Notebooks" and it will reside at the top level of your Google Drive.  Once you mount your Google Drive in Colab, Colab will recognize this as a directory inside "*/content/drive/My Drive*" at "*/content/drive/My Drive/Colab Notebooks*".  The code below assumes you will store your OAuth Token in this particular directory, and that your local repo will be located in a lower-level directory.

In [1]:
##################################################################
# Required Code Cell - Must Run This To Enable Any Future Actions
'''
ADJUST THE VARIABLE ASSIGNMENTS IN THIS CELL TO FIT YOUR PARTICULAR SITUATION
'''
# Examples are provided in the comments for one particular repo and Google Drive path that was used
##################################################################


OAUTH_TOKEN_FILENAME = 'GitHub_Token.txt'                     # this is a one-line text file containing only your GitHub OAuth token (see github.com/settings/tokens) -- do not place this inside your repo!!!
COLAB_GDRIVE_MOUNTPOINT = '/content/drive'                    # leave this unchanged unless you know something
COLAB_DEFAULT_DIR = 'My Drive/Colab Notebooks'                # leave this unchanged unless you explicitly created a different default Colab directory
GDRIVE_PATH_TO_LOCAL_REPO = 'NRUHSE_2_Kaggle_Coursera/final'  # this is the directory (relative to Colab Default) in which you will have cloned the remote GitHub repo
GIT_REPO_MASTER = 'Kag'                                       # Name of master branch on GitHub
GIT_REPO_PATH_PARENT = 'migai'                                # Typically, the orignator of the repo on GitHub (URL for 'Kag' repo == github.com/migai/Kag )

GIT_USERNAME = 'migai'
GIT_USER_EMAIL = "gaidis@alum.mit.edu"

##################################################################
# Create Paths based on personal information input above, then mount the Google Drive
##################################################################
from urllib.parse import urlunparse
from pathlib import Path
import os

GDRIVE_HOME = Path(COLAB_GDRIVE_MOUNTPOINT)                   # "/content/drive
COLAB_HOME = GDRIVE_HOME / COLAB_DEFAULT_DIR                  # "/content/drive/My Drive/Colab Notebooks
TOKEN_FILE = COLAB_HOME / OAUTH_TOKEN_FILENAME                # "/content/drive/My Drive/Colab Notebooks/GitHub_Token.txt
GDRIVE_CLONE_PATH = COLAB_HOME / GDRIVE_PATH_TO_LOCAL_REPO    # "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final"
GDRIVE_REPO_PATH = GDRIVE_CLONE_PATH / GIT_REPO_MASTER        # "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"

# # Printouts of the various paths to ensure your sanity and help with debug
# print(f"GDRIVE_HOME = {GDRIVE_HOME}")
# print(f"COLAB_HOME = {COLAB_HOME}")
# print(f"TOKEN_FILE = {TOKEN_FILE}")
# print(f"GDRIVE_CLONE_PATH = {GDRIVE_CLONE_PATH}")
# print(f"GRIVE_REPO_PATH = {GDRIVE_REPO_PATH}\n\n")

##3. **Mount the Google Drive in Colab; Create GitHub OAuth URL**
You must run these code cells each time you start a new runtime in Colab for this IPynb.
</br></br>

---
---

**This will mount your Google Drive in the Colab VM, </br>
load your GitHub authorization token from the mounted Google Drive,</br>
and will create an authorization URL that allows you to push files to the remote GitHub repo**



###The following code will mount your personal Google Drive in the Colab VM at "/content/drive"</br>
* If you have recently mounted your Google Drive within this IPynb, you may get lucky and not have to re-authorize (Colab will skip the following, and tell you your drive is already mounted).
* Things like re-opening this IPynb, starting a new runtime, changing runtime type (CPU/GPU), changing your computer's IP address, or having runtime suspended for a substantial amount of time before you reconnect -- will all trigger the need for you to re-authorize Colab to access your personal Google Drive.</br></br>

####To authorize Colab to access your Google Drive, 
* The code below will present an input textbox for your authorization code. (Or, it will tell you your drive is already mounted, and you don't need to go through this procedure.) 
  * To obtain this code, click the lengthy link above the input textbox that appears below </br> (Link looks like:  "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=?947318989803.apps.googleusercontent.com&redirect_uri=urn%3a&response_type=code&scope=email%?20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly " )
  * Then, in the new browser tab that appears, click to allow use of your chosen personal Google account containing the Google Drive of interest
  * This new browser tab should then present you with a lengthy (40 char?) passcode provided by Google.  COPY THE PASSCODE to your clipboard.
  * Then, you can close the passcode tab and return to this IPynb browser tab to paste and enter the passcode in the input cell provided. (paste + enter)

* Colab crunches for a few seconds, and then should return a message that your drive is mounted.
</br>

####Code output will look like this:
```
Enter your authorization code:
··········
Mounted at /content/drive
```
</br>

Or, if Drive is already mounted, you get a message like this:</br>
```
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
```

###After Google Drive is mounted in Colab: Create GitHub URL</br>
* To access GitHub from within a Colab IPynb to be able to push files to the remote GitHub repo, we need to inform GitHub that we are authorized to do so.  This involves creating a URL pointing at your GitHub repo, but also including your OAuth token in this URL. (**NOTE:** you don't have to do this if all you wish to do is pull or clone the remote repo.)
* Rather than explicitly entering your GitHub OAuth token in this IPynb, the code below allows you to store the OAuth token in a text file (on your Google Drive, but outside your local repo).  Why do this?  Because the OAuth token is then never printed inside this IPynb, and you can place this IPynb *inside* your local repo, and have this IPynb synced with GitHub.  **GitHub will deauthorize your OAuth token if you try to upload it to GitHub inside another file**, e.g., where other viewers of your public repo may find it.
* (You do not NEED to put this IPynb in your repo, but this way of reading your OAuth token gives you that option.)</br>

###Note the order of code cells here:
* To have Colab read your token and thus create an authorized access path to GitHub, we had to mount the Google Drive first (as in the previous code cell above)

###Some tips for debugging:
If you are having trouble getting a properly formatted URL for *GITHUB_REPO_PATH*, you can try one of the following:
* This seems to work, but may be considered "unpythonic"... simple string concatenation method of creating authorized URL link to GitHub repo:
```
GITHUB_REPO_PATH = "https://" + GIT_TOKEN + "@github.com/" + \
    GIT_REPO_PATH_PARENT + "/" + GIT_REPO_MASTER + ".git"
```
* The **os.join()** creation of paths is decent, but not as nice as the **Path** methods, which automatically detect os-related quirks and adjust your backslashes to be in the proper direction, etc.

* **NOTE**: Don't use python **Path** to join a **https://**  because you may find that **Path** will remove one of the backslashes.  Example below:
```
# DO NOT DO THIS:
GITHUB_REPO_PATH = Path("https://" + GIT_TOKEN + "@github.com") / \
    GIT_REPO_PATH_PARENT / (GIT_REPO_MASTER + ".git")
```
* I ended up using **urlunparse** to create the GitHub URL in what I'm guessing is close to a pythonic, os-independent, path-based way.  You can also do the simple string concatentation.


#Mount Google Drive in Colab

In [2]:
from google.colab import drive
drive.mount(COLAB_GDRIVE_MOUNTPOINT)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


##Create GitHub Authorized URL to Enable Push

In [3]:
GIT_TOKEN = Path(TOKEN_FILE).read_text()
GITHUB_REPO_PATH = urlunparse(("https", GIT_TOKEN+"@github.com", (GIT_REPO_PATH_PARENT + "/" + GIT_REPO_MASTER + ".git"),"","",""))

##4. **Pull** Remote GitHub repo file updates to your local Google Drive
**Use the following code cell if you wish to update your local Google Drive files**

You *must* have a cloned copy of the repo on your Google Drive before this will work. 

In [4]:
os.chdir(GDRIVE_REPO_PATH)
!git pull origin master

From https://github.com/migai/Kag
 * branch            master     -> FETCH_HEAD
Already up to date.


##5. **Push** Google Drive local repo to GitHub
**Use the following code cell if you wish to push your local file changes to the remote GitHub repo**

You *must* have a cloned repo on your Google Drive first, and *must* also have a valid GitHub OAuth token (in the GITHUB_REPO_PATH URL object).

**Before running this code cell, adjust the *push_message* at the top**

In [None]:

push_message = "_model_v13 LGBM parameterization"


################################################

os.chdir(GDRIVE_REPO_PATH)
!git config user.email "{GIT_USER_EMAIL}"
!git config user.name "{GIT_USERNAME}"

# make sure we are in the correct location on GitHub
#  (you may comment out these two statements if things are running smoothly for you, but they do help prevent git errors, and add minimal overhead)
!git remote remove origin   
!git remote add origin "{GITHUB_REPO_PATH}"

!git add .
!git commit -m "{push_message}"
!git push origin master

[master 0456d63] _model_v10 sklearn histGBR
 287 files changed, 9854770 insertions(+), 1735863 deletions(-)
 delete mode 100644 LGBM_feature_importance_v1.3_mg.png
 delete mode 100644 LGBM_feature_importance_v1.4_mg.png
 delete mode 100644 data_output/kaggle_utils_at_mg.py
 create mode 100644 ipynb_versions/Feature_merge_and_model_v10.ipynb
 create mode 100644 ipynb_versions/Feature_merge_and_model_v11.ipynb
 rename ipynb_versions/{ => archived}/Feature_merge_and_model_v7.ipynb (100%)
 rename ipynb_versions/{ => archived}/Feature_merge_and_model_v8.ipynb (100%)
 create mode 100644 ipynb_versions/archived/Feature_merge_and_model_v9.ipynb
 rename models_and_predictions/{gbt_feature_improtance_v1.png => GBT_Andreas/gbt_feature_importance_v1.png} (100%)
 rename models_and_predictions/{ => GBT_Andreas}/gbt_feature_importance_v3.png (100%)
 rename models_and_predictions/{ => GBT_Andreas}/gbt_model_v1.sav (100%)
 rename models_and_predictions/{ => GBT_Andreas}/gbt_pred_test_v1.pickle (100%)
 

##6. **Clone** Repo from GitHub to Google Drive (ONLY DO THIS ONCE!)
***Beware:***

**Use the following code cell ONLY if you are starting a new local Google Drive repo.</br>And, only do this cloning once -- do not repeat for the same repo.**

---
---
</br>

The following code is commented out, to make sure you do not inadvertently run it.  If you need to start a new repo connection, uncomment the code, and run this cell immediately after mounting your Google Drive.  Only then can you safely do push or pull from your local Google Drive repo to the remote GitHub repo.

In [None]:
'''
##################################################################
# OPTIONAL Code Cell - Only run this if you are starting with a new repo cloning
##################################################################

# Create empty folder to hold the cloned repo (if not done already), and then navigate to it
Path.mkdir(GDRIVE_CLONE_PATH, exist_ok=True)  # "exist_ok = True" ignores error if you have already made the directory
os.chdir(GDRIVE_CLONE_PATH)

# clone it
!git clone "{GITHUB_REPO_PATH}"
'''

##7. **Debugging** Tips and Code Snippets

###**Number 1 Cause of Issues:  Improper Formatting of URL Path Object**
From the code in the cell after mounting your Google Drive, make sure your GITHUB_REPO_PATH looks something like the upper URL, and not the lower URL:
</br>

https://123abc456def890adfa334af@github.com/migai/Kag.git
</br>

https:/123abc456def890adfa334af@github.com/migai/Kag.git
</br></br>

Simple solution: Ensure your URL and Path unparsers are creating properly formatted objects.  Use simple string concatenation if you must.  Check with print statements, for example, **but do not leave your token visible in any file you upload to GitHub**.  (See #2 below)
</br>


###**Number 2 Cause of Issues:  Your OAuth Token is in GitHub Repo**
You must not include your GitHub OAuth Token anywhere in any file you push to the GitHub repo.  GitHub checks every file you upload, and if it sees an active OAuth token anywhere (including code blocks, text blocks, printouts...), GitHub will reject/delete your OAuth Token.

</br>

Simple solution: go to https://github.com/settings/tokens/new and create yourself a new token.

###7.1) Status Check

In [None]:
os.chdir(GDRIVE_REPO_PATH)
!git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   helper_code/Git_enabled_Colab_with_GoogleDrive_and_GitHub.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


###7.2) Step-by-Step Add/Commit/Push

In [None]:
!git add "{'ipynb_versions/MG_EDA_items only v1.0.ipynb'}"

In [None]:
!git commit -m "ipynb computes word vectors"

[master 1a17701] ipynb computes word vectors
 1 file changed, 1 insertion(+), 1 deletion(-)


In [None]:
!git push origin master

Counting objects: 37, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (37/37), done.
Writing objects: 100% (37/37), 755.57 KiB | 561.00 KiB/s, done.
Total 37 (delta 29), reused 0 (delta 0)
remote: Resolving deltas: 100% (29/29), completed with 6 local objects.[K
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.[K
remote: error: Trace: a82a1cb20fb435e64bbe216f97aeb67e[K
remote: error: See http://git.io/iEPt8g for more information.[K
remote: error: File data_output/item_vectors.csv is 170.98 MB; this exceeds GitHub's file size limit of 100.00 MB[K
remote: error: File data_output/item_vectors2.csv is 155.56 MB; this exceeds GitHub's file size limit of 100.00 MB[K
To https://github.com/migai/Kag.git
 ! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'https://6102a328e47422962ce4ea8339e961e74cdd298d@github.com/migai/Kag.git'


###7.3) Aborting, Tracking Who Did What

In [None]:
!git merge --abort
!git pull origin master

fatal: There is no merge to abort (MERGE_HEAD missing).
From https://github.com/migai/Kag
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
Already up to date.


In [None]:
!git blame data_output/item_vectors2.csv

fatal: no such path 'data_output/item_vectors2.csv' in HEAD


In [None]:
# https://stackoverflow.com/questions/19573031/cant-push-to-github-because-of-large-file-which-i-already-deleted

# Here's something I found super helpful if you've already been messing around with your repo before you asked for help. First type:

# git status
# After this, you should see something along the lines of

# On branch master
# Your branch is ahead of 'origin/master' by 2 commits.
#   (use "git push" to publish your local commits)

# nothing to commit, working tree clean
# The important part is the "2 commits"! From here, go ahead and type in:

# git reset HEAD~<HOWEVER MANY COMMITS YOU WERE BEHIND>
# So, for the example above, one would type:

# git reset HEAD~2
# After you typed that, your "git status" should say:

# On branch master
# Your branch is up to date with 'origin/master'.

# nothing to commit, working tree clean
# From there, you can delete the large file (assuming you haven't already done so), and you should be able to re-commit everything without losing your work.

!git reset HEAD~1

Unstaged changes after reset:
M	helper_code/Git_enabled_Colab_with_GoogleDrive_and_GitHub.ipynb
M	ipynb_versions/MG_EDA_items only v1.0.ipynb


In [None]:
!git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   helper_code/Git_enabled_Colab_with_GoogleDrive_and_GitHub.ipynb[m
	[31mmodified:   ipynb_versions/MG_EDA_items only v1.0.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mdata_output/item_vectors.csv.gz[m

no changes added to commit (use "git add" and/or "git commit -a")


###7.4) Checking PATH and Resetting Origin

In [None]:
Path.cwd()

PosixPath('/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag')

In [None]:
# make sure we are in the correct location on GitHub
# Sometimes if you are having issues with Colab finding your file locations, it is because somehow your origin changed
#   there is code inside the "push" code cell to do this regularly, although it really shouldn't have to be done when things are working correctly
!git remote remove origin   
!git remote add origin "{GITHUB_REPO_PATH}"

##8. Thoughts on Alternative Processes For Colab-GitHub Integration
see, for example, Oleg Zero's post at [towardsdatascience.com](https://towardsdatascience.com/colaboratory-drive-github-the-workflow-made-simpler-bde89fba8a39) from October, 2019 and his associated [GitHub repositories](https://github.com/OlegZero13)

also, a slightly more detailed, updated version (December, 2019) of Oleg's post is available [here](https://towardsdatascience.com/google-drive-google-colab-github-dont-just-read-do-it-5554d5824228).


###Modifications to the above workflow:
Oleg creates a temporary directory on Google Drive in which to clone the GitHub repo, then copies the cloned files into the intentioned directory, and removes the temporary files.  I'm not exactly sure why he feels "recloning" is necessary, as opposed to just re-adding the origin and performing a pull operation.  (Oleg mentions "A nice thing about this solution is that it won’t crash if executed multiple times. Whenever executed, it will only update what is new and that’s it.") 
Anyhow, this code is shown below, if for some reason it becomes useful in the future.

In [None]:
!mkdir ./temp
!git clone "{GIT_PATH}"
!mv ./temp/* "{PROJECT_PATH}"
!rm -rf ./temp

###Importing (syncing) files Google Drive --> Colab VM and copying files from Colab VM to Google Drive
Oleg also provides an example of how one could load files from the local Google Drive repo into Colab using the !rsync command. It collects everything that belongs to the Drive directory and copies it into our local runtime.
Also, with rsync we have the option to exclude some of the content, which may be unnecessary or take too long to copy (the example below excludes the import of the directory "data" into the Colab VM)

*To Do:  I need to read more about the rsync command*

In [None]:
!rsync -aP --exclude=data/ "{PROJECT_PATH}"/*  ./

In [None]:
# Copying files from the Colab VM to the Google Drive long-term storage can be done with the !cp command:
!cp -r ./* "{PROJECT_PATH}"

###One last thing:  quick writing and reading of .py with Colab
Here is one link to some tips:  https://colab.research.google.com/notebooks/io.ipynb

Oleg's post describes using magic commands within Colab to write code from within the Colab notebook, and how one might go about reloading the code after modification such that Colab recognizes the changes (as opposed to using !shred command):

![Code for %%writefile and %reload_ext](https://miro.medium.com/max/1334/0*IlOTzOp9dYEMiTp6.png)

