<center> <a href="https://dagshub.com"><img alt=\"DAGsHub\" width=500px src=https://raw.githubusercontent.com/DAGsHub/client/master/dagshub_github.png></a> </center>

<center><h1>Use all the benefits DagsHub has to offer in your Git project.</h1></center>

---

<center><h4><b>With this Colab notebook, you will learn how to use DagsHub miracles in your GitHub / GitLab project. </b><h4><center>


<p>DagsHub allows its users to mirror* their Git projects (from GitHub, GitLab, etc.) and utilize DagsHub's features while keeping the Git server on a different provider. We will use the Hello world project to demonstrate how to use Mirroring.</p>


**The project** - In this walkthrough, we will train a model to classify 'Ham' and 'Spam' emails. We will use the Enron dataset that stores labeled email in a CSV file.

---

**Mirror(*)** - when mirroring a Git repository to DagsHub, we clone the Git Server to DagsHub and automatically pull the changes every 24 hours or manually by clicking on the sync button next to the DagsHub repo name. This way, you can see all the Git files on your DagsHub's project while using a different Git provider, and utilize DagsHub advantages (storage, experiment tracking, etc.).

<img src="https://dragonballz.co.il/wp-content/uploads/2020/12/discord-logo.jpg" height="23"/> [Discord Channel](https://discord.com/channels/698874030052212737/698874030572437526) | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Linkedin.svg/1200px-Linkedin.svg.png" height="23"/> [LinkedIn](https://www.linkedin.com/company/dagshub/) | <img src="https://help.twitter.com/content/dam/help-twitter/brand/logo.png" height="25"/> [Twitter](https://twitter.com/TheRealDAGsHub) | <img src="https://res-2.cloudinary.com/crunchbase-production/image/upload/c_lpad,f_auto,q_auto:eco/plwmuai9t3okgwbuhkho" height="30"/> [DAGsHub](https://dagshub.com) | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Octicons-mark-github.svg/1200px-Octicons-mark-github.svg.png" height="25"/> [GitHub](https://github.com/DAGsHub)

# Configure DAGsHub, GitHub and Git

In [1]:
import requests
import getpass
import datetime

**Set Environment Variables - DAGsHub**


In [2]:
#@title Enter the DAGsHub repository owner name:

DAGSHUB_REPO_OWNER= "johngechu" #@param {type:"string"}

In [3]:
#@title Enter the DAGsHub repository name:

DAGSHUB_REPO_NAME= "First-test-demo-project" #@param {type:"string"}

In [4]:
#@title Enter the username of your DAGsHub account:

DAGSHUB_USER_NAME = "johngechu" #@param {type:"string"}

**Set Environment Variables - GitHub**


In [5]:
#@title Enter the GitHub repository owner name:

GITHUB_REPO_OWNER= "johngechu" #@param {type:"string"}

In [6]:
#@title Enter the GitHub repository name:

GITHUB_REPO_NAME= "First-test-demo-project" #@param {type:"string"}

In [7]:
#@title Enter the username of your GitHub account:

GITHUB_USER_NAME = "johngechu" #@param {type:"string"}

In [8]:
#@title Enter the email for your GitHub account:

GITHUB_EMAIL = "johnngechu18@gmail.com" #@param {type:"string"}

We take security very seriously and don't want your DAGsHub password to be saved in the notebook runtime. Thus, we created an API that generates an access token to your DAGsHub account. With this token, you will push your Git tracked files without saving the password as a variable.

In [9]:
r = requests.post('https://dagshub.com/api/v1/user/tokens',
                  json={"name": f"colab-token-{datetime.datetime.now()}"},
                  auth=(DAGSHUB_USER_NAME, getpass.getpass('Please enter your DAGsHub token or password: ')))
r.raise_for_status()
DAGSHUB_TOKEN=r.json()['sha1']

Please enter your DAGsHub token or password: ··········


In [10]:
GITHUB_TOKEN = getpass.getpass('Please enter your GitHub token or password: ')

Please enter your GitHub token or password: ··········


**Configure Git**

In [11]:
!git config --global user.email {GITHUB_EMAIL}
!git config --global user.name {GITHUB_USER_NAME}

**Clone the Repository**

In [12]:
!git clone https://{GITHUB_USER_NAME}:{GITHUB_TOKEN}@github.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}.git

%cd {GITHUB_REPO_NAME}

Cloning into 'First-test-demo-project'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (3/3), done.
/content/First-test-demo-project


# Install and Configure DVC

**Initialize DVC**

In [15]:
# Install DVC
!pip install dvc &> /dev/null

# Import DVC package - relevant only when working in a Colab environment
import dvc

# Initilize DVC in the local directory
!dvc init &> /dev/null

# Track the changes with git
!git add .dvc .dvcignore .gitignore
!git commit -m "Initialize DVC"

fatal: pathspec '.gitignore' did not match any files
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean


**Configure DVC**

In [16]:
# Set DVC remote storage as 'DAGsHub storage'
!dvc remote add origin s3://dvc
!dvc remote modify origin endpointurl https://dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}.s3
# General DVC configuration
!dvc remote modify origin --local access_key_id {DAGSHUB_TOKEN}
!dvc remote modify origin --local secret_access_key {DAGSHUB_TOKEN}

# Project Setup


At this point, we want to add the required files for our ML project to the local directory. We will use the dvc get command that downloads files from a Git repository or DVC storage without tracking them.

**Download the project's files**

In [17]:
!dvc get https://dagshub.com/nirbarazida/hello-world requirements.txt
!dvc get https://dagshub.com/nirbarazida/hello-world src
!dvc get https://dagshub.com/nirbarazida/hello-world-files data/

Downloading requirements.txt:   0% 0/1 [00:00<?, ?files/s{'info': ''}]
!
requirements.txt          |0.00 [00:00,        ?B/s]
Downloading src:   0% 0/3 [00:00<?, ?files/s{'info': ''}]
!
src/const.py          |0.00 [00:00,        ?B/s]
                                                
!
src/data_preprocessing.py          |0.00 [00:00,        ?B/s]
                                                             
!
src/modeling.py          |0.00 [00:00,        ?B/s]
Downloading :   0% 0/1 [00:00<?, ?files/s{'info': ''}]
!
data/enron.csv          |0.00 [00:00,        ?B/s]
data/enron.csv          |4.05M [00:00,    39.6MB/s]
data/enron.csv          |9.98M [00:00,    51.6MB/s]
data/enron.csv          |16.0M [00:00,    56.0MB/s]
data/enron.csv          |22.1M [00:00,    58.9MB/s]
data/enron.csv          |27.9M [00:00,    57.5MB/s]
data/enron.csv          |34.3M [00:00,    60.2MB/s]
data/enron.csv          |41.5M [00:00,    64.1MB/s]
data/enron.csv          |47.4M [00:00,    63.3MB/s]
data/enron.c

**Install Requirements**

In [18]:
!pip install -r requirements.txt &> /dev/null

# Track Files Using DVC and Git 🏇🏼

The data directory contains the data sets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

**Track Files with DVC**



In [19]:
# Add the data directory to DVC tracking
!dvc add data

[?25l⠋ Checking graph
Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
!
          |0.00 [00:00,     ?file/s]
                                    
!
  0% |          |0/? [00:00<?,    ?files/s]
                                           
Adding data to cache:   0% 0/1 [00:00<?, ?file/s]
Adding data to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
                                                             
Checking out /content/First-test-demo-project/data:   0% 0/2 [00:00<?, ?files/s]
  0% 0/2 [00:00<?, ?files/s{'info': ''}]                                        
 50% 1/2 [00:00<00:00,  6.30files/s{'info': ''}]
Adding...: 100% 1/1 [00:00<00:00,  1.50file/s{'info': ''}]

To track the changes with git, run:

	git add .gitignore data.dvc

To enable auto staging, run:

	dvc config core.autostage true


In [20]:
# Track the changes with Git
!git add data.dvc .gitignore
!git commit -m "Add the data directory to DVC tracking"

[main ce978a9] Add the data directory to DVC tracking
 2 files changed, 7 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 data.dvc


**Track Files with Git**

In [21]:
!git add requirements.txt src/
!git commit -m "Add requirements and src to Git tracking"

[main f21b315] Add requirements and src to Git tracking
 4 files changed, 97 insertions(+)
 create mode 100644 requirements.txt
 create mode 100644 src/const.py
 create mode 100644 src/data_preprocessing.py
 create mode 100644 src/modeling.py


# Push the Files to the Remotes

**Push Git tracked files**


In [22]:
!git push https://{GITHUB_USER_NAME}:{GITHUB_TOKEN}@github.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}.git

Enumerating objects: 18, done.
Counting objects:   5% (1/18)Counting objects:  11% (2/18)Counting objects:  16% (3/18)Counting objects:  22% (4/18)Counting objects:  27% (5/18)Counting objects:  33% (6/18)Counting objects:  38% (7/18)Counting objects:  44% (8/18)Counting objects:  50% (9/18)Counting objects:  55% (10/18)Counting objects:  61% (11/18)Counting objects:  66% (12/18)Counting objects:  72% (13/18)Counting objects:  77% (14/18)Counting objects:  83% (15/18)Counting objects:  88% (16/18)Counting objects:  94% (17/18)Counting objects: 100% (18/18)Counting objects: 100% (18/18), done.
Delta compression using up to 2 threads
Compressing objects:   7% (1/14)Compressing objects:  14% (2/14)Compressing objects:  21% (3/14)Compressing objects:  28% (4/14)Compressing objects:  35% (5/14)Compressing objects:  42% (6/14)Compressing objects:  50% (7/14)Compressing objects:  57% (8/14)Compressing objects:  64% (9/14)Compressing objects:  71% (10/14)Compressing

**Push DVC tracked files**


In [24]:
!pip install dvc-s3

Collecting dvc-s3
  Downloading dvc_s3-3.2.0-py3-none-any.whl (13 kB)
Collecting s3fs>=2023.6.0 (from dvc-s3)
  Downloading s3fs-2024.6.1-py3-none-any.whl (29 kB)
Collecting aiobotocore[boto3]>=2.5.0 (from dvc-s3)
  Downloading aiobotocore-2.13.1-py3-none-any.whl (76 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.9/76.9 kB 1.7 MB/s eta 0:00:00
Collecting botocore<1.34.132,>=1.34.70 (from aiobotocore[boto3]>=2.5.0->dvc-s3)
  Downloading botocore-1.34.131-py3-none-any.whl (12.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.3/12.3 MB 15.2 MB/s eta 0:00:00
Collecting aioitertools<1.0.0,>=0.5.1 (from aiobotocore[boto3]>=2.5.0->dvc-s3)
  Downloading aioitertools-0.11.0-py3-none-any.whl (23 kB)
Collecting boto3<1.34.132,>=1.34.70 (from aiobotocore[boto3]>=2.5.0->dvc-s3)
  Downloading boto3-1.34.131-py3-none-any.whl (139 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 139.2/139.2 kB 10.2 MB/s eta 0:00:00
Collecting jmespath<2.0.0,>=0.7.1 (from boto3<1.34.132,>=1.34.70->aiobotoco

In [25]:
!dvc push -r origin

Collecting          |2.00 [00:00, 50.6entry/s]
Pushing
Querying remote cache:   0% 0/1 [00:00<?, ?files/s]
Querying remote cache:   0% 0/1 [00:00<?, ?files/s{'info': ''}]
                                                               
!
  0% Checking cache in 'dvc/files/md5'|          |0/? [00:00<?,    ?files/s]
                                                                            
!
  0% |          |0/? [00:00<?,    ?files/s]
                                           
Pushing to s3:   0% 0/2 [00:00<?, ?file/s]
Pushing to s3:   0% 0/2 [00:00<?, ?file/s{'info': ''}]
Pushing to s3:   0% 0/1 [00:00<?, ?file/s{'info': ''}]

  0% 0.00/57.4M [00:00<?, ?B/s]

  0% 0.00/57.4M [00:00<?, ?B/s{'info': ''}]

100% 57.4M/57.4M [00:01<00:00, 36.3MB/s{'info': ''}]

                                                    
Pushing to s3: 100% 1/1 [00:02<00:00,  1.67s/file{'info': ''}]

  0% 0.00/69.0 [00:00<?, ?B/s]

  0% 0.00/69.0 [00:00<?, ?B/s{'info': ''}]

100% 69.0/69.0 [00:00<00:00, 319B/s{'inf

# Checkpoint 🎯



If you check your DAGsHub repository's new status, you will see all the files that we pushed with Git and DVC, as shown here.

- The main repository page:
<center><a><img src="https://i.ibb.co/F7TpFPw/5-repo-stat-after-push.png" alt="5-repo-stat-after-push" border="0"></a></center>
<br>

  <u>**Note**</u>: The DVC tracked files are marked with a blue background.

- The data directory:
<center><a><img src="https://i.ibb.co/6P9RrNj/6-data-dir-after-push.png" alt="6-data-dir-after-push" border="0"></a></center>
<br>

- The data file itself:
<center><a><img src="https://i.ibb.co/9HWvKTY/7-content-of-enron-file.png" alt="7-content-of-enron-file" border="0"></a></center>

# Process and Track Data Changes

We want to preprocess our data and track the results using DVC. by running the data_preprocessing.py module; we will generate four new files of processed data to the 'data' directory. We will track the new files with DVC and Git and push them to the remotes.

In [26]:
# Process the Data
!python src/data_preprocessing.py

[DEBUG] Preprocessing raw data 
     [DEBUG] Loading raw data
     [DEBUG] Removing punctuation from Emails
     [DEBUG] Label encoding target column
     [DEBUG] vectorizing the emails by words
     [DEBUG] Splitting data to train and test
     [DEBUG] Saving data to file


In [27]:
# Track the Changes
!dvc add data &> /dev/null
!git add data.dvc
!git commit -m "Process raw-data and save it to data directory"

[main f4507e5] Process raw-data and save it to data directory
 1 file changed, 3 insertions(+), 3 deletions(-)


**Push the Files to the remotes**

In [28]:
!git push https://{GITHUB_USER_NAME}:{GITHUB_TOKEN}@github.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}.git

!dvc push -r origin &> /dev/null

Enumerating objects: 5, done.
Counting objects:  20% (1/5)Counting objects:  40% (2/5)Counting objects:  60% (3/5)Counting objects:  80% (4/5)Counting objects: 100% (5/5)Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects:  33% (1/3)Compressing objects:  66% (2/3)Compressing objects: 100% (3/3)Compressing objects: 100% (3/3), done.
Writing objects:  33% (1/3)Writing objects:  66% (2/3)Writing objects: 100% (3/3)Writing objects: 100% (3/3), 372 bytes | 372.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To https://github.com/johngechu/First-test-demo-project.git
   f21b315..f4507e5  main -> main


# Checkpoint

If you check the data directory's new status in your DAGsHub repository, you will see all the new data files there, as shown below.

- The data directory
<center><a><img src="https://i.ibb.co/GxjTxB3/8-data-dir-after-push.png" alt="8-data-dir-after-push" border="0" /></a></center>

# Create Data Science Experiments 🧪

In [29]:
!pip3 install dagshub &> /dev/null

**Run new experiment**

In [33]:
!python3 src/modeling.py

[DEBUG] Initialize Modeling 
     [DEBUG] Loading data sets for modeling
     [DEBUG] Runing Random Forest Classifier
     [INFO] Finished modeling with AUC Score: 0.931


**Track the Experiment Files**

In [34]:
!git add metrics.csv params.yml
!git commit -m "New Experiment - Random Forest Classifier with basic processing"

[main 1e4a785] New Experiment - Random Forest Classifier with basic processing
 2 files changed, 22 insertions(+)
 create mode 100644 metrics.csv
 create mode 100644 params.yml


**Push the Files to the Remotes**

In [35]:
!git push https://{GITHUB_USER_NAME}:{GITHUB_TOKEN}@github.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}.git

Enumerating objects: 5, done.
Counting objects:  20% (1/5)Counting objects:  40% (2/5)Counting objects:  60% (3/5)Counting objects:  80% (4/5)Counting objects: 100% (5/5)Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects:  25% (1/4)Compressing objects:  50% (2/4)Compressing objects:  75% (3/4)Compressing objects: 100% (4/4)Compressing objects: 100% (4/4), done.
Writing objects:  25% (1/4)Writing objects:  50% (2/4)Writing objects:  75% (3/4)Writing objects: 100% (4/4)Writing objects: 100% (4/4), 645 bytes | 645.00 KiB/s, done.
Total 4 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To https://github.com/johngechu/First-test-demo-project.git
   f4507e5..1e4a785  main -> main


# Checkpoint 🎯

If you check your DAGsHub repository's new status, you will see that a new experiment was added to the Experiment Tab. If you go to the tab, you will see the hyperparameters of the model and its performances.

- The experiment tab:
<center><a href="https://ibb.co/PWwrpQD"><img src="https://i.ibb.co/wQMdHsc/10-experiment.png" alt="10-experiment" border="0" /></a></center>

# 📀 Save the notebook to DagsHub

In [None]:
from dagshub.notebook import save_notebook

save_notebook(repo=f"{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}",
              branch = "main",
              path="First-test-demo-project.ipynb",
              commit_message="Adding notebook",
              versioning="git")

# Finish Line 🏁

**Congratulations**  - You made it to the finish line! 🥳

In the Get Started section, we covered the fundamental of DAGsHub usage. We started with creating a repository and configure Git and DVC. Then, we added a project to the repository using Git and DVC to track the files. Lastly, we created our very first Data Science Experiment with DAGsHub Logger. <br><br>

More resources that can interest you:
- [DAGsHub Docs](https://dagshub.com/docs/).
- [Get Started Tutorial](https://dagshub.com/docs/getting-started/overview/).
- [DAGsHub Blog](https://dagshub.com/blog/).
- [FAQ](https://dagshub.com/docs/faq/).

<br>

We hope that this Tutorial was helpful and made the on-boarding process easier for you. If you found an issue in the notebook, please [let us know](https://dagshub.com/DAGsHub-Official/DAGsHub-Issues/issues/). If you have any questions feel free to join our [Discord channel](https://discord.com/invite/9gU36Y6) and ask there. We can't wait to see what remarkable project you will create and share with the Data Science community!
<br><br>