<a href="https://colab.research.google.com/github/mkhfring/parallel-c/blob/main/Hugging_Face_%2B_W%26B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Hugging Face + Weights & Biases

### Introduction
Visualize your [Hugging Face](https://github.com/huggingface/transformers) model's performance quickly with a seamless W&B integration. Compare hyperparameters, output metrics, and system stats like GPU utilization across your models. 

<img src="https://i.imgur.com/vnejHGh.png" width="800">

**Why use W&B**

Think of W&B like GitHub for machine learning models— save machine learning experiments to your private, hosted dashboard. Experiment quickly with the confidence that all the versions of your models are saved for you, no matter where you're running your scripts.

W&B lightweight integrations works with any Python script, and all you need to do is sign up for a free W&B account to start tracking and visualizing your models.

In the Hugging Face Transformers repo, we've instrumented the Trainer to automatically log training and evaluation metrics to W&B at each logging step.

Here's an in depth look at how the integration works: [Hugging Face + W&B Report](https://app.wandb.ai/jxmorris12/huggingface-demo/reports/Train-a-model-with-Hugging-Face-and-Weights-%26-Biases--VmlldzoxMDE2MTU).

### Install dependencies
Install the Hugging Face and Weights & Biases libraries, and the GLUE dataset and training script for this tutorial.
- [Hugging Face Transformers](https://github.com/huggingface/transformers): Natural language models and datasets
- [Weights & Biases](https://docs.wandb.com/): Experiment tracking and visualization
- [GLUE dataset](https://gluebenchmark.com/): A language understanding benchmark dataset
- [GLUE script](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py): Model training script for sequence classification

In [None]:
!pip install transformers datasets -qq
!pip install wandb -qq
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/text-classification/run_glue.py -qq

In [None]:
import os

directories = os.listdir()

program_dirs = [element for element in directories if 'program' in element]
print(program_dirs)
print(os.listdir(program_dirs[0]))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
ls

In [None]:
import os

directories = os.listdir()
print(directories)
program_dirs = [element for element in directories if 'p' in element]

print(program_dirs)

### [Sign up for a free account →](https://app.wandb.ai/login?signup=true)

- **Unified dashboard**: Central repository for all your model metrics and predictions
- **Lightweight**: No code changes required to integrate with Hugging Face
- **Accessible**: Free for individuals and academic teams
- **Secure**: All projects are private by default
- **Trusted**: Used by machine learning teams at OpenAI, Toyota, GitHub, Lyft and more

## API Key
Once you've signed up, run the next cell and click on the link to get your API key and authenticate this notebook.

In [None]:
!pip install numpy

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

In [None]:
import transformers
import torch
import numpy as np

# Load the CodeBERT model
model = transformers.AutoModel.from_pretrained('microsoft/codebert-base')

# Define the two code snippets that you want to compare
def similarity(em1, em2):
  em1 = em1.detach().numpy()
  em2 = em2.detach().numpy()
  dot_product = np.dot(em1, em2)
  norm1 = np.linalg.norm(em1)
  norm2 = np.linalg.norm(em2)

  similarity = dot_product / (norm1 * norm2)
  
  return similarity

code1 = """
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Scanner;

public class Main {
    private static Scanner scan;

    public static void main(String[] args) {
        scan = new Scanner(System.in);
        List<Integer> list = new ArrayList<Integer>();
        for (int i = 0; i < 10; i++) {
            list.add(scan.nextInt());
        }
        Collections.sort(list);
        System.out.println(list.get(9) + "\n");
        System.out.println(list.get(8) + "\n");
        System.out.println(list.get(7) + "\n");
    }
}
"""


# Tokenize the codes
tokenizer = transformers.AutoTokenizer.from_pretrained('microsoft/codebert-base')
input_ids1 = tokenizer.encode(code1, return_tensors='pt')


for p in program_dirs:
  p_similarity = []
  files = [f for f in os.listdir(p) if 'java' in f]
  for file in files:
    sim = []
    f = open(os.path.join(p, file))
    code2 = f.read()
    input_ids2 = tokenizer.encode(code2, return_tensors='pt')
# Compute the embeddings
    embedding2 = model(input_ids1).last_hidden_state
    embedding1 = model(input_ids2).last_hidden_state
# print(embedding2.shape)
# print(embedding1.shape)
    if embedding1.shape[1] > embedding2.shape[1]:
      em = embedding1
      embedding1  = embedding2
      embedding2 = em

# print("After relplace")
# print(embedding2.shape)
# print(embedding1.shape)
    sum = 0
    for i in range(embedding1.shape[1]-1):
      sim.append(similarity(embedding1[0][i], embedding2[0][i]))
      sum = sum + similarity(embedding1[0][i], embedding2[0][i])


  #   p_similarity.append( float(sum/ int(embedding1.shape[1])))
  # print(p_similarity)
  print(sim)
  print(f"the similarity for problem {p} is {sum / len(sim)}")
  # print("\n")





In [None]:


!pip install simpletransformers

In [None]:
from simpletransformers.language_representation import RepresentationModel
code1 = """
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Scanner;

public class Main {
    private static Scanner scan;

    public static void main(String[] args) {
        scan = new Scanner(System.in);
        List<Integer> list = new ArrayList<Integer>();
        for (int i = 0; i < 10; i++) {
            list.add(scan.nextInt());
        }
        Collections.sort(list);
        System.out.println(list.get(9) + "\n");
        System.out.println(list.get(8) + "\n");
        System.out.println(list.get(7) + "\n");
    }
}
"""
sentences = [ code1] 
print(program_dirs)
for p in program_dirs:
  p_similarity = []
  files = [f for f in os.listdir(p) if 'java' in f]
  for file in files:
    f = open(os.path.join(p, file))
    sentences.append(f.read())

print(sentences)
        
sentences = ["Machine Learning and Deep Learning are part of AI", code1] #it should always be a list

model = RepresentationModel(
        model_type="bert",
        model_name="bert-base-uncased",
        use_cuda=True
    )

word_vectors = model.encode_sentences(sentences, combine_strategy=None)
     

In [None]:
word_vectors.shape

## Resources
- [Documentation](https://docs.wandb.com/huggingface): docs on the Weights & Biases and Hugging Face integration
- Contact: Message us at contact@wandb.com with questions