# Starting Kit - Soundness 

TODOs: 
- Add detailed description of the challenge
- describe your data
- describe how you will evaluate
- provide instructions to the participants about what they should do

***
# Setup
***
`COLAB` determines whether this notebook is running on Google Colab.

In [1]:
COLAB='google.colab' in str(get_ipython())

In [2]:
if COLAB:
    # clone github repo
    !git clone https://github.com/ihsaan-ullah/M1-Challenge-Class-2024.git

    # move to the HEP starting kit folder
    %cd M1-Challenge-Class-2024/Soundness/Starting_Kit/

    !pip install -q --upgrade sentence-transformers transformers

***
# Imports
***

In [3]:
import os
import sys
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

from sentence_transformers import SentenceTransformer

from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")
tqdm.pandas()

***
# Directories
***

In [4]:
root_dir = "./"
# Input data directory to read training data from
input_dir = root_dir + "sample_data/"
# Reference data directory to read test labels from
reference_dir = root_dir + "sample_data/"
# Output data directory to write predictions to
output_dir = root_dir + "sample_result_submission"
# Program directory
program_dir = root_dir + "ingestion_program"
# Score directory
score_dir = root_dir + "scoring_program"
# Directory to read submitted submissions from
submission_dir = root_dir + "sample_code_submission"

***
# Add directories to path
***

In [5]:
sys.path.append(input_dir)
sys.path.append(reference_dir)
sys.path.append(output_dir)
sys.path.append(program_dir)
sys.path.append(submission_dir)

***
# Data
***
1. Load Data
2. Preprocess data


TODOS:
- show data statistics

### ⚠️ Note:
The data used here is sample data is for demonstration only to get a view of what the data looks like.

In [6]:
class Data():

  def __init__(self):

      self.df = None

      print("==========================================")
      print("Data")
      print("==========================================")

  def load_data(self):
    """
      Loads data from csv file
    """
    print("[*] Loading Data")

    # files path
    train_data_file = os.path.join(input_dir, 'train.csv')
    test_data_file = os.path.join(input_dir, 'test.csv')
    train_labels_file = os.path.join(input_dir, 'train.labels')
    test_labels_file = os.path.join(reference_dir, 'test.labels')
    
    # read data
    self.train_df = pd.read_csv(train_data_file)
    self.test_df = pd.read_csv(test_data_file)

    # read labels
    with open(train_labels_file, 'r') as file:
      self.train_labels = [int(line.strip()) for line in file.readlines()]
    with open(test_labels_file, 'r') as file:
      self.test_labels = [int(line.strip()) for line in file.readlines()]

  def transform_data(self):

    print("[*] Transforming Data")
    # pre-trained model
    self.embeddings_model = SentenceTransformer('all-MiniLM-L6-v2')

    # train context and references
    train_contexts = self.train_df['context_sentences'].tolist()
    train_references = (self.train_df['reference_title'] + " " + self.train_df['reference_abstract']).tolist()

    # test context and references
    test_contexts = self.test_df['context_sentences'].tolist()
    test_references = (self.test_df['reference_title'] + " " + self.test_df['reference_abstract']).tolist()

    # Compute train embeddings
    train_context_embeddings = self._compute_embeddings(train_contexts)
    train_reference_embeddings = self._compute_embeddings(train_references)

    # Compute test embeddings
    test_context_embeddings = self._compute_embeddings(test_contexts)
    test_reference_embeddings = self._compute_embeddings(test_references)

    # Calculate cosine similarity scores
    cosine_similarity = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
    train_similarity_scores = [cosine_similarity(context, reference) for context, reference in zip(train_context_embeddings, train_reference_embeddings)]
    test_similarity_scores = [cosine_similarity(context, reference) for context, reference in zip(test_context_embeddings, test_reference_embeddings)]

    self.X_train = train_similarity_scores
    self.X_test = test_similarity_scores

  def get_train_data(self):
    return self.X_train, self.train_labels
  
  def get_test_data(self):
    return self.X_test, self.test_labels

  def _compute_embeddings(self, texts):
    return self.embeddings_model.encode(texts, convert_to_tensor=False)
  
  def show_random_sample(self):
    random_sample_index = np.random.randint(0, len(self.train_df))
    print("Example Context sentence")
    print(self.train_df.context_sentences.iloc[random_sample_index])

In [7]:
# Initilaize data
data = Data()

Data


In [8]:
# load data
data.load_data()

[*] Loading Data


In [9]:
# load data
data.transform_data()

[*] Transforming Data


In [10]:
# show random sample from data
data.show_random_sample()

Example Context sentence
DNNs have also become powerful tools for many scientific fields, such as medicine, bioinformatics,and astronomy [X], which usually involve massive data volumes.\nHowever, deep learning still has some significant disadvantages.


***
# Visualize
***
TODOs:
- visualize your data

In [11]:
class Visualize():

    def __init__(self, data):

        print("==========================================")
        print("Visualize")
        print("==========================================")
        self.data = data



    def visualize_data(self):
        print("Implement this function")
        pass


In [12]:
# Initilaize Visualize
visualize = Visualize(data=data)
visualize.visualize_data()

Visualize
Implement this function


***
# Import Submission Model
***
We import a class named `Model` from the submission file (`model.py`). This `Model` class has the following methods:
- `init`: initializes classifier
- `fit`: gets train data and labels as input to train the classifier
- `predict`: gets test data and outputs predictions made by the trained classifier


In this example code, the `Model` class implements a Gradient Boosting Classifier model. You can find the code in `M1-Challenge-Class-2024/Soundness/Starting_Kit/sample_code_submission/model.py`. You can modify it the way you want, keeping the required class structure and functions there. More instructions are given inside the `model.py` file. If running in Collab, click the folder icon in the left sidebar to open the file browser.

In [13]:
from model import Model

***
# Program
***
**`Ingestion program`** is responsible to run the submission of a participant on Codabench platform. **`Program`** is a simplified version of the **Ingestion Program** to show to participants how it runs a submission.
1. Train a model on train data
2. Predict using Test data

In [14]:
class Program():

    def __init__(self, data):

        # used to keep object of Model class to run the submission
        self.model = None
        # object of Data class used here to get the train and test sets
        self.data = data

        # results
        self.results = []

        print("==========================================")
        print("Program")
        print("==========================================")
    
    def initialize_submission(self):
        print("[*] Initializing Submmited Model")
        self.model = Model()

    def fit_submission(self):
        print("[*] Calling fit method of submitted model")
        X_train, y_train  = self.data.get_train_data()
        X_train = np.array(X_train).reshape(-1, 1)
        self.model.fit(X_train, y_train)


    def predict_submission(self):
        print("[*] Calling predict method of submitted model")
      
        X_test, _ = self.data.get_test_data()
        X_test = np.array(X_test).reshape(-1, 1)
        self.y_test_hat = self.model.predict(X_test)

       

In [15]:
# Intiialize Program
program = Program(data=data)

Program


In [16]:
# Initialize submitted model
program.initialize_submission()

[*] Initializing Submmited Model
[*] - Initializing Classifier


In [17]:
# Call fit method of submitted model
program.fit_submission()

[*] Calling fit method of submitted model
[*] - Training Classifier on the train set


In [18]:
# Call predict method of submitted model
program.predict_submission()

[*] Calling predict method of submitted model
[*] - Predicting test set using trained Classifier


***
# Score
***

TODOs:
- Explain the evaluation metric


In [27]:
class Score():

    def __init__(self, data, program):

        self.data = data
        self.program = program

        print("==========================================")
        print("Score")
        print("==========================================")

    def compute_scores(self):
        print("[*] Computing scores")

        _, y_test = self.data.get_test_data()
        y_test_hat = self.program.y_test_hat

        accuracy = accuracy_score(y_test, y_test_hat)
        print(f"Accuracy score: {accuracy}")

In [28]:
# Initialize Score
score = Score(data=data, program=program)

Score


In [29]:
# Compute Score
score.compute_scores()

[*] Computing scores
Accuracy score: 0.6570397111913358


***
# Submissions
***

### **Unit Testing**

It is <b><span style="color:red">important that you test your submission files before submitting them</span></b>. All you have to do to make a submission is modify the file <code>model.py</code> in the <code>sample_code_submission/</code> directory, then run this test to make sure everything works fine. This is the actual program that will be run on the server to test your submission.
<br>
Keep the sample code simple.<br>

<code>python3</code> is required for this step

### **Test Ingestion Program**

In [30]:
!python3 $program_dir/ingestion.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


############################################
### Ingestion Program
############################################

[*] Loading Data
[*] Transforming Data
[*] Initializing Submmited Model
[*] - Initializing Classifier
[*] Calling fit method of submitted model
[*] - Training Classifier on the train set
[*] Calling predict method of submitted model
[*] - Predicting test set using trained Classifier
[*] Saving ingestion result

---------------------------------
[✔] Total duration: 0:38:25.285816
---------------------------------

----------------------------------------------
[✔] Ingestions Program executed successfully!
----------------------------------------------




### **Test Scoring Program**

In [32]:
!python3 $score_dir/score.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


############################################
### Scoring Program
############################################

[*] Reading predictions
[✔]
[*] Reading test labels
[✔]
[*] Computing scores
Accuracy Score: 0.6570397111913358
[✔]
[*] Writing scores
[✔]

---------------------------------
[✔] Total duration: 0:00:00.004887
---------------------------------

----------------------------------------------
[✔] Scoring Program executed successfully!
----------------------------------------------




### **Prepare the submission**

TODOs:  
- The following submission will be submitted by the participants to your competition website. Describe this clearly and point to the competition once your website is ready

In [33]:
import datetime
from data_io import zipdir
the_date = datetime.datetime.now().strftime("%y-%m-%d-%H-%M")
code_submission = 'Soundness-code_submission_' + the_date + '.zip'
zipdir(code_submission, submission_dir)
print("Submit : " + code_submission + " to the competition")
print("You can find the zip file in `M1-Challenge-Class-2024/Soundness/Starting_Kit/")

Submit : Soundness-code_submission_24-01-12-15-27.zip to the competition
You can find the zip file in `M1-Challenge-Class-2024/Soundness/Starting_Kit/
