## AI/ML Coding Round - Data Preparation

### Problem: 
Train a LLM that can answer queries about JFrog Pipelines' [native steps](https://jfrog.com/help/r/jfrog-pipelines-documentation/pipelines-steps). 
When posed with a question like "How do I upload an artifact?" or "What step should I use for an Xray scan?", the model should list the appropriate native step(s) and provide an associated YAML for that step.

 ### Requirements
1. Data Collection: Acquire publicly available information on Native Steps from JFrog's website that contain information on native steps for building pipelines. Data that is not publicly accessible falls outside the scope of this coding challenge. (https://jfrog.com/help/r/jfrog-pipelines-documentation/pipelines-steps)
2. Data Preprocessing: Process the text to make it suitable for training. This might involve tokenization, stemming, and other NLP techniques.
3. Model Training: Train a LLM on the (preprocessed) dataset. You can choose one of the freely available open source model like BERT or any other model available
4. Query Handling: Implement a function that takes a user query as input and returns the appropriate native step(s) and a sample YAML configuration.
5. YAML Generation: Implement a function that can generate a sample YAML configuration based on the identified native step(s).
------------

```python
@author - Midhun Kumar
@email  - midhunkumar04@agmail.com
```
-----------------------------------------------------------------------

### 1. Data preparation for training

- we have got the raw data from the site for variopus piplines
- we have data that can talk about the pipoline procedure and sample yaml files
- I have choosen the meta's opensourse `codellama-instruct-7B` model for finetuning since its current best model for code completion and instruction
- Since we are using the above model we need to preprocess for below format
![Alt text](image.png)
- As you can see we have know ewhat part is YAML which we taged inside this blocks `[code][/code]` we need to converts them into `[YAML]` tag in order extract from trained model

#### Step1: Load the data

In [2]:
# Importing Libraries
import re
import pandas as pd 
pd.set_option('display.max_colwidth',None)

In [3]:
raw_data = pd.read_csv('./jFrog_pipline.csv', index_col=False, encoding='utf')
raw_data.head(1)

Unnamed: 0.1,Unnamed: 0,Title,PiplineProcess
0,0,jfrog-pipelines,"JFrog Pipelines offers JFrog Platform customers three vital capabilities: end-to-end automation (CI/CD), workflow and tool orchestration, and the optimization of the JFrog toolset functionality in use. Consistent with JFrog’s customer-centric product philosophy, Pipelines is enterprise-ready and universal.\n### Workflow Automation\nA pipeline is an event-driven automated workflow for executing a set of DevOps activities (CI, deployments, infrastructure provisioning, etc). It is composed of a sequence of interdependent **steps** which execute discrete functions. Steps act on **resources** , which hold the information needed to execute (files, key-value pairs, etc).\nDevelopers can create pipelines easily with a simple declarative YAML-based language. While each step in a pipeline executes in a stateless runtime environment, Pipelines provides facilities to manage state and step outputs across the workflow so that all dependent steps can access the information they need from upstream steps in order to execute. This helps coordinate activities centrally across diverse DevOps tools and teams without custom DIY scripts.\nWorkflows can be configured for a variety of scenarios, including:\n * Continuous Integration for your applications\n * Continuous Delivery workflows that connect all your CI/CD and DevOps activities across tools and functional silos\n * Automate IT Ops workflows like infrastructure provisioning, security patching, and image building\n\n\n### Get up and running with JFrog Pipelines\nIn this section, you will find information to get you started whether you are a new user or an existing user.\n * If you do not yet have a subscription, get started with trial subscription of the JFrog Platform on the Cloud.\n * If you are a new user, get started with the onboarding videos for JFrog Pipelines.Onboarding Best Practices: JFrog Pipelines\n\n\n### Features\n#### Pipelines as Code\nDefine your automated workflow through code, using a domain specific language in a YAML file of key-value pairs that you can create and maintain with your favorite text editor.\n#### Real Time Visibility\nJFrog Pipelines renders your pipeline definition as an interactive diagram, helping you to see the flow of tasks and their inter-dependencies, as well as view the success record of any runs that were performed.\n#### Universal\nConnect your pipeline automation to your source code repositories in a version control system (such as GitHub or BitBucket) to automatically trigger execution on any new submission (commit) of a code change. Connect to other popular tools through your credentials for storage, issue-tracking, notification, orchestration and more through a library of integrations.\n#### Native Integration with Artifactory\nJFrog Pipelines is designed to be used with Artifactory, with built-in directives for pushing artifacts, performing builds, pushing build information, image scanning, and build promotion.\n#### Integration with JFrog Platform\nJFrog Pipelines is designed as an integral part of the JFrog platform, including scanning artifacts/builds through Xray, the creation and delivery of release bundles through JFrog Distribution, for a complete end-to-end SDLC pipeline from commit to production runtime.\n#### Security First\nFine-grained permissions and access control limit who can access workflows. Centralized, encrypted storage of credentials and keys help ensure secrets stay safe.\n#### Enterprise-Ready\nManage multiple execution nodes using a single installation of Pipelines and automatically distribute Pipeline execution across them for scale and speed.\n### Watch the Screencast\n\n Document url for reference - https://jfrog.com/help/r/jfrog-pipelines-documentation/pipelines-steps"


#### Step3: Save the data as csv

#### Points to remember:
1. we need to add `[INST]` tag for instruction in each data
2. `<<SYS>>` for system message and `[YAML]` for yaml block - here system message is somthing like asking llm to do specific task
3. we can consider till First `### - 3 Hashs` as Instruction block 
4. And will replce `[code]` to `[YAML]` 

example:

`[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]
`

### Step2: Preparing Data

In [92]:
# data preparation
system_message = f'<s>[INST]<<SYS>> You are a helpful, respectful and honest assistant. Always answers only users jFrog pipline related questiions.You will say i am not sure for other general questions <</SYS>>'

In [11]:
# Creating function for tag addtion
def tagAddtion(string:str) -> str:
    ''' 
    This function takes string as input and returns tag added as output
    '''
    string = re.sub('#','[/INST]#',string,1)
    string = re.sub(f'\[code\]','[YAML]', string)
    string = re.sub(f'\[/code\]','[/YAML]', string)
    string = system_message + string + '</s>'
    return string

In [93]:
final_data = pd.DataFrame()
final_data['PiplineProcess'] = raw_data.PiplineProcess.apply(tagAddtion)

In [94]:
final_data.loc[76]

PiplineProcess    <s>[INST]<<SYS>> You are a helpful, respectful and honest assistant. Always answers only users jFrog pipline related questiions.You will say i am not sure for other general questions <</SYS>>The **DistributeReleaseBundle** native step triggers the distribution of Release Bundles to an Artifactory Edge Node. This step requires a signed release bundle and one or more distribution rules to successfully execute.Distributing Release BundlesJFrog Artifactory Edge\n[/INST]##### YAML Schema\nThe YAML schema for DistributeReleaseBundle native step is as follows:\n **DistributeReleaseBundle**\n[YAML]\n    pipelines:\n      - name:    <string>\n        steps:\n          - name: my_distribute\n            type: DistributeReleaseBundle\n            configuration:\n                    #inherits all the tags from bash; \n              dryRun: <boolean>                # optional\n              inputResources:\n                - name: my_releaseBundle      # one ReleaseBundle is requi

In [95]:
# Verify if all 244 docuemnts has the proper tags
len(final_data.loc[final_data.PiplineProcess.str.contains('[/INST]')]), len(final_data.loc[final_data.PiplineProcess.str.contains('[INST]')])

(244, 244)

- We can see we have tags for all the 244 documents
- We can also see the YAML tags also in place

### Step3 : We have our tarining data lets save this


In [96]:
final_data.to_csv('./final_data_for_training.csv')

#### Experiment 1 Failed 😞


#### Experiment 2
- From the model we can see the model not performing well so we try to limit the data 
- Remove pipline documents where see not required for training
- we will add qustions instead of spliting the data

Remraks:
- Failled due to data extarcted not usefull or missing some features
- will try to trim the data or only we take importent procedures where clear cut answers mentiond
- will adjust the prompt and training structre aswell

In [6]:
raw_data.Title.to_list()

['jfrog-pipelines',
 'pipelines-use-cases',
 'pipelines-concepts',
 'pipelines-step-by-step',
 'see-it-live',
 'pipelines-quickstart',
 'pipeline-example-hello-world',
 'pipeline-example-docker-build-and-push',
 'pipeline-example-release-to-edge-node',
 'pipeline-example-go-build',
 'pipeline-example-npm-build',
 'pipeline-example-maven-build',
 'pipeline-example-helm-blue-green-deploy',
 'configuring-pipelines',
 'managing-pipelines-integrations',
 'managing-pipeline-sources',
 'pipeline-source-sync-recovery',
 'managing-pipelines-node-pools',
 'creating-custom-vm-images',
 'creating-dynamic-nodes-on-kubernetes',
 'managing-pipelines-static-nodes',
 'sending-pipelines-nodes-agent-logs-to-logstash',
 'creating-pipelines',
 'defining-a-pipeline',
 'pipelines-integrations',
 'airbrake-integration',
 'artifactory-integration',
 'aws-keys-integration',
 'azure-keys-integration',
 'digital-ocean-integration',
 'bitbucket-integration',
 'bitbucket-server-integration',
 'distribution-integrat

In [4]:
# We will take these 100 consepts and will train on them
pipline_for_training =[
    'pipeline-example-hello-world',
    'pipeline-example-docker-build-and-push',
    'pipeline-example-release-to-edge-node',
    'pipeline-example-go-build',
    'pipeline-example-npm-build',
    'pipeline-example-maven-build',
    'pipeline-example-helm-blue-green-deploy',
    'defining-a-pipeline',
    'pipelines-integrations',
    'airbrake-integration',
    'artifactory-integration',
    'aws-keys-integration',
    'azure-keys-integration',
    'digital-ocean-integration',
    'bitbucket-integration',
    'bitbucket-server-integration',
    'distribution-integration',
    'docker-registry-integration',
    'file-server-integration',
    'generic-integration',
    'github-integration',
    'github-enterprise-integration',
    'gitlab-integration',
    'google-cloud-integration',
    'jenkins-integration',
    'jenkins-server-integration',
    'jfrog-platform-access-token-integration',
    'jira-integration',
    'incoming-webhook-integration',
    'kubernetes-integration',
    'newrelic-integration',
    'outgoing-webhook-integration',
    'pagerduty-events-integration',
    'pem-key-integration',
    'slack-integration',
    'smtp-credentials-integration',
    'ssh-key-integration',
    'aql',
    'artifact',
    'buildinfo',
    'crontrigger',
    'distributionrule',
    'filespec',
    'gitrepo',
    'helmchart',
    'image',
    'incomingwebhook',
    'outgoingwebhook',
    'propertybag',
    'releasebundle',
    'remotefile',
    'vmcluster',
    'pipelines-steps',
    'bash',
    'createreleasebundle',
    'distributereleasebundle',
    'dockerpush',
    'dockerbuild',
    'gobuild',
    'gopublishbinary',
    'gopublishmodule',
    'gradlebuild',
    'helmbluegreencleanup',
    'helmbluegreendeploy',
    'helmbluegreenroleswitch',
    'helmdeploy',
    'helmpublish',
    'jenkins',
    'linuxvmdeploy',
    'matrix',
    'mvnbuild',
    'npmbuild',
    'npmpublish',
    'prematrix',
    'postmatrix',
    'powershell',
    'promotebuild',
    'publishbuildinfo',
    'signreleasebundle',
    'triggerpipeline',
    'uploadartifact',
    'xrayscan',
    'pipelines-utility-functions',
    'bump_semver',
    'replace_envs',
    'retry_command',
    'get_uuid',
    'save_artifact_info',
    'validate_artifact',
    'configure_jfrog_cli',
    'use_jfrog_cli',
    'check_xray_available',
    'cleanup_jfrog_cli',
    'set_trigger_payload',
    'get_trigger_payload',
    'end_step',
    'update_run_description',
    'set_run_name',
    'source-control',
    'compare_git'
    ]
len(pipline_for_training)

100

In [5]:
# filtering the 101 datas
new_data = raw_data[raw_data['Title'].isin(pipline_for_training)]
len(new_data)

100

In [56]:
# new data preparation
system_message = f'<s>[INST]<<SYS>> You are a helpful, respectful and honest assistant. Helps user to write jFrog pipline and answers about jFrog pipline process <</SYS>>'
# Now will add some possible qustions to the data and add them 
questions = 'Write a jFrog pipeline to do a'
def qustions_addtion(string:str) -> str:
    ''' 
    Adding question in data
    '''
    string = system_message + ' ' + questions + ' ' + string +'?' + '[/INST]'
    return string
    
    
# Creating function for tag addtion
def new_tagAddtion(string:str) -> str:
    ''' 
    This function takes string as input and returns tag added as output
    '''
    string = re.sub(f'\[code\]','[YAML]', string)
    string = re.sub(f'\[/code\]','[/YAML]', string)
    string =  string + '</s>'
    return string

In [57]:
new_final_data = pd.DataFrame()
new_final_data['Title'] = new_data.Title.apply(qustions_addtion)
new_final_data['PiplineProcess'] = new_data.PiplineProcess.apply(new_tagAddtion)

new_final_data['text'] = new_final_data['Title'] + ' '+ new_final_data['PiplineProcess']

In [58]:
new_final_data['text']

6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       

In [59]:
new_final_data['text'].to_csv('./new_final_data_for_training.csv')

#### Experiment 3
- Since we are training more complex dataset let me try to make it simple for single GPU infrences

In [7]:
test = ''' 
<s>[INST]<<SYS>> You are a helpful, respectful and honest assistant. Helps user to write jFrog pipline and answers about jFrog pipline process <</SYS>> Write a jFrog pipeline to do a aql?[/INST] An `Aql` resource specifies an Artifactory query using Artifactory Query Language.Artifactory Query Language
An Aql resource can be used in the `configuration` of a FileSpec resource to specify file properties to match against.
An Aql resource can be used as an `inputresource` for the CreateReleaseBundle step. It defines the query that is used to create a release bundle.
##### YAML Schema
The YAML schema for Aql:
[YAML]
    resources:
      - name:                       <string>
        type: Aql
        configuration:
          sourceArtifactory:   <artifactory integration name>
          query:               <string>
          addedProperties:
            key1:              val1
            key2:              val2
          mappings: 
            - name: mapping1
              input:    
              output: 
            - name: mapping2
              input:    
              output: 
[/YAML]
##### Tags
###### name
An alphanumeric string (underscores are permitted) that identifies the resource.
###### type
Must be `Aql `for this resource type.
###### configuration
Specifies all configuration selections for the resource.
Tag
 **Description**
Required/Optional  
`sourceArtifactory `
The name of an Artifactory Integration.
Required  
`query `
An Artifactory Query Language query string.Artifactory Query Language
Required  
`addedProperties`
List of added properties that will be added to the artifacts after distribution of the release bundle.
  * `key` \- Property key to be created or updated after distribution of the release bundle.
  * `value` \- List of values to be added to the property key after distribution of the release bundle.


Optional  
`mappings`
List of mappings that are applied to the artifact paths after distribution of the release bundle.
  * `name` \- The friendly name of the mapping entry.
  * `input` \- Regex match string for artifact paths.
  * `output` \- Replacement for artifact paths matched by the ""input"" match string. Capture groups can be used as ""$1"".


Optional  
##### Examples
  * These examples may require an Artifactory integration and/or a Distribution integration.
  * The Pipelines DSL for this example is available in this repository in the JFrog GitHub account.


###### Example 1: AQL with build name and number
[YAML]
    resources:
      - name: myAqlResource
        type: Aql
        configuration:
          sourceArtifactory: myArtifactoryIntegration
          query: items.find({""@build.name"":{""$eq"":""buildName""}})
      - name: myReleaseBundle
        type: ReleaseBundle
        configuration:
          sourceDistribution: myDistributionIntegration
          name: name
          version: '1'
    
    pipelines:
      - name: pipelineToUpdateResource
        steps:
          - name: myBashStep
            type: Bash
            configuration:
              integrations:
                - name: myArtifactoryIntegration
              outputResources:
                - name: myAqlResource
            execution:
              onExecute:
                - write_output myAqlResource query='items.find({""@build.name"":{""$eq"":""myBuildName""}})'
    
      - name: CreateReleaseBundlePipeline
        configuration:
          inputResources:
            - name: myAqlResource
          outputResources:
            - name: myReleaseBundle
        steps:
          - name: myCreateReleaseBundleStep
            type: CreateReleaseBundle
            configuration:
              releaseBundleName: ${JFROG_CLI_BUILD_NAME}
              releaseBundleVersion: ${run_id}
              dryRun: true
[/YAML]
###### Example 2: AQL with file name
[YAML]
    resources:
      - name: myAqlResource
        type: Aql
        configuration:
          sourceArtifactory: myArtifactoryIntegration
          query: 'items.find({""$and"": [{""repo"": {""$eq"": ""example-repo-local""}}, {""name"": {""$match"": ""myFile.txt""}}]})'
    
      - name: myReleaseBundle
        type: ReleaseBundle
        configuration:
          sourceDistribution: myDistributionIntegration
          name: name
          version: '1'
    
    pipelines:
      - name: myCreateReleaseBundlePipeline
        steps:
          - name: myCreateReleaseBundleStep
            type: CreateReleaseBundle
            configuration:
              releaseBundleName: ${JFROG_CLI_BUILD_NAME}
              releaseBundleVersion: ${run_id}
              dryRun: false
              inputResources:
                - name: myAqlResource
              outputResources:
                - name: myReleaseBundle
[/YAML]
###### Example 3: AQL to add an additional property to the artifacts in the release bundle
[YAML]
    resources:
      - name: myAqlResource
        type: Aql
        configuration:
          sourceArtifactory: myArtifactoryIntegration
          query: 'items.find({""$and"": [{""repo"": {""$eq"": ""example-repo-local""}}, {""name"": {""$match"": ""myFile.txt""}}]})'
          addedProperties:
            myRunNumber: ${run_id}
    
      - name: myReleaseBundle
        type: ReleaseBundle
        configuration:
          sourceDistribution: myDistributionIntegration
          name: name
          version: '1'
    
    pipelines:
      - name: myCreateReleaseBundlePipeline
        steps:
          - name: myCreateReleaseBundleStep
            type: CreateReleaseBundle
            configuration:
              releaseBundleName: ${JFROG_CLI_BUILD_NAME}
              releaseBundleVersion: ${run_id}
              dryRun: false
              inputResources:
                - name: myAqlResource
              outputResources:
                - name: myReleaseBundle
[/YAML]
###### Example 4: AQL to map one path to another
[YAML]
    resources:
      - name: myAqlResource
        type: Aql
        configuration:
          sourceArtifactory: myArtifactoryIntegration
          query: 'items.find({""$and"": [{""repo"": {""$eq"": ""example-repo-local""}}, {""name"": {""$match"": ""myFile.txt""}}]})'
          mappings:
            - name: myMapping
              input: example-repo-local/myFile.txt
              output: example-repo-local/myFile_2.txt
    
      - name: myReleaseBundle
        type: ReleaseBundle
        configuration:
          sourceDistribution: myDistributionIntegration
          name: name
          version: '1'
    
    pipelines:
      - name: myCreateReleaseBundlePipeline
        steps:
          - name: myCreateReleaseBundleStep
            type: CreateReleaseBundle
            configuration:
              releaseBundleName: ${JFROG_CLI_BUILD_NAME}
              releaseBundleVersion: ${run_id}
              dryRun: false
              inputResources:
                - name: myAqlResource
              outputResources:
                - name: myReleaseBundle
[/YAML]
###### Example 5: AQL mapping paths using capturing groups in the input
[YAML]
    resources:
      - name: myAqlResource
        type: Aql
        configuration:
          sourceArtifactory: myArtifactoryIntegration
          query: 'items.find({""$and"": [{""repo"": {""$eq"": ""example-repo-local""}}, {""name"": {""$match"": ""myFile.txt""}}]})'
          mappings:
            - name: myMapping
              input: example-repo-local/myFile.(.*)
              output: example-repo-local/myFile_2.$1
    
      - name: myReleaseBundle
        type: ReleaseBundle
        configuration:
          sourceDistribution: myDistributionIntegration
          name: name
          version: '1'
    
    pipelines:
      - name: myCreateReleaseBundlePipeline
        steps:
          - name: myCreateReleaseBundleStep
            type: CreateReleaseBundle
            configuration:
              releaseBundleName: ${JFROG_CLI_BUILD_NAME}
              releaseBundleVersion: ${run_id}
              dryRun: false
              inputResources:
                - name: myAqlResource
              outputResources:
                - name: myReleaseBundle
[/YAML]
###### Example 6
 **pipelines.yml**
[YAML]
    resources:
      - name: Aql_ART_1
        type: Aql
        configuration:
          sourceArtifactory: myArtifactoryIntegration
          query: items.find({""@build.name"":{""$eq"":""art""}})
          addedProperties:
            key1: value1
            key2: value2
          mappings:
            - name: mapping1
              input: IN
              output: test1
            - name: mapping2
              input: IN
              output: test2
    
    pipelines:
      - name: pipelines_ART_1
        steps:
          - name: step_ART_1
            type: PowerShell
            configuration:
              nodePool: win_2019
              inputResources:
                - name: Aql_ART_1
                  trigger: true
            execution:
              onExecute:
                - write_output ""executing step...""
[/YAML]

 Document url for reference - https://jfrog.com/help/r/jfrog-pipelines-documentation/pipelines-steps</s>"
59,"<s>[INST]<<SYS>> You are a helpful, respectful and honest assistant. Helps user to write jFrog pipline and answers about jFrog pipline process <</SYS>> Write a jFrog pipeline to do a artifact?[/INST] An `Artifact` resource can be used to trigger a step or pipeline on any of the following Artifactory events:
  * Deploy
  * Delete
  * Copy
  * Move
  * Cache


When this resource is added as an input resource to any step, on resource update, the step or the pipeline is triggered.
### Important
If there are issues while using Artifact resource, see the Troubleshooting section below for possible solutions.
### Note
This resource cannot be added as an output to a step.
### See it Live!
Click here to see a live example that uses this resource.
##### YAML Schema
### Rename Resource Name
In your pipeline YAML, whenever you make a change to any tag in the Configuration section, the name of the resource must be renamed. If the resource name is not renamed, any changes to the Configuration section are not taken into consideration during pipeline sync.
The YAML schema for Artifact:
[YAML]
    resources:
      name:         <string>
      type: Artifact
      configuration:
        jfrogPlatformProvider:     <JFrogPlatformAccessToken_integration>    # required
        repositories:                                # required
          - repository_name
          - repository_name
        artifactPath:                                # optional
          include:
            - pattern_1
            - pattern_2
          exclude:
            - pattern_1
            - pattern_2
          branches:                                    # required
            include:     <includePattern>            
            exclude:     <excludePattern>   
          buildOn:                                     # required
            deployed:    <boolean>
            deleted:     <boolean>
            copied:      <boolean>
            moved:       <boolean>
            cached:      <boolean>
    
[/YAML]
'''

In [11]:
print(re.search('[YAML](.*)[/YAML]',test).group(1))

S>> You are a helpful, respectful and honest assistant. Helps user to write jFrog pipline and answers about jFrog pipline process <</SYS>> Write a jFrog pipeline to do a aql?[/INST] An `Aql` resource specifies an Artifactory query using Artifactory Query Language.Artifactory Query 
