## Welcome to the LLM tutorial 👋!

The **2 key goals** of the tutorial is as follows:

1. Get comfortable using the Python library langchain;
2. Using LLMs to solve a fictional innovation mapping scenario 

Feel free to refer to some of these materials during the tutorial and in your project work: 

#### Documentation 

- [langchain](https://python.langchain.com/docs/get_started/introduction.html)

#### Repos 

- [The Practical Guides to Large Language Models](https://github.com/Mooler0410/LLMsPracticalGuide)
- [A series of langchain tutorials](https://github.com/gkamradt/langchain-tutorials)

#### Videos

- [LangChain Crash Course for Beginners](https://www.youtube.com/watch?v=nAmC7SoVLd8)


In [1]:
import langchain

from dotenv import load_dotenv
from pathlib import Path
import os
import pandas as pd

from dap_taltech.utils.data_getters import DataGetter
from dap_taltech import logger
import llm_utils as lu

### Preamble

The code block below is the preamble to this tutorial so that:

- we install our dependencies;
- have access to our datasets; and 
- our OpenAI API key.  

In [2]:
os.system(
    f"pip install -r {Path.cwd()}/llm_requirements.txt --quiet" #install requirements to run this notebook
)

load_dotenv() # load environment variables

oa_key = os.environ.get('OPENAI_API_KEY') #get our open api key from our environment variable 

if not oa_key:
    logger.error("No open api key found. Please set your openAI api key as an environment variable named OPENAI_API_KEY.")

## Tour of LangChain 🦜🔗: Building a language model application

LangChain is a python "framework for developing applications powered by language models." 

You can use it for tasks like Generative Question-Answering (GQA), analyzing structured data, text summarisation etc. 

The core feature of the library is the ability to **"chain together"** different components to create advanced LLM use cases.   

NOTE: This section is takes heavily from existing online tutorials described in [this repo](https://github.com/gkamradt/langchain-tutorials).

Core to langchain (and any language model application) is:

1. **Prompts**
2. **Language models:** Make calls to language models through common interfaces
3. **Output parsers:** Extract information from model outputs

The example below does just that: it instantiates OpenAI's language model, defines a prompt, `text`, and makes a call to the language model to outputs a response.

In [3]:
from langchain.llms import OpenAI

llm = OpenAI(temperature=0.9) #instantiate our language model. The temperature parameter controls how "creative" the model is.

#make this innovation mapping related 
text = "What are 5 vacation destinations for someone who likes to eat pasta?" #define our prompt

print(llm(text)) #print the output



1. Rome, Italy 
2. Bologna, Italy 
3. Venice, Italy 
4. Florence, Italy 
5. Amalfi Coast, Italy


Let's break down these three components in a bit more detail. 

## Prompts 

Prompts refer to the textual input into the model.

You can use langchain to create prompts, like simple hardcoded prompt using the PromptTemplate class or passing examples as part of `few-shot prompting`. 

**TASK**: Can you re-write the `text` prompt to take as input variables:

1. The number of vacation destinations;
2. The meal type

Refer to the [prompt template documentation](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) for guidance. 

In [9]:
from langchain import PromptTemplate

#Define your prompt template to take as input "number" and "meal" variables.
multiple_input_prompt = PromptTemplate(###) 
    
#format your prompt template with the variables you want to use

prompt = ###

Easy enough! What if you want to make use of data to inform your prompt? You can do so vis-a-vis a **feature store**. 

In [4]:
#feature store

In [5]:
#few shot learning example

In [6]:
## Chains

In [7]:
## Agents 

### Putting it all together

## Use case: LLMs in traditional NLP tasks 

To demonstrate how we can embed LLMs to help with tranditional NLP tasks, we are going to work through the following: 

**Problem:** You are an innovation mapping researcher who has recently won a grant to investigate the Armenian labor market. You have been granted access to armenian online job advert postings. You are interested in tracking the change in skill demand over time.

**Approach:** Use a LLM and langchain to:

1. Extract skill entities from the job advert text;
2. Parse the output to be json compatible;
3. Disambiguate the output by mapping the extracted skill to a knowledge base of
        skills. 

In more traditional methods, we might train our own NER model to extract skills, manually labelling hundreds of job adverts with the start and end indices of skills. This example instead embeds an LLM into the system instead.

In [8]:
#First things first, lets load in our dataset

dg = DataGetter(local=False) #instantiate our data getter 

job_adverts = dg.get_armenian_job_adverts() #get our job adverts

[94;1;1m2023-07-31 18:03:40,574 - TalTech HackWeek 2023 - INFO - Loading data from open dap-taltech s3 bucket. (206508154.py:46)[0m


In [9]:
#To better understand the data, let's take a look at the dataset description by calling help on the data getter method

dg.get_armenian_job_adverts?

#Looks like there's a lot of rich information there. Let's take a look at a sample of the data itself.

job_adverts.head(5)

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


[0;31mSignature:[0m [0mdg[0m[0;34m.[0m[0mget_armenian_job_adverts[0m[0;34m([0m[0;34m)[0m [0;34m->[0m [0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Get Armenian job adverts data posted from 2004 to 2015.

This data was sourced from the following website: https://www.kaggle.com/datasets/madhab/jobposts

The data includes information such as:
    - jobpost: the original job post
    - Title: the title of the job
    - Company: the company that posted the job
    - AnnouncementCode: the announcement code of the job
    - Term: the term of the job
    - Eligibility: the eligibility of the job
    - Audience: the audience of the job
    - StartDate: the start date of the job
    - Location: the location of the job

Returns:
    pd.DataFrame: A pandas dataframe containing Armenian job adverts data.
[0;31mFile:[0m      /var/folders/cq/9gxjkt2j2g1c0cfhjb2qtz000000gn/T/ipykernel_47912/2

In [10]:
# While there's a lot of information in this dataset, the jobpost and date posted are key to analysis. Let's get a better sense of those columns:

print(f"there are {job_adverts.jobpost.isna().sum()} missing values in the jobpost column")

print(f"{round(job_adverts.jobpost.nunique()/len(job_adverts)*100, 2)}% of job posts in the jobpost are unique")

print(f"there are {job_adverts.date	.isna().sum()} missing values in the date column")


#Great! there are no missing values for either columns and almost all job postings are unique. 
# Let's take a look at an example:
print(' ')
print("-----------------------------------here is an example of job advert text:-----------------------------------")
print(' ')
print(job_adverts.jobpost.sample(1).values)
print(' ')
print("-----------------------------------here is an example of a job date:-----------------------------------")
print(' ')
print(job_adverts.date.sample(1).values)

there are 0 missing values in the jobpost column
99.43% of job posts in the jobpost are unique
there are 0 missing values in the date column
 
-----------------------------------here is an example of job advert text:-----------------------------------
 
['ArmenTel CJSC\r\nTITLE:  Head of Large Business Division\r\nLOCATION:  Yerevan, Armenia\r\nJOB DESCRIPTION:  N/A\r\nJOB RESPONSIBILITIES:\r\n- Realize control over service provision for corporate clients of the\r\nCompany in accordance with acting procedures, instructions and schemes;\r\n- Organize and control the realization of plans according to the approved\r\nkey performance indicators;\r\n- Organize the process of proactive search and attraction of potential\r\ncorporate clients;\r\n- Provide qualitative service and offer more profitable tariffs and\r\nservices for increasing the loyalty of corporate clients;\r\n- Provide profit increase and outflow minimization from the cooperation\r\nwith key clients in the frame of his/ her re

In [11]:
#Looks like there are some funky characters we will want to clean up in the job advert text.
#Also, the date column is not in a datetime format. Let's clean that up as well. 

# Let's do so by calling the clean_job_advert and clean_date functions we defined in the llm_utils script.
# We will also drop duplicate jobpost values and create a job_id column using the index.

job_adverts = (job_adverts
                    #drop duplicates
                    .dropna(subset=['jobpost, date'])
                    #clean jobpost column
                    .assign(clean_jobpost=lambda x: x.jobpost.apply(lu.clean_job_advert))
                    #create jobid column
                    .assign(job_id=lambda x: x.index)
                    #clean date column
                    .assign(clean_date=lambda x: x.date.apply(lu.clean_date)))

# Finally, lets create a job_adverts_dict that we can use to map job_ids to clean job adverts.
job_adverts_dict = job_adverts.set_index('job_id').clean_jobpost.to_dict()               

Great! Now that we've inspected our data and cleaned up the job text slightly, lets use langchain to:

    1) Extrac