### LLM Workshop -- Data Analysis using Large Language Models
Welcome to our LLM Workshop!
We hope that today you will learn some practical skills in applying Large Language Models in Data Analysis.
Enjoy and don't hesitate to ask questions!

### Prerequisites
Before we start, let's make sure that we have everything installed on our machines:

- Git;
- Pyenv;
- Our libraries (dependancies);
- Jupiter Lab;
- Ollama.

### Importing Stuff
Now that all is installed, we can start by importing any librabies we will use:

In [1]:
import re
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from langchain_ollama import OllamaLLM
from pprint import pprint

### Connect and Test Model
Let's test if we can connect to the model we pulled:

In [2]:
llm = OllamaLLM(model="llama3.2")
# answer = llm.invoke("An LLM Workshop is ...")
# pprint(answer)

In [3]:
# Exercises:
# 1. Download and connect another model (i.e. ollama pull, ollama run)
# 2. Connect it connect to compare the results in later exercises

### Load Data
After testing the model, it's time to load the data:

In [4]:
df = pd.read_json("db.json")
# df.head()

In [5]:
# Exercises:
# 1. Print the number of job postings in the database
# 2. Print any other statistical metrics that would be relevant here

### Categorize Positions
With the model working and the data loaded, let's try to categorize some job positions!

In [6]:
categories = ["Full-Stack Developer", "Front-End Developer", "Back-End Developer", "Dev Ops", "Data Engineer", "Other"]
prompt = """
    Please add an appropriate category to the following job descriptions.
    Respond with only the list of position descriptions and categories, without additional explanations.
    For example: Senior DevOps Engineer - Dev Ops, Full stack web разработчик (django rest framework и react js) - Full-Stack Developer, Senior / lead developer linux, python, c++, engleza - Back-End Developer etc. 
    Choose from the following categories: {}.
    Job Descriptions: {}.
"""
category_pattern = r"(.+) - (.+)"

def categorize_positions(llm, position_descriptions):
    llm_response = llm.invoke(prompt.format(', '.join(categories), ", ".join(position_descriptions)))
    function_response = []

    for line in llm_response.split("\n"):
        match = re.search(category_pattern, line)
        if match:
            position_description, category = match.groups()
            function_response.append((position_description, category))

    return function_response

# categorize_positions(llm, ["Lead Python Engineer", "Python Developer", "Python-разработчик для разработки систем"])

In [7]:
category_df = pd.DataFrame({"Description": [], 'Category': []})
batch_size = 10

num_chunks = len(df) // batch_size + (len(df) % batch_size > 0)
batches = np.array_split(df["position_title"], num_chunks)

for i, batch in enumerate(batches, 1):
    print(f"Batch {i}..")
    new_categories = categorize_positions(llm, batch.tolist())
    new_categories_df = pd.DataFrame(new_categories, columns=["Description", "Category"])
    category_df = pd.concat([category_df, new_categories_df], ignore_index=True)

# category_df

  return bound(*args, **kwds)


Batch 1..
Batch 2..
Batch 3..
Batch 4..
Batch 5..
Batch 6..
Batch 7..
Batch 8..
Batch 9..


### Vizualize Results
Finally we can visualize our results:

In [8]:
def visualize_results(df, column_name):
    counts = df[column_name].value_counts()
    counts.plot(kind='bar', color='skyblue', figsize=(10, 6))
    plt.xlabel(column_name, fontsize=12)
    plt.ylabel("Number of Positions", fontsize=12)
    plt.title(f"Number of Job Positions by {column_name}", fontsize=14)
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

# visualize_results(category_df, "Category")

In [9]:
# Exercises:
# 1. Visualize other data columns available in the database (like position_experience or position_posted)
# 2. Visualize any other statistical metrics that can be relevant for this data

In [10]:
# Exercises:
# 1. How can we handle all extra job categories?
# 2. How can we tune the model to be more focused?

### References
Congratulations for reaching here!
For those interested, we've left a couple of references you might find useful in your own projects.
Good luck!

1. [Pyenv GitHub Repo](https://github.com/pyenv/pyenv);
2. [Install Jupyter Lab](https://jupyter.org/install);
3. [Download Ollama](https://ollama.com/download);
4. [Docs for the OllamaLLM Class](https://python.langchain.com/api_reference/ollama/llms/langchain_ollama.llms.OllamaLLM.html).