# Install and load the libraries.

In [1]:
!pip install -q chromadb==0.4.22
!pip install -q langchain==0.1.4
!pip install -q sentence_transformers==2.3.0
!pip install -q accelerate==0.26.1

# Load the Dataset

In [2]:
import numpy as np
import pandas as pd

In [3]:
#Download and unzip  the dataset from kaggle:
#https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset?resource=download

#Pass the directory where the .csv file is stored to read_csv
resumes = pd.read_csv('../data/UpdatedResumeDataSet.csv', sep=',')

#Using plain text resumes for simplicity
DOCUMENT="Resume" 
CATEGORY="Category"
ID="ID"

In [4]:
resumes.head(2)

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...


In [5]:
#Show distinct resume categories
categories = resumes[CATEGORY].unique()
print(f"Categories: {categories}")

Categories: ['Data Science' 'HR' 'Advocate' 'Arts' 'Web Designing'
 'Mechanical Engineer' 'Sales' 'Health and fitness' 'Civil Engineer'
 'Java Developer' 'Business Analyst' 'SAP Developer' 'Automation Testing'
 'Electrical Engineering' 'Operations Manager' 'Python Developer'
 'DevOps Engineer' 'Network Security Engineer' 'PMO' 'Database' 'Hadoop'
 'ETL Developer' 'DotNet Developer' 'Blockchain' 'Testing']


In [6]:
#Because it is just a example we select a small portion of resumes
#Per the 3/11/25 meeting, we'll focus on tech resumes
CATEGORY = "Data Science"
subset_resumes = resumes[resumes["Category"] == CATEGORY].reset_index(drop=True)

In [7]:
subset_resumes.head(2)

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...


In [8]:
subset_resumes.shape

(40, 2)

## CREATE THE DOCUMENT FROM THE DATAFRAME

In [9]:
from langchain.document_loaders import DataFrameLoader
from langchain.vectorstores import Chroma


First, we create the loader, indicating the data source and the name of the column in the DataFrame where we store what we could consider as the document, that is, the information we want to pass to the model so that it takes it into account in its responses.

In [10]:
df_loader = DataFrameLoader(subset_resumes, page_content_column=DOCUMENT)

Then, we use the loader to load the document.

In [11]:
df_document = df_loader.load()

In [12]:
display(df_document[:2])

[Document(page_content='Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, NaÃ¯ve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details \r\n\r\nData Science Assurance Associate \r\n\r\nData Science Assurance Associate - Ernst & Young LLP\r\nSkill Details \r\nJAVASCRIPT- Exprience - 24 months\r\njQuery- Exprience - 24 months\r\nPython- Exprience - 24 monthsCompany Details \r\ncompany - Ernst & Young LLP\r\ndescripti

# Creating the embeddings

We split the data into manageable chunks to store as vectors using **CharacterTextSplitter**. There isn't an exact way to do this, more chunks means more detailed context, but will increase the size of our vectorstore.

There are no magic numbers to inform. It is important to consider that the larger the chunk size, the more context the model will have, but the size of our vector store will also increase.

In [13]:
from langchain.text_splitter import CharacterTextSplitter

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,  # Large enough to keep resume context
    chunk_overlap=400  # Allows minor overlap for coherence
)

texts = text_splitter.split_documents(df_document)

In [15]:
display(texts[:2])

[Document(page_content='Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, NaÃ¯ve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details \r\n\r\nData Science Assurance Associate \r\n\r\nData Science Assurance Associate - Ernst & Young LLP\r\nSkill Details \r\nJAVASCRIPT- Exprience - 24 months\r\njQuery- Exprience - 24 months\r\nPython- Exprience - 24 monthsCompany Details \r\ncompany - Ernst & Young LLP\r\ndescripti

We load the library to create the pre trained model from HuggingFace to create the embeddings from sentences.


In [16]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


# Creating the Index With Chroma
Here we are creating the index of embeddings. Using the document, and the embedding function created above.

In [17]:
directory_cdb = './chromadb'
chroma_db = Chroma.from_documents(
    texts, # using the entire df_document is too slow
    embedding_function, 
    persist_directory=directory_cdb
)

## LANGCHAIN

### Using the new LCEL Architecture from LangChain.
LangChain recommends using LCEL (LangChain Expression Language) over Chains. 

In [18]:
from langchain.llms import HuggingFacePipeline
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

Now we create the retriever object, the responsible to return the data contained in the ChromaDB Database.

In [19]:
retriever = chroma_db.as_retriever()

In [20]:
!pip install huggingface_hub==0.23.0

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting huggingface_hub==0.23.0
  Using cached huggingface_hub-0.23.0-py3-none-any.whl.metadata (12 kB)
Using cached huggingface_hub-0.23.0-py3-none-any.whl (401 kB)
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.30.1
    Uninstalling huggingface-hub-0.30.1:
      Successfully uninstalled huggingface-hub-0.30.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
transformers 4.51.0 requires huggingface-hub<1.0,>=0.30.0, but you have huggingface-hub 0.23.0 which is incompatible.[0m[31m
[0mSuccessfully installed huggingface_hub-0.23.0


In [21]:
from getpass import getpass
hf_key = getpass("Hugging Face Key: ")

In [22]:
!huggingface-cli login --token $hf_key

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/ebengunadi/.cache/huggingface/token
Login successful


## Importing LangChain Libraries.

In [23]:
from langchain import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

from langchain.llms import HuggingFacePipeline
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [24]:
import torch
from torch import cuda

In [27]:
#In a MAC Silicon the device must be 'mps'
device = torch.device('mps') #to use with MAC Silicon
# device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

In [28]:
device

device(type='mps')

##Load the Model .

In [29]:
#You can try with any llama model, but you will need more GPU and memory as you
#increase the size of the model.
model_id = "meta-llama/Llama-3.2-3B-Instruct" 

In [30]:
%%time

# begin initializing HF items, need auth token for these
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    token=hf_key
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    device_map='auto',
    token=hf_key
)
model.eval()
print(f"Model loaded on {device}")


Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00,  6.37s/it]


Model loaded on mps
CPU times: user 5 s, sys: 5.43 s, total: 10.4 s
Wall time: 13.3 s


In [31]:
tokenizer = AutoTokenizer.from_pretrained(model_id,
                                          use_aut_token=hf_key)


In [32]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=128,
    temperature=0.1,
    #do_sample=False,
    top_p=0,
    #trust_remote_code=True,
    eos_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1,
    return_full_text=True,
    device_map='auto'
)

hf_llm = HuggingFacePipeline(pipeline=pipe)

Device set to use mps


## Testing

Query documents from Vector DB

In [46]:
retrieved_docs = retriever.get_relevant_documents("Data Science Skills:")
for i, doc in enumerate(retrieved_docs):
    print(f"Doc {i+1}:\n{doc.page_content}\n")

Doc 1:
Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, NaÃ¯ve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details 

Data Science Assurance Associate 

Data Science Assurance Associate - Ernst & Young LLP
Skill Details 
JAVASCRIPT- Exprience - 24 months
jQuery- Exprience - 24 months
Python- Exprience - 24 monthsCompany Details 
company - Ernst & Young LLP
description - Fraud Investigations and Dispute Services 

Have LLM generate a summary without RAG

In [47]:
%%time
response = hf_llm("Data Science Skills:")
print(response)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


 A Comprehensive Guide
Data science is a multidisciplinary field that combines statistics, computer science, and domain-specific knowledge to extract insights from data. To become proficient in data science, you need to develop a range of skills across various domains. Here's a comprehensive guide to help you get started:

**Foundational Skills**

1. **Programming**: Python, R, SQL, and Julia are popular programming languages used in data science.
2. **Statistics**: Understand probability, regression analysis, hypothesis testing, and statistical modeling.
3. **Mathematics**: Familiarize yourself with linear algebra, calculus, and optimization techniques.

**Data Analysis and
CPU times: user 7.74 s, sys: 22.3 s, total: 30.1 s
Wall time: 49 s


In [48]:
%%time

# Function to clean and format retrieved documents
def format_docs(docs):
    return "\n\n".join([doc.page_content.strip() for doc in docs])

# Update the LCEL chain
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}  # Apply formatting
    | ChatPromptTemplate.from_template(
        "Answer the question based on the following context:\n{context}\n\nQuestion: {question}"
    )
    | hf_llm
    | StrOutputParser()
)

response = chain.invoke("Data Science Skills:")
print(response)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


 What is your experience with machine learning algorithms? 

Answer: I have extensive experience with various machine learning algorithms including regression, SVM, Naive Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language Processing, Dimensionality Reduction, Topic Modeling (LDA, NMF), PCA, and Neural Networks.
CPU times: user 9.42 s, sys: 16.8 s, total: 26.2 s
Wall time: 57.3 s
