# Let's Create a Chatbot with RAG!

## Define your Problem Statement


1. Search for an open opportunity at your possible potential clients
2. Search for relevant information from the opening that might act as a good starting point for your business
3. Go through all of your company's projects and find relevant projects
4. Get important to share information about these projects
5. Write a compelling cold email about how your firm can be of great service to your potential client
6. Repeat for ALL potential clients


In [None]:
# Web Scraping
# !pipenv install langchain_community

### 1. Chat API Inferencing


In [1]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model="llama-3.3-70b-versatile",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

response = llm.invoke("Who is the greatest footballer of all time?")
print(response.content)

The debate about who is the greatest footballer of all time is ongoing and often subjective, with opinions varying depending on personal taste, cultural context, and generational differences. However, based on various polls, awards, and expert opinions, the top contenders for this title are often narrowed down to a few exceptional players. Here are some of the most commonly cited candidates:

1. **Lionel Messi**: Regarded by many as the greatest of all time, Messi has won a record-breaking seven Ballon d'Or awards, ten La Liga titles, and four UEFA Champions League titles. His incredible dribbling skills, goal-scoring ability, and vision on the pitch have made him a legend in the sport.
2. **Cristiano Ronaldo**: A five-time Ballon d'Or winner, Ronaldo has consistently dominated the sport, winning numerous titles with Manchester United, Real Madrid, and Juventus. His athleticism, skill, and dedication have made him one of the most successful players in history.
3. **Diego Maradona**: A 

In [2]:
response = llm.invoke("Who is the greatest footballer of all time?, no preamble")
print(response.content)

Lionel Messi.


## 2. Web Scraping

In [3]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://jobs.apple.com/en-us/details/200583355/aiml-senior-data-science-manager-aiml-data?team=MLAI")
page_data = loader.load().pop().page_content

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
page_data

'\n\n\n\n\n\n\n\n\nAIML - Senior Data Science Manager, AIML Data - Careers at Apple\n\n\n\n\n\n\n\nAppleStoreMaciPadiPhoneWatchVisionAirPodsTV & HomeEntertainmentAccessoriesSupport\n\n\n0+\nCareers at AppleOpen MenuClose Menu\n\n      Work at Apple\n    \n \n\n      Life at Apple\n    \n \n\n      Profile\n    \n \n\n      Sign In\n    \n \nSearch\nJobs at Apple\nAIML - Senior Data Science Manager, AIML DataCupertino, California, United StatesMachine Learning and AIAdd to Favorites AIML - Senior Data Science Manager, AIML DataRemoved from favoritesAdd a favoriteCloseTo view your favorites, sign in with your Apple Account.Sign InDon’t have an Apple Account?Create one nowForgot your Apple Account or password?Submit ResumeAIML - Senior Data Science Manager, AIML DataBack to search resultsSummaryPosted: Jan 2, 2025Role Number:200583355Do you get excited by driving product impact via measurement and evaluation, for products and services used by hundreds of millions of people globally? The v

In [5]:
from langchain_core.prompts import PromptTemplate

prompt_extract = PromptTemplate.from_template(
    """
    I will give you a scraped text from a job posting.
    Your job is to extract the job details & requirements in a JSON format containing the following keys: 'role', 'experience', 'skills' and 'description'.
    Only return valid JSON. No preamble, please.
    Here is the scraped text: {page_data}
    """
)

In [6]:
# Extract the relevant information
chain_extract = prompt_extract | llm
response = chain_extract.invoke(input={'page_data': page_data})
print(response.content)
print(type(response.content))


```
{
  "role": "Senior Data Science Manager, AIML Data",
  "experience": "10+ years of relevant work experience, 6-8 years of experience managing a team of data scientists or related roles",
  "skills": [
    "Data science",
    "Machine learning",
    "Analytics",
    "Statistical analysis",
    "Data quality evaluation",
    "Prompt engineering",
    "Fine-tuning models",
    "Technical execution",
    "Building and deploying models",
    "Developing pipelines",
    "Debugging complex data processes"
  ],
  "description": "Drive product impact via measurement and evaluation, improve product quality and guide feature development with data. Partner with machine learning and product engineering teams to deliver amazing search experiences across Apple products."
}
```
<class 'str'>


In [7]:
# Convert to JSON
from langchain_core.output_parsers import JsonOutputParser

json_parser = JsonOutputParser()
json_response = json_parser.parse(response.content)

print(json_response)
print(type(json_response))

{'role': 'Senior Data Science Manager, AIML Data', 'experience': '10+ years of relevant work experience, 6-8 years of experience managing a team of data scientists or related roles', 'skills': ['Data science', 'Machine learning', 'Analytics', 'Statistical analysis', 'Data quality evaluation', 'Prompt engineering', 'Fine-tuning models', 'Technical execution', 'Building and deploying models', 'Developing pipelines', 'Debugging complex data processes'], 'description': 'Drive product impact via measurement and evaluation, improve product quality and guide feature development with data. Partner with machine learning and product engineering teams to deliver amazing search experiences across Apple products.'}
<class 'dict'>


## 3. Store Relevant Info in Vector DB

In [8]:
import csv

def read_csv_file(file_path):
    data = []
    with open(file_path, 'r') as file:
        csv_reader = csv.reader(file)
        # Skip the header row
        next(csv_reader)
        for row in csv_reader:
            # Separate technical skills (list) and project link (string)
            skills = tuple(row[:-1]) # Exclude the last element which is the project link
            project_link = row[-1] 
            data.append((skills, project_link))
    return data

# Example usage:
file_path = '../sample_portfolio.csv'
data = read_csv_file(file_path)

for skills, project_link in data:
    print(skills, project_link)

('Python', ' SQL', ' Pandas')  https://github.com/user/project1
('SQL', ' Python', ' Airflow')  https://github.com/user/project2
('PySpark', ' Spark SQL', ' Delta Lake')  https://github.com/user/project3
('Machine Learning', ' Deep Learning', ' TensorFlow')  https://github.com/user/project4
('Data Engineering', ' ETL', ' ELT')  https://github.com/user/project5
('Cloud Platforms (AWS', ' GCP', ' Azure)')  https://github.com/user/project6
('Data Warehousing', ' Data Modeling', ' DBT')  https://github.com/user/project7
('Data Visualization', ' Power BI', ' Tableau')  https://github.com/user/project8
('MLOps', ' MLflow', ' Kubeflow')  https://github.com/user/project9
('Natural Language Processing (NLP)', ' NLTK', ' spaCy')  https://github.com/user/project10
('Computer Vision', ' OpenCV', ' TensorFlow')  https://github.com/user/project11
('Time Series Analysis', ' Forecasting', ' Prophet')  https://github.com/user/project12
('Data Cleaning', ' Data Wrangling', ' Pandas')  https://github.com

In [9]:
# Insert data into vector database

import uuid
import chromadb

client = chromadb.PersistentClient('vectorstore')
collection = client.get_or_create_collection(name='portfolio_links')

if not collection.count():
    for skills, project_link in data:
        collection.add(
            documents=str(skills),
            metadatas={
                'portfolio_url': project_link
            },
            ids = str(uuid.uuid4())
        )

## 4. Generate Cold Email

In [10]:
json_response['skills']

['Data science',
 'Machine learning',
 'Analytics',
 'Statistical analysis',
 'Data quality evaluation',
 'Prompt engineering',
 'Fine-tuning models',
 'Technical execution',
 'Building and deploying models',
 'Developing pipelines',
 'Debugging complex data processes']

In [11]:
matched_portfolio_urls = collection.query(query_texts=json_response['skills'][0], n_results=2)
matched_portfolio_urls

{'ids': [['b54bd83c-7fe2-443e-a408-005ebfc2ff52',
   '4231e210-a217-467f-9d72-579a4d2a4203']],
 'embeddings': None,
 'documents': [["('Data Warehousing', ' Data Modeling', ' DBT')",
   "('Data Governance', ' Data Quality', ' Metadata Management')"]],
 'uris': None,
 'data': None,
 'metadatas': [[{'portfolio_url': ' https://github.com/user/project7'},
   {'portfolio_url': ' https://github.com/user/project18'}]],
 'distances': [[0.7733153595301009, 0.8905141540339159]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [12]:
email_prompt = PromptTemplate.from_template(
    """
    I will give you a role and a task that you have to perform in that specific role.
    Your Role: Your name is Harmeet, You are an incredible business development officer who knows how to get clients. You work for X Consulting firm, your firm works with all sorts of IT clients and provide solutions in the domain of Data Science and AI. 
    X AI focuses on efficient tailored solutions for all clients keeping costs down. 
    Your Job: Your Job is to write cold emails to clients regarding the Job openings that they have advertised. Try to pitch your clients with an email hook that opens a conversation about a possibility of working with them. Add the most relevant portfolio URLs from
    the following (shared below) to showcase that we have the right expertise to get the job done. 
    I will now provide you with the Job description and the portfolio URLs:
    JOB DESCRIPTION: {job_description}
    ------
    PORTFOLIO URLS: {portfolio_urls}
    """
)

In [13]:
job_description = json_response['description']

In [14]:
chain_email = email_prompt | llm
response = chain_email.invoke({'job_description':job_description, 'portfolio_urls':matched_portfolio_urls})
print(response.content)

Subject: Enhancing Product Impact with Data-Driven Solutions

Dear Hiring Manager,

I came across the job description for a role that focuses on driving product impact via measurement and evaluation, improving product quality, and guiding feature development with data. As a Business Development Officer at X Consulting firm, I was impressed by the emphasis on leveraging data to deliver exceptional search experiences across Apple products.

Our team at X AI has extensive experience in providing tailored solutions in the domain of Data Science and AI, with a strong focus on efficient and cost-effective approaches. I'd like to highlight a few examples of our work that align with the requirements of this role:

* Our expertise in Data Warehousing, Data Modeling, and DBT can help streamline data management and provide a solid foundation for data-driven decision-making. You can explore our project on Data Warehousing and Data Modeling at https://github.com/user/project7.
* We've also develope