In [1]:
"""
1. Search for an open opportunity at your possible potential clients
2. Search for relevant information from the opening that might act as a good starting point for your business
3. Go through all of your company's projects and find relevant projects
4. Get important to share information about these projects
5. Write a compelling cold email about how your firm can be of great service to your potential client
6. Repeat for ALL potential clients
"""

"\n1. Search for an open opportunity at your possible potential clients\n2. Search for relevant information from the opening that might act as a good starting point for your business\n3. Go through all of your company's projects and find relevant projects\n4. Get important to share information about these projects\n5. Write a compelling cold email about how your firm can be of great service to your potential client\n6. Repeat for ALL potential clients\n"

In [2]:
# Web Scraping

# Relevant Information

# Chromadb query

# Email Generation

In [3]:
# Web Scraping
!pipenv install langchain_community

[1mLoading .env environment variables...[0m
[1;32mInstalling langchain_community...[0m
✔ Installation Succeeded
[1mInstalling dependencies from Pipfile.lock [0m[1;39m(5bd434)...[0m
[32mAll dependencies are now up-to-date![0m
[1;32mUpgrading[0m langchain_community in [39m dependencies.[0m
[?25lBuilding requirements...
[2KResolving dependencies....
[2K✔ Success! Locking packages...
[2K[32m⠙[0m Locking packages...
[1A[2K[?25lBuilding requirements...
[2KResolving dependencies....
[2K✔ Success! Locking packages...
[2K[32m⠙[0m Locking packages...
[1A[2K[1mInstalling dependencies from Pipfile.lock [0m[1;39m(385d37)...[0m
[32mAll dependencies are now up-to-date![0m
[1mInstalling dependencies from Pipfile.lock [0m[1;39m(385d37)...[0m


In [8]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.capitalonecareers.com/job/plano/principal-associate-data-science-financial-services/1732/75657249248")
page_data = loader.load().pop().page_content

In [13]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    model="llama-3.3-70b-versatile",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

# response = llm.invoke("Who is the greatest footballer of all time?")
# print(response.content)

In [12]:
from langchain_core.prompts import PromptTemplate

prompt_extract = PromptTemplate.from_template(
    """
    I will give you a scraped text from a job posting.
    Your job is to extract the job details & requirements in a JSON format containing the following keys: 'role', 'experience', 'skills' and 'description'.
    Only return valid JSON. No preamble, please.
    Here is the scraped text: {page_data}
    """
)

In [19]:
chain_extract = prompt_extract | llm
response = chain_extract.invoke(input={'page_data': page_data})
print(response.content)
print(type(response.content))


```json
{
    "role": "Principal Associate, Data Science - Financial Services",
    "experience": "Bachelor's Degree plus 5 years of experience in data analytics, or Master's Degree plus 3 years in data analytics, or PhD",
    "skills": [
        "Python",
        "GitHub",
        "Sagemaker",
        "SQL",
        "AWS",
        "Machine learning",
        "Open-source languages",
        "Cloud computing platforms",
        "Information Retrieval",
        "Recommender System",
        "Search Ranking",
        "Large-scale, Real-time machine learning systems"
    ],
    "description": "Partner with a cross-functional team of data scientists, software engineers, and product managers to deliver a product customers love. Leverage a broad stack of technologies to reveal the insights hidden within huge volumes of numeric and textual data. Build machine learning models through all phases of development, from design through training, evaluation, validation, and implementation."
}
```
<cl

In [22]:
from langchain_core.output_parsers import JsonOutputParser

json_parser = JsonOutputParser()
json_response = json_parser.parse(response.content)

print(json_response)
print(type(json_response))

{'role': 'Principal Associate, Data Science - Financial Services', 'experience': "Bachelor's Degree plus 5 years of experience in data analytics, or Master's Degree plus 3 years in data analytics, or PhD", 'skills': ['Python', 'GitHub', 'Sagemaker', 'SQL', 'AWS', 'Machine learning', 'Open-source languages', 'Cloud computing platforms', 'Information Retrieval', 'Recommender System', 'Search Ranking', 'Large-scale, Real-time machine learning systems'], 'description': 'Partner with a cross-functional team of data scientists, software engineers, and product managers to deliver a product customers love. Leverage a broad stack of technologies to reveal the insights hidden within huge volumes of numeric and textual data. Build machine learning models through all phases of development, from design through training, evaluation, validation, and implementation.'}
<class 'dict'>


In [23]:
import csv

def read_csv_file(file_path):
    data = []
    with open(file_path, 'r') as file:
        csv_reader = csv.reader(file)
        # Skip the header row
        next(csv_reader)
        for row in csv_reader:
            # Separate technical skills (list) and project link (string)
            skills = tuple(row[:-1]) # Exclude the last element which is the project link
            project_link = row[-1] 
            data.append((skills, project_link))
    return data

# Example usage:
file_path = 'sample_portfolio.csv'
data = read_csv_file(file_path)

for skills, project_link in data:
    print(skills, project_link)

('Python', ' SQL', ' Pandas')  https://github.com/user/project1
('SQL', ' Python', ' Airflow')  https://github.com/user/project2
('PySpark', ' Spark SQL', ' Delta Lake')  https://github.com/user/project3
('Machine Learning', ' Deep Learning', ' TensorFlow')  https://github.com/user/project4
('Data Engineering', ' ETL', ' ELT')  https://github.com/user/project5
('Cloud Platforms (AWS', ' GCP', ' Azure)')  https://github.com/user/project6
('Data Warehousing', ' Data Modeling', ' DBT')  https://github.com/user/project7
('Data Visualization', ' Power BI', ' Tableau')  https://github.com/user/project8
('MLOps', ' MLflow', ' Kubeflow')  https://github.com/user/project9
('Natural Language Processing (NLP)', ' NLTK', ' spaCy')  https://github.com/user/project10
('Computer Vision', ' OpenCV', ' TensorFlow')  https://github.com/user/project11
('Time Series Analysis', ' Forecasting', ' Prophet')  https://github.com/user/project12
('Data Cleaning', ' Data Wrangling', ' Pandas')  https://github.com

In [24]:
# Insert data into vector database

import uuid
import chromadb

client = chromadb.PersistentClient('vectorstore')
collection = client.get_or_create_collection(name='portfolio_links')

if not collection.count():
    for skills, project_link in data:
        collection.add(
            documents=str(skills),
            metadatas={
                'portfolio_url': project_link
            },
            ids = str(uuid.uuid4())
        )

In [27]:
json_response['skills']

['Python',
 'GitHub',
 'Sagemaker',
 'SQL',
 'AWS',
 'Machine learning',
 'Open-source languages',
 'Cloud computing platforms',
 'Information Retrieval',
 'Recommender System',
 'Search Ranking',
 'Large-scale, Real-time machine learning systems']

In [33]:
matched_portfolio_urls = collection.query(query_texts=json_response['skills'][0], n_results=2)
matched_portfolio_urls

{'ids': [['72e6ed32-3af0-475e-9d1a-5caab35badff',
   '4d17ad99-d9ee-45f3-b544-377a4a5c8b3d']],
 'embeddings': None,
 'documents': [["('SQL', ' Python', ' Airflow')",
   "('Python', ' SQL', ' Pandas')"]],
 'uris': None,
 'data': None,
 'metadatas': [[{'portfolio_url': ' https://github.com/user/project2'},
   {'portfolio_url': ' https://github.com/user/project1'}]],
 'distances': [[1.0100101142910578, 1.0451927628216906]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [40]:
email_prompt = PromptTemplate.from_template(
    """
    I will give you a role and a task that you have to perform in that specific role.
    Your Role: Your name is Harmeet, You are an incredible business development officer who knows how to get clients. You work for X Consulting firm, your firm works with all sorts of IT clients and provide solutions in the domain of Data Science and AI. 
    X AI focuses on efficient tailored solutions for all clients keeping costs down. 
    Your Job: Your Job is to write cold emails to clients regarding the Job openings that they have advertised. Try to pitch your clients with an email hook that opens a conversation about a possibility of working with them. Add the most relevant portfolio URLs from
    the following (shared below) to showcase that we have the right expertise to get the job done. 
    I will now provide you with the Job description and the portfolio URLs:
    JOB DESCRIPTION: {job_description}
    ------
    PORTFOLIO URLS: {portfolio_urls}
    """
)

In [41]:
job_description = json_response['description']

In [42]:
chain_email = email_prompt | llm
response = chain_email.invoke({'job_description':job_description, 'portfolio_urls':matched_portfolio_urls})
print(response.content)

Subject: Unlocking Hidden Insights with X AI's Expertise in Data Science and AI

Dear Hiring Manager,

I came across your job posting for a data scientist position, and I was impressed by the opportunity to work with a cross-functional team to deliver a product that customers love. As a Business Development Officer at X Consulting firm, I'd like to introduce you to our team of experts in Data Science and AI, who have a proven track record of leveraging a broad stack of technologies to reveal insights hidden within large volumes of data.

Our team has extensive experience in building machine learning models, from design through training, evaluation, validation, and implementation. We've worked with a range of technologies, including SQL, Python, Airflow, and Pandas, to name a few. I'd like to highlight a couple of our notable projects that demonstrate our expertise:

* https://github.com/user/project1: This project showcases our ability to work with large datasets, applying machine lear