# Cold Email Generator for Students

This notebook provides a Streamlit application to generate cold emails for students based on job postings and LinkedIn profiles.

In [None]:
# Install required packages
!pip install langchain langchain_openai langchain_community bs4 streamlit requests selenium webdriver-manager

Collecting bs4
  Using cached bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Using cached bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Installing collected packages: bs4
Successfully installed bs4-0.0.2



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: C:\masterp\projects\Cold-Email-Generator-for-Students\.venv\Scripts\python.exe -m pip install --upgrade pip


In [6]:
import json
import requests
import re
from bs4 import BeautifulSoup
from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
import streamlit as st

ModuleNotFoundError: No module named 'langchain_openai'

## LinkedIn Scraping Issues

LinkedIn uses strong anti-scraping measures that prevent direct access via simple requests.
Let's examine the issue with our current approach:

In [8]:
# Example URL for testing - showing why direct requests don't work
url = "https://www.linkedin.com/in/mihirhirave/"  # replace with your actual URL
response = requests.get(url)
print(f"Response status code: {response.status_code}")
print("Response HTML (first 300 characters):")
print(response.text[:300])

soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()
print("\nExtracted text (showing all):")
print(text)

Response status code: 999
Response HTML (first 300 characters):
<html><head>
<script type="text/javascript">
window.onload = function() {
  // Parse the tracking code from cookies.
  var trk = "bf";
  var trkInfo = "bf";
  var cookies = document.cookie.split("; ");
  for (var i = 0; i < cookies.length; ++i) {
    if ((cookies[i].indexOf("trkCode=") == 0) && (coo

Extracted text (showing all):






## Alternative LinkedIn Data Collection Solutions

Since LinkedIn blocks direct scraping, we need alternative approaches:

In [None]:
# Option 1: Use Selenium with browser automation (requires login)
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time

def scrape_linkedin_with_selenium(linkedin_url):
    """Scrape LinkedIn with Selenium - requires manual login"""
    # Setup Chrome options
    chrome_options = Options()
    # chrome_options.add_argument("--headless")  # Run headless if no manual login needed
    
    # Initialize the Chrome driver
    try:
        driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
        
        # First go to LinkedIn login page
        driver.get("https://www.linkedin.com/login")
        print("Please log in to LinkedIn in the browser window.")
        print("After logging in, the script will continue automatically.")
        
        # Wait for manual login (wait until URL contains 'feed' which indicates successful login)
        max_wait = 60  # Maximum wait time in seconds
        wait_time = 0
        while wait_time < max_wait:
            if "feed" in driver.current_url:
                break
            time.sleep(5)
            wait_time += 5
            print(f"Waiting for login... ({wait_time} seconds)")
        
        if wait_time >= max_wait:
            print("Login timeout reached. Proceeding anyway...")
        
        # Now navigate to the profile URL
        print(f"Navigating to {linkedin_url}")
        driver.get(linkedin_url)
        time.sleep(5)  # Allow time for the page to load
        
        # Extract the page content
        page_source = driver.page_source
        soup = BeautifulSoup(page_source, 'html.parser')
        text = soup.get_text(separator=' ', strip=True)
        
        # Clean up and return
        driver.quit()
        return text
    except Exception as e:
        print(f"An error occurred: {e}")
        try:
            driver.quit()
        except:
            pass
        return None

# Uncomment to test this function
# profile_text = scrape_linkedin_with_selenium("https://www.linkedin.com/in/mihirhirave/")
# if profile_text:
#     print(f"Extracted {len(profile_text)} characters of profile text")
#     print(f"Sample: {profile_text[:500]}...")

In [None]:
# Option 2: Alternative approach - get user to manually input their LinkedIn info
def get_linkedin_data_from_user_input():
    """Get LinkedIn profile data directly from user inputs instead of scraping"""
    # In Streamlit, these would be text inputs or text areas
    linkedin_data = {
        "headline": "Software Engineer",  # st.text_input("Your LinkedIn Headline")
        "about": "Experienced software engineer with background in...",  # st.text_area("About You")
        "experience": "Software Engineer at Company X (2020-Present)\nIntern at Company Y (2019-2020)",  # st.text_area("Experience")
        "education": "BS Computer Science, University Z (2016-2020)",  # st.text_area("Education")
        "skills": "Python, JavaScript, Machine Learning, Data Analysis"  # st.text_input("Key Skills (comma separated)")
    }
    
    # Format the data into a string similar to what we'd get from scraping
    formatted_data = f"""Headline: {linkedin_data['headline']}
About: {linkedin_data['about']}
Experience: {linkedin_data['experience']}
Education: {linkedin_data['education']}
Skills: {linkedin_data['skills']}"""
    
    return formatted_data

# Example of profile data format that would be returned
example_profile_data = get_linkedin_data_from_user_input()
print("Example manual profile data:")
print(example_profile_data)

In [None]:
# Modified fetch_and_summarize function to handle LinkedIn specially
def fetch_and_summarize(url, llm):
    """Fetch and summarize content from a URL, with special handling for LinkedIn"""
    try:
        # Check if this is a LinkedIn URL
        if "linkedin.com/in/" in url:
            # For LinkedIn, we'll use our alternative approach
            print("LinkedIn URL detected. Using alternative method to get profile data...")
            
            # In a real app, you would use either the Selenium approach or manual input
            # For this demo, we'll use a mocked profile
            profile_data = get_linkedin_data_from_user_input()
            
            # Since we already have the formatted data, we can skip the summarization step
            # or still run it through the LLM to get a more concise version
            prompt_summary = PromptTemplate.from_template(
                """
                Summarize the following LinkedIn profile to highlight key skills and achievements that would be relevant for job applications:
                {content}
                """
            )
            chain_summary = prompt_summary | llm
            summary_response = chain_summary.invoke({"content": profile_data})
            return summary_response.content.strip()
            
        else:
            # For non-LinkedIn URLs, use the standard approach
            response = requests.get(url)
            response.raise_for_status()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            text = soup.get_text()
            
            prompt_summary = PromptTemplate.from_template(
                """
                Summarize the following content to highlight key skills and achievements:
                {content}
                """
            )
            chain_summary = prompt_summary | llm
            summary_response = chain_summary.invoke({"content": text})
            
            return summary_response.content.strip()
    except Exception as e:
        return f"Error fetching data: {e}"

# This function would replace the fetch_and_summarize function in the main app

In [None]:
# Streamlit App Configuration
st.set_page_config(page_title="Cold Email Generator for Students", page_icon="📧", layout="wide")

# Custom CSS for styling
st.markdown(
    """
    <style>
        body {
            background-color: #f0f8ff;
        }
        .generated-email {
            height: 400px !important;
            font-size: 14px;
            font-family: Arial, sans-serif;
        }
        .stTextArea label {
            font-size: 16px;
            font-weight: bold;
        }
        .stButton button {
            background-color: #ff4500;
            color: white;
            font-size: 16px;
        }
        .stTextInput label, .stJson label {
            font-size: 15px;
            color: #333;
        }
    </style>
    """,
    unsafe_allow_html=True,
)

# Streamlit App Title and Description
st.title("📧 Cold Email Generator for Students")
st.markdown(
    """
    Welcome to the **Cold Email Generator for Students**! This tool helps you craft professional and tailored cold emails 
    for job applications by leveraging AI and your online profiles.
    """
)

# Step 1: User Inputs
st.header("Step 1: Provide Input Details")
with st.form("user_inputs"):
    url_input = st.text_input("Enter the Job URL:", placeholder="e.g., https://www.example.com/job-posting")
    student_name = st.text_input("Your Name:", placeholder="e.g., Jane Doe")
    university_name = st.text_input("Your University/Organization:", placeholder="e.g., University of Example")
    
    # Modified LinkedIn input approach
    st.markdown("#### LinkedIn Information")
    st.markdown("*LinkedIn profiles can't be automatically scraped due to restrictions. Please provide your details manually:*")
    linkedin_headline = st.text_input("Your Headline:", placeholder="e.g., Computer Science Student at University of Example")
    linkedin_experience = st.text_area("Your Experience:", placeholder="e.g., Internship at Company X (Summer 2023)\nResearch Assistant (2022-Present)")
    linkedin_education = st.text_input("Your Education:", placeholder="e.g., BS Computer Science, University of Example (2020-2024)")
    linkedin_skills = st.text_input("Your Skills:", placeholder="e.g., Python, Java, Machine Learning, Web Development")
    
    portfolio_url = st.text_input("Portfolio Website (Optional):", placeholder="e.g., https://yourportfolio.com/projects")
    openai_api_key = st.text_input("OpenAI API Key (Required):", type="password")
    generate_button = st.form_submit_button("Extract Job Details and Generate Email")

if generate_button:
    if not url_input.strip() or not student_name.strip() or not university_name.strip() or not linkedin_headline.strip() or not openai_api_key.strip():
        st.error("Please provide all required inputs (Job URL, Name, University/Organization, LinkedIn Headline, and OpenAI API Key).")
    else:
        try:
            # Step 1: Initialize OpenAI LLM
            llm = ChatOpenAI(
                temperature=0,
                api_key=openai_api_key,
                model_name="gpt-4"
            )

            # Step 2: Scrape job description
            st.info("Scraping the job page content...")
            loader = WebBaseLoader([url_input])
            page_data = loader.load().pop().page_content

            prompt_extract = PromptTemplate.from_template(
                """
                ### SCRAPED TEXT FROM WEBSITE:
                {page_data}
                ### INSTRUCTION:
                Extract the job postings in valid JSON format with keys: `role`, `experience`, `skills`, and `description`.
                Return ONLY the JSON object without any additional text, markdown formatting, or headers.
                The response should start directly with the JSON object.
                """
            )
            chain_extract = prompt_extract | llm
            res = chain_extract.invoke({"page_data": page_data})
            
            # Clean up the response to extract just the JSON part
            content = res.content
            # Remove any markdown headers or text before the JSON
            json_match = re.search(r'(\[|\{).*(\]|\})', content, re.DOTALL)
            if json_match:
                json_str = json_match.group(0)
            else:
                json_str = content
                
            try:
                # Parse the cleaned JSON string
                job_details = json.loads(json_str)
            except json.JSONDecodeError as e:
                st.error(f"Error parsing JSON: {e}")
                st.code(content, language="json")
                raise Exception(f"Failed to parse job details. The API returned an invalid JSON format.")

            # Step 3: Process LinkedIn data (now from manual input instead of scraping)
            linkedin_data = f"""Headline: {linkedin_headline}
Experience: {linkedin_experience}
Education: {linkedin_education}
Skills: {linkedin_skills}"""
            
            # Summarize the LinkedIn data using the LLM
            prompt_linkedin = PromptTemplate.from_template(
                """
                Summarize the following LinkedIn profile to highlight relevant skills and experiences for job applications:
                {content}
                """
            )
            chain_linkedin = prompt_linkedin | llm
            linkedin_summary = chain_linkedin.invoke({"content": linkedin_data}).content.strip()
            
            # Process portfolio if provided
            if portfolio_url.strip():
                st.info("Analyzing portfolio content...")
                try:
                    # Standard web scraping approach for portfolio (assuming it's not LinkedIn)
                    response = requests.get(portfolio_url)
                    response.raise_for_status()
                    soup = BeautifulSoup(response.text, 'html.parser')
                    text = soup.get_text()
                    
                    prompt_summary = PromptTemplate.from_template(
                        """
                        Summarize the following content to highlight key skills and achievements:
                        {content}
                        """
                    )
                    chain_summary = prompt_summary | llm
                    portfolio_summary = chain_summary.invoke({"content": text}).content.strip()
                except Exception as e:
                    portfolio_summary = f"Error fetching portfolio data: {e}. Please provide a summary manually."
            else:
                portfolio_summary = "No portfolio provided."

            # Step 4: Generate Cold Email
            st.header("Step 2: Generate Cold Email")
            st.info("Generating cold email...")
            prompt_email = ChatPromptTemplate.from_template(
                """
                ### JOB DESCRIPTION:
                {job_description}

                ### CANDIDATE DETAILS:
                Name: {student_name}
                University/Organization: {university_name}
                LinkedIn Summary: {linkedin_summary}
                Portfolio Summary: {portfolio_summary}

                ### INSTRUCTIONS:
                Write a professional cold email for the job description in four structured paragraphs:
                1. Introduction of the candidate (name, background, and current affiliation).
                2. Highlight relevant experiences and skills based on the LinkedIn and portfolio summaries.
                3. Explain why the candidate is an excellent fit for the job.
                4. End with a call to action for further discussion.

                ### EMAIL (START HERE):
                """
            )

            chain_email = prompt_email | llm
            email_response = chain_email.invoke(
                {
                    "job_description": str(job_details),
                    "student_name": student_name,
                    "university_name": university_name,
                    "linkedin_summary": linkedin_summary,
                    "portfolio_summary": portfolio_summary,
                }
            )

            # Display the generated email
            st.subheader("Generated Cold Email")
            st.text_area("Cold Email", email_response.content, height=400, key="generated_email", help="Copy and customize as needed.")
            st.download_button(
                label="📥 Download Email",
                data=email_response.content,
                file_name="cold_email.txt",
                mime="text/plain",
            )

        except Exception as e:
            st.error(f"An error occurred: {e}")


In [4]:
# Test the installation worked properly
from langchain_openai import ChatOpenAI
print("Import successful!")

ModuleNotFoundError: No module named 'langchain_openai'

## LinkedIn Scraping Explanation

The original scraping approach failed because LinkedIn has anti-scraping protections in place:

1. They detect and block automated requests
2. They return a custom status code (999) and redirect to an authentication page
3. They require cookies and JavaScript execution

To solve this issue, we've implemented two alternative approaches:

1. **Manual Input Method** - The simplest solution is to ask users to enter their LinkedIn details directly.
2. **Selenium Method** - For more automated solutions, using Selenium with authentication would work but requires more setup.

The current implementation uses the Manual Input Method as it's more reliable and easier to set up.