# Example : Extract email address from a webpage

Lets extract information from webpages - using Unstructured

https://python.langchain.com/docs/integrations/document_loaders/unstructured_file/#overview

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(web_url="https://maths.du.ac.in/faculty-profile/")
docs = loader.load()

In [4]:
full_doc = "\n\n".join(doc.page_content for doc in docs)
print(full_doc)

DU





Menu Close

Home

About

Welcome

History

Campus

Seminar & Lecture Rooms

Research Scholar Room

Computer Lab

Committee Room

Library

Disabled-Friendly Campus

Clean and Green Campus

Ranking

Committees

Accomplishments

Department

Faculty

Students

Annual Reports

Gallery

Redressal Mechanisms

Brochure

Contact Us

People

Faculty Profile

Post-Doc Fellows

Ph.D. Scholars

M.Phil. Scholars

Supporting Staff

Tutors

Former Faculty

Former HODs

Research

Research Areas

Publications

Books Authored

Research Grants

Collaborations

Research Supervision

M.Phil. Awarded

Ph.D. Awarded

Academics

M.Sc. Programme

Ph.D. Programme

U.G. Curriculum

Academic Calendar

Time Tables

Examination and Results

Admissions

M.Sc. Admissions

Ph.D. Admissions

Resources

Library@DU

Forms

Useful Links

Previous Year Papers

News & Events @ Outside DU

Opportunities

Placement

Scholarships/Fellowships/Internships

Ad-hoc Panel

Events

Ph.D. Seminars

Colloquia/Workshops

Co-Curr

## No need for chunking and splitted

Since this is very small page, we can pass the content of the entire page as a context. No need to splitting and rerieval in this case.

In [5]:
#question = "Tell me about Randheer Singh?"
question = "Make a list of all email address."

In [6]:
# RAG promt template
prompt_template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:"""

# make the LLM read see the prompt, and analyse the retrieved document, and generate response

from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")

response = llm.invoke(prompt_template.format(
    context=full_doc,
    question=question))
    
print(response.content)

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Here are the email addresses from the provided context:

1. tarukd@gmail.com
2. cslalitha1@gmail.com
3. rdasmsu@gmail.com
4. vambethkar@gmail.com
5. vambethkar@maths.du.ac.in
6. sachi_srivastava@yahoo.com
7. ssrivastava@maths.du.ac.in
8. lalit@maths.du.ac.in
9. lkumarvashisht@gmail.com
10. arvindpatelmath09@gmail.com
11. agaur@maths.du.ac.in
12. hksinghdu@gmail.com
13. rjain@maths.du.ac.in
14. sachinambariya@gmail.com
15. rkpanda@maths.du.ac.in
16. azothansanga26@yahoo.com
17. anupama.panigrahi@gmail.com
18. anuj.bshn@gmail.com
19. pratimarai5@gmail.com
20. surendraiitr8@gmail.com
21. randheernsit@gmail.com
22. sumitnagpal.du@gmail.com
23. mrigendra154@gmail.com
24. akumar@maths.du.ac.in
25. head@maths.du.ac.in


In [30]:
from IPython.display import Markdown
Markdown(response.content)

Here is a list of email addresses from the provided context:

1. tarukd@gmail.com
2. cslalitha1@gmail.com
3. rdasmsu@gmail.com
4. vambethkar@gmail.com
5. vambethkar@maths.du.ac.in
6. sachi_srivastava@yahoo.com
7. ssrivastava@maths.du.ac.in
8. lalit@maths.du.ac.in
9. lkumarvashisht@gmail.com
10. arvindpatelmath09@gmail.com
11. agaur@maths.du.ac.in
12. hksinghdu@gmail.com
13. rjain@maths.du.ac.in
14. sachinambariya@gmail.com
15. rkpanda@maths.du.ac.in
16. azothansanga26@yahoo.com
17. anupama.panigrahi@gmail.com
18. anuj.bshn@gmail.com
19. pratimarai5@gmail.com
20. surendraiitr8@gmail.com
21. randheernsit@gmail.com
22. sumitnagpal.du@gmail.com
23. mrigendra154@gmail.com
24. akumar@maths.du.ac.in
25. head@maths.du.ac.in

In [9]:
# Reprompting on the same data to prduce a different format output.
# RAG promt template
prompt_template = """You are an assistant and you are required to format the information a desired format. Use the following pieces context. The context contains a list of professors, their names, email addresses, designations etc. Understand the task carefully and and respond.
Context: {context} 
Task: Extract the names, designation, for each professor and return in this format
**Name** *emailID*

"""

# make the LLM read see the prompt, and analyse the retrieved document, and generate response

from langchain.chat_models import init_chat_model
llm = init_chat_model("gpt-4o-mini", model_provider="openai")

response = llm.invoke(prompt_template.format(
    context=full_doc,
    question=question))
    
from IPython.display import Markdown
Markdown(response.content)

INFO: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Here is the formatted information for each professor:

**Prof. Tarun Kumar Das** *tarukd@gmail.com*  
**Prof. C.S. Lalitha** *cslalitha1@gmail.com*  
**Prof. Ruchi Das** *rdasmsu@gmail.com*  
**Prof. Vusala Ambethkar** *vambethkar@gmail.com, vambethkar@maths.du.ac.in*  
**Prof. Sachi Srivastava** *sachi_srivastava@yahoo.com; ssrivastava@maths.du.ac.in*  
**Prof. Lalit Kumar** *lalit@maths.du.ac.in, lkumarvashisht@gmail.com*  
**Prof. Arvind Patel** *arvindpatelmath09@gmail.com*  
**Dr. Atul Gaur** *agaur@maths.du.ac.in*  
**Dr. Hemant Kumar Singh** *hksinghdu@gmail.com*  
**Dr. Ranjana Jain** *rjain@maths.du.ac.in*  
**Dr. Sachin Kumar** *sachinambariya@gmail.com*  
**Dr. Ratikanta Panda** *rkpanda@maths.du.ac.in*  
**Dr. A. Zothansanga** *azothansanga26@yahoo.com*  
**Dr. Anupama Panigrahi** *anupama.panigrahi@gmail.com*  
**Dr. Anuj Bishnoi** *anuj.bshn@gmail.com*  
**Dr. Pratima Rai** *pratimarai5@gmail.com*  
**Dr. Surendra Kumar** *surendraiitr8@gmail.com*  
**Dr. Randheer Singh** *randheernsit@gmail.com*  
**Dr. Sumit Nagpal** *sumitnagpal.du@gmail.com*  
**Dr. Mrigendra Singh Kushwaha** *mrigendra154@gmail.com*  
**Prof. Ajay Kumar, FNASc** *akumar@maths.du.ac.in*  

# Exercise: Try to extract something else from a different webpage