# Web Scraping
Web scraping is the process of **extracting data** from a website. It is an
**automatic method** to obtain large amounts of data from websites . Most of this data is **unstructured data** in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications . There are many different ways to perform web scraping to obtain data from websites. These include using online services, particular API’s or even creating your code for web scraping from scratch .

Web scraping requires two parts, namely the **crawler** and the **scraper**. The crawler is an artificial intelligence algorithm that browses the web to search for the particular data required by following the links across the internet. The scraper, on the other hand, is a specific tool created to extract data from the website .

In [None]:
# Importing Required Libraries
# requests is used to get access to the website or in other words connect to the website
import requests
# Beautiful Soup 4 is a Python library that makes it easy to scrape information from web pages. It provides Pythonic idioms for iterating, searching, and modifying the parse tree. The library sits atop an HTML or XML parser.
from bs4 import BeautifulSoup

## Accessing/Scraping embedded links
We cannot directly scrape a website content, first we will access all the embedded links and after that we can scrape data from the website.

In [None]:
# Send an HTTP request to the URL of the webpage we want to access
URL = "https://csvtu.ac.in/ew/"
r = requests.get(URL)

# Create a BeautifulSoup object and parse the HTML content
soup = BeautifulSoup(r.content, 'html.parser')

# Find all the links on the webpage
links = soup.find_all('a')

# Save the scraped data to a text file
with open('Embedded_links_csvtu.txt', 'w') as f:

   # Iterating through the links
    for link in links:

       # Creating an object and storing links
        href = link.get('href')

        # To ensure we are scraping the link
        if href is not None:

           # Writing links to text file
            f.write(href + '\n')

## Getting the HTML code from the website

If we want to access the HTML code of the website we can use this code:

We need to install requests,html5lib and bs4 before running this code

```
import requests
URL = "https://csvtu.ac.in/ew/"
r = requests.get(URL)
print(r.content)

# For getting the header component of the website
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"}
# Here the user agent is for Edge browser on windows 10. You can find your browser user agent from the above given link.
r = requests.get(url=URL, headers=headers)
print(r.content)

#This will not run on online IDE
import requests
from bs4 import BeautifulSoup

URL = "https://csvtu.ac.in/ew/"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install html5lib
print(soup.prettify())
```

# Getting text from the main website
# After this we can also scrape text from embedded websites

In [None]:
# Since we have already established a connection with the website we do not need to do it again

# Create a BeautifulSoup object and parse the HTML content
soup = BeautifulSoup(r.content, 'html.parser')

# Extract all the text from the webpage
text = soup.get_text()

# Save the scraped data to a text file
with open('Main_website_data.txt', 'w') as f:
    f.write(text)

In [None]:
# Reading text from text file
with open('/content/Main_website_data.txt', 'r') as f:
    contents = f.read()
    print(contents)










Chhattisgarh Swami Vivekanand Technical University – CSVTU







































































FORMS / DOWNLOADS
CSVTU NSS
CSVTU STUDENT COUNCIL
LOCATION
PREVIOUS WEBSITE
ENROLL. DEFICIENCIES





Search for:





Recent Posts


Recruitment Notice For The Post of Principal, Professor, Asst. Professor, Asso. Professor & Lecturer  Under Statute-19 at Rungta Institute of Pharmaceutical Sciences, Bhilai


Public Relations Officer


AICTE Quality Improvement Scheme[AQIS] 2021-22 Financial Support


M.Tech/M.Plan Admissions 2020 at University Teaching Department,CSVTU,Newai,Bhilai


Important Notification-Suspicious Email Activities


Recent CommentsArchives

December 2021
August 2021
December 2020
September 2020
May 2020
April 2020
March 2020

Categories

Announcement

Notice

Uncategorized


Meta

Log in
Entries feed
Comments feed
WordPress.org










HOME
THE UNIVERSITY

About
Hon’ble Vice Chancellor
Hon’ble Pro Vice Chancellor
University Val

## Reading text from embedded websites

In [None]:
# First we will iterate through all the links and then read text from there

print("Embedded Links are:\n")
# Open the file for reading
with open('/content/Embedded_links_csvtu.txt', 'r') as file:

    # Read and process each line
    for line in file:
        # Print the line (you can replace this with any action you want)
        print(line.strip())  # .strip() removes newline characters if needed

Embedded Links are:

https://csvtu.ac.in/ew/downloads
http://csvtu.ac.in/NSS_CSVTU/
https://csvtu.ac.in/ew/csvtu-student-council/
https://csvtu.ac.in/ew/contact-us/
http://www.csvtu.ac.in/prev/
https://csvtu.ac.in/ew/enrollment-deficiencies/
https://csvtu.ac.in/ew/recruitment-notice-for-the-post-of-principal-professor-asst-professor-asso-professor-lecturer-under-statute-19-at-rungta-institute-of-pharmaceutical-sciences-bhilai/
https://csvtu.ac.in/ew/public-relations-officer/
https://csvtu.ac.in/ew/aicte-quality-improvement-schemeaqis-2021-22-financial-support/
https://csvtu.ac.in/ew/m-techm-plan-admissions-2020-at-university-teaching-departmentcsvtunewaibhilai/
https://csvtu.ac.in/ew/important-notification/
https://csvtu.ac.in/ew/2021/12/
https://csvtu.ac.in/ew/2021/08/
https://csvtu.ac.in/ew/2020/12/
https://csvtu.ac.in/ew/2020/09/
https://csvtu.ac.in/ew/2020/05/
https://csvtu.ac.in/ew/2020/04/
https://csvtu.ac.in/ew/2020/03/
https://csvtu.ac.in/ew/category/announcement/
https://csvtu

In [None]:
# Open the file for reading
# Adding links to list
lst = []
with open('/content/Embedded_links_csvtu.txt', 'r') as file:
    # Read and process each line
    for line in file:
        # Appending to list
        lst.append(line.strip())  # .strip() removes newline characters if needed

## Removing '#' and '' from links and unreachable links from list

In [None]:
# We need to remove # from list
new_lst = list(filter(lambda x: x != '#' and x != '' and x != '/csvtubhilaif/add' and x != 'act Hindi_English 2004.pdf', lst))

# Print list
print("List is:")
print(new_lst)

List is:
['https://csvtu.ac.in/ew/downloads', 'http://csvtu.ac.in/NSS_CSVTU/', 'https://csvtu.ac.in/ew/csvtu-student-council/', 'https://csvtu.ac.in/ew/contact-us/', 'http://www.csvtu.ac.in/prev/', 'https://csvtu.ac.in/ew/enrollment-deficiencies/', 'https://csvtu.ac.in/ew/recruitment-notice-for-the-post-of-principal-professor-asst-professor-asso-professor-lecturer-under-statute-19-at-rungta-institute-of-pharmaceutical-sciences-bhilai/', 'https://csvtu.ac.in/ew/public-relations-officer/', 'https://csvtu.ac.in/ew/aicte-quality-improvement-schemeaqis-2021-22-financial-support/', 'https://csvtu.ac.in/ew/m-techm-plan-admissions-2020-at-university-teaching-departmentcsvtunewaibhilai/', 'https://csvtu.ac.in/ew/important-notification/', 'https://csvtu.ac.in/ew/2021/12/', 'https://csvtu.ac.in/ew/2021/08/', 'https://csvtu.ac.in/ew/2020/12/', 'https://csvtu.ac.in/ew/2020/09/', 'https://csvtu.ac.in/ew/2020/05/', 'https://csvtu.ac.in/ew/2020/04/', 'https://csvtu.ac.in/ew/2020/03/', 'https://csvtu.a

## Since certain embedded links do not allow access we have to use exception handling to continue the scraping without interruption.

In [None]:
# Loop through the list of URLs and scrape their content
for url in new_lst:
    try:
        # Send an HTTP request to the URL of the webpage we want to access
        r = requests.get(url)

        # Check for any HTTP errors
        r.raise_for_status()

        # Create a BeautifulSoup object and parse the HTML content
        soup = BeautifulSoup(r.content, 'lxml')

        # Extract all the text from the webpage
        text = soup.get_text()

        # Save the scraped data to a text file
        with open('Embedded_links_data_2.txt', 'a') as f:
            f.write(text)

    except requests.exceptions.ConnectTimeout as e:
        print(f"Connection to {url} timed out. Check your internet connection.")
    except requests.exceptions.RequestException as e:
        print(f"An error occurred: {e}")




An error occurred: 403 Client Error: Forbidden for url: https://forms.eduqfix.com/csvtubhilaif/add
An error occurred: 403 Client Error: Forbidden for url: https://forms.eduqfix.com/csvtubhilaif/add
Connection to https://rajbhavancg.gov.in/ timed out. Check your internet connection.
An error occurred: HTTPConnectionPool(host='www.cgdteraipur.ac.in', port=80): Max retries exceeded with url: / (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7c65fe7b2f50>: Failed to resolve 'www.cgdteraipur.ac.in' ([Errno -2] Name or service not known)"))
An error occurred: HTTPSConnectionPool(host='www.csir.res.in', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))
An error occurred: 406 Client Error: Not Acceptable for url: http://www.aiu.ac.in/
Connection to http://www.niscair.res.in/ timed out. Check your internet 

In [None]:
# Combining text files
# Define the names of the two input text files
file1_name = '/content/Main_website_data.txt'
file2_name = '/content/Embedded_links_data_2.txt'

# Define the name of the output file where you want to concatenate the contents
output_file_name = 'CSVTU_scraped_data.txt'

# Open the first input file for reading
with open(file1_name, 'r') as file1:
    content1 = file1.read()

# Open the second input file for reading
with open(file2_name, 'r') as file2:
    content2 = file2.read()

# Combine the contents of both files
combined_content = content1 + content2

# Open the output file for writing
with open(output_file_name, 'w') as output_file:
    # Write the combined content to the output file
    output_file.write(combined_content)

print(f"Concatenated content has been saved to {output_file_name}")

Concatenated content has been saved to CSVTU_scraped_data.txt


## We have only done Web Scraping from main website and its embedded links but those embedded links may also have other links which may have useful data.So we are also extracting data from those websites which are embedded within the embedded links of the main website.

In [None]:
def web_Scraping_Text(url, visited_link, success_link):

  if url not in visited_link and 'csvtu.ac.in' in url:

    try:

      # Requesting website access for main website
      r = requests.get(url)

      # Since we have already established a connection with the website we do not need to do it again

      # Create a BeautifulSoup object and parse the HTML content
      soup = BeautifulSoup(r.content, 'lxml')

      visited_link.append(url)
      success_link.append(url)

      # Extract all the text from the webpage
      text = soup.get_text()

      # Find all the links on the webpage
      links = soup.find_all('a')

      # Save the scraped data to a text file
      with open('Complete_website_data.txt', 'a') as f:
          f.write(text)

      for link in links:

        # Creating an object and storing links
        href = link.get('href')

        # To ensure we are scraping the link
        if href is not None and 'csvtu.ac.in' in href and href not in visited_link:

          web_Scraping_Text(href, visited_link, success_link)


    except requests.exceptions.ConnectTimeout as e:

          # Appending link to visited_link as we have already visited it
          visited_link.append(url)
          print(f"Connection to {link} timed out. Check your internet connection.")

    except requests.exceptions.RequestException as e:

          visited_link.append(url)
          print(f"An error occurred: {e}")

web_Scraping_Text("https://csvtu.ac.in/ew/", [], [])

## We need to install PyPDF2 everytime we run this code.

In [None]:
# Installing Library to Read and Extract Text from PDF
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


## Defining a Recursive Function for Web Scraping all the embedded Website Text and Text from Embedded PDF Files.

In [None]:
# Importing Required Libraries
import requests
from bs4 import BeautifulSoup
import PyPDF2

# Declaring visited_link and pdf links as global variables so that all functions can access it
pdf_links, visited_links, success_links = [],[],[]

# Extracting text from pdf
def extract_text_from_pdf(url):

    # Making Global Variables so that we can use throughout the function
    global pdf_links, visited_links, success_links

    try:
        if url not in visited_links:
            response = requests.get(url, stream=True)

            # Check if the content-type is a PDF
            if 'application/pdf' in response.headers['content-type']:
                with open("temp_pdf.pdf", 'wb') as temp_pdf:
                    temp_pdf.write(response.content)

                text = ""

                with open("temp_pdf.pdf", 'rb') as pdf_file:
                    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

                    for page_num in range(pdf_reader.numPages):
                        page = pdf_reader.getPage(page_num)
                        text += page.extractText()

                save_to_file(text)
                visited_links.append(url)
                pdf_links.append(url)

    except:
        visited_links.append(url)

def save_to_file(text):
    with open('Complete_website_data.txt', 'a', encoding = 'utf-8') as f:
        f.write(text)

def web_scraping_text(url):

    global pdf_links, visited_links, success_links

    if url not in visited_links and 'csvtu.ac.in' in url:
        try:
            # Requesting website access for the main website
            r = requests.get(url)

            # Create a BeautifulSoup object and parse the HTML content
            soup = BeautifulSoup(r.content, 'lxml')

            # Extract all the text from the webpage
            text = soup.get_text()

            if text is not None and text != '':
              save_to_file(text)
              success_links.append(url)

            # Find all the links on the webpage
            links = soup.find_all('a')

            visited_links.append(url)

            for link in links:
                # Creating an object and storing links
                href = link.get('href')

                # To ensure we are scraping the link which contains data from csvtu only
                if href is not None and 'csvtu.ac.in' in href and href not in visited_links:
                  # Check if the link points to a PDF
                    if href.lower().endswith('.pdf'):
                        extract_text_from_pdf(href)
                    else:
                        web_scraping_text(href)

        except requests.exceptions.ConnectTimeout as e:
            # Appending link to visited_links as we have already visited it
            visited_links.append(url)
            print(f"Connection to {url} timed out. Check your internet connection.")

        except requests.exceptions.RequestException as e:
            visited_links.append(url)
            print(f"An error occurred: {e}")

# Calling function
web_scraping_text("https://csvtu.ac.in/ew/")



KeyboardInterrupt: ignored

## Seeing Websites which are visited and Websites from which data is succcessfully scraped.

In [None]:
# Visited Links
print("Visited Links are:\n")
print(visited_links)

# Scraped data from websites Links
print("\nLinks from which data is Successful Scraped are:\n")
print(success_links)

# Pdf Links
print("\nPDF Links from which data is Successful Scraped are:\n")
print(pdf_links)

Visited Links are:

['https://csvtu.ac.in/ew/', 'https://csvtu.ac.in/ew/downloads', 'http://csvtu.ac.in/NSS_CSVTU/', 'https://csvtu.ac.in/ew/csvtu-student-council/', 'https://csvtu.ac.in/ew/contact-us/', 'http://www.csvtu.ac.in/prev/', 'http://www.csvtu.ac.in/prev/index.htm', 'http://www.csvtu.ac.in/prev/vision.htm', 'http://www.csvtu.ac.in/prev/mission.htm', 'http://csvtu.ac.in/prev/AcadmicCalander2014_15.htm', 'http://www.csvtu.ac.in/prev/goals.htm', 'http://www.csvtu.ac.in/prev/objectives.htm', 'http://www.csvtu.ac.in/prev/faculties.htm', 'https://csvtu.ac.in/ew/enrollment-deficiencies/', 'https://csvtu.ac.in/ew/recruitment-notice-for-the-post-of-principal-professor-asst-professor-asso-professor-lecturer-under-statute-19-at-rungta-institute-of-pharmaceutical-sciences-bhilai/', 'https://csvtu.ac.in/ew/public-relations-officer/', 'https://csvtu.ac.in/ew/aicte-quality-improvement-schemeaqis-2021-22-financial-support/', 'https://csvtu.ac.in/ew/m-techm-plan-admissions-2020-at-university-

In [31]:
Visited_file = 'Visited Links.txt'
Success_file = 'Success Links.txt'

# Open the output file for writing
with open(Visited_file, 'w') as vlink:
    # Write the combined content to the output file
    for l in visited_links:
      vlink.write(l + "\n")

# Open the output file for writing
with open(Success_file, 'w') as slink:
    # Write the combined content to the output file
    for l in success_links:
      slink.write(l + "\n")

## Finding Links

In [21]:
# Module to match string
import re

# Reading text file
# Open the output file for writing
with open('/content/Visited Links.txt', 'r') as f:
    # Write the content to text
    text += f.read()

# Storing Text into input_string
input_string = text

# Extract links using regular expression
links = re.findall(r'https?://\S+', str(input_string))

# Combine links into a single string
combined_links = ' '.join(links)

# Separate links based on the pattern 'http' or 'https'
separated_links = re.split(r'https?://', combined_links)

# Filter out empty strings
filtered_links = [link.strip() for link in separated_links if link.strip()]

# Store links in a text file
output_file_path = 'links.txt'
with open(output_file_path, 'w') as file:
    for link in filtered_links:
        file.write('https://' + link + '\n')

print(f"Links extracted and stored in {output_file_path}")

Links extracted and stored in links.txt


In [29]:
visited_links = []
success_links = []

with open('/content/Visited Links.txt', 'r') as f:
  visited_links += f.read().splitlines()

with open('/content/Success Links.txt', 'r') as f:
  success_links += f.read().splitlines()

print("Total Number of Links visited are: ", len(visited_links))
print("\nTotal Number of Links from which data is Successfully Scraped are: ", len(success_links))

Total Number of Links visited are:  3156

Total Number of Links from which data is Successfully Scraped are:  3156


In [30]:
# Removing PDF link as data from PDF was not successfully scraped
for l in visited_links:
  if l.endswith('pdf'):
    visited_links.remove(l)

for s in success_links:
  if s.endswith('pdf'):
    success_links.remove(s)

print("Total Number of Links visited are: ", len(visited_links))
print("\nTotal Number of Links from which data is Successfully Scraped are: ", len(success_links))

Total Number of Links visited are:  3080

Total Number of Links from which data is Successfully Scraped are:  3080
