# Problem Statement: Web Scraping News Information from DawnNews WebPortal

**Background:**

Dawn News is an News platform where users can search for and read various news & articles.

**Objective:**

The objective of this project is to create a Python script that performs web scraping on Dawn News website to extract essential information about Samsung mobile phone and store it for further analysis or use.


### Import necessary libraries


In [1]:
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup

### Define the URL to scrape data


In [2]:
URL = f"https://www.dawn.com/latest-news"

### Setting user-agent headers


In [3]:
HEADERS = {"User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'}

### Send an HTTP GET request to the URL


In [13]:
response = requests.get(URL,headers=HEADERS)

### Print the HTTP response object


In [14]:
response

<Response [200]>

### Get the content of the HTTP response


In [15]:
response.content

b'<!DOCTYPE html>\n<html lang="en">\n\n<head>\n        \n    <!-- meta -->\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width,minimum-scale=1,initial-scale=1">\n    <!--[if IE]> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <![endif]-->\n    <title>Latest - DAWN.COM</title> \n     <meta name=\'subject\' content=\'Your window to latest news, analysis and features from Pakistan, South Asia and the world.\' /> \n     <meta name=\'description\' content=\'Pakistan&rsquo;s most trusted outlet for the breaking, latest and top news across the country and the world.\' /> \n     <meta property=\'fb:app_id\' content=\'1383068068604634\' />  \n     <meta property=\'og:locale\' content=\'en_US\' /> \n     <link rel=\'canonical\' href=\'https://www.dawn.com/latest-news\' /> \n     <link rel=\'alternate\' type=\'application/rss+xml\' title=\'The Dawn News\' href=\'https://www.dawn.com/feeds/latest-news/\' />  \n     \n     <link rel=\'index\' href=\'https://w

### Parse the HTML content of the response using BeautifulSoup


In [16]:
soup = BeautifulSoup(response.content,"html.parser")

In [8]:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<!-- meta -->
<meta charset="utf-8"/>
<meta content="width=device-width,minimum-scale=1,initial-scale=1" name="viewport"/>
<!--[if IE]> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <![endif]-->
<title>Latest - DAWN.COM</title>
<meta content="Your window to latest news, analysis and features from Pakistan, South Asia and the world." name="subject">
<meta content="Pakistan’s most trusted outlet for the breaking, latest and top news across the country and the world." name="description">
<meta content="1383068068604634" property="fb:app_id"/>
<meta content="en_US" property="og:locale"/>
<link href="https://www.dawn.com/latest-news" rel="canonical"/>
<link href="https://www.dawn.com/feeds/latest-news/" rel="alternate" title="The Dawn News" type="application/rss+xml"/>
<link href="https://www.dawn.com/latest-news" rel="index"/>
<meta content="https://www.dawn.com/_img/social-default.jpg" name="twitter:image"/>
<meta content="https://www.daw

### Find all links with class "s1Q9rs" on the search results page


In [17]:
links = soup.find_all("a",class_="story__link")

### Print the list of links


In [10]:
links

[<a class="story__link" href="https://www.dawn.com/news/1877696/may-9-riots-sc-constitutional-bench-rejects-govt-plea-allowing-military-courts-to-announce-verdicts">May 9 riots: SC constitutional bench rejects govt plea allowing military courts to announce verdicts</a>,
 <a class="story__link" href="https://www.dawn.com/news/1877700/kp-supervisory-committee-formed-to-restore-law-and-order-in-kurram">KP supervisory committee formed to restore law and order in Kurram</a>,
 <a class="story__link" href="https://www.dawn.com/news/1877697/alexandrine-parrots-disappearing-from-pakistans-skies">Alexandrine parrots disappearing from Pakistan’s skies</a>,
 <a class="story__link" href="https://www.dawn.com/news/1877689/pakistan-ranked-among-lowest-12pc-globally-in-mobile-broadband-internet-speeds-in-oct-report">Pakistan ranked among lowest 12pc globally in mobile, broadband internet speeds in Oct: report</a>,
 <a class="story__link" href="https://www.dawn.com/news/1877688/extreme-heat-endangers-g

### Get the first link from the list


In [18]:
links[0]

<a class="story__link" href="https://www.dawn.com/news/1877696/may-9-riots-sc-constitutional-bench-rejects-govt-plea-allowing-military-courts-to-announce-verdicts">May 9 riots: SC constitutional bench rejects govt plea allowing military courts to announce verdicts</a>

### Extract the href attribute from the first link to get the News URL


In [23]:
link = links[0].get("href")

### Print the extracted News URL


In [22]:
link

'https://www.dawn.com/news/1877696/may-9-riots-sc-constitutional-bench-rejects-govt-plea-allowing-military-courts-to-announce-verdicts'

### Construct the full News URL by appending it to the base URL


In [24]:
News_url =  link

### Print the full News URL


In [25]:
News_url

'https://www.dawn.com/news/1877696/may-9-riots-sc-constitutional-bench-rejects-govt-plea-allowing-military-courts-to-announce-verdicts'

### Send a new HTTP GET request to the News URL


In [26]:
new_response = requests.get(News_url,headers=HEADERS)

### Print the new HTTP response object


In [27]:
new_response

<Response [200]>

### Parse the HTML content of the News page using BeautifulSoup


In [28]:
new_soup = BeautifulSoup(new_response.content,"html.parser")

In [53]:
new_soup

<!DOCTYPE html>

<html lang="en">
<head>
<!-- meta -->
<meta charset="utf-8"/>
<meta content="width=device-width,minimum-scale=1,initial-scale=1" name="viewport"/>
<!--[if IE]> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <![endif]-->
<title>Extreme heat endangers garment factory workers in Pakistan and beyond: study - World - DAWN.COM</title>
<meta content="Your window to latest news, analysis and features from Pakistan, South Asia and the world." name="subject">
<meta content="International Labor Organisation recommends as much rest as work in any given hour to maintain safe core body temperature levels." name="description">
<meta content="1383068068604634" property="fb:app_id"/>
<meta content="en_US" property="og:locale"/>
<link href="https://www.dawn.com/news/1877688" rel="canonical"/>
<link href="https://www.dawn.com/news/amp/1877688" rel="amphtml"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="@dawn_com" name="twitter:site"/>
<meta content="

### Find and print the News title


In [29]:
new_soup.find("h2",class_="story__title text-7.5 font-bold font-playfair-display mt-1 pb-3 border-b border-gray-300 border-solid")

<h2 class="story__title text-7.5 font-bold font-playfair-display mt-1 pb-3 border-b border-gray-300 border-solid" data-id="1877696" data-layout="story" dir="auto"><a class="story__link" href="https://www.dawn.com/news/1877696/may-9-riots-sc-constitutional-bench-rejects-govt-plea-allowing-military-courts-to-announce-verdicts">May 9 riots: SC constitutional bench rejects govt plea allowing military courts to announce verdicts</a></h2>

### Extract and print the News title without unwanted characters


In [30]:
new_soup.find("h2",class_="story__title text-7.5 font-bold font-playfair-display mt-1 pb-3 border-b border-gray-300 border-solid").text

'May 9 riots: SC constitutional bench rejects govt plea allowing military courts to announce verdicts'

In [None]:
# new_soup.find("h2",class_="story__title text-7.5 font-bold font-playfair-display mt-1 pb-3 border-b border-gray-300 border-solid").text.replace("\n","")

### Find and print the Author Name


In [31]:
new_soup.find("span",class_="story__byline text-3 text-gray-500 font-arial")

<span class="story__byline text-3 text-gray-500 font-arial" dir="auto"><a class="story__byline__link" href="https://www.dawn.com/authors/10094/umer-mehtab">Umer Mehtab</a></span>

In [32]:
new_soup.find("span",class_="story__byline text-3 text-gray-500 font-arial").text

'Umer Mehtab'

### Extract and print the News Published


In [33]:
new_soup.find("span",class_="timestamp--date")

<span class="timestamp--date">December 9, 2024</span>

In [34]:
new_soup.find("span",class_="timestamp--date").text

'December 9, 2024'

In [35]:
# Create an empty list to store product URLs
links_list = []

In [36]:
# Extract product URLs and store them in the list
for link in links:
    links_list.append(link.get("href"))

In [38]:
links_list

['https://www.dawn.com/news/1877696/may-9-riots-sc-constitutional-bench-rejects-govt-plea-allowing-military-courts-to-announce-verdicts',
 'https://www.dawn.com/news/1877700/kp-supervisory-committee-formed-to-restore-law-and-order-in-kurram',
 'https://www.dawn.com/news/1877697/alexandrine-parrots-disappearing-from-pakistans-skies',
 'https://www.dawn.com/news/1877689/pakistan-ranked-among-lowest-12pc-globally-in-mobile-broadband-internet-speeds-in-oct-report',
 'https://www.dawn.com/news/1877688/extreme-heat-endangers-garment-factory-workers-in-pakistan-and-beyond-study',
 'https://www.dawn.com/news/1877685/sc-suspends-ecp-decision-to-disqualify-pml-n-lawmaker-adil-bazai',
 'https://aurora.dawn.com/news/1145250',
 'https://www.dawn.com/news/1877679/south-korea-president-yoon-banned-from-foreign-travel-as-leadership-crisis-deepens',
 'https://www.dawn.com/news/1877677/dozens-of-schools-in-delhi-get-bomb-threats-report',
 'https://www.dawn.com/news/1877617',
 'https://www.dawn.com/news/

In [39]:
# Create a dictionary to store scraped data
d = {"title":[],"author":[],"published":[]}

In [40]:
# Define functions to extract news title
def get_title(soup):
    
    try:
        # Extract and clean the product title
        title_value = soup.find("h2",class_="story__title text-7.5 font-bold font-playfair-display mt-1 pb-3 border-b border-gray-300 border-solid").text.replace("\n","")
    except:
        # If the title is not found, set it as an empty string
        title_value = ""
        
    return title_value

In [41]:
# Define functions to extract author name
def get_author(soup):
    
    try:
        # Extract and clean the product title
        author_value = soup.find("span",class_="story__byline text-3 text-gray-500 font-arial").text
    except:
        # If the title is not found, set it as an empty string
        author_value = ""
        
    return author_value

In [42]:
# Define functions to extract author name
def get_publish(soup):
    
    try:
        # Extract and clean the product title
        published_value = soup.find("span",class_="timestamp--date").text
    except:
        # If the title is not found, set it as an empty string
        published_value = ""
        
    return published_value

In [43]:
# Iterate through product URLs and scrape data
for link in links_list:
    
    # Create the full URL for the product page
    new_webpage_URL =  link
    
    # Send an HTTP GET request to the product page and parse the HTML content
    new_response = requests.get(new_webpage_URL,headers=HEADERS)
    new_soup = BeautifulSoup(new_response.content,"html.parser")
    
    # Call the defined functions to extract and append data to the dictionary
    d["title"].append(get_title(new_soup))
    d["author"].append(get_author(new_soup))
    d["published"].append(get_publish(new_soup))
    

In [44]:
print(d)

{'title': ['May 9 riots: SC constitutional bench rejects govt plea allowing military courts to announce verdicts', 'KP supervisory committee formed to restore law and order in Kurram', 'Alexandrine parrots disappearing from Pakistan’s skies', 'Pakistan ranked among lowest 12pc globally in mobile, broadband internet speeds in Oct: report', 'Extreme heat endangers garment factory workers in Pakistan and beyond: study', 'SC suspends ECP decision to disqualify PML-N lawmaker Adil Bazai', '', 'South Korea President Yoon banned from foreign travel as leadership crisis deepens', 'Dozens of schools in Delhi get bomb threats: report', 'Damascus is free?', 'CORPORATE WINDOW:  Business in limbo', '', 'Neglecting health and education', '', 'The ongoing internet folly', 'KSE-100 nears 110,000-point mark as shares at PSX trade in green for 9th consecutive session', '', '', 'Beyond personality disorders', 'Syria transition must ensure ‘accountability’ for past crimes: UN', 'Democratic backsliding', '

In [45]:
# Create a DataFrame from the scraped data
df = pd.DataFrame(d)

# Replace empty strings in the "title" column with NaN and drop rows with NaN values in the "title" column

# Option 1: Using chained assignment (recommended)
df["title"] = df["title"].replace("", np.nan)

# Drop rows with NaN values in the "title" column
df = df.dropna(subset=["title"])

In [46]:
df

Unnamed: 0,title,author,published
0,May 9 riots: SC constitutional bench rejects g...,Umer Mehtab,"December 9, 2024"
1,KP supervisory committee formed to restore law...,Arif Hayat,"December 9, 2024"
2,Alexandrine parrots disappearing from Pakistan...,Anadolu Agency,"December 9, 2024"
3,Pakistan ranked among lowest 12pc globally in ...,,"December 9, 2024"
4,Extreme heat endangers garment factory workers...,Reuters,"December 9, 2024"
...,...,...,...
213,‘BCCI accepts two-way hybrid model to resolve ...,Mohammad Yaqoob,"December 7, 2024"
217,"Bushra Bibi resurfaces after PTI protest, says...",Arif Hayat,"December 6, 2024"
221,Rethinking Afghan policy,Muhammad Amir Rana,"December 8, 2024"
224,Strategic dilemma,Aisha Khan,"December 8, 2024"


In [47]:
df.count

<bound method DataFrame.count of                                                  title              author  \
0    May 9 riots: SC constitutional bench rejects g...         Umer Mehtab   
1    KP supervisory committee formed to restore law...          Arif Hayat   
2    Alexandrine parrots disappearing from Pakistan...      Anadolu Agency   
3    Pakistan ranked among lowest 12pc globally in ...                       
4    Extreme heat endangers garment factory workers...             Reuters   
..                                                 ...                 ...   
213  ‘BCCI accepts two-way hybrid model to resolve ...     Mohammad Yaqoob   
217  Bushra Bibi resurfaces after PTI protest, says...          Arif Hayat   
221                           Rethinking Afghan policy  Muhammad Amir Rana   
224                                  Strategic dilemma          Aisha Khan   
225                      Milei, Argentina and Pakistan      Shahid Mehmood   

            published  
0    D

In [48]:
# Save the DataFrame to a CSV file
df.to_csv("DawnNews_data.csv", index=False)