This is sample code web crawling code. In this notebook, links to articles published by the print on the farm laws topic are obtained and stored in a text file.

In [None]:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
import time
import random
import pickle
import sys

In [None]:
#Header object
hdr = {'User-Agent': 'Mozilla/5.0'}

In [None]:
#Defining a function that takes a link and returns the soup version of the html page
def link_to_soup(link_pg):
    req = Request(link_pg, headers=hdr)
    page = urlopen(req) 
    soup_pg = BeautifulSoup(page)
    return soup_pg

In [None]:
#Defining a function for regex pattern matching
def pattern_finder(pattern, text):
    matches = re.findall(pattern, text, re.IGNORECASE)
    
    if not matches:
        return ["no match"]
    else:
        return matches

In [None]:
#Function to collect links on a search page
def links_in_page(brwsr):
    
    #Getting the BSoup page from the browser object
    BSoup_search_page= BeautifulSoup(brwsr.page_source, 'html.parser')
    
    #Finding the relevant div tags
    news_items = BSoup_search_page.find_all("div", class_ = "td_module_16")
    
    #Initializing an empty list
    all_links_in_page = []
    
    #Looping through all the news items in the page
    for news_it in news_items:
        
        #Finding the link tag (<a>)
        
        heading3 = news_it.find("h3", class_ = "entry-title")
        link = heading3.find("a")
        link_html = str(link.get("href"))
        
        #Extracting the link and appending it to the list
        all_links_in_page.append(link_html)
    
   
    return all_links_in_page
    

Let's use selenium to get a web page that contains the search results for our topic.

In [None]:
#Using firefox to get the web page
browser = webdriver.Firefox()
browser.get("https://theprint.in/?s=farm+laws")

Selenium docs: https://www.selenium.dev/documentation/en/webdriver/locating_elements/

Transferring from selenium to BSoup: https://python-forum.io/thread-695.html

In [None]:
#Function to click the "More News" button and load all the articles related to the current topic
def button_more_news(brwsr, page_counter):
    
    if page_counter == 1:
        link = "https://theprint.in/?s=farm+laws"
    else:
        link = "https://theprint.in/page/" + str(page_counter) + "/?s=farm+laws"
    
    
    #Fetching the next page of search results
    brwsr.get(link)
    
    #Extracting the news articles in this page
    news_links_list = links_in_page(brwsr)
    
    #Pausing
    sleep(random.randint(3, 15))
    
    #Returning the news links collected
    return news_links_list, brwsr


In [None]:
#Clicking the next button to load more search results

page_counter = 1                       #Changing this as I'm resuming
links_on_topic = []
num_of_pages = 80                       #This can vary from topic to topic; check manually


while (page_counter <= num_of_pages):

    try:

        #Navigating to the next page
        news_links_list, browser = button_more_news(browser, page_counter)

        #Adding these links to our list
        links_on_topic = links_on_topic + news_links_list

        #Printing the page count
        print("Current page number: ", page_counter)

        #Incrementing the page counter
        page_counter += 1


    except Exception as e:    

        print("Ran into a problem")
        # get the exception information
        error_type, error_obj, error_info = sys.exc_info()      

        #print error info and line that threw the exception                          
        print(error_type, 'Line:', error_info.tb_lineno)
        print("Error object: ", error_obj)
        
        break


print("Number of links on this topic: ", len(links_on_topic))

Current page number:  1
Current page number:  2
Current page number:  3
Current page number:  4
Current page number:  5
Current page number:  6
Current page number:  7
Current page number:  8
Current page number:  9
Current page number:  10
Current page number:  11
Current page number:  12
Current page number:  13
Current page number:  14
Current page number:  15
Current page number:  16
Current page number:  17
Current page number:  18
Current page number:  19
Current page number:  20
Current page number:  21
Current page number:  22
Current page number:  23
Current page number:  24
Current page number:  25
Current page number:  26
Current page number:  27
Current page number:  28
Current page number:  29
Current page number:  30
Current page number:  31
Current page number:  32
Current page number:  33
Current page number:  34
Current page number:  35
Current page number:  36
Current page number:  37
Current page number:  38
Current page number:  39
Current page number:  40
Current p

In [None]:
len(list(set(links_on_topic)))

1475

In [None]:
links_on_topic[-5:]

['https://theprint.in/sg-national-interest/gogoi-the-only-congressman-smiling/543995/',
 'https://theprint.in/sg-writings-on-the-wall/writings-on-the-wall-in-a-tearing-hurry-ana/544001/',
 'https://theprint.in/sg-national-interest/national-interest-first-family-second-nature/544060/',
 'https://theprint.in/sg-national-interest/desi-punch-italian-judy/544275/',
 'https://theprint.in/sg-uncategorized/the-violent-aftermath/544359/']

Now the articles that these links refer to can be extracted.

First, let's save these links to disk.

In [None]:
#Saving the list as a pickle file
with open("links to articles.txt", "wb") as fp:
    pickle.dump(links_on_topic, fp)

In [None]:
#Unpickling the list
with open("links to articles.txt", "rb") as fp:
    links_on_topic = pickle.load(fp)

print("Number of links in this list: ", len(links_on_topic))

Number of links in this list:  1480


Let's add these articles to the dataframe.

In [None]:
#Function to find the tags in an article
def tags_finder(tags_html):
    pattern = r">([\w ]+)<"
    tags_matches = pattern_finder(pattern, str(tags_html))
    return tags_matches

In [None]:
#To check if our topic is one of the tags in the article; also extracting the article's title, tag, and description
def topic_article_checker(soup_page, link):    
    
    #Finding the header in the article
    header_obj = soup_page.find("header")
    
    #Getting the article's title
    title_html = header_obj.find("h1", class_ = "entry-title")
    pattern = r">([^<>]+)<"
    title = pattern_finder(pattern, str(title_html))[0]
    
    #Getting the article's description
    desc_html = header_obj.find("h2", class_ = "td-post-sub-title")
    pattern = r">([^<>]+)<"
    desc = pattern_finder(pattern, str(desc_html))[0]

    #The date can also be capture here
    date_html = header_obj.find("span", class_ = "update_date")
    pattern = r">([^<>]+) IST<"
    date = pattern_finder(pattern, str(date_html))[0]
       
        #We can use the article's tags from the website, if the article has tags
    try:
        #Getting the tags from the article
        tags_container = soup_page.find("div", class_ = "td-post-source-tags")
        tags_html = tags_container.find_all("li")
        pattern = r">([^<>]+)<"
        tags = pattern_finder(pattern, str(tags_html))
    
    except:
        #This is if the article doesn't have tags
        if ('farm' in title.lower()) or ('farm' in desc.lower()) or ('farm' in link.lower()):
            return True, title, desc, date
        elif ('msp' in title.lower()) or ('msp' in desc.lower()) or ('msp' in link.lower()):
            return True, title, desc, date
        elif ('crop' in title.lower()) or ('crop' in desc.lower()) or ('crop' in link.lower()):
            return True, title, desc, date
        elif ('agri' in title.lower()) or ('agri' in desc.lower()) or ('agri' in link.lower()):
            return True, title, desc, date
        elif ('paddy' in title.lower()) or ('paddy' in desc.lower()) or ('paddy' in link.lower()):
            return True, title, desc, date
        else:
            return False, title, desc, date
    
    
    #If the tags were available, then we can use them check to if the article is on the correct topic
    
    for tag in tags:
        if ('farm' in tag.lower()) or ('msp' in tag.lower()) or ('crop' in tag.lower()) or ('agri' in tag.lower()) or ('paddy' in tag.lower()):
            return True, title, desc, date
        
    #If none of the conditions have been met, then we can say that this article is not on the current topic
    return False, title, desc, date


In [None]:
#Creating an empty dataframe to store articles
theprint_df = pd.DataFrame(columns=['Title', 'Link', 'Date', 'Tag', 'Article'])
print(theprint_df.shape)
theprint_df.head()

(0, 5)


Unnamed: 0,Title,Link,Date,Tag,Article


In [None]:
#Defining a function to remove leading and trailing white spaces
def remove_whitespace_trail_lead(text):
    
    #Removing trailing white spaces (retaining line breaks)
    #text = re.sub(r"[ \t]+$", "", text)
    text = re.sub("[ \s]+$", "", text)
    
    #Removing leading whitespaces
    #text = re.sub(r"^[ \t]+", "", text)
    text = re.sub("^[ \s]+", "", text)
    
    return text

In [None]:
#Defining a function to take an article's soup page and add the article to the 
def article_adder(soup_page, link):   
    
    #Checking if the article is already in the dataframe
    if not theprint_df['Link'].str.contains(link).any():
        
        #Checking if the article has the relevant topic tag
        check, title, desc, date = topic_article_checker(soup_page, link)
        
        if check:
            
            #Deleting the "contribution" box
            div_obj = soup_page.find("div", class_ = "post_contribute")
            div_obj.clear()
            
            #Deleting the "Subscribe to our channels" box
            p_obj = soup_page.find("p", class_ = "postBtm")
            p_obj.clear()
            
            
            #Extracting the article text
            #Note that this is not the full name of the class, but BeautifulSoup doesn't require the full class name
            article_body = soup_page.find("div", class_ = "td-post-content")
            
            
            #Finding the paragraphs within the article body
            #The lambda function allows us to look for the following tags:
            #<p>, <h3>, <p (with any class)>, and <h3 (with any class)>
            #The code could be simpler, but I'm using this to show how a lambda function can be used with find_all
            article_paras = article_body.find_all(lambda x: ((x.name == 'p' or x.name == 'h3') and 
                                                    x.get('class') == [None]) or (x.name == 'p' or x.name == 'h3'))

            
            #Removing the paras that are links to other articles
            #The find() function returns -1 if a substring is not found in the parent string
            article_text_paras = [i for i in article_paras if str(i).lower().find("also read")==-1]
            
            #Removing the paras that contain videos embedded
            article_text_paras = [i for i in article_text_paras if str(i).find("<iframe")==-1]
            
            #Extracting the text from these paras
            pattern = r">([^<>]+)<"
            text_paras = pattern_finder(pattern, str(article_text_paras))
    
            #Removing leading and trailing whitespaces from the text
            text_paras_cleaned = [remove_whitespace_trail_lead(t) for t in text_paras if t != ', ']
            desc_cleaned = [remove_whitespace_trail_lead(desc)]
            

            #Checking for city/location name in the first paragraph
            pattern = "^[A-Z]{1,15}[A-Za-z\s()\\\/]{0,15}:([^<>]*$)"
            result = pattern_finder(pattern, text_paras_cleaned[0])
            
            
            #Checking if the city/location name is in the second paragraph
            if result[0] == "no match":
                                
                #Inserting the result in place of the second paragraph if there is a match in the second paragraph
                result = pattern_finder(pattern, text_paras_cleaned[1])
                if result[0] != "no match":
                    text_paras_cleaned[1] = result[0]
            
            #Inserting the result in place of the first paragraph if there is a match
            else:
                text_paras_cleaned[0] = result[0]
            
            
            #Merging the string items in the list of text paras
            text_paras_joined = ' '.join(text_paras_cleaned)
            
            
            #Making sure that there is a description in the article
            if (desc_cleaned[0] != "no match"):                
                text_paras_fin = desc_cleaned + [text_paras_joined]
            
            #In this branch, there's no description
            else:
                text_paras_fin = [text_paras_joined]
                
            
            #Adding the article description to the list
            article_text = ''.join(text_paras_fin)            #Joining without a space here
            
            
            #Setting the index value where the new row is to be inserted
            if theprint_df.empty:
                row_id = 0
            else:
                row_id = theprint_df.index[-1]
                row_id += 1


            #Adding this article's details to the dataframe
            theprint_df.loc[row_id] = [title] + [link] + [date] + ['Farm Laws'] + [article_text]
            
            return 1
            
        else:
            print("Article is not about current topic")
            print("Link to article: ", link)
            return 0
        
    else:
        print("Link is already present in the dataframe")
        print("Link to article: ", link)
        return -1


We can extract articles from these links. Let's define a function for this.

In [None]:
def link_to_article(links_to_add):
    
    #Initializing a few variables and lists
    article_counter = 0
    already_present = 0
    other_topics = []
    exceptions_list = []

    #Looping through the links and adding them to the dataframe
    for link in links_to_add:

        try:
            soup_page = link_to_soup(link)

            #Adding the article and collecting the returned value
            check = article_adder(soup_page, link)

            #Tracking the links that were not about our current topic
            if check==0:
                other_topics.append(link)
            elif check==1:
                article_counter += check
            else:
                already_present += 1


                
            sleep(random.randint(3, 15))

            print("Articles added: ", article_counter)


        except Exception as e:    

            #Adding the link to the exceptions list
            exceptions_list.append(link)

            print("Ran into a problem")
            # get the exception information
            error_type, error_obj, error_info = sys.exc_info()      

            #print error info and line that threw the exception                          
            print(error_type, 'Line:', error_info.tb_lineno)
            print("Error object: ", error_obj)

            continue
    
    print("Number of articles added: ", article_counter)
    print("Number of articles that were already in the dataframe: ", already_present)
    
    return other_topics, exceptions_list

Let's add the articles to our dataframe.

In [None]:
#Using two lists to collect the links that were not added to the dataframe
topic_other_list, exceptions_links_list = link_to_article(links_on_topic[:25])

Article is not about current topic
Link to article:  https://theprint.in/politics/amarinder-meets-sonia-gandhi-says-sidhus-continued-attacks-on-him-reflect-poorly-on-party/712752/
Articles added:  0
Articles added:  1
Article is not about current topic
Link to article:  https://theprint.in/health/how-this-remote-backward-up-district-is-fighting-malnutrition-amid-covid-pandemic/711726/
Articles added:  1
Article is not about current topic
Link to article:  https://theprint.in/politics/uddhav-sonia-have-direct-line-of-communication-says-senas-sanjay-raut-denies-cracks-in-mva/711993/
Articles added:  1
Article is not about current topic
Link to article:  https://theprint.in/india/lok-sabha-passes-bill-allowing-depositors-to-get-insurance-money-in-90-days/712123/
Articles added:  1
Article is not about current topic
Link to article:  https://theprint.in/politics/hopeful-it-panel-will-take-up-pegasus-issue-going-forward-says-shashi-tharoor/711427/
Articles added:  1
Articles added:  2
Artic

In [None]:
theprint_df.shape

(3, 5)

In [None]:
theprint_df.head()

Unnamed: 0,Title,Link,Date,Tag,Article
0,‘Not what we wanted’ — why opposition rejected...,https://theprint.in/politics/not-what-we-wante...,"10 August, 2021 7:26 pm",Farm Laws,Rajya Sabha was adjourned Tuesday as Oppn accu...
1,"‘Mr Modi, come listen to us’: Derek O’Brien sh...",https://theprint.in/politics/mr-modi-come-list...,"8 August, 2021 6:46 pm",Farm Laws,The TMC MP released a three-minute video clip ...
2,"Rahul Gandhi, other opposition parties’ leader...",https://theprint.in/india/rahul-gandhi-other-o...,"6 August, 2021 2:08 pm",Farm Laws,Leaders of 14 opposition parties met at the Pa...


In [None]:
print(theprint_df['Link'][2])
print(theprint_df['Article'][2])

https://theprint.in/india/rahul-gandhi-other-opposition-parties-leaders-visit-kisan-sansad-at-jantar-mantar/710378/
Leaders of 14 opposition parties met at the Parliament House and decided to visit the venue. They didn't speak from the podium nor were they seated on the dais at the gathering. Several leaders of opposition parties, including former Congress chief Rahul Gandhi, on Friday extended their solidarity to protesting farmers and joined their Kisan Sansad at the Jantar Mantar here, saying the three “black” agri laws will have to be withdrawn. Leaders of 14 opposition parties met at Parliament House and then reached the nearby Jantar Mantar to participate in the Kisan Sansad, which began on July 22 to mark over seven months of the farmers’ protests at Delhi’s border points against the laws. The leaders neither spoke from the podium of the Kisan Sansad (farmers’ parliament) nor were they seated on the dais. “Today all opposition parties together decided to support the farmers and 

These articles appear to be fine. Let's go ahead and add the rest of the articles.

In [None]:
#Using two lists to collect the links that were not added to the dataframe
topic_other_list, exceptions_links_list = link_to_article(links_on_topic[25:])

Article is not about current topic
Link to article:  https://theprint.in/india/intruders-a-house-besieged-gunfire-12-hour-jk-encounter-that-killed-indian-army-colonel/415206/
Articles added:  0
Article is not about current topic
Link to article:  https://theprint.in/defence/kashmirs-most-wanted-terrorist-riyaz-naikoo-killed-in-encounter-in-his-pulwama-village/415391/
Articles added:  0
Article is not about current topic
Link to article:  https://theprint.in/india/15-yr-old-boy-killed-in-handwara-encounter-was-differently-abled-and-out-playing-with-friends/414991/
Articles added:  0
Article is not about current topic
Link to article:  https://theprint.in/talk-point/is-pakistan-taking-advantage-of-global-covid-crisis-to-turn-on-terror-tap-against-india/414911/
Articles added:  0
Article is not about current topic
Link to article:  https://theprint.in/defence/24-test-positive-for-coronavirus-in-oncology-ward-of-army-rr-hospital-in-delhi/414902/
Articles added:  0
Article is not about curr

Articles added:  4
Articles added:  5
Articles added:  6
Articles added:  7
Article is not about current topic
Link to article:  https://theprint.in/india/punjab-govt-selling-covid-vaccines-to-private-hospitals-for-profit-alleges-sukhbir-badal/671622/
Articles added:  7
Articles added:  8
Article is not about current topic
Link to article:  https://theprint.in/ani-press-releases/india-pulses-and-grains-association-urges-government-to-assuage-fear-of-traders-about-stock-monitoring-exercise/670847/
Articles added:  8
Article is not about current topic
Link to article:  https://theprint.in/opinion/not-just-modi-govts-tug-of-war-with-social-media-balance-of-power-shifting-from-users-anyway/669199/
Articles added:  8
Articles added:  9
Article is not about current topic
Link to article:  https://theprint.in/yourturn/subscriberwrites-up-elections-2022-development-or-covid-resentment-the-coming-polls-could-give-bjp-a-scare/667861/
Articles added:  9
Articles added:  10
Article is not about cu

Articles added:  30
Article is not about current topic
Link to article:  https://theprint.in/india/how-rss-helped-save-darbar-sahib-twice-and-upheld-hindu-sikh-unity/633599/
Articles added:  30
Articles added:  31
Article is not about current topic
Link to article:  https://theprint.in/india/stones-pelted-at-rakesh-tikaits-cavalcade-in-rajasthans-alwar-4-detained/633222/
Articles added:  31
Article is not about current topic
Link to article:  https://theprint.in/india/rajasthan-law-its-modernising-villages-why-you-may-find-camels-only-in-a-zoo-in-future/632613/
Articles added:  31
Articles added:  32
Article is not about current topic
Link to article:  https://theprint.in/last-laughs/nandigram-goes-to-polls-on-april-fools-day-and-the-hand-of-god-in-west-bengal/632357/
Articles added:  32
Articles added:  33
Article is not about current topic
Link to article:  https://theprint.in/50-word-edit/modi-govt-retaining-2-6-inflation-target-is-wise-rbi-now-must-balance-growth-and-inflation/6321

Articles added:  76
Article is not about current topic
Link to article:  https://theprint.in/world/pakistan-army-spokesperson-asif-ghafoor-has-his-burnol-moment-on-twitter/298473/
Articles added:  76
Article is not about current topic
Link to article:  https://theprint.in/national-interest/modi-has-convinced-the-world-kashmir-is-indias-internal-affair-but-theyre-still-watching/298147/
Articles added:  76
Article is not about current topic
Link to article:  https://theprint.in/plugged-in/modi-shares-headlines-chinmayanand-rape-victim-and-dalit-kids-killing/297049/
Articles added:  76
Article is not about current topic
Link to article:  https://theprint.in/opinion/general-rawat-is-front-runner-for-cds-but-all-he-talks-about-is-cosmetic-changes-in-military/297036/
Articles added:  76
Article is not about current topic
Link to article:  https://theprint.in/world/saudi-arabia-recovering-faster-from-oil-attack-exceeds-own-target-for-restoring-capacity/296974/
Articles added:  76
Article is n

Articles added:  94
Articles added:  95
Article is not about current topic
Link to article:  https://theprint.in/politics/wont-implement-caa-under-any-circumstance-if-voted-to-power-in-assam-says-rahul-gandhi/604985/
Articles added:  95
Articles added:  96
Articles added:  97
Articles added:  98
Article is not about current topic
Link to article:  https://theprint.in/best-of-theprint-icymi/why-pfizers-covid-vaccine-was-not-granted-emergency-use-approval-by-govts-expert-panel/604452/
Articles added:  98
Article is not about current topic
Link to article:  https://theprint.in/politics/after-modi-sitharaman-bats-for-private-sector-says-bjp-always-believed-in-indian-businesses/604523/
Articles added:  98
Articles added:  99
Article is not about current topic
Link to article:  https://theprint.in/opinion/pov/ias-officers-are-not-lazy-babus-time-to-reject-the-colonial-slang/604231/
Articles added:  99
Article is not about current topic
Link to article:  https://theprint.in/national-interest/

Articles added:  191
Articles added:  192
Article is not about current topic
Link to article:  https://theprint.in/india/accounts-of-prasar-bharati-ceo-caravan-actor-sushant-singh-among-those-withheld-by-twitter/596638/
Articles added:  192
Article is not about current topic
Link to article:  https://theprint.in/economy/winners-losers-who-got-what-in-nirmala-sitharamans-budget-2021/596532/
Articles added:  192
Articles added:  193
Articles added:  194
Article is not about current topic
Link to article:  https://theprint.in/india/govt-extends-social-security-benefits-to-platform-and-gig-workers/596345/
Articles added:  194
Article is not about current topic
Link to article:  https://theprint.in/economy/nirmala-sitharaman-reads-out-speech-from-tablet-as-budget-goes-paperless/596276/
Articles added:  194
Article is not about current topic
Link to article:  https://theprint.in/opinion/politically-correct/haryana-cm-khattar-exposes-modi-shahs-biggest-flaw-as-talent-hunters/596028/
Articles 

Articles added:  261
Articles added:  262
Articles added:  263
Articles added:  264
Article is not about current topic
Link to article:  https://theprint.in/politics/why-should-jai-shri-ram-chant-upset-anyone-says-yogi-on-mamata-refusing-to-give-speech/591641/
Articles added:  264
Articles added:  265
Articles added:  266
Articles added:  267
Articles added:  268
Articles added:  269
Articles added:  270
Article is not about current topic
Link to article:  https://theprint.in/opinion/in-indias-job-market-women-have-higher-exit-rate-lower-entry-rate-than-men-study/588645/
Articles added:  270
Articles added:  271
Articles added:  272
Articles added:  273
Articles added:  274
Article is not about current topic
Link to article:  https://theprint.in/politics/congress-says-it-will-have-elected-president-by-june-at-any-cost-after-stormy-cwc-meet/590442/
Articles added:  274
Articles added:  275
Article is not about current topic
Link to article:  https://theprint.in/theprint-otc/90-of-indias

Articles added:  358
Articles added:  359
Articles added:  360
Articles added:  361
Article is not about current topic
Link to article:  https://theprint.in/politics/mamata-indicates-west-bengal-will-implement-pm-kisan-scheme-says-govt-has-sought-farmers-data/579404/
Articles added:  361
Articles added:  362
Article is not about current topic
Link to article:  https://theprint.in/last-laughs/tackling-indias-vaccine-hesitancy-and-modis-world-topper-trophy/579088/
Articles added:  362
Article is not about current topic
Link to article:  https://theprint.in/politics/those-doubting-bharat-biotechs-efficacy-mentally-challenged-says-union-minister-pradhan/579214/
Articles added:  362
Articles added:  363
Articles added:  364
Articles added:  365
Article is not about current topic
Link to article:  https://theprint.in/politics/punjab-cm-slams-power-hungry-bjp-for-using-governors-office-for-own-vested-interests/578918/
Articles added:  365
Articles added:  366
Articles added:  367
Article is n

Articles added:  392
Article is not about current topic
Link to article:  https://theprint.in/politics/i-had-pretty-good-access-to-everybody-surendra-nihal-singh/50200/
Articles added:  392
Article is not about current topic
Link to article:  https://theprint.in/opinion/the-weakening-of-indias-institutions-under-modi-was-seen-at-indiras-time/49696/
Articles added:  392
Article is not about current topic
Link to article:  https://theprint.in/politics/the-rise-of-the-communal-hate-soundtrack-in-india/47672/
Articles added:  392
Article is not about current topic
Link to article:  https://theprint.in/opinion/women-disability-higher-risk-sexual-violence-justice/46777/
Articles added:  392
Article is not about current topic
Link to article:  https://theprint.in/politics/bjp-set-for-major-overhaul-many-states-likely-to-get-new-party-chiefs-ahead-of-ls-polls/46502/
Articles added:  392
Article is not about current topic
Link to article:  https://theprint.in/opinion/toppling-idols-is-just-more

Articles added:  532
Article is not about current topic
Link to article:  https://theprint.in/world/sikh-protesters-hold-rallies-in-us-cities-against-farm-laws-in-india/561561/
Articles added:  532
Articles added:  533
Articles added:  534
Articles added:  535
Articles added:  536
Articles added:  537
Articles added:  538
Articles added:  539
Articles added:  540
Articles added:  541
Articles added:  542
Articles added:  543
Articles added:  544
Articles added:  545
Articles added:  546
Articles added:  547
Article is not about current topic
Link to article:  https://theprint.in/economy/economy-reflating-faster-than-expected-pent-up-demand-not-the-only-factor-says-sitharaman/557823/
Articles added:  547
Articles added:  548
Articles added:  549
Articles added:  550
Article is not about current topic
Link to article:  https://theprint.in/world/trump-administration-accuses-facebook-of-h-1b-visa-abuse/557248/
Articles added:  550
Articles added:  551
Articles added:  552
Articles added:  

Articles added:  649
Article is not about current topic
Link to article:  https://theprint.in/politics/if-they-have-trouble-with-bharat-mata-ki-jai-bihar-has-trouble-with-them-modi-slams-rjd/536131/
Articles added:  649
Articles added:  650
Article is not about current topic
Link to article:  https://theprint.in/india/protection-under-new-land-orders-same-as-himachal-and-uttarakhand-laws-jk-govt-says/535976/
Articles added:  650
Article is not about current topic
Link to article:  https://theprint.in/health/toxic-air-from-farm-fires-could-make-north-indias-covid-fight-deadlier/535904/
Articles added:  650
Articles added:  651
Article is not about current topic
Link to article:  https://theprint.in/yourturn/reader-view-govt-needs-to-encourage-farmers-to-adopt-other-methods-to-tackle-stubble-burning/535429/
Articles added:  651
Article is not about current topic
Link to article:  https://theprint.in/opinion/what-is-jks-roshni-act-and-how-it-enabled-land-loot-in-the-name-of-light/535257/


Articles added:  673
Articles added:  674
Articles added:  675
Article is not about current topic
Link to article:  https://theprint.in/politics/jagans-40-minute-meeting-with-modi-amid-pandemic-fuels-ysrcp-bjp-alliance-buzz-again/518173/
Articles added:  675
Articles added:  676
Articles added:  677
Articles added:  678
Articles added:  679
Articles added:  680
Articles added:  681
Article is not about current topic
Link to article:  https://theprint.in/economy/gst-ibc-mpc-modi-govts-key-economic-reforms-stall-as-coronavirus-disrupts-economy/517170/
Articles added:  681
Articles added:  682
Articles added:  683
Article is not about current topic
Link to article:  https://theprint.in/india/bjp-running-dictatorship-in-the-country-torturing-dalits-the-most-says-mamata-banerjee/516247/
Articles added:  683
Article is not about current topic
Link to article:  https://theprint.in/india/indias-defence-interests-compromised-by-earlier-governments-says-pm-modi/516145/
Articles added:  683
Artic

Articles added:  719
Article is not about current topic
Link to article:  https://theprint.in/features/lessons-from-occupy-wallstreet-to-arab-spring-young-people-must-adopt-mature-activism/458407/
Articles added:  719
Article is not about current topic
Link to article:  https://theprint.in/national-interest/uttar-pradesh-is-indias-broken-heartland-break-it-into-4-or-5-states/458552/
Articles added:  719
Article is not about current topic
Link to article:  https://theprint.in/pageturner/excerpt/i-feed-a-surrendered-naxal-mutton-liquor-track-his-woman-dantewada-is-not-delhi/455724/
Articles added:  719
Articles added:  720
Article is not about current topic
Link to article:  https://theprint.in/world/red-tape-is-the-stealth-weapon-as-us-china-feud-gets-nasty/450697/
Articles added:  720
Articles added:  721
Article is not about current topic
Link to article:  https://theprint.in/world/now-trump-blames-obama-and-biden-for-all-the-crises-hes-facing/448790/
Articles added:  721
Article is n

Articles added:  726
Article is not about current topic
Link to article:  https://theprint.in/plugged-in/front-page-crisis-sos-on-yes-bank-spread-of-coronavirus-delhi-riots/376569/
Articles added:  726
Articles added:  727
Article is not about current topic
Link to article:  https://theprint.in/economy/how-banks-can-help-india-get-its-growth-back-on-track/370403/
Articles added:  727
Article is not about current topic
Link to article:  https://theprint.in/opinion/ahead-of-modi-trump-meet-a-look-at-eight-ongoing-trade-issues-between-us-and-india/367741/
Articles added:  727
Article is not about current topic
Link to article:  https://theprint.in/india/farmers-not-investing-in-cows-due-to-vigilantism-threat-experts-on-cattle-population-decline/364851/
Articles added:  727
Article is not about current topic
Link to article:  https://theprint.in/plugged-in/daddys-girl-mehbooba-in-todays-papers-coronavirus-hits-imports-report-mint-et/362364/
Articles added:  727
Article is not about current

Articles added:  729
Article is not about current topic
Link to article:  https://theprint.in/plugged-in/dainik-jagran-says-bjps-results-less-than-expected-amar-ujala-on-gangulys-second-innings/311588/
Articles added:  729
Article is not about current topic
Link to article:  https://theprint.in/plugged-in/jk-ganguly-on-page-1-ndtvs-reality-check-on-ncrb-times-now-on-if-govt-wants-to-snoop/309942/
Articles added:  729
Article is not about current topic
Link to article:  https://theprint.in/theprint-otc/selling-psus-fiscal-deficit-short-term-solution-nobel-laureate-abhijit-banerjee/309377/
Articles added:  729
Articles added:  730
Article is not about current topic
Link to article:  https://theprint.in/theprint-essential/why-a-section-of-the-land-acquisition-act-turned-into-a-big-judicial-controversy/305787/
Articles added:  730
Article is not about current topic
Link to article:  https://theprint.in/world/us-china-relationship-quickly-deteriorating/303097/
Articles added:  730
Article i

Articles added:  732
Article is not about current topic
Link to article:  https://theprint.in/india/budget-2019-live-updates-finance-minister-nirmala-sitharaman-modi-govt/258740/
Articles added:  732
Article is not about current topic
Link to article:  https://theprint.in/india/viral-video-of-dalit-youth-being-thrashed-splits-open-caste-divide-in-haryana-village/255337/
Articles added:  732
Article is not about current topic
Link to article:  https://theprint.in/thought-shot/amitabh-kant-on-closing-ibc-loopholes-arif-m-khan-on-triple-talaq-c-raja-mohan-on-indo-us-ties/254171/
Articles added:  732
Article is not about current topic
Link to article:  https://theprint.in/india/sirsa-police-yet-to-submit-report-on-dera-chief-ram-rahims-parole-plea/254021/
Articles added:  732
Article is not about current topic
Link to article:  https://theprint.in/india/full-text-of-president-kovind-speech-new-india-wants-uninterrupted-accelerated-growth/252615/
Articles added:  732
Articles added:  733
Ar

Articles added:  742
Article is not about current topic
Link to article:  https://theprint.in/theprint-essential/what-the-fugitive-economic-offender-tag-means-for-vijay-mallya/173995/
Articles added:  742
Articles added:  743
Article is not about current topic
Link to article:  https://theprint.in/opinion/with-rbi-flip-flopping-india-could-see-a-return-of-tycoon-friendly-debtor-regime/172182/
Articles added:  743
Article is not about current topic
Link to article:  https://theprint.in/politics/would-love-it-if-2019-poll-is-rahul-vs-modi-in-presidential-style-arun-jaitley/171487/
Articles added:  743
Article is not about current topic
Link to article:  https://theprint.in/india/governance/up-cops-following-adityanaths-thoko-neeti-to-avoid-transfer-alleges-akhilesh/171008/
Articles added:  743
Article is not about current topic
Link to article:  https://theprint.in/india/governance/bulandshahr-cop-killer-is-also-suspect-in-a-delhi-murder/170470/
Articles added:  743
Article is not about 

Articles added:  753
Article is not about current topic
Link to article:  https://theprint.in/politics/10-reasons-why-indias-economy-is-in-the-doldrums-according-to-p-chidambaram/68627/
Articles added:  753
Article is not about current topic
Link to article:  https://theprint.in/india/governance/govt-makes-room-for-private-sector-talent-wants-specialists-to-join-ministries-as-joint-secys/68463/
Articles added:  753
Article is not about current topic
Link to article:  https://theprint.in/national-interest/why-narendra-modi-govt-is-seized-with-last-year-panic-like-any-other/68094/
Articles added:  753
Article is not about current topic
Link to article:  https://theprint.in/plugged-in/rbi-increases-repo-rate-after-4-5-years-and-rahul-gandhis-speech-has-started-a-fight/67205/
Articles added:  753
Article is not about current topic
Link to article:  https://theprint.in/india/governance/modi-inaugurates-delhi-meerut-expressway-takes-a-jibe-at-the-opposition/63309/
Articles added:  753
Articl

In [None]:
#Saving the dataframe and viewing its shape
theprint_df.to_pickle("theprint_farm_laws", compression="zip")
theprint_df.shape

(762, 5)

In [None]:
#To load the pickled dataframe
theprint_df = pd.read_pickle("theprint_farm_laws", compression="zip")
theprint_df.shape

(762, 5)

In [None]:
print("Number of topics that were not on this topic: ", len(topic_other_list))
print("Number of links that threw exceptions: ", len(exceptions_links_list))

Number of topics that were not on this topic:  696
Number of links that threw exceptions:  0


Let's review some of the articles that have been added to the dataframe.

In [None]:
#Picking 10 random rows of the data frame to check
sample_df = theprint_df.sample(n=10)

In [None]:
sample_df

Unnamed: 0,Title,Link,Date,Tag,Article
221,Will voluntarily court arrest if police forcib...,https://theprint.in/india/will-voluntarily-cou...,"29 January, 2021 10:55 am",Farm Laws,The Aam Aadmi Party leader's remarks come afte...
142,‘They must know what future holds for them’ — ...,https://theprint.in/india/they-must-know-what-...,"6 February, 2021 6:20 pm",Farm Laws,Former Rajasthan Deputy CM says the Congress i...
227,‘Conspiracy’ to storm Delhi was hatched after ...,https://theprint.in/india/conspiracy-to-storm-...,"29 January, 2021 7:30 am",Farm Laws,In one of their 33 FIRs on tractor rally viole...
115,Farm protests ‘sacred’ but andolan jeevis have...,https://theprint.in/india/farm-protests-sacred...,"10 February, 2021 7:34 pm",Farm Laws,"Speaking in the Lok Sabha, Prime Minister Nare..."
475,How Modi govt can avoid another farmers’ prote...,https://theprint.in/ilanomics/how-modi-govt-ca...,"11 December, 2020 8:30 am",Farm Laws,When changes are introduced through consultati...
648,Haryana Congress stages walkout from assembly ...,https://theprint.in/india/haryana-congress-sta...,"6 November, 2020 8:53 pm",Farm Laws,"As uproar broke out in the Haryana Assembly, C..."
469,"Maoists, Leftists have infiltrated ‘so called’...",https://theprint.in/india/maoists-leftists-hav...,"12 December, 2020 4:04 pm",Farm Laws,"Union Minister Piyush Goyal, who is part of go..."
387,"Article 370, farm laws, NEP — Modi govt plans ...",https://theprint.in/india/article-370-farm-law...,"29 December, 2020 2:00 pm",Farm Laws,The PIB and the Bureau of Outreach Communicati...
357,Modi govt’s proposal shows it wants farmers to...,https://theprint.in/opinion/modi-govts-proposa...,"6 January, 2021 5:02 pm",Farm Laws,The Modi govt has put very little on the negot...
707,‘Suit-boot ki sarkar’ jibe slowed reforms in M...,https://theprint.in/opinion/suit-boot-ki-sarka...,"24 September, 2020 3:48 pm",Farm Laws,The debate against India’s recent agricultural...


In [None]:
print(sample_df['Link'][707])
print(sample_df['Article'][707])

https://theprint.in/opinion/suit-boot-ki-sarkar-jibe-slowed-reforms-in-modi-govt-that-must-not-repeat-with-farm-bills/509553/
The debate against India’s recent agricultural reforms has been hijacked by protests in the name of poor small farmers.A visibly low-hanging economic reform contains within it several vested interests. The protests we see against three important agricultural reforms recently are, therefore, not surprising. They are one more step in the noisy democracy of India that desperately needs economic reforms on the one hand but gets swept by the tide of political rhetoric on the other. A reform that proposes to increase the prices farmers get for their output by giving them flexibility to sell, with governments continuing to support a base minimum thorough the minimum support price (MSP) within the extant system, should not cost anyone but benefit millions of farmers. And yet, the discourse against India’s recent agricultural reforms has been hijacked by protests in the 

In [None]:
#Saving the dataframe to disk
theprint_df.to_pickle("theprint_farm_laws", compression="zip")

In [None]:
theprint_df[theprint_df['Article'].str.contains("PTI|IANS", case=True)]

Unnamed: 0,Title,Link,Date,Tag,Article
2,"Rahul Gandhi, other opposition parties’ leader...",https://theprint.in/india/rahul-gandhi-other-o...,"6 August, 2021 2:08 pm",Farm Laws,Leaders of 14 opposition parties met at the Pa...
6,Farmer leaders to meet Mamata Banerjee to seek...,https://theprint.in/india/farmer-leaders-to-me...,"9 June, 2021 5:34 pm",Farm Laws,West Bengal CM's support for the agitation whi...
12,Punjab farmers hoist black flags to mark 6 mon...,https://theprint.in/politics/punjab-farmers-ho...,"26 May, 2021 3:14 pm",Farm Laws,SAD chief Sukhbir Singh Badal also raised a bl...
27,RLD to build memorial for farmers who died dur...,https://theprint.in/politics/rld-to-build-memo...,"17 April, 2021 1:46 pm",Farm Laws,In a bid to resonate with the farming communit...
37,Modi govt official finally let it slip — doubl...,https://theprint.in/opinion/modi-govt-official...,"31 March, 2021 3:19 pm",Farm Laws,"In the five years since it was announced, what..."
56,Farmers ready to continue protest till Modi go...,https://theprint.in/india/farmers-ready-to-con...,"10 March, 2021 1:30 pm",Farm Laws,The farmer leader also rejected allegations by...
111,"Don’t make farm laws prestige issue, draft afr...",https://theprint.in/politics/dont-make-farm-la...,"11 February, 2021 4:34 pm",Farm Laws,Congress leader Pilot said while the party sup...
126,Farmer unions ask govt to fix date for talks a...,https://theprint.in/india/farmer-unions-ask-go...,"8 February, 2021 6:29 pm",Farm Laws,"Farmers, however, objected to Modi's remarks i..."
128,Those wanting business over hunger will be dri...,https://theprint.in/india/those-wanting-busine...,"8 February, 2021 4:05 pm",Farm Laws,The Bharatiya Kisan Union spokesperson's comme...
131,Ensure protesters are allowed to demonstrate p...,https://theprint.in/diplomacy/ensure-protester...,"8 February, 2021 2:40 pm",Farm Laws,During a meeting with India's Ambassador to th...


In [None]:
print(theprint_df['Link'][126])
print(theprint_df['Article'][126])

https://theprint.in/india/farmer-unions-ask-govt-to-fix-date-for-talks-after-pm-modis-invite-to-resume-dialogue/601265/
Farmers, however, objected to Modi's remarks in Rajya Sabha that a new ‘breed’ of agitators called 'andolan jivi' has emerged, &amp; said that agitation has an important role in a democracy. Farmer unions agitating against the three agri laws on Monday asked the government to fix a date for the next round of talks, soon after Prime Minister Narendra Modi urged them to end their stir and invited them to resume the dialogue. They, however, objected to Prime Minister Modi’s remarks in Rajya Sabha that a new “breed” of agitators called “andolan jivi” has emerged in the country, and said that agitation has an important role in a democracy. Farmer leader Shiv Kumar Kakka, who is a senior member of the Samkyukta Kisan Morcha which is spearheading the ongoing stir, said they are ready for the next round of talks and the government should tell them the date and time of the mee

Let's check if the articles in the dataframe were by PTI.

In [None]:
#Defining a method to check if the author of an article is PTI
def pti_ians_checker(df):
    
    pattern = r">([^<>]+)<"
    
    pti_articles_counter = 0
    no_author_counter = 0
    
    for index, row in df.iterrows():
        
        #Fetching the soup page
        soup_page = link_to_soup(row['Link'])
        
        try:
            #Getting the article's author's name
            author_html = soup_page.find("a", rel = "author")
            author = pattern_finder(pattern, str(author_html))[0]
            
            #Checking if PTI or IANS is the author of the article
            if ("pti" in author.lower()) or ("ians" in author.lower()):
                pti_ians_articles_list.append(row['Link'])
                pti_articles_counter += 1
                print("Number of PTI/IANS articles: ", pti_articles_counter)
                
        
        except:
            
            print("Author not found in this link: ", row['Link'])
            no_author_counter += 1
        
        #Pausing
        sleep(random.randint(3, 15))
    
    print("Total number of PTI/IANS articles: ", pti_articles_counter)
    print("Total number of articles where the author was not found: ", no_author_counter)


In [None]:
#Initializing an empty list to collect these articles
pti_ians_articles_list = []

In [None]:
#Checking if the articles in the dataframe are from PTI or IANS
pti_ians_checker(theprint_df[:5])

In [None]:
pti_ians_articles_list

['https://theprint.in/politics/mr-modi-come-listen-to-us-derek-obrien-shares-oppns-message-on-pegasus-farmers-issue/711400/',
 'https://theprint.in/india/rahul-gandhi-other-opposition-parties-leaders-visit-kisan-sansad-at-jantar-mantar/710378/']

Let's check all the articles in the dataframe.

In [None]:
#Checking if the articles in the dataframe are from PTI or IANS
pti_ians_checker(theprint_df[5:])

Number of PTI/IANS articles:  1
Number of PTI/IANS articles:  2
Number of PTI/IANS articles:  3
Number of PTI/IANS articles:  4
Number of PTI/IANS articles:  5
Number of PTI/IANS articles:  6
Number of PTI/IANS articles:  7
Number of PTI/IANS articles:  8
Number of PTI/IANS articles:  9
Number of PTI/IANS articles:  10
Number of PTI/IANS articles:  11
Number of PTI/IANS articles:  12
Number of PTI/IANS articles:  13
Number of PTI/IANS articles:  14
Number of PTI/IANS articles:  15
Number of PTI/IANS articles:  16
Number of PTI/IANS articles:  17
Number of PTI/IANS articles:  18
Number of PTI/IANS articles:  19
Number of PTI/IANS articles:  20
Number of PTI/IANS articles:  21
Number of PTI/IANS articles:  22
Number of PTI/IANS articles:  23
Number of PTI/IANS articles:  24
Number of PTI/IANS articles:  25
Number of PTI/IANS articles:  26
Number of PTI/IANS articles:  27
Number of PTI/IANS articles:  28
Number of PTI/IANS articles:  29
Number of PTI/IANS articles:  30
Number of PTI/IANS 

Number of PTI/IANS articles:  246
Number of PTI/IANS articles:  247
Number of PTI/IANS articles:  248
Number of PTI/IANS articles:  249
Number of PTI/IANS articles:  250
Number of PTI/IANS articles:  251
Number of PTI/IANS articles:  252
Number of PTI/IANS articles:  253
Number of PTI/IANS articles:  254
Number of PTI/IANS articles:  255
Number of PTI/IANS articles:  256
Number of PTI/IANS articles:  257
Number of PTI/IANS articles:  258
Number of PTI/IANS articles:  259
Number of PTI/IANS articles:  260
Total number of PTI/IANS articles:  260
Total number of articles where the author was not found:  0


In [None]:
len(pti_ians_articles_list)

262

In [None]:
#Let's select a few random links to check
random.choices(pti_ians_articles_list, k=10)

['https://theprint.in/politics/amarinder-singh-slams-kejriwal-over-farm-law-notification-calls-him-sneaky-little-fellow/556367/',
 'https://theprint.in/world/uk-mps-seek-ministerial-intervention-in-farmer-protests-in-india/558301/',
 'https://theprint.in/india/punjab-youth-congress-to-burn-pm-modis-effigy-on-dussehra-in-protest-against-farm-bills/530399/',
 'https://theprint.in/india/pm-modi-pays-surprise-visit-to-delhis-gurudwara-rakab-ganj-amid-punjab-farmers-led-protests/570371/',
 'https://theprint.in/politics/haryanas-rampal-majra-quits-bjp-over-farmer-protests-alleges-centre-trying-to-sabotage-them/594339/',
 'https://theprint.in/politics/cant-block-roads-like-this-democracy-not-for-such-things-cm-khattar-on-farmer-protests/570853/',
 'https://theprint.in/india/governance/centre-asks-states-to-tighten-security-ensure-peace-in-advisory-on-bharat-bandh/562178/',
 'https://theprint.in/india/we-understand-why-farmers-are-at-border-govt-offer-still-on-the-table-says-sitharaman/596773/

Let's drop the articles authored by PTI from the dataframe.

In [None]:
#Dropping these articles
print("Number of articles before dropping", theprint_df.shape)
theprint_df = theprint_df[~theprint_df['Link'].isin(pti_ians_articles_list)]
print("Number of articles after dropping", theprint_df.shape)

Number of articles before dropping (762, 5)
Number of articles after dropping (500, 5)


Next, let's check if the words 'farm' and 'law' are present in the articles that are in the dataframe.

In [None]:
theprint_df[~theprint_df['Article'].str.contains('farm', case=False)]

Unnamed: 0,Title,Link,Date,Tag,Article
208,"90% of milk in India comprises A2 protein, say...",https://theprint.in/best-of-theprint-icymi/90-...,"30 January, 2021 3:39 pm",Farm Laws,"A selection of the best news reports, analysis..."
688,Modi govt on right path on agriculture and lab...,https://theprint.in/opinion/modi-govt-on-right...,"2 October, 2020 8:54 am",Farm Laws,Modi govt unwisely ignoring the central diffic...
711,Modi govt’s hasty passage of farm Bills shows ...,https://theprint.in/opinion/modi-govts-hasty-p...,"22 September, 2020 11:42 am",Farm Laws,"In the current Lok Sabha, 17 Bills have been r..."
728,Nirmala Sitharaman sports a different mask eac...,https://theprint.in/in-pictures/nirmala-sithar...,"17 May, 2020 7:16 pm",Farm Laws,In a series of press conferences over the past...


In [None]:
#Initializing an empty list to collect the indices of articles to be dropped
articles_to_drop = []

In [None]:
print(theprint_df['Link'][728])
print(theprint_df['Article'][728])

https://theprint.in/in-pictures/nirmala-sitharaman-sports-a-different-mask-each-day-at-rs-20-lakh-crore-package-briefings/423671/
In a series of press conferences over the past five days, Finance Minister Nirmala Sitharaman unveiled details of the Rs 20 lakh crore stimulus package announced by PM Modi last week. Finance Minister Nirmala Sitharaman faced the media every day over the last five days as she explained details of Prime Minister Modi’s Rs 20-lakh-crore package announced last week to revamp the economy hit by the coronavirus pandemic. As the clock struck 4 pm, Sitharaman took her place at the National Media Centre in Delhi elaborating on the packages for MSMEs, reforming the agricultural sector and throwing open the defence and coal sectors. She came masked every day, a different one each day — some matching with her sari, some contrasting with it. 


In [None]:
articles_to_drop.append(728)

In [None]:
#Dropping these articles
print("Number of articles before dropping", theprint_df.shape)
theprint_df.drop(articles_to_drop, inplace=True)
print("Number of articles after dropping", theprint_df.shape)

Number of articles before dropping (500, 5)
Number of articles after dropping (498, 5)


In [None]:
#Saving the dataframe to disk
theprint_df.to_pickle("theprint_farm_laws", compression="zip")

In [None]:
#Saving the PTI articles in a file
with open("PTI links.txt", "wb") as fp:
    pickle.dump(pti_ians_articles_list, fp)

In [None]:
#Unpickling the list to check
with open("PTI links.txt", "rb") as fp:
    pti_ians_articles_list = pickle.load(fp)

print("Number of links in this list: ", len(pti_ians_articles_list))

Number of links in this list:  262
