In [None]:
This is sample data cleaning code. Some of the unnecessary text that appears in articles published by The Print are removed in this notebook.

In [None]:
import pandas as pd
import re
import random
import pickle
import sys

Let's load the dataframe from the disk

In [None]:
#To load the pickled dataframe
theprint_df = pd.read_pickle("theprint_dataframe", compression="zip")
theprint_df.shape

(2883, 7)

In [None]:
theprint_df.head(3)

Unnamed: 0,Title,News_Outlet,Link,Date,Authors,Topic,Article
0,This Mohali village has no mobile or internet ...,theprint,https://theprint.in/health/in-mohali-poor-heal...,2021-06-06,Ananya Bhardwaj,farm laws,The Mohali administration has put up all 341 v...
1,"AAP supports farmers’ Bharat Bandh call too, a...",theprint,https://theprint.in/india/aap-supports-farmers...,2020-12-06,ANI,farm laws,Delhi Environment Minister and AAP leader Gopa...
3,"Modi’s boat ride got the coverage, but on TV, ...",theprint,https://theprint.in/opinion/telescope/modis-bo...,2020-12-03,Shailaja Bajpai,farm laws,When a Zee News reporter tried to explain a ga...


### 1. Dropping articles from agencies

In [None]:
#Checking the authors in the dataframe
theprint_df['Authors'].value_counts()[:50]

PTI                    812
Azaan Javaid           238
ThePrint Team          107
Snehesh Alex Philip     83
Nayanima Basu           81
Ananya Bhardwaj         63
Fatima Khan             58
Chitleen K Sethi        56
ANI                     55
Neelam Pandey           51
Sravasti Dasgupta       34
Shanker Arnimesh        31
Moushumi Das Gupta      30
Revathi Krishnan        30
Kairvy Grewal           28
Jyoti Malhotra          27
Apoorva Mandhani        27
Shivam Vij              27
Bismee Taskin           27
Amrita Nayak Dutta      27
Taran Deol              26
Simrin Sirur            24
Debayan Roy             24
Sanya Dhingra           24
Bhadra Sinha            22
Unnati Sharma           22
Ruhi Tewari             22
Yogendra Yadav          21
Shekhar Gupta           21
Pia Krishnankutty       21
Maneesh Chhibber        21
Samyak Pandey           20
Rohini Swamy            20
Ritika Jain             20
D.K. Singh              19
Madhuparna Das          18
Deeksha Bhardwaj        15
S

In [None]:
theprint_df[theprint_df['Authors'].str.contains('PTI|ANI', case=True)].shape

(867, 7)

Let's drop these rows.

In [None]:
#Getting the indices of the rows that are to be dropped
rows_to_drop = theprint_df.index[theprint_df['Authors'].str.contains('ANI|PTI', case=True)]
len(rows_to_drop)

867

In [None]:
#Dropping these rows
print("Number of rows before dropping: ", theprint_df.shape[0])
theprint_df.drop(rows_to_drop, inplace=True)
print("Number of rows after dropping: ", theprint_df.shape[0])

Number of rows before dropping:  2883
Number of rows after dropping:  2016


In [None]:
theprint_df.head(3)

Unnamed: 0,Title,News_Outlet,Link,Date,Authors,Topic,Article
0,This Mohali village has no mobile or internet ...,theprint,https://theprint.in/health/in-mohali-poor-heal...,2021-06-06,Ananya Bhardwaj,farm laws,The Mohali administration has put up all 341 v...
3,"Modi’s boat ride got the coverage, but on TV, ...",theprint,https://theprint.in/opinion/telescope/modis-bo...,2020-12-03,Shailaja Bajpai,farm laws,When a Zee News reporter tried to explain a ga...
4,"Improve farm inputs, equip panchayats to verif...",theprint,https://theprint.in/opinion/improve-farm-input...,2020-12-07,Srijan Pal Singh,farm laws,In my project to evolve the PURA policy with f...


### 2. Removing unnecessary text from the article

#### '(Edited by ...)'

In [None]:
#Checking for a pattern that needs to be removed from the article
theprint_df[theprint_df['Article'].str.contains('\(edited by', case=False)].shape

(46, 7)

In [None]:
#Checking the indices of the articles that contain this pattern
theprint_df.index[theprint_df['Article'].str.contains('\(edited by', case=False)]

Int64Index([   0,   15,   47,  102,  189,  190,  235,  278,  327,  360,  419,
             442,  503,  518,  521,  529,  614,  680, 1001, 1122, 1166, 1295,
            1349, 1432, 1454, 1466, 1470, 1586, 1624, 1668, 1677, 1700, 1794,
            1949, 1956, 2432, 2522, 2529, 2640, 2702, 2757, 2819, 2852, 2952,
            3070, 3193],
           dtype='int64')

In [None]:
print(theprint_df['Link'][15])
print(theprint_df['Article'][15])

https://theprint.in/india/parliament-panel-comprising-tmc-aap-mps-who-opposed-farm-laws-bats-for-one-of-3-laws/625346/
The Standing Committee on Food, chaired by TMC MP Sudip Bandyopadhyay, recommended the government to implement the Essential Commodities (Amendment) Act, 2020. A Parliament Standing  Committee, comprising MPs of parties that have been vehemently opposing the three new farm laws, has asked for implementation of Essential Commodities (Amendment) Act, 2020 — one of the three controversial laws — in “letter and spirit”.  The Standing Committee on Food, headed by Trinamool Congress MP Sudip Bandyopadhyay, has members such as Bhagwant Mann from AAP, who had even raised slogans against the laws in the presence of Prime Minister Narendra Modi. In its report tabled in the Lok Sabha Friday, the panel recommended the government to implement the Essential Commodities (Amendment) Act, 2020.  It said, “There is a need to create an environment based on ease of doing business and for 

In [None]:
#Defining a function to remove a substring from a string
def string_pattern_remover_1(ip_string, pattern):
        
    #Checking if the string contains the pattern text
    if 'Edited by' in ip_string:
        
        #Getting the position of the pattern in the string
        pattern_pos = ip_string.index('Edited by')
        
        #Calculating the position of the pattern in terms of percentage of string length
        pos_percentage = pattern_pos*100/len(ip_string)
        
        #If the pattern is towards the end of the string
        if pos_percentage > 85:
            
            #Removing the pattern from the string
            op_string = re.sub(pattern, "", ip_string, flags=re.I)
            
            return op_string
    
    #Returning the string without any changes, since there were no changes to the string
    return ip_string


In [None]:
#Defining a pattern to match the pattern
pattern_1 = r"\([^()]{0,2}edited by[^()]{1,30}\)"

In [None]:
#Applying the function to the articles
theprint_df['Article'] = theprint_df['Article'].apply(lambda x: string_pattern_remover_1(x, pattern_1))

In [None]:
#Checking for a pattern that needs to be removed from the article
theprint_df[theprint_df['Article'].str.contains('\(edited by', case=False)].shape

(0, 7)

#### 'Views are personal'

In [None]:
#Checking for a pattern that needs to be removed from the article
theprint_df[theprint_df['Article'].str.contains('views are personal', case=False)].shape

(302, 7)

In [None]:
#Checking the indices of the articles that contain this pattern
views_list = list(theprint_df.index[theprint_df['Article'].str.contains('views are personal', case=False)])
views_list[:5]

[3, 4, 5, 22, 40]

In [None]:
#Defining a function to remove a substring from a string
def string_pattern_remover_2(ip_string, pattern):
        
    #Checking if the string contains the pattern text
    if 'Views are personal' in ip_string:
        
        #Getting the position of the pattern in the string
        pattern_pos = ip_string.index('Views are personal')
        
        #Calculating the position of the pattern in terms of percentage of string length
        pos_percentage = pattern_pos*100/len(ip_string)
        
        #If the pattern is towards the end of the string
        if pos_percentage > 85:
            
            #Removing the pattern from the string
            op_string = re.sub(pattern, "", ip_string, flags=re.I)
            
            return op_string
    
    #Returning the string without any changes, since there were no changes to the string
    return ip_string


In [None]:
#Defining a pattern to match the pattern
pattern_2 = r"views are personal"

In [None]:
#Applying the function to the articles
theprint_df['Article'] = theprint_df['Article'].apply(lambda x: string_pattern_remover_2(x, pattern_2))

In [None]:
#Checking again for the pattern
theprint_df[theprint_df['Article'].str.contains('views are personal', case=False)].shape

(7, 7)

In [None]:
#Checking the indices of the articles that contain this pattern
theprint_df.index[theprint_df['Article'].str.contains('views are personal', case=False)]

Int64Index([759, 841, 1199, 1206, 1393, 1480, 2694], dtype='int64')

In [None]:
print(theprint_df['Link'][2694])
print(theprint_df['Article'][2694])

https://theprint.in/talk-point/congress-dmk-tmc-cpim-appeal-on-caa-is-office-of-the-president-still-relevant/337296/
As public outcry over the Citizenship Amendment Act continues, representatives of 12 opposition parties met President Ram Nath Kovind on Tuesday.As public outcry over the Citizenship Amendment Act continues, representatives of 12 opposition parties met President Ram Nath Kovind Tuesday, requesting him to recommend the Narendra Modi government to take back the “unconstitutional and divisive law”. Congress, CPI(M), TMC, DMK, SP, TMC were some of the parties that sent their delegates. ThePrint asks:  Congress, DMK, TMC, CPI appeal on CAA: Is office of the President still relevant? President’s office still relevant, especially in case of constitutional crisis. CAA is by no means such a crisis Makarand R. Paranjape   
 Director, Indian Institute of Advanced Study 
 The President’s office is still relevant, especially in the case of a constitutional crisis. The brouhaha over t

In [None]:
#Defining a function to remove a substring from a string
def string_pattern_remover_25(ip_string, pattern):
                    
    #Removing the pattern from the string
    op_string = re.sub(pattern, "", ip_string, flags=re.I)

    return op_string


In [None]:
#Making a list of the possible combinations to run
patterns_list = [('The views are personal.'), ('Author’s views are personal'), ('His views are personal.'), 
                     ('Views are personal')]

In [None]:
#Making a list of the indices of these rows
row_indices = list(theprint_df.index[theprint_df['Article'].str.contains('views are personal', case=False)])
len(row_indices)

7

In [None]:
#Looping through this list of patterns
for p in patterns_list:
    
    for i in row_indices:
        
        #Applying the function with these combinations      
        theprint_df.loc[i, 'Article'] = string_pattern_remover_25(theprint_df.loc[i, 'Article'], p)


In [None]:
#Checking again for the pattern
theprint_df[theprint_df['Article'].str.contains('views are personal', case=False)].shape

(0, 7)

#### 'The author is'

In [None]:
#Checking for a pattern that needs to be removed from the article
theprint_df[theprint_df['Article'].str.contains('the author is', case=False)].shape

(193, 7)

In [None]:
#Checking the indices of the articles that contain this pattern
views_list = list(theprint_df.index[theprint_df['Article'].str.contains('the author is', case=False)])
views_list[:5]

[5, 41, 50, 103, 110]

In [None]:
#Defining a function to remove a substring from a string
def string_pattern_remover_3(ip_string, pattern):
        
    #Checking if the string contains the pattern text
    if 'the authors are' in ip_string.lower():
        
        #Getting the position of the pattern in the string
        pattern_pos = ip_string.lower().index('the authors are')
        
        #Calculating the position of the pattern in terms of percentage of string length
        pos_percentage = pattern_pos*100/len(ip_string)
        
        #If the pattern is towards the end of the string
        if pos_percentage > 85:
            
            #Removing the pattern from the string
            op_string = re.sub(pattern, "", ip_string, flags=re.I)
            
            return op_string
    
    #Returning the string without any changes, since there were no changes to the string
    return ip_string


In [None]:
#Defining a pattern to match the pattern
pattern_3 = r"the author is(.*)$"

In [None]:
#Applying the function to the articles
theprint_df['Article'] = theprint_df['Article'].apply(lambda x: string_pattern_remover_3(x, pattern_3))

In [None]:
#Checking again for the pattern
theprint_df[theprint_df['Article'].str.contains('the author is', case=False)].shape

(3, 7)

In [None]:
#Checking the indices of these articles
theprint_df.index[theprint_df['Article'].str.contains('the author is', case=False)]

Int64Index([1393, 1399, 2213], dtype='int64')

In [None]:
print(theprint_df['Link'][2213])
print(theprint_df['Article'][2213])

https://theprint.in/thought-shot/faizan-mustafa-on-article-370-petitions-kaushik-basu-on-anti-corruption-petitions/284311/
The best of the day’s opinion, chosen and curated by ThePrint’s top editors.Faizan Mustafa | Vice chancellor, NALSAR University of Law, Hyderabad 
 The Hindu Mustafa writes that all the petitions on Article 370, that are being referred to a Constitution Bench by the Supreme Court, call for deep scrutiny. Firstly, the Centre’s move to amend Article 370 could be struck down by the apex court as ‘unconstitutional’ using the very same article. Article 370 “has not been abrogated”, he says. Instead, the government has invoked the article to amend Article 367. J&amp;K was already under President’s Rule, but the Parliament – by exercising ‘powers’ of the Legislative Assembly – gave its license to intervene in Article 370. However, Mustafa says that this may not be upheld by legal scrutiny–“what you cannot do directly, you cannot even do indirectly,’’ he writes. Secondly, 

In [None]:
#Checking the indices of these articles
theprint_df.index[theprint_df['Article'].str.contains('the authors are', case=False)]

Int64Index([2428, 2489, 2527], dtype='int64')

In [None]:
print(theprint_df['Link'][2428])
print(theprint_df['Article'][2428])

https://theprint.in/opinion/how-hinduism-got-distorted-in-the-sabarimala-debate/175794/
An attempt to homogenise Hindus using state power erodes the diversity of Hinduism. S ince the Supreme Court’s judgment on  28 September, 2018  overturning the bar on the entry of women aged 10-50 years into the 800-year old Sabarimala temple in Kerala, there has been a spirited debate on the issue of government interference in religious affairs, with passionately argued views and counter views offered on the merits and demerits of the court’s judgment. Interestingly, those projecting the Sabarimala issue as one of gender justice and women’s equality are quick to draw comparisons with the Modi government’s effort to end the practice of triple talaq. There is an argument that permitting the entry of women into Sabarimala is not only about ensuring women’s rights, but also to be seen in the context of the so-called innately discriminatory attributes within Hinduism—if the entry of women is to be barre

In [None]:
#Defining a pattern to match the pattern
pattern_35 = r"the authors are(.*)$"

In [None]:
#Applying the function to the articles
theprint_df['Article'] = theprint_df['Article'].apply(lambda x: string_pattern_remover_3(x, pattern_35))

In [None]:
#Checking again for the pattern
theprint_df[theprint_df['Article'].str.contains('the authors are', case=False)].shape

(0, 7)

#### 'journalist at theprint'

In [None]:
#Checking for the pattern
theprint_df[theprint_df['Article'].str.contains('journalist at theprint', case=False)].shape

(36, 7)

In [None]:
#Defining a function to remove a substring from a string
def string_pattern_remover_4(ip_string, pattern):
        
    #Checking if the string contains the pattern text
    if 'journalist at theprint' in ip_string.lower():
            
        #Getting the position of the pattern in the string
        pattern_pos = ip_string.lower().index('journalist at theprint')
        
        #Calculating the position of the pattern in terms of percentage of string length
        pos_percentage = pattern_pos*100/len(ip_string)
        
        #If the pattern is towards the end of the string
        if pos_percentage > 85:
            
            #Removing the pattern from the string
            op_string = re.sub(pattern, "", ip_string, flags=re.I)
            
            return op_string
    
    #Returning the string without any changes, since there were no changes to the string
    return ip_string


In [None]:
#Defining a pattern to match the pattern
pattern_4 = r"by [A-Za-z \s]{1,25}, journalist at theprint"

In [None]:
#Applying the function to the articles
theprint_df['Article'] = theprint_df['Article'].apply(lambda x: string_pattern_remover_4(x, pattern_4))

In [None]:
#Checking for the pattern
theprint_df[theprint_df['Article'].str.contains('journalist at theprint', case=False)].shape

(0, 7)

#### Dropping the rows with '/plugged-in/' in the link

These articles are usually about what other media are reporting and can be dropped.

In [None]:
theprint_df[theprint_df['Link'].str.contains('/plugged-in/')].shape

(106, 7)

Let's drop these rows.

In [None]:
rows_to_drop = list(theprint_df.index[theprint_df['Link'].str.contains('/plugged-in/')])
len(rows_to_drop)

106

In [None]:
#Dropping these rows
print("Number of rows before dropping: ", theprint_df.shape[0])
theprint_df.drop(rows_to_drop, inplace=True)
print("Number of rows after dropping: ", theprint_df.shape[0])

Number of rows before dropping:  2016
Number of rows after dropping:  1910


In [None]:
#Saving the dataframe to disk
theprint_df.to_pickle("theprint_df_cleaned", compression="zip")
theprint_df.shape

(1910, 7)

In [None]:
print("Farm Laws", theprint_df[theprint_df['Topic']=='farm laws'].shape)
print("Rafale", theprint_df[theprint_df['Topic']=='rafale'].shape)
print("Article 370", theprint_df[theprint_df['Topic']=='article 370'].shape)
print("Sabarimala", theprint_df[theprint_df['Topic']=='sabarimala'].shape)
print("Section 377", theprint_df[theprint_df['Topic']=='section 377'].shape)
print("CAA", theprint_df[theprint_df['Topic']=='caa'].shape)

Farm Laws (348, 7)
Rafale (228, 7)
Article 370 (832, 7)
Sabarimala (57, 7)
Section 377 (37, 7)
CAA (408, 7)
