# Times of India (India Times)

Initial stage is collecting links from a particular URL, here I have taken Business, Sports, World, Entertainment, and politics news in particular from Times of India. the reason for doing this manual labour for each category is to save the categorization(classifcation) training process for our model. The equivalent of this would have been a huge task itself because we would have to label the news on our own and then classify it, which still might not have give us as accurate as this manual method.

The URL format for articles in Times of India is : https://timesofindia.indiatimes.com/business/international-business/french-court-adds-pressure-on-google-to-pay-for-news/articleshow/78552309.cms

The format for reaching a category is: https://timesofindia.indiatimes.com/business

### Business 

Here we have collected all the links using Beautifulsoup library for url : https://timesofindia.indiatimes.com/business. The links are annotated by href in html.

In [1]:
import re
import requests
from bs4 import BeautifulSoup

response_toi_business = requests.get(url='https://timesofindia.indiatimes.com/business', headers={'User-Agent':''})
soup_toi_business = BeautifulSoup(response_toi_business.content, 'html.parser')

links_toi_business = []
for link in soup_toi_business.findAll('a', href=True):
    links_toi_business.append(link.get('href'))
#print(len(links_toi_business))
#print(links_toi_business)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only Business realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/business/' so I made use of those two major things to find all the valid URLs.  

In [2]:
import re
txt_toi_business = ' '.join(links_toi_business)
url_toi_business_raw = re.findall(r'https://timesofindia.indiatimes.com/business/[a-z0-9\/\.\-\:]*[0-9\.]+cms|/business/[a-z0-9\/\.\-\:]*[0-9\.]+cms', txt_toi_business)
#print(len(url_toi_business_raw))
#print(url_toi_business_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [3]:
url_toi_business_raw = set(url_toi_business_raw)
url_toi_business_raw = list(url_toi_business_raw)
#print(len(url_toi_business_raw))
url_toi_business_raw

['https://timesofindia.indiatimes.com/business/faqs/income-tax-faqs/income-tax-filing-all-you-need-to-know-about-form-26as/articleshow/59677791.cms',
 '/business/india-at-doorstep-of-revival-process-says-rbi-governor-shaktikanta-das/videoshow/78793420.cms',
 'https://timesofindia.indiatimes.com/business/faqs/mutual-fund-faqs/which-is-the-best-mutual-fund-to-invest/articleshow/59965814.cms',
 'https://timesofindia.indiatimes.com/business/faqs/home-loan-faqs/how-to-check-home-loan-eligibility/articleshow/60479843.cms',
 '/business/key-things-equitas-small-finance-bank-ipo-opens-for-subscription/videoshow/78766324.cms',
 '/business/india-business/e-tailing-to-become-200-billion-opportunity-by-2025-report/articleshow/78795666.cms',
 '/business/international-business/amazon-extends-work-from-home-option-till-june-2021/articleshow/78782751.cms',
 '/business/india-business/mkt-recovery-from-march-is-broad-based-sebi-chief/articleshow/78797136.cms',
 'https://timesofindia.indiatimes.com/busine

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like **'/business/india-business/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'**. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [4]:
url_toi_business = []
url_toi_business = [re.sub(r'(?<![a-z/:])(/business/[a-z0-9/.:-]*[0-9.]+cms)', r'https://timesofindia.indiatimes.com\1', without_header) for without_header in url_toi_business_raw]
#print(len(url_toi_business))
#print(url_toi_business)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

In [5]:
cached_url_toi_business = []
with open('./data/times_of_india/cached_url_toi_business.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_toi_business.append(currentPlace)

In [6]:
len(cached_url_toi_business)

586

#### Compare against cached URLs to find only new links for today

In [7]:
latest_url_toi_business = list(set(url_toi_business) - set(cached_url_toi_business))
#print(len(latest_url_toi_business))
#print(latest_url_toi_business)

#### Cache the latest URL for future comparisons

In [8]:
with open('./data/times_of_india/cached_url_toi_business.txt', 'a') as filehandle:
    for listitem in latest_url_toi_business:
        filehandle.write('%s\n' % listitem)

### Sports 

Initial stage is collecting links from a particular URL, here I have taken sports news in particular from Times of India. the reason for doing this manual labour for each category is to save the categorization(classifcation) training process for our model. The euqivalent of this would have been a huge task itself because we would have to label the news on our own and then classify it, which still might not have give us as accurate as this manual method.

Here we have collected all the links using Beautifulsoup library for url : https://timesofindia.indiatimes.com/sports. The links are annotated by href in html.

In [9]:
import re
import requests
from bs4 import BeautifulSoup

response_toi_sports = requests.get(url='https://timesofindia.indiatimes.com/sports', headers={'User-Agent':''})
soup_toi_sports = BeautifulSoup(response_toi_sports.content, 'html.parser')

links_toi_sports = []
for link in soup_toi_sports.findAll('a', href=True):
    links_toi_sports.append(link.get('href'))
#print(len(links_toi_sports))
#print(links_toi_sports)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only sports realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/sports/' so I made use of those two major things to find all the valid URLs.  

In [10]:
import re
txt_toi_sports = ' '.join(links_toi_sports)
url_toi_sports_raw = re.findall(r'https://timesofindia.indiatimes.com/sports/[a-z0-9\/\.\-\:]*[0-9\.]+cms|/sports/[a-z0-9\/\.\-\:]*[0-9\.]+cms', txt_toi_sports)
#print(len(url_toi_sports_raw))
#print(url_toi_sports_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [11]:
url_toi_sports_raw = set(url_toi_sports_raw)
url_toi_sports_raw = list(url_toi_sports_raw)
#print(len(url_toi_sports_raw))
url_toi_sports_raw

['/sports/football/epl/top-stories/manchester-united-dismiss-super-league-reports-say-they-are-focused-on-uefa-talks/articleshow/78791343.cms',
 '/sports/football/epl/top-stories/mesut-ozil-deeply-disappointed-by-arsenal-omission/articleshow/78790467.cms',
 '/sports/badminton/sindhu-quits-national-camp-over-personal-issues-reaches-london/articleshow/78758810.cms',
 '/sports/latest-videos/ipl-2020-punjab-beat-mumbai-in-ipls-first-ever-second-super-over/videoshow/78741981.cms',
 '/sports/latest-videos/ipl-2020-pacers-help-dc-beat-rr-to-reclaim-top-spot/videoshow/78673084.cms',
 '/sports/football/sergio-ramos-on-lionel-messi-the-argentina-superstar-has-earned-right-to-leave/videoshow/77904948.cms',
 '/sports/tennis/top-stories/rafael-nadal-to-play-paris-masters-next-month/articleshow/78771732.cms',
 '/sports/cricket/match-center/indian-premier-league-2020-delhi-capitals-vs-kings-xi-punjab-live-cricket-score/scorecard/matchfile-kpdd10202020197725.cms',
 '/sports/cricket/ipl-2020-one-of-our

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/sports/india-sports/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [12]:
url_toi_sports = []
url_toi_sports = [re.sub(r'(?<![a-z/:])(/sports/[a-z0-9/.:-]*[0-9.]+cms)', r'https://timesofindia.indiatimes.com\1', without_header) for without_header in url_toi_sports_raw]
#print(len(url_toi_sports))
#print(url_toi_sports)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

In [13]:
cached_url_toi_sports = []
with open('./data/times_of_india/cached_url_toi_sports.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_toi_sports.append(currentPlace)

In [14]:
len(cached_url_toi_sports)

502

#### Compare against cached URLs to find only new links for today

In [15]:
latest_url_toi_sports = list(set(url_toi_sports) - set(cached_url_toi_sports))

#print(len(latest_url_toi_sports))
#print(latest_url_toi_sports)

#### Cache the latest URL for future comparisons

In [16]:
with open('./data/times_of_india/cached_url_toi_sports.txt', 'a') as filehandle:
    for listitem in latest_url_toi_sports:
        filehandle.write('%s\n' % listitem)

### World 

Initial stage is collecting links from a particular URL, here I have taken world news in particular from Times of India. the reason for doing this manual labour for each category is to save the categorization(classifcation) training process for our model. The euqivalent of this would have been a huge task itself because we would have to label the news on our own and then classify it, which still might not have give us as accurate as this manual method.

Here we have collected all the links using Beautifulsoup library for url : https://timesofindia.indiatimes.com/world. The links are annotated by href in html.

In [17]:
import re
import requests
from bs4 import BeautifulSoup

response_toi_world = requests.get(url='https://timesofindia.indiatimes.com/world', headers={'User-Agent':''})
soup_toi_world = BeautifulSoup(response_toi_world.content, 'html.parser')

links_toi_world = []
for link in soup_toi_world.findAll('a', href=True):
    links_toi_world.append(link.get('href'))
#print(len(links_toi_world))
#print(links_toi_world)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only world realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/world/' so I made use of those two major things to find all the valid URLs.  

In [18]:
import re
txt_toi_world = ' '.join(links_toi_world)
url_toi_world_raw = re.findall(r'https://timesofindia.indiatimes.com/world/[a-z0-9\/\.\-\:]*[0-9\.]+cms|/world/[a-z0-9\/\.\-\:]*[0-9\.]+cms', txt_toi_world)
#print(len(url_toi_world_raw))
#print(url_toi_world_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [19]:
url_toi_world_raw = set(url_toi_world_raw)
url_toi_world_raw = list(url_toi_world_raw)
#print(len(url_toi_world_raw))
url_toi_world_raw

['/world/pakistan/two-parallel-governments-control-pakistan-former-pm-nawaz-sharif/articleshow/78793346.cms',
 '/world/china/hong-kong-eases-social-distancing-measures-after-number-of-infections-decline/articleshow/78769398.cms',
 '/world/uk/eu-to-uk-on-brexit-talks-you-cant-have-cake-eat-it-too/articleshow/78787323.cms',
 '/world/pakistan/pakistans-political-crisis-deepens/articleshow/78796926.cms',
 '/world/europe/eu-to-uk-on-brexit-talks-you-cant-have-cake-eat-it-too/articleshow/78787229.cms',
 '/world/abu-dhabi/photogallery/bab-al-qasr/articleshow/76127904.cms',
 '/world/abu-dhabi/photogallery/al-ain/articleshow/76121615.cms',
 '/world/uk/britain-partners-with-oxford-firm-to-assess-coronavirus-vaccine-t-cell-responses/articleshow/78801172.cms',
 'https://timesofindia.indiatimes.com/world/us/us-presidential-elections/trump-says-will-win-by-bigger-margin-seeks-thundering-defeat-for-rival-biden/articleshow/78781024.cms',
 '/world/rest-of-world/us-approves-1-billion-in-new-arms-sales-t

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/world/india-world/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [20]:
url_toi_world = []
url_toi_world = [re.sub(r'(?<![a-z/:])(/world/[a-z0-9/.:-]*[0-9.]+cms)', r'https://timesofindia.indiatimes.com\1', without_header) for without_header in url_toi_world_raw]
#print(len(url_toi_world))
#print(url_toi_world)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [21]:
cached_url_toi_world = []
with open('./data/times_of_india/cached_url_toi_world.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_toi_world.append(currentPlace)

In [22]:
len(cached_url_toi_world)

321

In [23]:
latest_url_toi_world = list(set(url_toi_world) - set(cached_url_toi_world)) 

#print(len(latest_url_toi_world))
#print(latest_url_toi_world)

#### Cache the latest URL for future comparisons

In [24]:
with open('./data/times_of_india/cached_url_toi_world.txt', 'a') as filehandle:
    for listitem in latest_url_toi_world:
        filehandle.write('%s\n' % listitem)

### Entertainment 

Initial stage is collecting links from a particular URL, here I have taken entertainment news in particular from Times of India. the reason for doing this manual labour for each category is to save the categorization(classifcation) training process for our model. The euqivalent of this would have been a huge task itself because we would have to label the news on our own and then classify it, which still might not have give us as accurate as this manual method.

Here we have collected all the links using Beautifulsoup library for url : https://timesofindia.indiatimes.com/entertainment. The links are annotated by href in html.

In [25]:
import re
import requests
from bs4 import BeautifulSoup

response_toi_entertainment = requests.get(url='https://timesofindia.indiatimes.com/entertainment/hindi', headers={'User-Agent':''})
soup_toi_entertainment = BeautifulSoup(response_toi_entertainment.content, 'html.parser')

links_toi_entertainment = []
for link in soup_toi_entertainment.findAll('a', href=True):
    links_toi_entertainment.append(link.get('href'))
#print(len(links_toi_entertainment))
#print(links_toi_entertainment)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only entertainment realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/entertainment/' so I made use of those two major things to find all the valid URLs.  

In [26]:
import re
txt_toi_entertainment = ' '.join(links_toi_entertainment)
url_toi_entertainment_raw = re.findall(r'https://timesofindia.indiatimes.com/entertainment/hindi/[a-z0-9\/\.\-\:]*[0-9\.]+cms|/entertainment/hindi/[a-z0-9\/\.\-\:]*[0-9\.]+cms', txt_toi_entertainment)
#print(len(url_toi_entertainment_raw))
#print(url_toi_entertainment_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [27]:
url_toi_entertainment_raw = set(url_toi_entertainment_raw)
url_toi_entertainment_raw = list(url_toi_entertainment_raw)
#print(len(url_toi_entertainment_raw))
url_toi_entertainment_raw

['/entertainment/hindi/bollywood/news/25-years-of-ddlj-karan-johar-shares-priceless-bts-photos-featuring-shah-rukh-khan-and-late-amrish-puri/articleshow/78765328.cms',
 '/entertainment/hindi/water-baby-lisa-haydon-makes-most-of-her-day-doing-what-she-loves-shares-her-latest-surfing-picture/videoshow/78746446.cms',
 '/entertainment/hindi/rajkummar-rao-and-bhumi-pednekar-to-start-shooting-for-badhaai-do-from-january-2021/videoshow/78760764.cms',
 '/entertainment/hindi/saqib-saleem-on-nepotism-debate-cant-virat-anushka-make-their-child-a-cricketer-or-an-actor/videoshow/78785083.cms',
 '/entertainment/hindi/ddlj-was-a-trendsetter-in-terms-of-fashion-styling-and-souvenirs/videoshow/78784473.cms',
 '/entertainment/hindi/movie-details/shakuntala-devi/movieshow/71146516.cms',
 '/entertainment/hindi/dolly-kitty-aur-woh-chamakte-sitare-official-trailer/videoshow/78189079.cms',
 '/entertainment/hindi/swara-bhasker-criticises-shah-rukh-khans-character-raj-malhotra-from-ddlj-says-it-makes-stalking-

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/entertainment/india-entertainment/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [28]:
url_toi_entertainment = []
url_toi_entertainment = [re.sub(r'(?<![a-z/:])(/entertainment/hindi/[a-z0-9/.:-]*[0-9.]+cms)', r'https://timesofindia.indiatimes.com\1', without_header) for without_header in url_toi_entertainment_raw]
#print(len(url_toi_entertainment))
#print(url_toi_entertainment)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [29]:
cached_url_toi_entertainment = []
with open('./data/times_of_india/cached_url_toi_entertainment.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_toi_entertainment.append(currentPlace)

In [30]:
len(cached_url_toi_entertainment)

473

In [31]:
latest_url_toi_entertainment = list(set(url_toi_entertainment) - set(cached_url_toi_entertainment))

#print(len(latest_url_toi_entertainment))
#print(latest_url_toi_entertainment)

#### Cache the latest URL for future comparisons

In [32]:
with open('./data/times_of_india/cached_url_toi_entertainment.txt', 'a') as filehandle:
    for listitem in latest_url_toi_entertainment:
        filehandle.write('%s\n' % listitem)

### Politics 

Initial stage is collecting links from a particular URL, here I have taken politics news in particular from Times of India. the reason for doing this manual labour for each category is to save the categorization(classifcation) training process for our model. The euqivalent of this would have been a huge task itself because we would have to label the news on our own and then classify it, which still might not have give us as accurate as this manual method.

Here we have collected all the links using Beautifulsoup library for url : https://timesofindia.indiatimes.com/politics. The links are annotated by href in html.

In [33]:
import re
import requests
from bs4 import BeautifulSoup

response_toi_politics = requests.get(url='https://timesofindia.indiatimes.com/politics/news', headers={'User-Agent':''})
soup_toi_politics = BeautifulSoup(response_toi_politics.content, 'html.parser')

links_toi_politics = []
for link in soup_toi_politics.findAll('a', href=True):
    links_toi_politics.append(link.get('href'))
#print(len(links_toi_politics))
#print(links_toi_politics)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only politics realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/politics/' so I made use of those two major things to find all the valid URLs.  

In [34]:
import re
txt_toi_politics = ' '.join(links_toi_politics)
url_toi_politics_raw = re.findall(r'https://timesofindia.indiatimes.com/politics/news/[a-z0-9\/\.\-\:]*[0-9\.]+cms|/politics/[a-z0-9\/\.\-\:]*[0-9\.]+cms', txt_toi_politics)
#print(len(url_toi_politics_raw))
#print(url_toi_politics_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [35]:
url_toi_politics_raw = set(url_toi_politics_raw)
url_toi_politics_raw = list(url_toi_politics_raw)
#print(len(url_toi_politics_raw))
url_toi_politics_raw

['/politics/news/rivals-belt-out-songs-hope-theyre-music-to-voters-ears/articleshow/78651121.cms',
 '/politics/news/bihar-polls-chirag-releases-ljp-manifesto-bihar-first-bihari-first/articleshow/78789444.cms',
 '/politics/news/rectify-mistake-focus-on-ensuring-justice-to-hathras-victims-family-mayawati-tells-up-govt/articleshow/78506998.cms',
 '/politics/news/item-remark-row-did-not-insult-anyone-claims-kamal-nath-bjp-holds-silent-protest/articleshow/78748488.cms',
 '/politics/news/farooq-calls-gupkar-declaration-meeting-today-mehbooba-to-attend/articleshow/78672261.cms',
 '/politics/news/bjp-backs-nitish-in-bihar-snubs-ljp/articleshow/78524425.cms',
 '/politics/news/mamata-invokes-sc-ruling-to-block-bjp-march-street-battles-rage-in-kolkata/articleshow/78564743.cms',
 '/politics/news/will-congress-cross-the-rubicon-declare-priyanka-cm-candidate/articleshow/78695875.cms',
 '/politics/news/modi-at-key-bjp-meet-to-finalise-candidates-for-bihar-election/articleshow/78598082.cms',
 '/politi

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/politics/india-politics/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [36]:
url_toi_politics = []
url_toi_politics = [re.sub(r'(?<![a-z/:])(/politics/news/[a-z0-9/.:-]*[0-9.]+cms)', r'https://timesofindia.indiatimes.com\1', without_header) for without_header in url_toi_politics_raw]
#print(len(url_toi_politics))
#print(url_toi_politics)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [37]:
cached_url_toi_politics = []
with open('./data/times_of_india/cached_url_toi_politics.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_toi_politics.append(currentPlace)

In [38]:
len(cached_url_toi_politics)

82

In [39]:
latest_url_toi_politics = list(set(url_toi_politics) - set(cached_url_toi_politics))
#print(len(latest_url_toi_politics))
#print(latest_url_toi_politics)

#### Cache the latest URL for future comparisons

In [40]:
with open('./data/times_of_india/cached_url_toi_politics.txt', 'a') as filehandle:
    for listitem in latest_url_toi_politics:
        filehandle.write('%s\n' % listitem)

In [41]:
%reset -f

# Indian Express

### Business

In [42]:
import re
import requests
from bs4 import BeautifulSoup

response_ie_business = requests.get(url='https://indianexpress.com/section/business/', headers={'User-Agent':''})
soup_ie_business = BeautifulSoup(response_ie_business.content, 'html.parser')

links_ie_business = []
for link in soup_ie_business.findAll('a', href=True):
    links_ie_business.append(link.get('href'))
#print(len(links_ie_business))
#print(links_ie_business)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only business realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/business/' so I made use of those two major things to find all the valid URLs.  

In [43]:
import re
txt_ie_business = ' '.join(links_ie_business)
url_ie_business = re.findall(r'https://indianexpress.com/article/business/[a-zA-Z0-9\/\.\-\:]*', txt_ie_business)
#print(len(url_ie_business))
#print(url_ie_business)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [44]:
url_ie_business = set(url_ie_business)
url_ie_business = list(url_ie_business)
#print(len(url_ie_business))
url_ie_business

['https://indianexpress.com/article/business/market/recovery-in-markets-after-pandemic-shock-is-broad-based-sebi-chief-ajay-tyagi-6821112/',
 'https://indianexpress.com/article/business/companies/lic-launches-pension-scheme-6825308/',
 'https://indianexpress.com/article/business/aviation/cathay-pacific-to-cut-over-5000-hong-kong-jobs-close-dragon-brand-6817254/',
 'https://indianexpress.com/article/business/economy/india-at-doorstep-of-economic-revival-says-rbi-governor-6822193/',
 'https://indianexpress.com/article/business/companies/hul-q2-net-profit-rises-9-operations-services-back-to-pre-covid-levels-6811918/',
 'https://indianexpress.com/article/business/companies/mercedes-benz-to-start-local-assembly-of-vehicle-range-amg-in-india-6805384/',
 'https://indianexpress.com/article/business/working-on-next-stimulus-dea-secy-6825271/',
 'https://indianexpress.com/article/business/companies/dukaan-raises-6-mn-from-matrix-lightspeed-others-6825387/',
 'https://indianexpress.com/article/bu

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [45]:
cached_url_ie_business = []
with open('./data/indian_express/cached_url_ie_business.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ie_business.append(currentPlace)

In [46]:
len(cached_url_ie_business)

155

In [47]:
latest_url_ie_business = list(set(url_ie_business) - set(cached_url_ie_business))

#print(len(latest_url_ie_business))
#print(latest_url_ie_business)

#### Cache the latest URL for future comparisons

In [48]:
with open('./data/indian_express/cached_url_ie_business.txt', 'a') as filehandle:
    for listitem in latest_url_ie_business:
        filehandle.write('%s\n' % listitem)

### Sports

In [49]:
import re
import requests
from bs4 import BeautifulSoup

response_ie_sports = requests.get(url='https://indianexpress.com/section/sports/', headers={'User-Agent':''})
soup_ie_sports = BeautifulSoup(response_ie_sports.content, 'html.parser')

links_ie_sports = []
for link in soup_ie_sports.findAll('a', href=True):
    links_ie_sports.append(link.get('href'))
#print(len(links_ie_sports))
#print(links_ie_sports)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only sports realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/sports/' so I made use of those two major things to find all the valid URLs.  

In [50]:
import re
txt_ie_sports = ' '.join(links_ie_sports)
url_ie_sports = re.findall(r'https://indianexpress.com/article/sports/[a-zA-Z0-9\/\.\-\:]*', txt_ie_sports)
#print(len(url_ie_sports))
#print(url_ie_sports)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [51]:
url_ie_sports = set(url_ie_sports)
url_ie_sports = list(url_ie_sports)
#print(len(url_ie_sports))
url_ie_sports

['https://indianexpress.com/article/sports/ipl/kolkata-knight-riders-kkr-ipl-team-2020-full-squad-players-list-6590998/',
 'https://indianexpress.com/article/sports/football/english-premier-league/axel-tuanzebe-manchester-uniteds-unlikely-hero-in-paris-6825517/',
 'https://indianexpress.com/article/sports/ipl/sunrisers-hyderabad-srh-ipl-team-2020-full-squad-players-list-6589430/',
 'https://indianexpress.com/article/sports/ipl/rr-vs-srh-predicted-playing-11-ipl-2020-live-updates-6832104/',
 'https://indianexpress.com/article/sports/ipl/rcb-ipl-team-2020-players-list-royal-challengers-bangalore-full-squad-players-list-6590686/',
 'https://indianexpress.com/article/sports/ipl/mumbai-indians-ipl-2020-full-squad-players-list-6590649/',
 'https://indianexpress.com/article/sports/cricket/england-vs-south-africa-limited-overs-series-tour-6823186/',
 'https://indianexpress.com/article/sports/ipl/shikhar-dhawan-ipl-2020-dc-running-faster-not-afraid-6821493/',
 'https://indianexpress.com/article

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [52]:
cached_url_ie_sports = []
with open('./data/indian_express/cached_url_ie_sports.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ie_sports.append(currentPlace)

In [53]:
len(cached_url_ie_sports)

258

In [54]:
latest_url_ie_sports = list(set(url_ie_sports) - set(cached_url_ie_sports))

#print(len(latest_url_ie_sports))
#print(latest_url_ie_sports)

#### Cache the latest URL for future comparisons

In [55]:
with open('./data/indian_express/cached_url_ie_sports.txt', 'a') as filehandle:
    for listitem in latest_url_ie_sports:
        filehandle.write('%s\n' % listitem)

### Entertainment

In [56]:
import re
import requests
from bs4 import BeautifulSoup

response_ie_entertainment = requests.get(url='https://indianexpress.com/section/entertainment/', headers={'User-Agent':''})
soup_ie_entertainment = BeautifulSoup(response_ie_entertainment.content, 'html.parser')

links_ie_entertainment = []
for link in soup_ie_entertainment.findAll('a', href=True):
    links_ie_entertainment.append(link.get('href'))
#print(len(links_ie_entertainment))
#print(links_ie_entertainment)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only entertainment realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/entertainment/' so I made use of those two major things to find all the valid URLs.  

In [57]:
import re
txt_ie_entertainment = ' '.join(links_ie_entertainment)
url_ie_entertainment = re.findall(r'https://indianexpress.com/article/entertainment/[a-zA-Z0-9\/\.\-\:]*', txt_ie_entertainment)
#print(len(url_ie_entertainment))
#print(url_ie_entertainment)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [58]:
url_ie_entertainment = set(url_ie_entertainment)
url_ie_entertainment = list(url_ie_entertainment)
#print(len(url_ie_entertainment))
url_ie_entertainment

['https://indianexpress.com/article/entertainment/television/shehzad-deol-gets-evicted-from-bigg-boss-14-6821626/',
 'https://indianexpress.com/article/entertainment/bollywood/suraj-pe-mangal-bhari-trailer-6820350/',
 'https://indianexpress.com/article/entertainment/web-series/jamie-foxx-to-play-lead-in-netflix-vampire-comedy-day-shift-6820394/',
 'https://indianexpress.com/article/entertainment/bollywood/sanjay-dutt-recovers-from-cancer-happy-to-come-out-victorious-from-this-battle-6821320/',
 'https://indianexpress.com/article/entertainment/web-series/david-letterman-is-from-mars-kim-kardashian-is-from-venus-but-they-meet-on-earth-for-a-netflix-special-6821401/',
 'https://indianexpress.com/article/entertainment/movie-review/varmaa-movie-review-bala-bold-interpretation-of-arjun-reddy-6705535/',
 'https://indianexpress.com/article/entertainment/web-series/mirzapur-rasika-beena-is-as-manipulative-as-she-was-in-season-1-6738705/',
 'https://indianexpress.com/article/entertainment/movie-

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [59]:
cached_url_ie_entertainment = []
with open('./data/indian_express/cached_url_ie_entertainment.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ie_entertainment.append(currentPlace)

In [60]:
len(cached_url_ie_entertainment)

238

In [61]:
latest_url_ie_entertainment = list(set(url_ie_entertainment) - set(cached_url_ie_entertainment))

#print(len(latest_url_ie_entertainment))
#print(latest_url_ie_entertainment)

#### Cache the latest URL for future comparisons

In [62]:
with open('./data/indian_express/cached_url_ie_entertainment.txt', 'a') as filehandle:
    for listitem in latest_url_ie_entertainment:
        filehandle.write('%s\n' % listitem)

### World

In [63]:
import re
import requests
from bs4 import BeautifulSoup

response_ie_world = requests.get(url='https://indianexpress.com/section/world/', headers={'User-Agent':''})
soup_ie_world = BeautifulSoup(response_ie_world.content, 'html.parser')

links_ie_world = []
for link in soup_ie_world.findAll('a', href=True):
    links_ie_world.append(link.get('href'))
#print(len(links_ie_world))
#print(links_ie_world)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only world realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/world/' so I made use of those two major things to find all the valid URLs.  

In [64]:
import re
txt_ie_world = ' '.join(links_ie_world)
url_ie_world = re.findall(r'https://indianexpress.com/article/world/[a-zA-Z0-9\/\.\-\:]*', txt_ie_world)
#print(len(url_ie_world))
#print(url_ie_world)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [65]:
url_ie_world = set(url_ie_world)
url_ie_world = list(url_ie_world)
#print(len(url_ie_world))
url_ie_world

['https://indianexpress.com/article/world/thai-protesters-reject-pms-olive-branch-give-fresh-ultimatum-6831410/',
 'https://indianexpress.com/article/world/pope-francis-endorses-same-sex-civil-unions-in-new-documentary-film-6821741/',
 'https://indianexpress.com/article/world/us-and-russia-ready-to-freeze-number-of-nuclear-warheads-6821330/',
 'https://indianexpress.com/article/world/europes-dying-villages-woo-immigrants-to-survive-6821304/',
 'https://indianexpress.com/article/world/japans-suga-opposes-actions-that-boost-tension-in-south-china-sea-6821417/',
 'https://indianexpress.com/article/world/germany-issues-warrants-for-panama-papers-lawyers-say-reports-6821299/',
 'https://indianexpress.com/article/world/pakistan-unlikely-to-exit-fatfs-grey-list-report-6821376/',
 'https://indianexpress.com/article/world/us-deserves-to-have-prez-who-understands-dignity-of-people-says-kamala-harris-6830908/',
 'https://indianexpress.com/article/world/canada-avoids-snap-election-after-parliament

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [66]:
cached_url_ie_world = []
with open('./data/indian_express/cached_url_ie_world.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ie_world.append(currentPlace)

In [67]:
len(cached_url_ie_world)

252

In [68]:
latest_url_ie_world = list(set(url_ie_world) - set(cached_url_ie_world))

#print(len(latest_url_ie_world))
#print(latest_url_ie_world)

#### Cache the latest URL for future comparisons

In [69]:
with open('./data/indian_express/cached_url_ie_world.txt', 'a') as filehandle:
    for listitem in latest_url_ie_world:
        filehandle.write('%s\n' % listitem)

### India

In [70]:
import re
import requests
from bs4 import BeautifulSoup

response_ie_india = requests.get(url='https://indianexpress.com/section/india/', headers={'User-Agent':''})
soup_ie_india = BeautifulSoup(response_ie_india.content, 'html.parser')

links_ie_india = []
for link in soup_ie_india.findAll('a', href=True):
    links_ie_india.append(link.get('href'))
#print(len(links_ie_india))
#print(links_ie_india)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only india realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/india/' so I made use of those two major things to find all the valid URLs.  

In [71]:
import re
txt_ie_india = ' '.join(links_ie_india)
url_ie_india = re.findall(r'https://indianexpress.com/article/india/[a-zA-Z0-9\/\.\-\:]*', txt_ie_india)
#print(len(url_ie_india))
#print(url_ie_india)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [72]:
url_ie_india = set(url_ie_india)
url_ie_india = list(url_ie_india)
#print(len(url_ie_india))
url_ie_india

['https://indianexpress.com/article/india/case-over-patidar-agitation-hardik-moves-gujarat-hc-seeks-deletion-of-bail-clause-6828737/',
 'https://indianexpress.com/article/india/kin-of-kerala-journalist-facing-uapa-charges-meet-rahul-gandhi/',
 'https://indianexpress.com/article/india/gujarat-bjp-suspends-three-over-filing-papers-against-party-nominees-6828504/',
 'https://indianexpress.com/article/india/police-commemoration-day-in-remembering-the-slain-stories-of-loss-bravery-and-belonging-6832333/',
 'https://indianexpress.com/article/india/coronavirus-india-cases-deaths-covid-19-peak-schools-live-updates-6817667/',
 'https://indianexpress.com/article/india/amit-shah-birthday-narendra-modi-wishes-6831614/',
 'https://indianexpress.com/article/india/narendra-modi-to-virtually-inaugurate-three-projects-in-gujarat-on-oct-24-6828189/',
 'https://indianexpress.com/article/india/kerala-to-start-msp-for-vegetables-from-nov-1-6825840/',
 'https://indianexpress.com/article/india/chhattisgarh-t

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [73]:
cached_url_ie_india = []
with open('./data/indian_express/cached_url_ie_india.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ie_india.append(currentPlace)

In [74]:
len(cached_url_ie_india)

260

In [75]:
latest_url_ie_india = list(set(url_ie_india) - set(cached_url_ie_india))

#print(len(latest_url_ie_india))
#print(latest_url_ie_india)

#### Cache the latest URL for future comparisons

In [76]:
with open('./data/indian_express/cached_url_ie_india.txt', 'a') as filehandle:
    for listitem in latest_url_ie_india:
        filehandle.write('%s\n' % listitem)

In [77]:
%reset -f

# Times Now

### Business

In [78]:
import re
import requests
from bs4 import BeautifulSoup

response_tn_business = requests.get(url='https://www.timesnownews.com/business-economy', headers={'User-Agent':''})
soup_tn_business = BeautifulSoup(response_tn_business.content, 'html.parser')

links_tn_business = []
for link in soup_tn_business.findAll('a', href=True):
    links_tn_business.append(link.get('href'))
#print(len(links_tn_business))
#print(links_tn_business)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only business realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/business/' so I made use of those two major things to find all the valid URLs.  

In [79]:
import re
txt_tn_business = ' '.join(links_tn_business)
url_tn_business_raw = re.findall(r'https://www.timesnownews.com/business-economy/[a-zA-Z0-9\/\.\-\:]*/[0-9]+|/business-economy/[a-zA-Z0-9\/\.\-\:]*/[0-9]+', txt_tn_business)
#print(len(url_tn_business_raw))
#print(url_tn_business_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [80]:
url_tn_business_raw = set(url_tn_business_raw)
url_tn_business_raw = list(url_tn_business_raw)
#print(len(url_tn_business_raw))
url_tn_business_raw

['https://www.timesnownews.com/business-economy/companies/article/ril-stock-rally-swells-mukesh-ambanis-net-worth-by-rs-41500-crore-to-rs-6-56-lakh-crore-in-a-day/650845',
 '/business-economy/real-estate/article/july-sales-of-new-homes-surge-13-9-far-more-than-thought/642730',
 '/business-economy/industry/article/aai-plans-to-develop-100-airports-waterdromes-heliports-under-udan-by/670866',
 'https://www.timesnownews.com/business-economy/industry/article/delhi-airport-emerges-second-safest-globally-on-covid-19-related-safety-protocols-dial/670878',
 '/business-economy/world-news/article/sweden-bans-chinas-huawei-zte-from-5g-network/670355',
 '/business-economy/personal-finance/article/latest-fixed-deposit-rates-compared-fd-rates-for-sbi-pnb-hdfc-and-other-banks/670368',
 'https://www.timesnownews.com/business-economy/economy/article/govt-needs-to-put-out-fiscal-roadmap-post-virus-says-rbi-governor-shaktikanta-das/670844',
 '/business-economy/article/goddess-lakshmi-shines-during-navrat

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/business/india-business/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [81]:
url_tn_business = []
url_tn_business = [re.sub(r'(?<![a-z/:])(/business-economy/[a-zA-Z0-9\/\.\-\:]*/[0-9]+)', r'https://www.timesnownews.com\1', without_header) for without_header in url_tn_business_raw]
#print(len(url_tn_business))
#print(url_tn_business)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [82]:
cached_url_tn_business = []
with open('./data/times_now/cached_url_tn_business.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_tn_business.append(currentPlace)

In [83]:
len(cached_url_tn_business)

309

In [84]:
latest_url_tn_business = list(set(url_tn_business) - set(cached_url_tn_business))

#print(len(latest_url_tn_business))
#print(latest_url_tn_business)

#### Cache the latest URL for future comparisons

In [85]:
with open('./data/times_now/cached_url_tn_business.txt', 'a') as filehandle:
    for listitem in latest_url_tn_business:
        filehandle.write('%s\n' % listitem)

### Sports

In [86]:
import re
import requests
from bs4 import BeautifulSoup

response_tn_sports = requests.get(url='https://www.timesnownews.com/sports', headers={'User-Agent':''})
soup_tn_sports = BeautifulSoup(response_tn_sports.content, 'html.parser')

links_tn_sports = []
for link in soup_tn_sports.findAll('a', href=True):
    links_tn_sports.append(link.get('href'))
#print(len(links_tn_sports))
#print(links_tn_sports)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only sports realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/sports/' so I made use of those two major things to find all the valid URLs.  

In [87]:
import re
txt_tn_sports = ' '.join(links_tn_sports)
url_tn_sports_raw = re.findall(r'https://www.timesnownews.com/sports/[a-zA-Z0-9\/\.\-\:]*/[0-9]+|/sports/[a-zA-Z0-9\/\.\-\:]*/[0-9]+', txt_tn_sports)
#print(len(url_tn_sports_raw))
#print(url_tn_sports_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [88]:
url_tn_sports_raw = set(url_tn_sports_raw)
url_tn_sports_raw = list(url_tn_sports_raw)
#print(len(url_tn_sports_raw))
url_tn_sports_raw

['/sports/cricket/article/miyan-ready-ho-jao-what-virat-kohli-told-mohammed-siraj-before-his-sensational-opening-spell-against-kkr/670927',
 'https://www.timesnownews.com/sports/tennis/article/novak-djokovic-pulls-out-of-paris-masters-with-no-points-to-win/670771',
 'https://www.timesnownews.com/sports/cricket/article/rr-vs-srh-prediction-who-will-win-rajasthan-royals-vs-sunrisers-hyderabad-ipl-2020-match-today/670962',
 '/sports/football/article/chapions-league-liverpool-edge-past-ajax-1-0-show-they-can-cope-without-virgil-van-dijk/670974',
 'https://www.timesnownews.com/sports/cricket/article/dont-think-a-lot-of-people-have-belief-in-rcb-but-i-do-virat-kohli-oozing-with-confidence-post-kkr-win/671031',
 'https://www.timesnownews.com/sports/cricket/article/ipl-2020-updated-points-table-orange-cap-purple-cap-standings-after-kkr-vs-rcb-match/670911',
 '/sports/football/article/kingsley-coman-bags-brace-as-bayern-munich-thump-atletico-madrid-4-0-in-champions-league-opener/670971',
 'http

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/sports/india-sports/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [89]:
url_tn_sports = []
url_tn_sports = [re.sub(r'(?<![a-z/:])(/sports/[a-zA-Z0-9\/\.\-\:]*/[0-9]+)', r'https://www.timesnownews.com\1', without_header) for without_header in url_tn_sports_raw]
#print(len(url_tn_sports))
#print(url_tn_sports)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [90]:
cached_url_tn_sports = []
with open('./data/times_now/cached_url_tn_sports.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_tn_sports.append(currentPlace)

In [91]:
len(cached_url_tn_sports)

258

In [92]:
latest_url_tn_sports = list(set(url_tn_sports) - set(cached_url_tn_sports))

#print(len(latest_url_tn_sports))
#print(latest_url_tn_sports)

#### Cache the latest URL for future comparisons

In [93]:
with open('./data/times_now/cached_url_tn_sports.txt', 'a') as filehandle:
    for listitem in latest_url_tn_sports:
        filehandle.write('%s\n' % listitem)

### World

In [94]:
import re
import requests
from bs4 import BeautifulSoup

response_tn_world = requests.get(url='https://www.timesnownews.com/international', headers={'User-Agent':''})
soup_tn_world = BeautifulSoup(response_tn_world.content, 'html.parser')

links_tn_world = []
for link in soup_tn_world.findAll('a', href=True):
    links_tn_world.append(link.get('href'))
#print(len(links_tn_world))
#print(links_tn_world)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only world realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/world/' so I made use of those two major things to find all the valid URLs.  

In [95]:
import re
txt_tn_world = ' '.join(links_tn_world)
url_tn_world_raw = re.findall(r'https://www.timesnownews.com/international/[a-zA-Z0-9\/\.\-\:]*/[0-9]+|/international/[a-zA-Z0-9\/\.\-\:]*/[0-9]+', txt_tn_world)
#print(len(url_tn_world_raw))
#print(url_tn_world_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [96]:
url_tn_world_raw = set(url_tn_world_raw)
url_tn_world_raw = list(url_tn_world_raw)
#print(len(url_tn_world_raw))
url_tn_world_raw

['/international/photo-gallery/international-womens-day-2020-greta-thunberg-to-oprah-winfrey-10-influential-female-figures-of-the-world/562427',
 'https://www.timesnownews.com/international/article/us-charges-6-russian-intelligence-officers-for-global-hacking-campaign-including-notpetya-ransomware-attacks/669917',
 'https://www.timesnownews.com/international/article/uk-govt-suffers-defeat-by-house-of-lords-over-brexit-bill/670343',
 'https://www.timesnownews.com/international/article/hindu-groups-want-kamala-harris-niece-to-apologise-for-tweeting-image-showing-aunt-as-durga/670136',
 '/international/video/uzbekistan-s-tourism-landscape-offers-tremendous-business-opportunities-for-india/322568',
 '/international/article/us-presidential-election-debate-2020-donald-trump-joe-biden-final-face-off/670920',
 'https://www.timesnownews.com/international/article/microphones-will-be-muted-in-next-us-presidential-debate-on-thursday-report/669985',
 'https://www.timesnownews.com/international/arti

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/world/india-world/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [97]:
url_tn_world = []
url_tn_world = [re.sub(r'(?<![a-z/:])(/international/[a-zA-Z0-9\/\.\-\:]*/[0-9]+)', r'https://www.timesnownews.com\1', without_header) for without_header in url_tn_world_raw]
#print(len(url_tn_world))
#print(url_tn_world)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [98]:
cached_url_tn_world = []
with open('./data/times_now/cached_url_tn_world.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_tn_world.append(currentPlace)

In [99]:
len(cached_url_tn_world)

168

In [100]:
latest_url_tn_world = list(set(url_tn_world) - set(cached_url_tn_world))

#print(len(latest_url_tn_world))
#print(latest_url_tn_world)

#### Cache the latest URL for future comparisons

In [101]:
with open('./data/times_now/cached_url_tn_world.txt', 'a') as filehandle:
    for listitem in latest_url_tn_world:
        filehandle.write('%s\n' % listitem)

### Entertainment

In [102]:
import re
import requests
from bs4 import BeautifulSoup

response_tn_entertainment = requests.get(url='https://www.timesnownews.com/entertainment-news', headers={'User-Agent':''})
soup_tn_entertainment = BeautifulSoup(response_tn_entertainment.content, 'html.parser')

links_tn_entertainment = []
for link in soup_tn_entertainment.findAll('a', href=True):
    links_tn_entertainment.append(link.get('href'))
#print(len(links_tn_entertainment))
#print(links_tn_entertainment)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only entertainment realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/entertainment/' so I made use of those two major things to find all the valid URLs.  

In [103]:
import re
txt_tn_entertainment = ' '.join(links_tn_entertainment)
url_tn_entertainment_raw = re.findall(r'https://www.timesnownews.com/entertainment-news/[a-zA-Z0-9\/\.\-\:]*/[0-9]+|/entertainment-news/[a-zA-Z0-9\/\.\-\:]*/[0-9]+', txt_tn_entertainment)
#print(len(url_tn_entertainment_raw))
#print(url_tn_entertainment_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [104]:
url_tn_entertainment_raw = set(url_tn_entertainment_raw)
url_tn_entertainment_raw = list(url_tn_entertainment_raw)
#print(len(url_tn_entertainment_raw))
url_tn_entertainment_raw

['https://www.timesnownews.com/entertainment-news/article/mandira-bedi-farida-jalal-anupam-kher-parmeet-sethi-how-dilwale-dulhania-le-jayenge-supporting-cast-members-look-like-now/670664',
 'https://www.timesnownews.com/entertainment-news/article/video-sidharth-shukla-screams-with-anger-as-hina-khan-gauahar-khan-say-his-team-cheated-during-task/670074',
 '/entertainment-news/article/akshay-kumar-continues-to-get-trolled-on-twitter-as-hindu-sena-demands-laxmmi-bomb-title-change/670554',
 'https://www.timesnownews.com/entertainment-news/article/kbc-12-man-urges-amitabh-bachchan-to-visit-his-ancestral-village-babu-patti-here-s-how-he-responds/670366',
 '/entertainment-news/article/bride-to-be-neha-kakkar-shares-unseen-photos-of-rohanpreet-singhs-romantic-proposal-ahead-of-wedding/671062',
 'https://www.timesnownews.com/entertainment-news/article/deepika-padukone-alia-bhatt-anushka-sharma-kareena-kapoor-khan-when-bollywood-actresses-wore-pantsuits-without-shirts/669929',
 'https://www.time

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/entertainment/india-entertainment/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [105]:
url_tn_entertainment = []
url_tn_entertainment = [re.sub(r'(?<![a-z/:])(/entertainment-news/[a-zA-Z0-9\/\.\-\:]*/[0-9]+)', r'https://www.timesnownews.com\1', without_header) for without_header in url_tn_entertainment_raw]
#print(len(url_tn_entertainment))
#print(url_tn_entertainment)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [106]:
cached_url_tn_entertainment = []
with open('./data/times_now/cached_url_tn_entertainment.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_tn_entertainment.append(currentPlace)

In [107]:
len(cached_url_tn_entertainment)

134

In [108]:
latest_url_tn_entertainment = list(set(url_tn_entertainment) - set(cached_url_tn_entertainment))

#print(len(latest_url_tn_entertainment))
#print(latest_url_tn_entertainment)

#### Cache the latest URL for future comparisons

In [109]:
with open('./data/times_now/cached_url_tn_entertainment.txt', 'a') as filehandle:
    for listitem in latest_url_tn_entertainment:
        filehandle.write('%s\n' % listitem)

### India

In [110]:
import re
import requests
from bs4 import BeautifulSoup

response_tn_india = requests.get(url='https://www.timesnownews.com/india', headers={'User-Agent':''})
soup_tn_india = BeautifulSoup(response_tn_india.content, 'html.parser')

links_tn_india = []
for link in soup_tn_india.findAll('a', href=True):
    links_tn_india.append(link.get('href'))
#print(len(links_tn_india))
#print(links_tn_india)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only india realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/india/' so I made use of those two major things to find all the valid URLs.  

In [111]:
import re
txt_tn_india = ' '.join(links_tn_india)
url_tn_india_raw = re.findall(r'https://www.timesnownews.com/india/[a-zA-Z0-9\/\.\-\:]*/[0-9]+|/india/[a-zA-Z0-9\/\.\-\:]*/[0-9]+', txt_tn_india)
#print(len(url_tn_india_raw))
#print(url_tn_india_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [112]:
url_tn_india_raw = set(url_tn_india_raw)
url_tn_india_raw = list(url_tn_india_raw)
#print(len(url_tn_india_raw))
url_tn_india_raw

['https://www.timesnownews.com/india/article/indigenously-built-stealth-corvette-ins-kavaratti-to-be-commissioned-into-navy-on-thursday/670914',
 'https://www.timesnownews.com/india/maharashtra-news/article/palghar-lynching-case-208-new-accused-50-nabbed-by-cid/670901',
 'https://www.timesnownews.com/india/article/india-us-22-dialogue-on-october-27-talks-to-focus-on-strengthening-strategic-ties/670908',
 '/india/article/kerala-govt-revokes-decision-on-salary-cut-deferred-salary-from-april-to-merge-with-pf/671014',
 '/india/article/cabinet-nod-to-defence-ministry-s-proposal-to-sign-beca-with-united-states-of-america/671017',
 'https://www.timesnownews.com/india/article/bjp-takes-it-on-the-chin-even-as-ncp-claims-not-just-eknath-khadse-many-willing-to-quit-party-to-join-aghadi/670876',
 'https://www.timesnownews.com/india/article/india-to-observe-october-22-as-black-day-to-highlight-pakistan-backed-militias-invasion-of-jk/670967',
 'https://www.timesnownews.com/india/article/environmenta

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/india/india-india/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [113]:
url_tn_india = []
url_tn_india = [re.sub(r'(?<![a-z/:])(/india/[a-zA-Z0-9\/\.\-\:]*/[0-9]+)', r'https://www.timesnownews.com\1', without_header) for without_header in url_tn_india_raw]
#print(len(url_tn_india))
#print(url_tn_india)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [114]:
cached_url_tn_india = []
with open('./data/times_now/cached_url_tn_india.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_tn_india.append(currentPlace)

In [115]:
len(cached_url_tn_india)

285

In [116]:
latest_url_tn_india = list(set(url_tn_india) - set(cached_url_tn_india))

#print(len(latest_url_tn_india))
#print(latest_url_tn_india)

#### Cache the latest URL for future comparisons

In [117]:
with open('./data/times_now/cached_url_tn_india.txt', 'a') as filehandle:
    for listitem in latest_url_tn_india:
        filehandle.write('%s\n' % listitem)

In [118]:
%reset -f

# Hindustan Times

### Business

In [119]:
import re
import requests
from bs4 import BeautifulSoup

response_ht_business = requests.get(url='https://www.hindustantimes.com/business-news/', headers={'User-Agent':''})
soup_ht_business = BeautifulSoup(response_ht_business.content, 'html.parser')

links_ht_business = []
for link in soup_ht_business.findAll('a', href=True):
    links_ht_business.append(link.get('href'))
#print(len(links_ht_business))
#print(links_ht_business)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only business realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/business/' so I made use of those two major things to find all the valid URLs.  

In [120]:
import re
txt_ht_business = ' '.join(links_ht_business)
url_ht_business = re.findall(r'https://www.hindustantimes.com/business-news/[a-zA-Z0-9\.\-\:]*/story-[a-zA-Z0-9\-]+.html', txt_ht_business)
#print(len(url_ht_business))
#print(url_ht_business)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [121]:
url_ht_business = set(url_ht_business)
url_ht_business = list(url_ht_business)
#print(len(url_ht_business))
url_ht_business

['https://www.hindustantimes.com/business-news/indian-spot-gold-rate-and-silver-price-on-oct-21-2020/story-8T0bBx9ZqnX0a4kuCBKQmK.html',
 'https://www.hindustantimes.com/business-news/indian-spot-gold-rate-and-silver-price-on-oct-22-2020/story-bAmuWmPEK1flF6DzL427kO.html',
 'https://www.hindustantimes.com/business-news/sebi-considering-multiple-steps-to-reboot-economy-chairman-ajay-tyagi/story-K5CJVXlqDR5DqvIr43B3KN.html',
 'https://www.hindustantimes.com/business-news/rbi-constantly-trying-to-be-innovative-to-aid-recovery-shaktikanta-das/story-MtjgsszamE4ueyQpZbRMIP.html',
 'https://www.hindustantimes.com/business-news/tesla-beats-on-profit-reaffirms-goal-of-500-000-deliveries/story-jWXJOvoZld49jhuyo2rVLI.html',
 'https://www.hindustantimes.com/business-news/judge-says-injury-speculative-in-trump-s-bid-to-crack-down-on-social-media/story-FaL1phTCpW6DshJjRGefGM.html',
 'https://www.hindustantimes.com/business-news/sensex-down-over-120-points-in-opening-session-nifty-at-11-900/story-LNl

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [122]:
cached_url_ht_business = []
with open('./data/hindustan_times/cached_url_ht_business.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ht_business.append(currentPlace)

In [123]:
len(cached_url_ht_business)

186

In [124]:
latest_url_ht_business = list(set(url_ht_business) - set(cached_url_ht_business))

#print(len(latest_url_ht_business))
#print(latest_url_ht_business)

#### Cache the latest URL for future comparisons

In [125]:
with open('./data/hindustan_times/cached_url_ht_business.txt', 'a') as filehandle:
    for listitem in latest_url_ht_business:
        filehandle.write('%s\n' % listitem)

### Sports

In [126]:
import re
import requests
from bs4 import BeautifulSoup

response_ht_sports = requests.get(url='https://www.hindustantimes.com/sports-news/', headers={'User-Agent':''})
soup_ht_sports = BeautifulSoup(response_ht_sports.content, 'html.parser')

links_ht_sports = []
for link in soup_ht_sports.findAll('a', href=True):
    links_ht_sports.append(link.get('href'))
#print(len(links_ht_sports))
#print(links_ht_sports)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only sports realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/sports/' so I made use of those two major things to find all the valid URLs.  

In [127]:
import re
txt_ht_sports = ' '.join(links_ht_sports)
url_ht_sports = re.findall(r'https://www.hindustantimes.com/(?:cricket|tennis|other-sports|football)/[a-zA-Z0-9\.\-\:]*/story-[a-zA-Z0-9\-]+.html', txt_ht_sports)
#print(len(url_ht_sports))
#print(url_ht_sports)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [128]:
url_ht_sports = set(url_ht_sports)
url_ht_sports = list(url_ht_sports)
#print(len(url_ht_sports))
url_ht_sports

['https://www.hindustantimes.com/other-sports/japan-will-take-steps-to-guard-against-olympics-cyberattacks/story-9UzdJlWtVY45LenilVp49N.html',
 'https://www.hindustantimes.com/other-sports/sindhu-in-london-as-practice-in-national-camp-wasn-t-happening-properly-father/story-mfoc6iia6Xor8sNvLGhNiI.html',
 'https://www.hindustantimes.com/cricket/ipl-2020-nortje-you-d-better-watch-your-speed/story-rTAPnwEqpoUfQi4iKYC3gM.html',
 'https://www.hindustantimes.com/cricket/ipl-2020-hits-and-flops-spinners-who-have-thrilled-and-fallen-flat/story-fXBHGl2Kzfx1sSt8OLPeTJ.html',
 'https://www.hindustantimes.com/cricket/ipl-2020-rr-s-predicted-xi-vs-srh-steve-smith-likely-to-go-with-winning-combination/story-3tPKGAiyLwMzTu6szKf6LM.html',
 'https://www.hindustantimes.com/football/with-a-global-fanbase-spanning-millions-barcelona-look-to-flex-digital-muscles/story-AEU2TMiX6BwK05CTafcfoJ.html',
 'https://www.hindustantimes.com/cricket/ipl-2020-all-rounder-ravindra-jadeja-posts-strong-message-amid-csk-s-p

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [129]:
cached_url_ht_sports = []
with open('./data/hindustan_times/cached_url_ht_sports.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ht_sports.append(currentPlace)

In [130]:
len(cached_url_ht_sports)

329

In [131]:
latest_url_ht_sports = list(set(url_ht_sports) - set(cached_url_ht_sports))

#print(len(latest_url_ht_sports))
#print(latest_url_ht_sports)

#### Cache the latest URL for future comparisons

In [132]:
with open('./data/hindustan_times/cached_url_ht_sports.txt', 'a') as filehandle:
    for listitem in latest_url_ht_sports:
        filehandle.write('%s\n' % listitem)

### World

In [133]:
import re
import requests
from bs4 import BeautifulSoup

response_ht_world = requests.get(url='https://www.hindustantimes.com/world-news/', headers={'User-Agent':''})
soup_ht_world = BeautifulSoup(response_ht_world.content, 'html.parser')

links_ht_world = []
for link in soup_ht_world.findAll('a', href=True):
    links_ht_world.append(link.get('href'))
#print(len(links_ht_world))
#print(links_ht_world)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only world realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/world/' so I made use of those two major things to find all the valid URLs.  

In [134]:
import re
txt_ht_world = ' '.join(links_ht_world)
url_ht_world = re.findall(r'https://www.hindustantimes.com/world-news/[a-zA-Z0-9\.\-\:]*/story-[a-zA-Z0-9\-]+.html', txt_ht_world)
#print(len(url_ht_world))
#print(url_ht_world)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [135]:
url_ht_world = set(url_ht_world)
url_ht_world = list(url_ht_world)
#print(len(url_ht_world))
url_ht_world

['https://www.hindustantimes.com/world-news/barack-obama-hits-campaign-trail-slams-us-president-donald-trump/story-X1KNPnAoV6hPkyMTqXH2sN.html',
 'https://www.hindustantimes.com/world-news/us-says-iran-russia-attempting-to-interfere-in-election/story-Vwo1tqGcMIH7cE6lnw6GxK.html',
 'https://www.hindustantimes.com/world-news/democrats-to-boycott-senate-judiciary-vote-on-judge-barrett-s-nomination/story-bxNPjB1eOxeVNrO1XnMNZK.html',
 'https://www.hindustantimes.com/world-news/is-death-of-volunteer-a-roadblock-for-astrazeneca-covid-19-vaccine-latest-developments/story-JkdPahaSXgB28lLUhqxenM.html',
 'https://www.hindustantimes.com/world-news/china-struggles-to-fill-donald-trump-s-america-first-leadership-void/story-lORWq6lEcqrIv96Pd8B0NI.html',
 'https://www.hindustantimes.com/world-news/pope-francis-becomes-1st-pope-to-endorse-same-sex-civil-unions/story-vGkVW3mF3USG6GtwwihO1I.html',
 'https://www.hindustantimes.com/world-news/pakistan-unlikely-to-exit-fatf-s-grey-list-report/story-rtPe2SN

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [136]:
cached_url_ht_world = []
with open('./data/hindustan_times/cached_url_ht_world.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ht_world.append(currentPlace)

In [137]:
len(cached_url_ht_world)

298

In [138]:
latest_url_ht_world = list(set(url_ht_world) - set(cached_url_ht_world))

#print(len(latest_url_ht_world))
#print(latest_url_ht_world)

#### Cache the latest URL for future comparisons

In [139]:
with open('./data/hindustan_times/cached_url_ht_world.txt', 'a') as filehandle:
    for listitem in latest_url_ht_world:
        filehandle.write('%s\n' % listitem)

### Entertainment

In [140]:
import re
import requests
from bs4 import BeautifulSoup

response_ht_entertainment = requests.get(url='https://www.hindustantimes.com/entertainment/', headers={'User-Agent':''})
soup_ht_entertainment = BeautifulSoup(response_ht_entertainment.content, 'html.parser')

links_ht_entertainment = []
for link in soup_ht_entertainment.findAll('a', href=True):
    links_ht_entertainment.append(link.get('href'))
#print(len(links_ht_entertainment))
#print(links_ht_entertainment)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only entertainment realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/entertainment/' so I made use of those two major things to find all the valid URLs.  

In [141]:
import re
txt_ht_entertainment = ' '.join(links_ht_entertainment)
url_ht_entertainment = re.findall(r'https://www.hindustantimes.com/(?:bollywood|tv|music|regional-movies|hollywood)/[a-zA-Z0-9\.\-\:]*/story-[a-zA-Z0-9\-]+.html', txt_ht_entertainment)
#print(len(url_ht_entertainment))
#print(url_ht_entertainment)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [142]:
url_ht_entertainment = set(url_ht_entertainment)
url_ht_entertainment = list(url_ht_entertainment)
#print(len(url_ht_entertainment))
url_ht_entertainment

['https://www.hindustantimes.com/tv/find-salman-khan-s-humour-condescending-ex-bigg-boss-contestant-karanvir-bohra-defends-rubina-dilaik/story-nkldwi9tjyAfcxC2f3tGLJ.html',
 'https://www.hindustantimes.com/hollywood/evil-eye-movie-review-scariest-thing-about-priyanka-chopra-produced-amazon-horror-film-are-the-indian-accents/story-3G6EZdpI4ITnPB8rAdCN6O.html',
 'https://www.hindustantimes.com/bollywood/kareena-kapoor-shares-new-pic-from-aircraft-as-she-returns-to-mumbai-adds-mask-pehniye-aur-bahar-dekhiye/story-hisTpufm16J5jf04AgFt3K.html',
 'https://www.hindustantimes.com/music/neha-kakkar-shares-video-from-roka-ceremony-with-rohanpreet-singh-calls-it-a-gift-for-her-fans/story-zTfG5CMRHI4crRWeouDR1M.html',
 'https://www.hindustantimes.com/tv/shehzad-deol-on-eviction-from-bigg-boss-if-evictions-happen-through-in-house-voting-then-jasmin-bhasin-would-be-next/story-RkX3fPRiTk0FOMwPhaLlKM.html',
 'https://www.hindustantimes.com/tv/bigg-boss-14-contestants-salaries-revealed-rubina-dilaik-is

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [143]:
cached_url_ht_entertainment = []
with open('./data/hindustan_times/cached_url_ht_entertainment.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ht_entertainment.append(currentPlace)

In [144]:
len(cached_url_ht_entertainment)

327

In [145]:
latest_url_ht_entertainment = list(set(url_ht_entertainment) - set(cached_url_ht_entertainment))

#print(len(latest_url_ht_entertainment))
#print(latest_url_ht_entertainment)

#### Cache the latest URL for future comparisons

In [146]:
with open('./data/hindustan_times/cached_url_ht_entertainment.txt', 'a') as filehandle:
    for listitem in latest_url_ht_entertainment:
        filehandle.write('%s\n' % listitem)

### India

In [147]:
import re
import requests
from bs4 import BeautifulSoup

response_ht_india = requests.get(url='https://www.hindustantimes.com/india-news/', headers={'User-Agent':''})
soup_ht_india = BeautifulSoup(response_ht_india.content, 'html.parser')

links_ht_india = []
for link in soup_ht_india.findAll('a', href=True):
    links_ht_india.append(link.get('href'))
#print(len(links_ht_india))
#print(links_ht_india)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only india realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/india/' so I made use of those two major things to find all the valid URLs.  

In [148]:
import re
txt_ht_india = ' '.join(links_ht_india)
url_ht_india = re.findall(r'https://www.hindustantimes.com/india-news/[a-zA-Z0-9\.\-\:]*/story-[a-zA-Z0-9\-]+.html', txt_ht_india)
#print(len(url_ht_india))
#print(url_ht_india)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [149]:
url_ht_india = set(url_ht_india)
url_ht_india = list(url_ht_india)
#print(len(url_ht_india))
url_ht_india

['https://www.hindustantimes.com/india-news/news-updates-from-hindustan-times-pm-modi-to-address-bengal-on-durga-puja-today-and-all-the-latest-news/story-bxnwlVnRb6j6vRm8Ept2EP.html',
 'https://www.hindustantimes.com/india-news/for-3rd-consecutive-day-active-covid-19-cases-in-india-remain-below-7-5-lakh/story-FM4YZ95jxmGIb8OzNNBXvN.html',
 'https://www.hindustantimes.com/india-news/2g-services-extended-in-j-k-high-speed-data-in-ganderbal-udhampur/story-j9Xjr7ZiSGiy2r2r1DiqTI.html',
 'https://www.hindustantimes.com/india-news/isro-releases-draft-policy-to-regulate-space-communication-by-private-players/story-hcrB1xAKZDFNQdI0y4GatJ.html',
 'https://www.hindustantimes.com/india-news/devendra-fadnavis-calls-eknath-khadse-s-allegations-against-him-half-truth/story-nLFLvSqyw1FJt4nGwn0RkN.html',
 'https://www.hindustantimes.com/india-news/pm-modi-s-durga-puja-address-pm-to-address-people-of-west-bengal/story-di5bIfCh9fdJHpdWrFCcNP.html',
 'https://www.hindustantimes.com/india-news/india-germa

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [150]:
cached_url_ht_india = []
with open('./data/hindustan_times/cached_url_ht_india.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ht_india.append(currentPlace)

In [151]:
len(cached_url_ht_india)

319

In [152]:
latest_url_ht_india = list(set(url_ht_india) - set(cached_url_ht_india))

#print(len(latest_url_ht_india))
#print(latest_url_ht_india)

#### Cache the latest URL for future comparisons

In [153]:
with open('./data/hindustan_times/cached_url_ht_india.txt', 'a') as filehandle:
    for listitem in latest_url_ht_india:
        filehandle.write('%s\n' % listitem)

In [154]:
%reset -f

# ANI

### Business

In [155]:
import re
import requests
from bs4 import BeautifulSoup

response_ani_business = requests.get(url='https://aninews.in/category/business/', headers={'User-Agent':''})
soup_ani_business = BeautifulSoup(response_ani_business.content, 'html.parser')

links_ani_business = []
for link in soup_ani_business.findAll('a', href=True):
    links_ani_business.append(link.get('href'))
#print(len(links_ani_business))
#print(links_ani_business)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only business realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/business/' so I made use of those two major things to find all the valid URLs.  

In [156]:
import re
txt_ani_business = ' '.join(links_ani_business)
url_ani_business_raw = re.findall(r'https://aninews.in/news/business/[a-zA-Z0-9\/\.\-\:]+[0-9]+|/news/business/[a-zA-Z0-9\/\.\-\:]+[0-9]+', txt_ani_business)
#print(len(url_ani_business_raw))
#print(url_ani_business_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [157]:
url_ani_business_raw = set(url_ani_business_raw)
url_ani_business_raw = list(url_ani_business_raw)
#print(len(url_ani_business_raw))
url_ani_business_raw

['/news/business/zuper-covid-19-compliance-pack-helps-companies-like-ikea-manage-safe-business-operations-in-the-new-reality20201021151738',
 '/news/business/khaaugully-delivers-the-food-you-want-launches-online-store-having-2500-items20201021133735',
 '/news/business/satte-genx-paves-the-way-for-the-revival-of-the-travel-amp-tourism-industry20201021133543',
 '/news/business/cosmoprof-india-to-be-rescheduled-in-202120201021105426',
 '/news/business/kathak-queen-jayanti-mala-inaugurates-online-dance-platform-kathakworldcom20201021151557',
 '/news/business/your-guide-to-indulging-without-overindulging-this-festive-season-with-california-walnuts20201021105508',
 '/news/business/as-part-of-its-customer-first-approach-design-cafe-launches-a-fully-integrated-home-tech-platform-for-customers20201021134115',
 '/news/business/business/sbi-announces-interest-concession-up-to-25-bps-home-loan-emis-to-reduce20201021152658',
 '/news/business/business/jio-qualcomm-align-efforts-on-5g-achieve-over-1g

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/business/india-business/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [158]:
url_ani_business = []
url_ani_business = [re.sub(r'(?<![a-z/:])(/news/business/[a-zA-Z0-9\/\.\-\:]+[0-9]+)', r'https://aninews.in\1', without_header) for without_header in url_ani_business_raw]
#print(len(url_ani_business))
#print(url_ani_business)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [159]:
cached_url_ani_business = []
with open('./data/ani/cached_url_ani_business.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ani_business.append(currentPlace)

In [160]:
len(cached_url_ani_business)

260

In [161]:
latest_url_ani_business = list(set(url_ani_business) - set(cached_url_ani_business))

#print(len(latest_url_ani_business))
#print(latest_url_ani_business)

#### Cache the latest URL for future comparisons

In [162]:
with open('./data/ani/cached_url_ani_business.txt', 'a') as filehandle:
    for listitem in latest_url_ani_business:
        filehandle.write('%s\n' % listitem)

### Sports

In [163]:
import re
import requests
from bs4 import BeautifulSoup

response_ani_sports = requests.get(url='https://aninews.in/category/sports/', headers={'User-Agent':''})
soup_ani_sports = BeautifulSoup(response_ani_sports.content, 'html.parser')

links_ani_sports = []
for link in soup_ani_sports.findAll('a', href=True):
    links_ani_sports.append(link.get('href'))
#print(len(links_ani_sports))
#print(links_ani_sports)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only sports realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/sports/' so I made use of those two major things to find all the valid URLs.  

In [164]:
import re
txt_ani_sports = ' '.join(links_ani_sports)
url_ani_sports_raw = re.findall(r'https://aninews.in/news/sports/[a-zA-Z0-9\/\.\-\:]+[0-9]+|/news/sports/[a-zA-Z0-9\/\.\-\:]+[0-9]+', txt_ani_sports)
#print(len(url_ani_sports_raw))
#print(url_ani_sports_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [165]:
url_ani_sports_raw = set(url_ani_sports_raw)
url_ani_sports_raw = list(url_ani_sports_raw)
#print(len(url_ani_sports_raw))
url_ani_sports_raw

['/news/sports/cricket/csa-gets-go-ahead-from-govt-to-host-england-for-limited-overs-series20201021222556',
 '/news/sports/cricket/mcilroy-rahm-morikawa-set-to-join-woods-for-zozo-championship-at-sherwood20201021204145',
 '/news/sports/football/it-was-massive-win-against-psg-says-david-de-gea20201021194508',
 '/news/sports/cricket/afghanistan-spinner-rashid-khan-re-signs-for-adelaide-strikers20201022081220',
 '/news/sports/cricket/ipl-13-siraj-gurkeerat-guide-rcb-to-comprehensive-win-over-kkr20201021223554',
 '/news/sports/cricket/ipl-13-morris-inclusion-has-bolstered-the-bowling-unit-says-rcb-pacer-siraj20201022103040',
 '/news/sports/cricket/ipl-13-wanted-to-deliver-magical-performance-for-rcb-says-siraj-after-heroics-against-kkr20201022090046',
 '/news/sports/football/achraf-hakimi-tests-positive-for-coronavirus20201022100101',
 '/news/sports/others/world-junior-badminton-championships-2020-cancelled20201022085605',
 '/news/sports/cricket/romario-shepherd-to-replace-injured-dwayne-b

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/sports/india-sports/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [166]:
url_ani_sports = []
url_ani_sports = [re.sub(r'(?<![a-z/:])(/news/sports/[a-zA-Z0-9\/\.\-\:]+[0-9]+)', r'https://aninews.in\1', without_header) for without_header in url_ani_sports_raw]
#print(len(url_ani_sports))
#print(url_ani_sports)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [167]:
cached_url_ani_sports = []
with open('./data/ani/cached_url_ani_sports.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ani_sports.append(currentPlace)

In [168]:
len(cached_url_ani_sports)

319

In [169]:
latest_url_ani_sports = list(set(url_ani_sports) - set(cached_url_ani_sports))

#print(len(latest_url_ani_sports))
#print(latest_url_ani_sports)

#### Cache the latest URL for future comparisons

In [170]:
with open('./data/ani/cached_url_ani_sports.txt', 'a') as filehandle:
    for listitem in latest_url_ani_sports:
        filehandle.write('%s\n' % listitem)

### World

In [171]:
import re
import requests
from bs4 import BeautifulSoup

response_ani_world = requests.get(url='https://aninews.in/category/world/', headers={'User-Agent':''})
soup_ani_world = BeautifulSoup(response_ani_world.content, 'html.parser')

links_ani_world = []
for link in soup_ani_world.findAll('a', href=True):
    links_ani_world.append(link.get('href'))
#print(len(links_ani_world))
#print(links_ani_world)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only world realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/world/' so I made use of those two major things to find all the valid URLs.  

In [172]:
import re
txt_ani_world = ' '.join(links_ani_world)
url_ani_world_raw = re.findall(r'https://aninews.in/news/world/[a-zA-Z0-9\/\.\-\:]+[0-9]+|/news/world/[a-zA-Z0-9\/\.\-\:]+[0-9]+', txt_ani_world)
#print(len(url_ani_world_raw))
#print(url_ani_world_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [173]:
url_ani_world_raw = set(url_ani_world_raw)
url_ani_world_raw = list(url_ani_world_raw)
#print(len(url_ani_world_raw))
url_ani_world_raw

['/news/world/us/obama-launches-fiery-attack-on-trump-says-he-cant-even-take-basic-steps-to-protect-himself20201022053703',
 '/news/world/europe/over-40-hostages-released-in-georgias-zugdidi20201022034519',
 '/news/world/asia/us-approves-usd-18-billion-in-arms-sales-to-taiwan-amid-tensions-with-china20201022083125',
 '/news/world/europe/eu-council-chief-to-convene-informal-meeting-on-covid-19-on-october-2920201022010220',
 '/news/world/asia/moscows-covid-19-death-toll-rises-to-618720201022020843',
 '/news/world/asia/5-killed-28-injured-in-building-explosion-in-karachi20201022031434',
 '/news/world/us/biden-tied-with-trump-in-battleground-state-of-texas-poll20201022014109',
 '/news/world/asia/10-cops-killed-in-clashes-between-army-and-karachi-police20201022010320',
 '/news/world/others/brazils-covid-19-death-toll-tops-15500020201022085245',
 '/news/world/others/brazils-covid-19-vaccine-volunteer-dies-authorities-say-trial-to-continue20201022065940',
 '/news/world/asia/indian-diplomat-su

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/world/india-world/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [174]:
url_ani_world = []
url_ani_world = [re.sub(r'(?<![a-z/:])(/news/world/[a-zA-Z0-9\/\.\-\:]+[0-9]+)', r'https://aninews.in\1', without_header) for without_header in url_ani_world_raw]
#print(len(url_ani_world))
#print(url_ani_world)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [175]:
cached_url_ani_world = []
with open('./data/ani/cached_url_ani_world.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ani_world.append(currentPlace)

In [176]:
len(cached_url_ani_world)

359

In [177]:
latest_url_ani_world = list(set(url_ani_world) - set(cached_url_ani_world))

#print(len(latest_url_ani_world))
#print(latest_url_ani_world)

#### Cache the latest URL for future comparisons

In [178]:
with open('./data/ani/cached_url_ani_world.txt', 'a') as filehandle:
    for listitem in latest_url_ani_world:
        filehandle.write('%s\n' % listitem)

### Entertainment

In [179]:
import re
import requests
from bs4 import BeautifulSoup

response_ani_entertainment = requests.get(url='https://aninews.in/category/entertainment/', headers={'User-Agent':''})
soup_ani_entertainment = BeautifulSoup(response_ani_entertainment.content, 'html.parser')

links_ani_entertainment = []
for link in soup_ani_entertainment.findAll('a', href=True):
    links_ani_entertainment.append(link.get('href'))
#print(len(links_ani_entertainment))
#print(links_ani_entertainment)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only entertainment realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/entertainment/' so I made use of those two major things to find all the valid URLs.  

In [180]:
import re
txt_ani_entertainment = ' '.join(links_ani_entertainment)
url_ani_entertainment_raw = re.findall(r'https://aninews.in/news/entertainment/[a-zA-Z0-9\/\.\-\:]+[0-9]+|/news/entertainment/[a-zA-Z0-9\/\.\-\:]+[0-9]+', txt_ani_entertainment)
#print(len(url_ani_entertainment_raw))
#print(url_ani_entertainment_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [181]:
url_ani_entertainment_raw = set(url_ani_entertainment_raw)
url_ani_entertainment_raw = list(url_ani_entertainment_raw)
#print(len(url_ani_entertainment_raw))
url_ani_entertainment_raw

['/news/entertainment/hollywood/jared-leto-to-reprise-joker-role-for-zack-snyders-justice-league20201022101822',
 '/news/entertainment/bollywood/ayushmann-khurrana-kick-starts-shooting-for-chandigarh-kare-aashiqui-with-vaani-kapoor20201021172104',
 '/news/entertainment/bollywood/homi-adjania-remembers-irrfan-khan-with-angrezi-medium-bts-video20201021230439',
 '/news/entertainment/bollywood/will-miss-my-friends-desperately-anupam-kher-pens-heartfelt-note-to-friends-anil-kapoor-satish-kaushik20201021153313',
 '/news/entertainment/bollywood/kareena-kapoor-khan-advices-fans-to-wear-mask-in-latest-instagram-post20201021204057',
 '/news/entertainment/out-of-box/kangana-ranaut-dazzles-in-pastel-lehenga-for-cousins-wedding20201021185920',
 '/news/entertainment/bollywood/on-shammi-kapoors-birth-anniversary-lata-mangeshkar-recalls-late-actors-singing-skills20201021195816',
 '/news/entertainment/bollywood/john-abraham-divya-khosla-kumar-kick-start-shooting-of-satyameva-jayate-220201021155313',
 '

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/entertainment/india-entertainment/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [182]:
url_ani_entertainment = []
url_ani_entertainment = [re.sub(r'(?<![a-z/:])(/news/entertainment/[a-zA-Z0-9\/\.\-\:]+[0-9]+)', r'https://aninews.in\1', without_header) for without_header in url_ani_entertainment_raw]
#print(len(url_ani_entertainment))
#print(url_ani_entertainment)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [183]:
cached_url_ani_entertainment = []
with open('./data/ani/cached_url_ani_entertainment.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ani_entertainment.append(currentPlace)

In [184]:
len(cached_url_ani_entertainment)

217

In [185]:
latest_url_ani_entertainment = list(set(url_ani_entertainment) - set(cached_url_ani_entertainment))

#print(len(latest_url_ani_entertainment))
#print(latest_url_ani_entertainment)

#### Cache the latest URL for future comparisons

In [186]:
with open('./data/ani/cached_url_ani_entertainment.txt', 'a') as filehandle:
    for listitem in latest_url_ani_entertainment:
        filehandle.write('%s\n' % listitem)

### Politics

In [187]:
import re
import requests
from bs4 import BeautifulSoup

response_ani_politics = requests.get(url='https://aninews.in/category/national/politics/', headers={'User-Agent':''})
soup_ani_politics = BeautifulSoup(response_ani_politics.content, 'html.parser')

links_ani_politics = []
for link in soup_ani_politics.findAll('a', href=True):
    links_ani_politics.append(link.get('href'))
#print(len(links_ani_politics))
#print(links_ani_politics)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only politics realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/politics/' so I made use of those two major things to find all the valid URLs.  

In [188]:
import re
txt_ani_politics = ' '.join(links_ani_politics)
url_ani_politics_raw = re.findall(r'https://aninews.in/news/national/politics/[a-zA-Z0-9\/\.\-\:]+[0-9]+|/news/national/politics/[a-zA-Z0-9\/\.\-\:]+[0-9]+', txt_ani_politics)
#print(len(url_ani_politics_raw))
#print(url_ani_politics_raw)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [189]:
url_ani_politics_raw = set(url_ani_politics_raw)
url_ani_politics_raw = list(url_ani_politics_raw)
#print(len(url_ani_politics_raw))
url_ani_politics_raw

['/news/national/politics/prashant-kishor-is-not-backing-ljp-chirag-paswan-clarifies-after-rumours20201020195455',
 '/news/national/politics/604-people-booked-in-tns-thoothukudi-after-clash-between-aiadmk-dmk-cadres20201022111314',
 '/news/national/politics/people-should-choose-a-party-or-leader-based-on-past-work-nadda20201020175134',
 '/news/national/politics/im-alone-but-will-try-to-live-upto-your-expectations-chirag-paswan-to-party-workers-during-gaya-visit20201022060721',
 '/news/national/politics/shivraj-singh-chouhan-advises-kamal-nath-to-love-madhya-pradesh-and-its-people20201021031413',
 '/news/national/politics/aishwarya-daughter-of-jdu-candidate-chandrika-rai-touches-nitish-kumars-feet20201021232943',
 '/news/national/politics/will-join-ncp-on-oct-23-have-suffered-a-lot-in-bjp-khadse20201021181701',
 '/news/national/politics/nadda-accuses-congress-of-praising-pakistan-cites-examples-of-rahul-gandhi-tharoor20201020202954',
 '/news/national/politics/resignation-of-eknath-khads

#### Filtering without header URLs and adding header
As you can notice above that there are many URLs which doesn't start with https or even www, rather they start from the category like '/politics/india-politics/sensex-drops-over-100-points-in-opening-trade-nifty-below-11650/articleshow/78527429.cms'. Now for such links we will have to use RegEx to substitute https://timesofindia.indiatimes.com in the beginning. That is what is done in next code snippet.

In [190]:
url_ani_politics = []
url_ani_politics = [re.sub(r'(?<![a-z/:])(/news/national/politics/[a-zA-Z0-9\/\.\-\:]+[0-9]+)', r'https://aninews.in\1', without_header) for without_header in url_ani_politics_raw]
#print(len(url_ani_politics))
#print(url_ani_politics)

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [191]:
cached_url_ani_politics = []
with open('./data/ani/cached_url_ani_politics.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ani_politics.append(currentPlace)

In [192]:
len(cached_url_ani_politics)

204

In [193]:
latest_url_ani_politics = list(set(url_ani_politics) - set(cached_url_ani_politics))

#print(len(latest_url_ani_politics))
#print(latest_url_ani_politics)

#### Cache the latest URL for future comparisons

In [194]:
with open('./data/ani/cached_url_ani_politics.txt', 'a') as filehandle:
    for listitem in latest_url_ani_politics:
        filehandle.write('%s\n' % listitem)

In [195]:
%reset -f

# NDTV

### Business

In [196]:
import re
import requests
from bs4 import BeautifulSoup

response_ndtv_business = requests.get(url='https://www.ndtv.com/business/', headers={'User-Agent':''})
soup_ndtv_business = BeautifulSoup(response_ndtv_business.content, 'html.parser')

links_ndtv_business = []
for link in soup_ndtv_business.findAll('a', href=True):
    links_ndtv_business.append(link.get('href'))
#print(len(links_ndtv_business))
#print(links_ndtv_business)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only business realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/business/' so I made use of those two major things to find all the valid URLs.  

In [197]:
import re
txt_ndtv_business = ' '.join(links_ndtv_business)
url_ndtv_business = re.findall(r'https://www.ndtv.com/business/[a-zA-Z0-9\/\.\-\:]+[0-9]{4,10}', txt_ndtv_business)
#print(len(url_ndtv_business))
#print(url_ndtv_business)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [198]:
url_ndtv_business = set(url_ndtv_business)
url_ndtv_business = list(url_ndtv_business)
#print(len(url_ndtv_business))
url_ndtv_business

['https://www.ndtv.com/business/share-market-updates-nifty-sensex-track-broader-asia-lower-on-us-stimulus-setback-2313928',
 'https://www.ndtv.com/business/fixed-deposit-fd-interest-rate-sbi-hdfc-bank-kotak-mahindra-bank-icici-latest-annual-returns-2304192',
 'https://www.ndtv.com/business/sbi-fixed-deposit-fd-interest-rates-sbi-pays-4-9-return-on-1-year-fd-here-are-its-other-rates-2301442',
 'https://www.ndtv.com/business/covid-19-news-goldman-pushes-ahead-with-1-460-india-hires-internships-2229390',
 'https://www.ndtv.com/business/foreign-inflows-into-asian-bonds-more-than-doubles-in-september-2312432',
 'https://www.ndtv.com/business/prabhat-dairy-news-sebi-orders-prabhat-dairy-to-deposit-rs-1-292-crore-in-escrow-account-2313446',
 'https://www.ndtv.com/business/will-remove-constraints-governments-new-plan-to-sweeten-air-india-deal-the-loss-making-state-owned-carrier-2312762',
 'https://www.ndtv.com/business/top-court-asks-sbi-capital-markets-to-start-funding-6-stalled-projects-of-a

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [199]:
cached_url_ndtv_business = []
with open('./data/ndtv/cached_url_ndtv_business.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ndtv_business.append(currentPlace)

In [200]:
len(cached_url_ndtv_business)

143

In [201]:
latest_url_ndtv_business = list(set(url_ndtv_business) - set(cached_url_ndtv_business))

#print(len(latest_url_ndtv_business))
#print(latest_url_ndtv_business)

#### Cache the latest URL for future comparisons

In [202]:
with open('./data/ndtv/cached_url_ndtv_business.txt', 'a') as filehandle:
    for listitem in latest_url_ndtv_business:
        filehandle.write('%s\n' % listitem)

### World

In [203]:
import re
import requests
from bs4 import BeautifulSoup

response_ndtv_world = requests.get(url='https://www.ndtv.com/world-news/', headers={'User-Agent':''})
soup_ndtv_world = BeautifulSoup(response_ndtv_world.content, 'html.parser')

links_ndtv_world = []
for link in soup_ndtv_world.findAll('a', href=True):
    links_ndtv_world.append(link.get('href'))
#print(len(links_ndtv_world))
#print(links_ndtv_world)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only world realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/world/' so I made use of those two major things to find all the valid URLs.  

In [204]:
import re
txt_ndtv_world = ' '.join(links_ndtv_world)
url_ndtv_world = re.findall(r'https://www.ndtv.com/world-news/[a-zA-Z0-9\/\.\-\:]+[0-9]{4,10}', txt_ndtv_world)
#print(len(url_ndtv_world))
#print(url_ndtv_world)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [205]:
url_ndtv_world = set(url_ndtv_world)
url_ndtv_world = list(url_ndtv_world)
#print(len(url_ndtv_world))
url_ndtv_world

['https://www.ndtv.com/world-news/brazilian-volunteer-in-oxford-covid-vaccine-trial-dies-say-officials-2313874',
 'https://www.ndtv.com/world-news/france-vows-wont-give-up-cartoons-as-teens-charged-over-teachers-killing-2313855',
 'https://www.ndtv.com/world-news/georgia-bank-hostage-gunman-releases-majority-of-hostages-at-bank-2313920',
 'https://www.ndtv.com/world-news/astrazeneca-covid-19-vaccine-trial-volunteer-dies-says-brazil-2313790',
 'https://www.ndtv.com/world-news/barack-obama-warns-joe-biden-supporters-not-to-be-complacent-despite-polls-2313857',
 'https://www.ndtv.com/world-news/us-covid-patient-dies-on-flight-officials-didnt-know-she-was-positive-2313901',
 'https://www.ndtv.com/world-news/ex-google-boss-eric-schmidt-says-social-networks-amplifiers-for-idiots-and-crazy-people-2313895',
 'https://www.ndtv.com/world-news/coronavirus-us-body-tweaks-close-contact-definition-expands-pool-of-people-at-risk-2313892',
 'https://www.ndtv.com/world-news/tiktok-cracks-down-on-hate-o

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [206]:
cached_url_ndtv_world = []
with open('./data/ndtv/cached_url_ndtv_world.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ndtv_world.append(currentPlace)

In [207]:
len(cached_url_ndtv_world)

235

In [208]:
latest_url_ndtv_world = list(set(url_ndtv_world) - set(cached_url_ndtv_world))

#print(len(latest_url_ndtv_world))
#print(latest_url_ndtv_world)

#### Cache the latest URL for future comparisons

In [209]:
with open('./data/ndtv/cached_url_ndtv_world.txt', 'a') as filehandle:
    for listitem in latest_url_ndtv_world:
        filehandle.write('%s\n' % listitem)

### India

In [210]:
import re
import requests
from bs4 import BeautifulSoup

response_ndtv_india = requests.get(url='https://www.ndtv.com/india/', headers={'User-Agent':''})
soup_ndtv_india = BeautifulSoup(response_ndtv_india.content, 'html.parser')

links_ndtv_india = []
for link in soup_ndtv_india.findAll('a', href=True):
    links_ndtv_india.append(link.get('href'))
#print(len(links_ndtv_india))
#print(links_ndtv_india)

#### Finding only the relevant URLs from the list of all links.
Now we have to filter only the links that are related to articles and that too only india realted articles. I have used RegEx for this task. All the news articles have '.cms' type in the end and they contain '/india/' so I made use of those two major things to find all the valid URLs.  

In [211]:
import re
txt_ndtv_india = ' '.join(links_ndtv_india)
url_ndtv_india = re.findall(r'https://www.ndtv.com/india-news/[a-zA-Z0-9\/\.\-\:]+[0-9]{4,10}', txt_ndtv_india)
#print(len(url_ndtv_india))
#print(url_ndtv_india)

#### Removing duplicates
There can be duplicates present so a simple task to remove duplicates is to convert the list to set, as we know set can't contain duplicates. And then change the set back to list.

In [212]:
url_ndtv_india = set(url_ndtv_india)
url_ndtv_india = list(url_ndtv_india)
#print(len(url_ndtv_india))
url_ndtv_india

['https://www.ndtv.com/india-news/final-trial-of-drdo-developed-nag-missile-successful-ready-for-induction-into-army-2313971',
 'https://www.ndtv.com/india-news/as-amit-shah-turns-56-pm-narendra-modi-wishes-him-long-life-in-service-of-india-on-birthday-2313908',
 'https://www.ndtv.com/india-news/hindu-rao-doctors-urge-pm-modi-to-resolve-salary-crisis-2313860',
 'https://www.ndtv.com/india-news/pm-modi-to-join-durga-puja-event-in-bengal-today-in-bjps-big-push-for-polls-2313879',
 'https://www.ndtv.com/india-news/coronavirus-live-updates-brazil-volunteer-dies-in-astrazeneca-covid-19-vaccine-clinical-trial-2313863',
 'https://www.ndtv.com/india-news/bihar-assembly-election-2020-live-updates-nirmala-sitharaman-unveils-bjp-sankalp-patra-poll-manifesto-nitish-kumar-vs-tejaswi-yadav-bjp-jdu-vs-rjd-2313949',
 'https://www.ndtv.com/india-news/bihar-assembly-polls-2020-chirag-paswans-jail-for-scams-threat-in-fresh-attack-on-nitish-kumar-2313919',
 'https://www.ndtv.com/india-news/maharashtra-wit

#### Cached Previous day URL to compare against the new URL
All the links present here can have duplicates from previous day or few days back. So, we have to make comparison with the URLs that we already have to find any redundant article. We can use our traditional filesystem to get the URLs from cached file.

#### Compare against cached URLs to find only new links for today

In [213]:
cached_url_ndtv_india = []
with open('./data/ndtv/cached_url_ndtv_india.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        cached_url_ndtv_india.append(currentPlace)

In [214]:
len(cached_url_ndtv_india)

287

In [215]:
latest_url_ndtv_india = list(set(url_ndtv_india) - set(cached_url_ndtv_india))

#print(len(latest_url_ndtv_india))
#print(latest_url_ndtv_india)

#### Cache the latest URL for future comparisons

In [216]:
with open('./data/ndtv/cached_url_ndtv_india.txt', 'a') as filehandle:
    for listitem in latest_url_ndtv_india:
        filehandle.write('%s\n' % listitem)

In [217]:
%reset -f