<a href="https://colab.research.google.com/github/manishrawat2022/ReStock/blob/main/moneycontrol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Moneycontrol.com Scraper Notebook

#### Install required dependencies

In [51]:
!pip install requests_html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [52]:
! /usr/bin/python3 -m pip install "pymongo[srv]"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Import the required libraries

In [53]:
import requests
import requests_html
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import datetime
import pymongo
from pymongo import MongoClient

#### Function to compute links under a url

In [54]:
def compute_article_links(url):
  session = HTMLSession()
  r = session.get(url)

  element = r.html.find('ul#cagetory', first=True)
  return element.absolute_links

#### Function to scrape article data

In [55]:
def get_article_data(url):
    page = str(requests.get(url).content.decode("utf-8", "ignore"))
    soup = BeautifulSoup(page, "html.parser")
    article = {}

    try:
        article["title"] = soup.find(
            attrs={"class": "article_title"}).string.strip()
    except:
        return None;
      
    try:
        article["desc"] = soup.find(
            attrs={"class": "article_desc"}).string.strip()
    except:
        pass

    try:
        content = soup.select(".content_wrapper > p")
        article["content"] = " ".join(
            [c.string for c in content if c.string])
    except:
        pass
    
    try:
        author_element = soup.select_one(".content_block span")
        article["author"] = author_element.string
    except:
        pass
        
    try:
        time_date_element = soup.find(attrs={"class": "article_schedule"})
        time_date_string = ""
        for element in time_date_element.contents:
            if element and element.string.strip():
                time_date_string += element.string.strip()
        article["timestamp"] = datetime.datetime.strptime(time_date_string, "%B %d, %Y/ %I:%M %p %Z")
    except Exception as e:
        #print(e)
        try:
            tags_last_line = soup.select_one(".tags_last_line")
            time_date_string = tags_last_line.string.upper()
            article["timestamp"] = datetime.datetime.strptime(time_date_string, "FIRST PUBLISHED: %b %d, %Y %I:%M %p")
        except Exception as e:
            #print(e)
            return None

    return article


In [56]:
urls = compute_article_links('https://www.moneycontrol.com/news/business/')
urls

{'https://www.moneycontrol.com/news/business/adani-enterprises-loses-cils-short-term-coal-import-tender-8846541.html',
 'https://www.moneycontrol.com/news/business/amazon-ceo-andy-jassy-breaks-from-the-bezos-way-8846681.html',
 'https://www.moneycontrol.com/news/business/ashok-gehlot-approves-power-projects-of-2120-mw-capacity-8846601.html',
 'https://www.moneycontrol.com/news/business/commodities/oil-jumps-on-russia-gas-supply-jitters-weaker-dollar-8846881.html',
 'https://www.moneycontrol.com/news/business/dgca-asks-airlines-to-tighten-regulations-to-avoid-technical-malfunctions-8845101.html',
 'https://www.moneycontrol.com/news/business/earnings/arfin-india-standalone-june-2022-net-sales-at-rs-146-05-crore-up-47-09-y-o-y-8846741.html',
 'https://www.moneycontrol.com/news/business/earnings/bhansali-eng-consolidated-june-2022-net-sales-at-rs-337-41-crore-up-45-49-y-o-y-8846751.html',
 'https://www.moneycontrol.com/news/business/earnings/bhansali-eng-standalone-june-2022-net-sales-at-r

In [57]:
get_article_data('https://www.moneycontrol.com/news/business/real-estate/construction-and-demolition-waste-is-choking-bengalurus-lungs-experts-cry-for-reforms-8835051.html')

{'author': 'Souptik Datta',
 'content': 'Every morning, owners open their shops near the Tin Factory Metro Station on Old Madras Road wiping off a layer of dust from their windows or panels. However, as normal as this may sound, this is more than just pollution we see every day. The road beside the metro station is strewn with illegally dumped Construction and Demolition (C&D) waste. On the way to office every day, Balaji Ragotham comes across large tractors dumping C&D waste on the side of the road in eastern Bengaluru\'s KR Puram. "I have spoken to them often. However, they never tend to listen," he added. Sandeep Anirudhan, Convenor of Citizens’ Agenda for Bengaluru, said that from Bellandur Lake to Varthur Lake, the road is littered with C&D waste. "All across Bengaluru, C&D waste can be seen almost everywhere, be it roads, lake beds or just wetlands," he added. C&D waste is generated from the construction, renovation, repair, and demolition of houses, roads, and other real estate 

#### Connect to MongoDB

In [58]:
client = MongoClient('mongodb+srv://random:Random@stock.mbex3cy.mongodb.net/?retryWrites=true&w=majority')
db = client["Stocks"]
collection = db["moneycontrol"]

In [61]:
base_urls = {
    #"business": ("https://www.moneycontrol.com/news/business", 30),
    #"companies": ("https://www.moneycontrol.com/news/tags/companies.html", 30)
    #"economy": ("https://www.moneycontrol.com/news/business/economy", 30),
    #"personal-finance": ("https://www.moneycontrol.com/news/business/personal-finance", 30),
    #"stocks": ("https://www.moneycontrol.com/news/business/stocks", 30),
    #"tech-analysis": ("https://www.moneycontrol.com/news/tags/technical-analysis.html", 30)
}

In [62]:
for source, url_desc in base_urls.items():
    print(f"Entering : {source}")
    base_url = url_desc[0]
    page_count = url_desc[1]
    for i in range(1, page_count + 1):
        print(f"Processing page : {i}")
        if i==1:
            url = base_url
        else:
            url = base_url + "/page-" + str(i)+"/";
        links = compute_article_links(url)
        articles = []
        for link in links:
            article = get_article_data(link)
            if article == None:
                continue
            article["source"] = source
            article["link"] = link
            articles.append(article)
        collection.insert_many(articles, ordered = False)


Entering : tech-analysis
Processing page : 1


TypeError: ignored