## Web Scraping notebook for Scraping site www.moneycontrol.com
#### Name: Hrishikesh Milind Mahajan
#### Roll No.: PC-45
#### Serial No.: 1032171054
##### Dataset generated for seminar topic: Machine Learning techniques for Financial Sentiment Analysis

    Libraries Used: 
        1. Newspaper (https://github.com/codelucas/newspaper)
        2. NLTK (https://github.com/nltk/nltk)
        3. Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
        4. URLLIB (https://github.com/urllib3/urllib3)
    
    About Data:
        Data generated is collected from moneycontrol website, the data contains the following fields....
        1. Author Name
        2. Summary
        3. Time of Upload 
        4. Text inside article
        5. Title of Article
        6. Metadata Description
        7. URL of the article 
        
    Procedure followed:
        1. Accept Page number range from user
        2. Crawl the HTML page for links inside article list DOM
        3. Append each link to URL in list
        4. Send list to Newspaper API
        5. Fetch articles one by one and fill out above data

In [1]:
import newspaper

ModuleNotFoundError: No module named 'newspaper'

In [22]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rishikesh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [2]:
from bs4 import BeautifulSoup

In [3]:
import urllib.request 
request_url = urllib.request.urlopen('https://www.moneycontrol.com/news/business/stocks/') 
soup = BeautifulSoup(request_url, 'html.parser')

In [4]:
#Following function fetches the URLs for each article in a single page
def get_mc_urls(base_url):
    urls_to_parse = []
    request_url = urllib.request.urlopen(base_url) 
    soup = BeautifulSoup(request_url, 'html.parser')
    l = soup.find(id = 'cagetory')
    refs = l.find_all('a')
    nl = l.find_all('li')
    print(len(refs), len(nl))
    for i in range(len(nl)):
        try:
            urls_to_parse.append((nl[i].find('a')['href'], nl[i].find('span').text))
        except:
            pass
    return list(set(urls_to_parse))

In [5]:
# Inserter gathers data from newspaper article and stores in pandas DataFrame
from newspaper import Article
import pandas as pd
def inserter(urls_to_parse):
    authors_list = []
    summary_list = []
    time_upload = []
    article_text_list = []
    title_list = []
    url_list = []
    description_list = []
    for url in urls_to_parse:
        #print(f'Parsing URL {url[0]}')
        article = Article(url[0])
        article.download()
        article.parse()
        article.nlp()
        try:
            authors_list.append(article.authors[0])
        except:
            authors_list.append('')
        summary_list.append(article.summary)
        time_upload.append(url[1])
        article_text_list.append(article.text)
        title_list.append(article.title)
        url_list.append(article.url)
        description_list.append(article.meta_description)
    return pd.DataFrame({'time':time_upload, 'author': authors_list, 'title': title_list, 'summary' : summary_list, 'description': description_list, 'article_text': article_text_list, 'url':url_list})

In [6]:
#driver code
backup_main_data = pd.DataFrame()
def main_scraper(base_url, p1, pN):
    print(f'Scraping {base_url}')
    main_data = pd.DataFrame()
    start = True
    for page_id in range(p1, pN):
        print(f'Scrapping page {page_id} of {pN-p1} pages.')
        if start:
            start = False
            urls_to_parse = get_mc_urls(base_url)
            temp = inserter(urls_to_parse)
            main_data = temp
        else:
            base_url = base_url+'page-'+str(page_id)+'/'
            urls_to_parse = get_mc_urls(base_url)
            temp = inserter(urls_to_parse)
            main_data = pd.concat([main_data, temp])
            backup_main_data = main_data
        print(f'Completed scrapping page {page_id}.')
    print('Dataset ready!')
    return main_data

In [7]:
# Non Combination
def page_scraper(base_url):
    print(f'Scraping {base_url}')
    urls_to_parse = get_mc_urls(base_url)
    temp = inserter(urls_to_parse)
    print('Dataset ready!')
    return temp

In [8]:
main_data = pd.DataFrame()
for i in range(11,21):
    page = page_scraper(f'https://www.moneycontrol.com/news/business/stocks/page-{i}/')
    main_data = pd.concat([main_data, page])
    print(len(main_data))

Scraping https://www.moneycontrol.com/news/business/stocks/page-11/
50 31
Dataset ready!
25
Scraping https://www.moneycontrol.com/news/business/stocks/page-12/
50 31
Dataset ready!
50
Scraping https://www.moneycontrol.com/news/business/stocks/page-13/
50 31
Dataset ready!
75
Scraping https://www.moneycontrol.com/news/business/stocks/page-14/
50 31
Dataset ready!
100
Scraping https://www.moneycontrol.com/news/business/stocks/page-15/
50 31
Dataset ready!
125
Scraping https://www.moneycontrol.com/news/business/stocks/page-16/
50 31
Dataset ready!
150
Scraping https://www.moneycontrol.com/news/business/stocks/page-17/
50 31
Dataset ready!
175
Scraping https://www.moneycontrol.com/news/business/stocks/page-18/
50 31
Dataset ready!
200
Scraping https://www.moneycontrol.com/news/business/stocks/page-19/
50 31
Dataset ready!
225
Scraping https://www.moneycontrol.com/news/business/stocks/page-20/
50 31
Dataset ready!
250


In [9]:
main_data.to_csv('MoneyControl-First-11-20-Pages.csv')