<b>Introduction</b></br>
The dynamic landscape of financial markets demands real-time insights to empower investors in making informed decisions. In this context, Finsmes emerges as a pivotal resource, providing a wealth of information and news crucial for investors navigating the intricate world of finance. Finsmes, a prominent financial news platform, serves as a beacon for investors seeking to stay abreast of market trends, investments, startups, venture capital, emerging opportunities, and critical developments. 

<b>Rationale</b></br>
The rapid pace of financial markets and the wealth of information available make it imperative for investors to have timely, consolidated, and actionable insights. Recognizing the need for a comprehensive tool tailored to extract key financial news data, the rationale behind this project stems from the challenges investors face in efficiently gathering and analyzing relevant information from platforms like Finsmes.

Financial news platforms, such as Finsmes, are rich sources of market trends and investment opportunities. However, the manual extraction of data is both time-consuming and prone to errors, hindering investors from accessing real-time information crucial for decision-making. The rationale for this project, therefore, lies in the necessity to automate the extraction process, providing a robust and efficient solution to empower investors with a consolidated dataset.

By addressing the manual data collection bottleneck, this project seeks to streamline the information retrieval process, ensuring accuracy and consistency in the extracted dataset. The rationale underscores the project's commitment to enhancing the accessibility of financial news data, allowing investors to focus on analyzing trends and patterns rather than spending valuable time on manual data compilation.

Ultimately, the rationale for this project is grounded in the conviction that an automated web scraping tool for Finsmes will not only save time but will also provide investors with a reliable resource. This resource will offer a consolidated dataset comprising news links, published dates, headlines, and content, enabling investors to stay informed, identify emerging trends, and make well-informed decisions aligned with their investment strategies. The overarching goal is to empower investors to navigate the complexities of the financial landscape with heightened confidence and agility.

<b>Problem Statement</b></br>
Investors face the challenge of efficiently sourcing and processing vast amounts of financial news data scattered across various platforms. The manual collection of such data is time-consuming and prone to errors. The project aimed to address this challenge by developing a web scraping solution tailored to extract relevant data from Finsmes, thereby mitigating the data collection bottleneck and providing a consolidated, structured dataset for analysis.

In [2]:
# import necessary libraries
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import time

In [None]:
# collect the user data  - country and n
# country - select the country of interest from the list [All, USA, UK, GERMANY, FRANCE, CANADA, INDIA, ITALY]
# n - select the number of pages of interest 
country = input('Select the country of your interest from [ALL, USA, UK, GERMANY, FRANCE, CANADA, INDIA, ITALY] : ').lower()
n = int(input('Enter the number of pages of your interest to scrape: '))

try:
    root = 'https://www.finsmes.com/'
    # requests for access
    source = requests.get(root)
    source.raise_for_status() # helps in capturing error if link doesnt work
    
    
    # parse html content
    home_page = bs(source.text, 'html.parser')
    # extracting mainpage links for each country
    usa = home_page.find('li', id = 'menu-item-76541').find('a').get('href')
    uk = home_page.find('li', id = 'menu-item-76542').find('a').get('href')
    germany = home_page.find('li', id = 'menu-item-76543').find('a').get('href')
    france = home_page.find('li', id = 'menu-item-76547').find('a').get('href')
    canada = home_page.find('li', id = 'menu-item-76544').find('a').get('href')
    india = home_page.find('li', id = 'menu-item-76545').find('a').get('href')
    italy = home_page.find('li', id = 'menu-item-76546').find('a').get('href')
    root = 'https://www.finsmes.com/'
    
    
    # create a dataframe to store the extracted data
    df = pd.DataFrame(columns = ['Article URL', 'Published Date', 'Headline', 'Content'])
    
    # create a condition that calls the sub-function based on user-input
    if country == 'all':
        investment_news(root,n)
    elif country == 'usa':
        investment_news(usa,n)
    elif country == 'uk':
        investment_news(uk,n)
    elif country == 'germany':
        investment_news(germany,n)
    elif country == 'france':
        investment_news(france,n)
    elif country == 'canada':
        investment_news(canada,n)
    elif country == 'india':
        investment_news(india,n)
    elif country == 'italy':
        investment_news(italy,n)
    else:
        print('Kindly enter valid country!')
        
    
    # store/output the df in excel file
    df.to_excel('finsmes_investment_news.xlsx')
    
    
except Exception as e:
    print(e)
    

Select the country of your interest from [ALL, USA, UK, GERMANY, FRANCE, CANADA, INDIA, ITALY] : usa

Enter the number of pages of your interest to scrape: 3322

In [None]:
# define a sub-function that extracts all the urls from a given page
def investment_news(option, n):
    # requests for access
    main_page = requests.get(option)
    
    # create an empty list to store the extracted urls
    news_links = []
    
    # loops for n pages from main page
    for i in range(0,n):
        
        # parse main_page
        main_page = bs(main_page.text, 'html.parser')
        
        # finds all header tags with class entry-title
        links = main_page.find_all('h2', class_ = 'entry-title')
        
        # loops for each link in links
        for link in links:
            
            # extracts href content from a tag
            if link.find('a') is not None:
                url = link.find('a')['href']
                
                #appends extracted href in news_links
                news_links.append(url)
                
        # calls main function to extract the required data
        news_parse(news_links)
        
        # finds next page link
        next_page = main_page.find('div', class_ = 'nav-previous').find('a')['href']
        
        # requests access for next_page
        main_page = requests.get(next_page)
        

In [None]:
# define a main function that extracts the required data from a given article link
# url, published_date, news_headline, content are the required data in this project
def news_parse(news_links):
    # loops until all urls in the given list is parsed
    for i in range(0, len(news_links)):
        #requests for access
        url.requests.get(news_links[i])
        # wait for 1 second
        time.sleep(1)
        
        #parsing html content
        soup = bs(url.text, 'html.parser')
        
        
        # published date attribute using try and except block
        try:
            published_date = soup.find('div', class_ = 'author-date').find('time', class_ = 'entry-date published').text
        except:
            published_Date = soup.find('div', class_ = 'author-date').find('time', class_ = 'entry-date published updated').text
            
        
        # headline attribute
        try:
            headline = soup.find('header', class_ = 'entry-header').find('h1', class_ = 'enrty-title').text
        except:
            headline = np.NaN
            
        # content attribute
        try:
            content = soup.find('div', class_ = 'entry-content').find('p').text
        except:
            content = np.NaN
        
        
        # creatinga dictionary using extracted data
        data = dict()
        data.update({'Article URL' : news_links[i], 'Published Date' : published_date, 'Headline' : headline, 'Content' : content})
        
        
        # append the data into the original df
        global df
        df = df.append(data, ignore_index = True)