I will create a custom Named Entity Recognition model for this task, Since it will be suitable to extract the required information (who acquires the company, who is the company being acquired, what is the sell price). We need labeled data for this task. Since I couldn't find any datasets on the internet, I need to create my own. To do that we first need to scrape the news websites. We can use tools like BeautifulSoup or Selenium to extract news from google. We can only query "mergers and acquisitions", the "in Germany" part is not relevant for training data. Also, it will give us more results this way. I also searched in english for the same reason.

We can get at most 1000 articles with one query (google puts a limit). Also there isn't enough articles with one query to "mergers and acquisitions". To get more, we need to search by date and accumulate the news.

In [1]:
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime, timedelta
import concurrent.futures

def fetch_news_articles(query, start_date, end_date):
    base_url = "https://www.google.com/search"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }
    params = {
        "q": query,
        "tbm": "nws",
        "tbs": f"cdr:1,cd_min:{start_date.strftime('%m/%d/%Y')},cd_max:{end_date.strftime('%m/%d/%Y')}",
        "num": 100 #to get the most number of articles in one page (google allows at most 100)
    }
    
    response = requests.get(base_url, params=params, headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        articles = []
        for item in soup.select("div.SoaBEf"):
            title_elem = item.select_one("div.MBeuO")
            link_elem = item.select_one("a")
            source_elem = item.select_one("div.UPmit")
            
            if title_elem and link_elem:
                article = {
                    "title": title_elem.text.strip(),
                    "link": link_elem["href"],
                    "source": source_elem.text.strip() if source_elem else "N/A"
                }
                articles.append(article)
        
        print(f"Fetched {len(articles)} articles for date range {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
        return articles
    else:
        print(f"Failed to fetch articles. Status code: {response.status_code}")
        return []

def fetch_articles_with_date_ranges(query):
    end_date = datetime.now()
    start_date = end_date - timedelta(days=365)
    #search every day for the last year
    date_ranges = [(start_date + timedelta(days=i), start_date + timedelta(days=i)) for i in range((end_date - start_date).days + 1)]

    all_articles = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        future_to_date_range = {executor.submit(fetch_news_articles, query, start_date, end_date): (start_date, end_date) 
                                for start_date, end_date in date_ranges}
        for future in concurrent.futures.as_completed(future_to_date_range):
            all_articles.extend(future.result())
    
    return all_articles

# Query for mergers and acquisitions in Germany
query = "mergers and acquisitions"

# Fetch news articles with date ranges
articles = fetch_articles_with_date_ranges(query)

# Write fetched articles to a CSV file
csv_file = "articles.csv"
csv_columns = ["title", "link", "source"]

try:
    with open(csv_file, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
        writer.writeheader()
        for article in articles:
            writer.writerow(article)
    print(f"\nSuccessfully wrote {len(articles)} articles to '{csv_file}'.")
except IOError:
    print(f"\nError: Could not write to '{csv_file}'.")

Fetched 31 articles for date range 2023-07-13 to 2023-07-13
Fetched 2 articles for date range 2023-07-15 to 2023-07-15
Fetched 21 articles for date range 2023-07-14 to 2023-07-14
Fetched 26 articles for date range 2023-07-12 to 2023-07-12
Fetched 2 articles for date range 2023-07-16 to 2023-07-16
Fetched 40 articles for date range 2023-07-19 to 2023-07-19
Fetched 23 articles for date range 2023-07-17 to 2023-07-17
Fetched 29 articles for date range 2023-07-18 to 2023-07-18
Fetched 17 articles for date range 2023-07-21 to 2023-07-21
Fetched 3 articles for date range 2023-07-23 to 2023-07-23
Fetched 42 articles for date range 2023-07-20 to 2023-07-20
Fetched 19 articles for date range 2023-07-24 to 2023-07-24
Fetched 1 articles for date range 2023-07-22 to 2023-07-22
Fetched 27 articles for date range 2023-07-26 to 2023-07-26
Fetched 23 articles for date range 2023-07-27 to 2023-07-27
Fetched 1 articles for date range 2023-07-29 to 2023-07-29
Fetched 20 articles for date range 2023-07-25

Now, that we have the articles, we can proceed by extracting the main body from each URL.

In [43]:
!pip install lxml_html_clean

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.1.1-py3-none-any.whl.metadata (1.5 kB)
Downloading lxml_html_clean-0.1.1-py3-none-any.whl (11 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.1.1


In [48]:
from bs4 import BeautifulSoup
from newspaper import Article

# Function to extract article content using newspaper3k
def extract_content_newspaper(url):
    article = Article(url)
    try:
        article.download()
        article.parse()
        return article.text
    except Exception as e:
        print(f"Failed to extract using newspaper3k: {e}")
        return None

# Function to extract article content using BeautifulSoup as fallback
def extract_content_bs(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        paragraphs = soup.find_all('p')
        return ' '.join([para.get_text() for para in paragraphs])
    except Exception as e:
        print(f"Failed to extract using BeautifulSoup: {e}")
        return None

# Function to extract content from a URL
def extract_content(url):
    content = extract_content_newspaper(url)
    if not content:
        content = extract_content_bs(url)
    return content

# Read the URLs from the CSV file
csv_file = "articles.csv"
articles = []

try:
    with open(csv_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            articles.append(row)
except IOError:
    print(f"Error: Could not read from '{csv_file}'.")

# Extract content for each article
for article in articles:
    url = article['link']
    content = extract_content(url)
    article['content'] = content
    print(f"Extracted content from {url}")

# Write the updated articles with content to a new CSV file
output_csv_file = "articles_with_content.csv"
csv_columns = ["title", "link", "content"]

try:
    with open(output_csv_file, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
        writer.writeheader()
        for article in articles:
            writer.writerow(article)
    print(f"Successfully wrote articles with content to '{output_csv_file}'.")
except IOError:
    print(f"Error: Could not write to '{output_csv_file}'.")


Extracted content from https://news.google.com/rss/articles/CBMiemh0dHBzOi8vd3dkLmNvbS9idXNpbmVzcy1uZXdzL21lcmdlcnMtYWNxdWlzaXRpb25zL3NvZGFsaXMtZ3JvdXAtYWNxdWlyZXMtbWFqb3JpdHktc3Rha2UtZ2VybWFuLWJlYXV0eS1hcnRkZWNvLTEyMzYzNzM3Mjkv0gEA?oc=5
Extracted content from https://news.google.com/rss/articles/CBMicWh0dHBzOi8vd3d3Lm1kbS5jb20vbmV3cy9vcGVyYXRpb25zL21lcmdlcnMtYWNxdWlzaXRpb25zL2ZvcnRpdmUtdG8tYWNxdWlyZS1nZXJtYW55cy1lbGVrdHJvLWF1dG9tYXRpay1mb3ItMS00NWIv0gEA?oc=5
Extracted content from https://news.google.com/rss/articles/CBMikQFodHRwczovL3d3dy5idXNpbmVzcy1zdGFuZGFyZC5jb20vdGVjaG5vbG9neS90ZWNoLW5ld3Mvd2hhdC1tYWtlcy1lbmdpbmVlcmluZy1yLWQtc3BhY2UtYS10YXJnZXQtZm9yLW1lcmdlcnMtYW5kLWFjcXVpc2l0aW9ucy0xMjQwNDI0MDA4MzBfMS5odG1s0gGVAWh0dHBzOi8vd3d3LmJ1c2luZXNzLXN0YW5kYXJkLmNvbS9hbXAvdGVjaG5vbG9neS90ZWNoLW5ld3Mvd2hhdC1tYWtlcy1lbmdpbmVlcmluZy1yLWQtc3BhY2UtYS10YXJnZXQtZm9yLW1lcmdlcnMtYW5kLWFjcXVpc2l0aW9ucy0xMjQwNDI0MDA4MzBfMS5odG1s?oc=5
Extracted content from https://news.google.com/rss/articles/CBMia2

So, this code works for a small number of requests. But after doing more requests we get rate limits from google. Thus I needed to use proxies to bypass the limits. But none of them worked, so I gave up trying this approach, and used SerpApi instead.