# Breitbart articles

In [1]:
# loading in the required packages
import pandas as pd
import requests as rq
from bs4 import BeautifulSoup
import regex as re
from urllib.parse import urljoin

The articles that I'm looking for are those tagged with Ukraine, Russia, and the Ukraine-Russia war. Afterwards, I will filter by date and keywords to get the relevant articles.

## Example Exploration

In [2]:
# getting the html code from the website
article = rq.get('https://www.breitbart.com/europe/2023/08/28/poland-and-baltic-states-threaten-to-shut-border-with-russian-ally-belarus/')
article_html = article.text
# print(rentals_html)

# parsing the page using beautifulsoup
article_soup = BeautifulSoup(article_html)

In [3]:
article_soup

<!DOCTYPE html>
<html class="post-tmpl-default single single-post pid-24829277 tf-single pt-post c-europe" lang="en" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# op: http://media.facebook.com/op# article: http://ogp.me/ns/article#">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="https://geolocation.onetrust.com" rel="dns-prefetch"/>
<link as="script" href="https://cdn.cookielaw.org/scripttemplates/otSDKStub.js" rel="preload"/>
<link as="fetch" crossorigin="" href="https://cdn.cookielaw.org/consent/bea5fecf-7066-4a7e-ad83-51130a031a8a/bea5fecf-7066-4a7e-ad83-51130a031a8a.json" rel="preload" type="application/json"/>
<link as="script" href="https://scripts.webcontentassessor.com/scripts/25915dba3f71a41b2d6242657214b496c2beb5fd937e31770246f82c53195453" rel="preload"/>
<link as="script" href="https://securepubads.g.doubleclick.net/tag/js/gpt.js" rel="preload"/>
<link as="script" crossorigin="" href="https://pagead2.

In [4]:
article_soup.title.text

'Poland, Baltic States Threaten to Shut Border with Russian Ally Belarus'

In [5]:
article_soup.time.text

'28 Aug 2023'

In [6]:
article_soup.p.text

'WARSAW, Poland (AP) – NATO members Poland and the Baltic states will seal off their borders with Russia’s ally Belarus in the event of any military incidents or a massive migrant push by Minsk, the interior ministers warned Monday.'

In [7]:
# Convert one or more carriage returns to a single new line
article_text = re.sub('\r+', '\n', article_soup.text)
# Convert multiple new line to a single new line
article_text = re.sub('\n+', '\n', article_text)
# Split up string at line breaks
article_text = re.split('\n', article_text)
# Remove leading and trailing white spaces
article_text = [string.strip() for string in article_text]
# Remove list items without word characters or digits, incl. empty strings
article_text = [string for string in article_text if re.search('\w|\d',string)] #using if statement to sort out strings
# Condense multiple spaces to one
article_text = [re.sub('\s+', ' ', string) for string in article_text]

article_text

['Poland, Baltic States Threaten to Shut Border with Russian Ally Belarus',
 'Enable AccessibilitySkip to Content',
 'PoliticsEntertainmentMediaEconomyWorldLondon / EuropeBorder / Cartel ChroniclesIsrael / Middle EastAfricaAsiaLatin AmericaAll WorldVideoTechSportsOn the HillOn the Hill ArticlesOn The Hill Exclusive VideoWiresB Inspired',
 'BREITBART',
 'Enable Accessibility',
 'PoliticsEntertainmentMediaEconomyWorldLondon / EuropeBorder / Cartel ChroniclesIsrael / Middle EastAfricaAsiaLatin AmericaWorld NewsVideoTechSportsOn the HillOn the Hill ArticlesOn The Hill Exclusive VideoWiresPodcastsBreitbart News DailyB InspiredAbout UsPeopleNewsletters',
 'BREITBART',
 'Fake ConservativesUkraineEnglish ChannelEurope Migrant CrisisUK PoliticsTrans PoliticsBrusselsFarageGermanyFrance',
 'Poland and Baltic States Threaten to Shut Border with Russian Ally Belarus',
 '16',
 'Omar Marques/Anadolu Agency via Getty ImagesBreitbart London28 Aug 2023',
 'WARSAW, Poland (AP) – NATO members Poland and t

In [8]:
# get heading
h = article_soup.find_all("h1")

# get body text
# div p

In [9]:
def remove_links(html):
    # Define a regular expression pattern to match <a> tags and everything between them
    pattern = re.compile(r'<a\b[^>]*>(.*?)</a>', re.DOTALL)

    # Use sub() function to replace matches with an empty string
    result = re.sub(pattern, '', html)

    return result

In [10]:
d = article_soup.find_all('p')
d

[<p class="subheading">WARSAW, Poland (AP) – NATO members Poland and the Baltic states will seal off their borders with Russia’s ally Belarus in the event of any military incidents or a massive migrant push by Minsk, the interior ministers warned Monday.</p>,
 <p>The ministers said they were seeing growing tensions on NATO’s and the European Union’s borders with Belarus, which has taken in thousands of Russia’s military mercenaries and is pushing Middle East and African migrants into Europe, despite barriers having been put up.</p>,
 <p class="a8d-pre">They warned of swift and concerted reaction in the case of a military incident or large migrant push.</p>,
 <p>The ministers of Poland, Lithuania, Latvia and Estonia addressed the media following their talks. In a joint statement they demanded that the government of Belarus President Alexander Lukashenko immediately remove from its territory the Wagner Group mercenaries. They also demanded the removal of migrants from border areas and th

I only want to extract the text, what function should I use?
Russia-Ukraine tag
- find articles starting from when we are looking at, double check when they started to use the tag.
Ukraine tag
Russia tag

In [11]:
def get_article_text(text):
    if not text:
        return "No <p> tags found."
    else:
        # Use BeautifulSoup to parse each HTML string in the list
        soup_objects = [BeautifulSoup(str(html), 'html.parser') for html in text]

        # Extract text from each BeautifulSoup object
        article_text = [soup.get_text() for soup in soup_objects]

        # Combine the extracted text into a single string
        article_text_combined = ' '.join(article_text)

        return article_text_combined

In [12]:
d1 = get_article_text(d)
print(d1)

WARSAW, Poland (AP) – NATO members Poland and the Baltic states will seal off their borders with Russia’s ally Belarus in the event of any military incidents or a massive migrant push by Minsk, the interior ministers warned Monday. The ministers said they were seeing growing tensions on NATO’s and the European Union’s borders with Belarus, which has taken in thousands of Russia’s military mercenaries and is pushing Middle East and African migrants into Europe, despite barriers having been put up. They warned of swift and concerted reaction in the case of a military incident or large migrant push. The ministers of Poland, Lithuania, Latvia and Estonia addressed the media following their talks. In a joint statement they demanded that the government of Belarus President Alexander Lukashenko immediately remove from its territory the Wagner Group mercenaries. They also demanded the removal of migrants from border areas and their return to their home countries. WWIII Watch: Poland to Deploy 

In [13]:
# h2 class href extract URLs from h2 headings on a page
def extract_urls_from_h2(url):
    try:
        # Send an HTTP request to the URL
        response = rq.get(url)

        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Parse the HTML content of the page
            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract URLs from h2 headings
            urls = []
            for h2_tag in soup.find_all('h2'):
                # Find all 'a' tags within the h2
                a_tags = h2_tag.find_all('a', href=True)
                for a_tag in a_tags:
                    # Join the URL with the href attribute to get the absolute URL
                    absolute_url = urljoin(url, a_tag['href'])
                    urls.append(absolute_url)

            return urls

        else:
            print(f"Failed to retrieve the page. Status code: {response.status_code}")
            return None

    except Exception as e:
        print(f"An error occurred: {e}")
        return None

### Russia-Ukraine War tag
Checking the results manually, we can see that it only goes to page 11, therefore the number of pages will be 11.

In [14]:
# extract_urls_from_h2('https://www.breitbart.com/tag/russia-ukraine-war/')

In [15]:
all_links = set() #store as set to only get unique links
num_pages = 11
base_url = "https://www.breitbart.com/tag/russia-ukraine-war"

for page_num in range(1, num_pages + 1):
    url = f"{base_url}/page/{page_num}/" if page_num > 1 else base_url
    response = rq.get(url)
    urls = extract_urls_from_h2(url)
    all_links.update(urls) 
        
all_links = list(all_links)

In [16]:
all_links

['https://www.breitbart.com/europe/2023/05/14/you-decide-boy-scouts-project-or-weapon-of-war-ukraine-shows-off-captured-drone-oddity/',
 'https://www.breitbart.com/europe/2023/11/12/russia-ramps-up-attacks-on-key-cities-in-eastern-ukraine/',
 'https://www.breitbart.com/europe/2023/05/21/biden-zelensky-gave-flat-assurance-american-made-f-16-jets-wont-be-used-to-attack-russian-mainland/',
 'https://www.breitbart.com/europe/2023/10/01/uk-defence-secretary-suggests-sending-british-troops-to-ukraine-on-training-missions/',
 'https://www.breitbart.com/politics/2023/02/15/poll-support-for-endless-u-s-aid-to-ukraine-softens-as-voters-see-problems-closer-to-home/',
 'https://www.breitbart.com/europe/2023/08/17/ukraine-enraged-as-nato-official-suggests-ceding-land-to-russia-in-exchange-for-membership/',
 'https://www.breitbart.com/europe/2023/06/25/wagner-chief-agrees-to-exile-in-belarus-charges-of-armed-rebellion-to-be-dropped-in-deal-with-putin/',
 'https://www.breitbart.com/europe/2023/02/19/

There are urls for breitbart and its store which is not relevant to us, so they are removed. The len function is used to see if they have been removed from the list, which we can see that it is. 

In [17]:
len(all_links)

416

In [18]:
urls_to_remove = {'https://www.breitbart.com/', 'https://store.breitbart.com/'}

# Assuming all_links is the list of URLs
filtered_links = [link for link in all_links if link not in urls_to_remove]

In [19]:
len(filtered_links)

414

In [20]:
article_titles = []
article_date = []
article_datetime = []
article_text = []

for link in filtered_links:
    article = rq.get(link)
    article_html = article.text

    # parsing the page using beautifulsoup
    article_soup = BeautifulSoup(article_html)
    
    # title
    title = article_soup.title.text
    article_titles.append(title)
    
    # date
    date = article_soup.time.text
    article_date.append(date)
    # datetime
    datetime_value = article_soup.find('time')['datetime']
    article_datetime.append(datetime_value)

    #text 
    d = article_soup.find_all('p')
    text = get_article_text(d)
    article_text.append(text)

In [21]:
len(article_titles)

414

In [23]:
#turn list into df
# article_titles
# article_date
# article_datetime
# article_text
# filtered_text_list

data = {
    'URL': filtered_links,
    'Title': article_titles,
    'Date': article_date,
    'Datetime': article_datetime,
    'Text': article_text,
#     'filtered_text_list': filtered_text_list
}

# create df from dictionary
df_ukraine_russia = pd.DataFrame(data)
df_ukraine_russia = df_ukraine_russia.sort_values(by='Datetime')
df_ukraine_russia.head()

Unnamed: 0,URL,Title,Date,Datetime,Text
9,https://www.breitbart.com/faith/2022/03/19/pop...,Pope Francis: ‘There Is No Such Thing as a Jus...,19 Mar 2022,2022-03-19T09:30:38Z,ROME — Pope Francis said Friday that we are us...
228,https://www.breitbart.com/europe/2022/03/22/pi...,"Pics: Ukraine Claims to Retake Kyiv Suburb, Ba...",22 Mar 2022,2022-03-22T05:27:35Z,"KYIV, Ukraine (AP) – Ukraine said it retook a ..."
168,https://www.breitbart.com/europe/2022/03/31/eu...,EU Members Agree To 10-Point Plan To Settle Uk...,31 Mar 2022,2022-03-31T01:35:38Z,European Union member state Interior Ministers...
49,https://www.breitbart.com/europe/2022/04/06/ca...,Car Rams Russian Embassy in Romania with 'Flam...,6 Apr 2022,2022-04-06T03:59:46Z,"BUCHAREST, Romania (AP) – A car carrying conta..."
214,https://www.breitbart.com/europe/2022/04/06/eu...,"EU, US, UK Prepare More Russian Sanctions over...",6 Apr 2022,2022-04-06T04:20:39Z,"BRUSSELS (AP) – The United States, United King..."


### Ukraine tag

There are a lot more articles for Ukraine than for the Ukraine-Russia war. Through a qualitative check, the num_pages is set to 101 as that starts from the date we are looking at from August 2021.

In [25]:
all_links_ukraine = set() #store as set to only get unique links
num_pages_ukraine = 101
base_url_ukraine = "https://www.breitbart.com/tag/ukraine"

for page_num in range(1, num_pages_ukraine + 1):
    url = f"{base_url_ukraine}/page/{page_num}/" if page_num > 1 else base_url_ukraine
    response = rq.get(url)
    urls = extract_urls_from_h2(url)
    all_links_ukraine.update(urls) 
        
all_links_ukraine = list(all_links_ukraine)

In [26]:
len(all_links_ukraine)

4042

In [27]:
urls_to_remove = {'https://www.breitbart.com/', 'https://store.breitbart.com/'}

# Assuming all_links is the list of URLs
filtered_links_ukraine = [link for link in all_links_ukraine if link not in urls_to_remove]

In [28]:
len(filtered_links_ukraine)

4040

In [29]:
for link in filtered_links_ukraine:
    article = rq.get(link)
    article_html = article.text

    # parsing the page using beautifulsoup
    article_soup = BeautifulSoup(article_html)
    
    # title
    title = article_soup.title.text
    article_titles.append(title)
    
    # date
    date = article_soup.time.text
    article_date.append(date)
    # datetime
    datetime_value = article_soup.find('time')['datetime']
    article_datetime.append(datetime_value)

    #text 
    d = article_soup.find_all('p')
    text = get_article_text(d)
    article_text.append(text)

### Russia tag

In [30]:
# page 105
all_links_rus = set() #store as set to only get unique links
num_pages_rus = 105
base_url_rus = "https://www.breitbart.com/tag/russia"

for page_num in range(1, num_pages_rus + 1):
    url = f"{base_url_rus}/page/{page_num}/" if page_num > 1 else base_url_rus
    response = rq.get(url)
    urls = extract_urls_from_h2(url)
    all_links_rus.update(urls) 
        
all_links_rus = list(all_links_rus)

In [31]:
len(all_links_rus)

4202

In [32]:
urls_to_remove = {'https://www.breitbart.com/', 'https://store.breitbart.com/'}

# Assuming all_links is the list of URLs
filtered_links_rus = [link for link in all_links_rus if link not in urls_to_remove]

In [33]:
len(filtered_links_rus)

4200

In [34]:
for link in filtered_links_rus:
    article = rq.get(link)
    article_html = article.text

    # parsing the page using beautifulsoup
    article_soup = BeautifulSoup(article_html)
    
    # title
    title = article_soup.title.text
    article_titles.append(title)
    
    # date
    date = article_soup.time.text
    article_date.append(date)
    # datetime
    datetime_value = article_soup.find('time')['datetime']
    article_datetime.append(datetime_value)

    #text 
    d = article_soup.find_all('p')
    text = get_article_text(d)
    article_text.append(text)

### Combine to make df

In [35]:
# removing commenting and copyright text from all articles
text_to_remove = "Please let us know if you're having issues with commenting.  \xa0   \xa0 \n\n\n\n\n\n\n Copyright © 2023 Breitbart"

for i in range(len(article_text)):
    article_text[i] = article_text[i].replace(text_to_remove, "")

In [38]:
#turn list into df
# article_titles
# article_date
# article_datetime
# article_text
# filtered_text_list

data = {
    'URL': filtered_links + filtered_links_ukraine + filtered_links_rus,
    'Title': article_titles,
    'Date': article_date,
    'Datetime': article_datetime,
    'Text': article_text,
#     'filtered_text_list': filtered_text_list
}

# create df from dictionary
df_all = pd.DataFrame(data)
df_all = df_all.sort_values(by='Datetime')
df_all.head()

Unnamed: 0,URL,Title,Date,Datetime,Text
2902,https://www.breitbart.com/national-security/20...,Hunter Biden's Father Says Ukraine Too Corrupt...,15 Jun 2021,2021-06-15T07:43:24Z,President Joe Biden dismissed the possibility ...
1024,https://www.breitbart.com/politics/2021/06/18/...,WH Halted Lethal Aid Package to Ukraine Before...,18 Jun 2021,2021-06-18T11:53:22Z,President Joe Biden’s White House halted a pro...
896,https://www.breitbart.com/asia/2021/06/18/ukra...,"Ukraine Claims to Have Suffered 50,000 Cyberat...",18 Jun 2021,2021-06-18T14:09:38Z,Ukraine’s State Service for Special Communicat...
2645,https://www.breitbart.com/national-security/20...,Pollak: What Biden Did to Ukraine Is Worse tha...,18 Jun 2021,2021-06-18T18:12:59Z,President Joe Biden’s administration allegedly...
5456,https://www.breitbart.com/asia/2021/06/21/russ...,"Russians Rank Stalin, Hitler, Putin in 'Histor...",21 Jun 2021,2021-06-21T12:24:12Z,Russia’s independent Levada Center published t...


In [39]:
df_all.shape

(8654, 5)

In [40]:
# saving df_all to excel
df_all.to_excel('all_bb_articles.xlsx')

We can look at stance towards Zelensky - positive or negative. Sentiment analysis - stance. Take sentences where they are mentioned and look at the stance - we can also manual code it. Do it on a sentence level. Label a few hundred and fine tune the model to train the model. Stance detection and what we are expecting. CHatGPT is the best classifier available, but we have to pay for it.
- Warner group for example

Google colab to run larger models.

Don't forget to connect to concepts and theories - but don't force it. Peach-war, are negotiations mentioned, which sources are saying that?

Models - interrupted time series, event and check the difference. How external shocks change something.