# News scraping from the archives of Le Journal de Montreal

We use web scraping with the help of BeautifulSoup to create a dataset. This is a scraping of the archived news from Le Journal de Montreal, which we will use to train our models. We label our datas into 5 categories: 'Sports', 'Culture', 'Actualite', 'International' and 'Affaires'.

The archives in Le Journal de Montreal are already split into categories. We will obtain the news from 50 random days from 2016 to 2020 (10 per year) in each category. If we were to obtain all the datas from the same time period, this could influence the outcome, as they would probably be talking about similar subjects. 

We import the relevant libraries. 

In [2]:
from requests import get
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

The archives in La Presse are in a site which has an url of the form "https://www.journaldemontreal.com/CATEGORY/archives/DATE", where DATE has format YYYY/mm/dd and CATEGORY is the news category.

In [3]:
#Categories we are interested in
categories = ['actualite', 'monde', 'argent', 'sports', 'spectacles']

In [18]:
# Create a list of urls for 50 random dates
import random
random.seed(41)

years = list(range(2016, 2021))
urls = []
i = 0
days = [random.randint(1,28) for i in range(0,50)]
sdays=["%02d" % x for x in days]
months = [random.randint(1,12) for i in range(0,40)]
smonths = ["%02d" % x for x in months]

#2020 is not complete, so we make sure the date is from the past
months_2020 = [random.randint(1,8) for i in range(0,10)]
smonths_2020 = ["%02d" % x for x in months_2020]

for category in categories:
    while i <5: 
        j = 0
        year = str(years[i])
        while j < 10:
            n = 10*i + j
            day = sdays[n]
            if i == 4:
                month = smonths_2020[j]
            else:
                month = smonths[n]
            url_temp = 'https://www.journaldemontreal.com/' + category +'/archives/' + year + '/' + month + '/' + day
            urls.append(url_temp)
            j = j+1
        i = i+1
    i = 0

We make sure there are no duplicates. 

In [19]:
len(urls) == len(set(urls))

True

We create requests.

In [22]:
#Requests
headers = {'Accept': 'text/html', 'User-Agent':'Mozilla/5.0'}
response_jdm = {}
for url in urls:
    response_jdm[url] = get(url, headers = headers)
    if response_jdm[url].status_code != 200:
        print("Status for", url, "is", response_jdm[url].status_code)

We create a soup. 

In [23]:
#Save the main page content
mainpage_jdm = {}
for url in urls:
    mainpage_jdm[url] = response_jdm[url].content 

#Soup Creation 
soup_jdm = {}
for url in urls:
    soup_jdm[url] = BeautifulSoup(mainpage_jdm[url], 'html.parser')

We parse each article. 

In [28]:
headline_jdm = {}
for url in urls:
    headline_jdm[url] = soup_jdm[url].find_all("article")
    #print(len(headline_jdm[url]))
headline_jdm[urls[2]][2]

<article class="archive actualite archive-block">
<a href="https://www.journaldemontreal.com/2016/02/08/je-la-bonne-reaction-a-avoir-en-presence-dun-tireur-actif"><div class="lft"> <img alt="Attentat" src="https://storage.journaldemontreal.com/v1/dynamic_resize/sws_path/jdx-prod-images/886de9e0-8dcb-43c2-9452-d9f626ef4a5f_JDX-0x0_WEB.jpg?quality=80&amp;size=100x&amp;version=2" title="Attentat"> </img></div> <div class="rght"> <div class="vip-section-with-icon clearfix"> <div class="small-category-name float-left"> Faits divers </div> </div> <div class="title" labelproperty="Headline">Que faire face à un tueur ?</div> <div class="excerpt">Fuir, se cacher et attaquer, recommande la police.</div> <div class="last-update">MISE à JOUR <time datetime="2016-02-08T11:56:20Z"> Lundi, 8 février 2016 06:56 </time> </div> </div></a> </article>

In [30]:
articles = {}
titles = {}
links = {}

# Loop over each subsection in the dictionnary
for url in urls: 
    #Create lists
    articles[url] = []
    titles[url] = []
    links[url] = []
    
    #Loop over each article in a section 
    for n in np.arange(0, len(headline_jdm[url])):
        if headline_jdm[url][n].find('div', {'class': ['excerpt', 'title']}) is None:
            print('NonType in sub', url, 'number', n)
        else:
        
            #Access link to the article 
            link = headline_jdm[url][n].find('a')['href']
            
        
            #Getting the title
            title = headline_jdm[url][n].find('div', {'class': ['excerpt', 'title']}).get_text()
            
        
            #Getting the content of the article
            article = get(link)
            article_content = article.content
            soup_article = BeautifulSoup(article_content, 'html.parser')
            body = soup_article.find_all('div', {'class' : ['article-main-txt', 'formatted-text', 'caption']})
            if len(body) ==0:
                print('Empty body in sub', url, 'number', n)
            else: 
                x = body[0].find_all('p') 
        
            #Unifying the paragraphs
                list_paragraphs = []
                for p in np.arange(0, len(x)):
                    paragraph = x[p].get_text()
                    list_paragraphs.append(paragraph)
                    final_article = " ".join(list_paragraphs)
        
                articles[url].append(final_article)
                links[url].append(link)
                titles[url].append(title)

Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/11/13 number 15
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/01/11 number 27
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/02/08 number 42
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/02/08 number 44
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/12/19 number 10
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/03/23 number 47
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/01/10 number 9
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/01/10 number 12
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/01/10 number 23
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/05/18 number 14
Empty body in sub https://www.journaldemontreal.com/actualite/archives/2016/05/18

We check a few examples to make sure that the algorithm worked. 

In [32]:
titles[urls[2]]

['Desjardins face à un tournant',
 'La proportion d’autochtones a doublé',
 'Que faire face à un tueur ?',
 'Rien n’ébranle Couillard',
 'Solution proposée pour les chauffeurs',
 'Le suspect court toujours',
 'Trois-Rivières procédera par appel d’offres',
 'Les serriculteurs répliquent à Arcand',
 'Une enquête publique est réclamée',
 'Totalement légal selon la direction',
 '[VIDEO] Un géant des mers frappé par une tempête',
 'Une 5e ado fugue du Centre de Laval',
 'Zampino se plaint du délai pour avoir la preuve',
 "Trois employés blessés à l'urgence",
 '200 taxis attendus autour du parlement mardi',
 'Noémy Coderre retrouvée',
 'Fin de carrière pour la voleuse sexy',
 "500 emplois à Montréal d'ici 2020",
 'Le pays? Faut espérer un effet Zanetti...',
 'La plaignante l’a revu après la présumée agression',
 'L’érosion linguistique',
 'Un camionneur a le dessus sur l’assurance-emploi',
 'QS écarte un rapprochement avec le PQ',
 'Les Québécois ont la pire santé bucco-dentaire ',
 'Une bre

In [35]:
links[urls[4]]

['https://www.journaldemontreal.com/2016/03/13/montreal-un-homme-meurt-lors-dune-intervention-policiere',
 'https://www.journaldemontreal.com/2016/03/13/un-batiment-industriel-brule-a-saint-jean-sur-richelieu-1',
 'https://www.journaldemontreal.com/2016/03/13/plus-confiants-que-jamais',
 'https://www.journaldemontreal.com/2016/03/14/elle-conduit-et-travaille-apres-avoir-perdu-ses-jambes',
 'https://www.journaldemontreal.com/2016/03/13/un-cameleon-vieux-de-99-millions-dannees',
 'https://www.journaldemontreal.com/2016/03/13/un-abri-dauto-qui-lave--la-voiture-en-plein-hiver',
 'https://www.journaldemontreal.com/2016/03/12/jeunes-filles-piegees-par-leur-chum',
 'https://www.journaldemontreal.com/2016/03/13/des-cours-de-langues-qui-ne-profitent-pas-a-des-refugies',
 'https://www.journaldemontreal.com/2016/03/13/les-partisans-du-chboivent-et-mangent-moins',
 'https://www.journaldemontreal.com/2016/03/13/elle-se-tue-au-volant-apres-18h-de-travail',
 'https://www.journaldemontreal.com/2016/03

In [37]:
(articles[urls[6]])[2]

' Surnommé le «roi du cannabis», le controversé homme d’affaires à la tête d’une chaîne de commerces qui vend de la marijuana médicale confirme qu’il aura très bientôt pignon sur rue à Montréal.  «On s’en vient», assure Don Briere au sujet de sa chaîne Weeds Glass & Gifts, qui compte déjà près d’une vingtaine de boutiques en Colombie-Britannique et en Ontario.  Il soutient qu’il a signé un bail sur la rue Saint-Denis, près de la rue Rachel. Cependant, il refuse d’en dévoiler l’adresse exacte avant d’y installer un système de sécurité puisque ses commerces, dit-il, sont souvent la cible des voleurs.  Chez Labeaume  Don Briere ne compte pas s’arrêter à Montréal non plus, puisqu’il a aussi Québec dans sa mire. Selon lui, l’arrivée au pouvoir de Justin Trudeau, qui a promis de légaliser la marijuana, est le moment idéal pour continuer son expansion.  La vente de marijuana en magasin reste illégale au Canada, mais celui qui est souvent surnommé le «roi du cannabis» profite d’une zone grise 

In [39]:
#We count the number of data points.

def count(dictionnary, name):
    c = 0
    for key, value in dictionnary.items(): 
        if isinstance(value, list): 
            c += len(value) 
    print('Number of '+ name + ':', c)

count(articles, 'articles')
count(titles, 'titles')
count(links, 'links')

Number of articles: 6696
Number of titles: 6696
Number of links: 6696


We put all the datas in a dataframe. We frst find the information about the category in the urls and we create a dataframe for that information. 

In [42]:
categories = {}
for url in links.keys():
    categories[url] = []
    for n in np.arange(0, len(links[url])):
        string = url.split('https://www.journaldemontreal.com/',1)[1].split("/", 1)[0]
        if string == 'spectacles': 
            string = 'culture'
        if string == 'sports': 
            string = 'sports'
        if string == 'argent': 
            string = 'affaires'
        if string == 'monde':
            string = 'international'
        if string == 'actualite':
            string = 'actualites'
        categories[url].append(string)  

In [50]:
categories[urls[60]]

['international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international',
 'international']

In [51]:
count(categories, 'categories')

Number of categories: 6696


In [52]:
#Create empty Dataframe
df_jdm = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])

for url in categories.keys(): 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[url]
    df_temp['content'] = articles[url]
    df_temp['link'] = links[url]
    df_temp['category'] = categories[url]
    df_jdm = df_jdm.append(df_temp)

In [53]:
df_jdm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6696 entries, 0 to 15
Data columns (total 4 columns):
title       6696 non-null object
content     6696 non-null object
link        6696 non-null object
category    6696 non-null object
dtypes: object(4)
memory usage: 261.6+ KB


In [56]:
df_jdm.sample(20)

Unnamed: 0,title,content,link,category
26,Un entraîneur-chef par intérim pour les Jaguars,Au lendemain du congédiement de l’entraîneur-...,https://www.journaldemontreal.com/2016/12/19/u...,sports
20,26 arrestations en France et en Belgique,Treize personnes ont été arrêtées en Belgique ...,https://www.journaldemontreal.com/2020/05/27/c...,international
9,10 meilleurs matchs,La NFL célébrera aujourd’hui à San Francisco ...,https://www.journaldemontreal.com/2016/02/06/1...,sports
20,Un pétrolier s’est échoué près de la côte,"CAP-BRETON, N.-É. | Un hélicoptère Cormorant ...",https://www.journaldemontreal.com/2017/01/08/u...,actualites
18,Lourde perte pour les Stars,"Opéré à une main vendredi, le défenseur des S...",https://www.journaldemontreal.com/2018/11/09/l...,sports
45,Lune de guêpe,Après plus de 100 jours passés à diriger Mont...,https://www.journaldemontreal.com/2018/02/15/l...,actualites
3,Le ministre de la Justice sous le feu de Trump,WASHINGTON |Le ministre américain de la Justi...,https://www.journaldemontreal.com/2017/07/25/d...,international
30,Huit arrestations à Saint-Jérôme,SAINT-JÉRÔME | Les policiers ont arrêté huit ...,https://www.journaldemontreal.com/2018/11/09/t...,actualites
7,Allemagne: pas encore de gouvernement,BERLIN | La chancelière allemande Angela Merk...,https://www.journaldemontreal.com/2018/01/11/a...,international
3,Une distribution toute étoile,Plusieurs visages bien connus du petit écran ...,https://www.journaldemontreal.com/2016/03/23/l...,culture


In [57]:
df_jdm.to_csv('articles_le_journal_de_montreal_archives.csv', header=True, index=True)