# News scraping for Le Journal de Montreal

We use web scraping with the help of BeautifulSoup to create a dataset. This is a scraping of the newest news, which is useful for acquiring new datas every day. We label our datas into 5 categories: 'Sports', 'Culture', 'Actualite', 'International' and 'Affaires'. 

We import the relevant libraries.

In [1]:
from requests import get
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

### Le Journal de Montreal

We use web scraping to get datas from Le Journal de Montreal. We will access different subpages, so we create dictionnaries for all of them. 

In [6]:
#Put the different subsections into a list
#Actualites 
sub_act_list = ['faits-divers', 'politique', 'nos-routes-en-deroute', 'sante', 
               'cannabis', 'education', 'transports', 'environnement', 
               'consommation', 'societe']
#International 
sub_int_list = ['etats-unis', 'ameriques', 'europe', 'moyen-orient', 
               'asie-pacifique', 'afrique']

#Affaires 
sub_aff_list = ['dans-vos-poches', 'penurie-de-main-doeuvre', 'reer', 
               'ou-vont-vos-impots', 'entreprises', 'bourse', 'PMEinc']
#Sports 
sub_spo_list = ['hockey', 'tennis', 'baseball', 
                'football', 'plein-air-chasse-et-peche', 
               'combats', 'courses', 'soccer', 'golf', 'ski', 'autres-sports']
#Culture 
sub_cul_list = ['jetset', 'cinema', 'television', 'musique', 'spectacles', 
               'theatre', 'sorties','humour']

#Concatenated list 
sub_jdm_list = sub_act_list +sub_int_list + sub_aff_list + sub_spo_list + sub_cul_list 
    
#url definitinons 
url = {}
for sub in sub_act_list:
    sub_url = "https://www.journaldemontreal.com/actualite/" + sub
    url[sub] = sub_url
for sub in sub_int_list:
    sub_url = "https://www.journaldemontreal.com/monde/" + sub
    url[sub] = sub_url
for sub in sub_aff_list:
    sub_url = "https://www.journaldemontreal.com/argent/" + sub
    url[sub] = sub_url
for sub in sub_spo_list:
    sub_url = "https://www.journaldemontreal.com/sports/" + sub
    url[sub] = sub_url
for sub in sub_cul_list:
    sub_url = "https://www.journaldemontreal.com/spectacles/" + sub
    url[sub] = sub_url

We create requests. 

In [7]:
#Requests
headers = {'Accept': 'text/html', 'User-Agent':'Mozilla/5.0'}
response_jdm = {}
for sub in sub_jdm_list:
    response_jdm[sub] = get(url[sub], headers = headers)
    print("Status for", sub, "is", response_jdm[sub].status_code)

Status for faits-divers is 200
Status for politique is 200
Status for nos-routes-en-deroute is 200
Status for sante is 200
Status for cannabis is 200
Status for education is 200
Status for transports is 200
Status for environnement is 200
Status for consommation is 200
Status for societe is 200
Status for etats-unis is 200
Status for ameriques is 200
Status for europe is 200
Status for moyen-orient is 200
Status for asie-pacifique is 200
Status for afrique is 200
Status for dans-vos-poches is 200
Status for penurie-de-main-doeuvre is 200
Status for reer is 200
Status for ou-vont-vos-impots is 200
Status for entreprises is 200
Status for bourse is 200
Status for PMEinc is 200
Status for hockey is 200
Status for tennis is 200
Status for baseball is 200
Status for football is 200
Status for plein-air-chasse-et-peche is 200
Status for combats is 200
Status for courses is 200
Status for soccer is 200
Status for golf is 200
Status for ski is 200
Status for autres-sports is 200
Status for jet

We create a soup. 

In [8]:
#Save the main page content
mainpage_jdm = {}
for sub in sub_jdm_list:
    mainpage_jdm[sub] = response_jdm[sub].content 

#Soup Creation 
soup_jdm = {}
for sub in sub_jdm_list:
    soup_jdm[sub] = BeautifulSoup(mainpage_jdm[sub], 'html.parser')

We parse each article. 

In [11]:
headline_jdm = {}
for sub in sub_jdm_list:
    headline_jdm[sub] = soup_jdm[sub].find_all("article")
    #print(len(headline_jdm[sub]))
headline_jdm['combats'][0]

<article class="item-inner">
<div class="ab-testing ab-test__default">
<div class="show-comments">
</div>
</div>
<div class="ab-testing ab-test__variant">
<div class="show-comments">
</div>
</div>
<a href="https://www.journaldemontreal.com/2020/09/03/une-scene-transformee-en-ring"><div class="title-box"> <div class="title-wrapper"> <div class="hidden-content"> <p class="sub-title">Québecor veut présenter des galas au Capitole, une salle qu’elle a achetée en juin.</p> </div> <div class="hints"> <span class="strapline">Boxe</span> </div> <h2 class="main-title"> Théâtre Capitole: une scène transformée en ring </h2> </div> </div> <picture> <!--[if IE 9]><video style="display: none;"><![endif]--> <source media="(min-width: 40em)" srcset="https://m1.quebecormedia.com/emp/emp/62754806_0879286acf0970-47ad-48b3-aac4-b4b54073b61d_ORIGINAL.jpg?impolicy=crop-resize&amp;x=145&amp;y=135&amp;w=1353&amp;h=1090&amp;width=764, https://m1.quebecormedia.com/emp/emp/62754806_0879286acf0970-47ad-48b3-aac4-b

From this, we see that we can easily extract the title and the description of each article, as well as the link for the complete article. We will use that to go to the article and extract the title and full text. 

In [42]:
articles = {}
titles = {}
links = {}

# Loop over each subsection in the dictionnary
for sub in sub_jdm_list: 
    #Create lists
    articles[sub] = []
    titles[sub] = []
    links[sub] = []
    
    #Loop over each article in a section 
    for n in np.arange(0, len(headline_jdm[sub])):
        if headline_jdm[sub][n].find('h2', {'class': ['main-title', 'medium ', 'short ']}) is None:
            print('NonType in sub', sub, 'number', n)
        else:
        
            #Access link to the article 
            link = headline_jdm[sub][n].find('a')['href']
            links[sub].append(link)
        
            #Getting the title
            title = headline_jdm[sub][n].find('h2', {'class': ['main-title', 'medium ', 'short ']}).get_text()
            titles[sub].append(title)
        
            #Getting the content of the article
            article = get(link)
            article_content = article.content
            soup_article = BeautifulSoup(article_content, 'html.parser')
            body = soup_article.find_all('div', {'class' : ['article-main-txt', 'formatted-text']})
            if len(body) ==0:
                print('Empty body in sub', sub, 'number', n)
            else: 
                x = body[0].find_all('p') 
        
            #Unifying the paragraphs
                list_paragraphs = []
                for p in np.arange(0, len(x)):
                    paragraph = x[p].get_text()
                    list_paragraphs.append(paragraph)
                    final_article = " ".join(list_paragraphs)
        
                articles[sub].append(final_article)

We make sure that the algorithm worked. 

In [43]:
titles['combats']

[' Théâtre Capitole: une scène transformée en ring ',
 ' Hugo Girard est jugé sans avoir été avisé ',
 ' La ministre confirme l’absence de communication ',
 ' Une année de défis pour Gill ',
 ' Estephan pourrait devoir patienter ',
 ' Les sports de combat reprendront au Québec ',
 ' Combat en vue pour Charles Jourdain ',
 ' Feu vert pour la boxe pro ',
 'La guerre de Sécession n’a rien changé',
 ' «Je boycotterais mon combat» –Jean Pascal ',
 ' Le tout pour le tout pour Zewski ',
 " «C'est le combat que j'attendais» - Mikaël Zewski ",
 ' «Il doit être honnête avec lui-même» ',
 ' Artur Beterbiev change d’adversaire ',
 ' Le chant du cygne pour Alvarez ',
 'Plus rien à donner... reste une saison',
 ' Dure défaite d’Alvarez ',
 ' Une pesée sans histoire ']

In [44]:
links['politique']

['https://www.journaldemontreal.com/2020/09/06/limpuissance-des-politiciens',
 'https://www.journaldemontreal.com/2020/09/06/des-slogans-a-une-vraie-strategie-economique',
 'https://www.journaldemontreal.com/2020/09/06/la-grc-avait-a-lil-les-independantistes',
 'https://www.journaldemontreal.com/2020/09/06/le-chef-conservateur-veut-reunir-les-deux-solitudes',
 'https://www.journaldemontreal.com/2020/09/06/berlin-critique-lappel-de-trump-a-voter-deux-fois',
 'https://www.journaldemontreal.com/2020/09/06/le-quebec-en-deconfinement-la-pandemie-aura-eu-du-bon',
 'https://www.journaldemontreal.com/2020/09/06/le-quebec-en-deconfinement-10-moments-marquants-de-la-crise-sanitaire',
 'https://www.journaldemontreal.com/2020/09/05/premiere-journee-de-vote-par-anticipation-au-nouveau-brunswick',
 'https://www.journaldemontreal.com/2020/09/05/la-dissolution-du-parti-quebecois-preconisee-par-un-de-ses-fondateurs',
 'https://www.journaldemontreal.com/2020/09/05/taxe-carbone--les-autocollants-dans-les

In [45]:
articles['cinema']

["Les militants prodémocratie de Hong Kong ont relancé leur appel au boycott de la nouvelle version du film de Disney Mulan, un an après que des propos de l'actrice qui incarne l'héroïne eut vivement fait réagir la communauté chinoise.\xa0 En 2019, l’actrice Liu Yifei a exprimé son soutien à la police de Hong Kong, que les manifestants antigouvernementaux accusent d’utiliser une force excessive pour apaiser les troubles. «Ce film est sorti aujourd'hui. Mais parce que Disney s'incline devant Pékin et que Liu Yifei approuve ouvertement et fièrement la brutalité policière à Hong Kong, j'exhorte tous ceux qui croient aux droits de l'homme à #BoycottMulan», a tweeté vendredi la figure de proue du mouvement militant hongkongais Joshua Wong. Liu Yifei, une citoyenne américaine d'origine chinoise, a participé au débat l'année dernière au plus fort des manifestations à Hong Kong, qui ont commencé comme des manifestations largement pacifiques et qui se sont finalement transformées en affrontemen

In [46]:
count = 0
for key, value in articles.items(): 
    if isinstance(value, list): 
        count += len(value) 
print(count) 

704


We now put all the datas into a Dataframe. 

In [47]:
#Create empty Dataframe
df_jdm = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])

#Actualites
for sub in sub_act_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'actualites'
    df_jdm = df_jdm.append(df_temp)
    
#International
for sub in sub_int_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'international'
    df_jdm = df_jdm.append(df_temp)

#Affaires
for sub in sub_aff_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'affaires'
    df_jdm = df_jdm.append(df_temp)

#Sports
for sub in sub_spo_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'sports'
    df_jdm = df_jdm.append(df_temp)

#Culture
for sub in sub_cul_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'culture'
    df_jdm = df_jdm.append(df_temp)

In [48]:
df_jdm.sample(10)

Unnamed: 0,title,content,link,category
12,Les Red Bulls limogent leur entraîneur,Les Red Bulls de New York ont annoncé avoir mi...,https://www.journaldemontreal.com/2020/09/04/l...,sports
12,Trump met fin aux formations contre le racisme,"Washington | Le président américain, Donald Tr...",https://www.journaldemontreal.com/2020/09/05/t...,international
4,La maison de vos rêves grâce à la pandémie?,Du «jamais-vu» depuis 40 ans. Voilà comment es...,https://www.journaldemontreal.com/2020/09/06/l...,actualites
10,Automne intime pour le Trident,Le Trident investira quatre lieux pour sa rela...,https://www.journaldemontreal.com/2020/08/17/a...,culture
1,Cinq blessés dans un accident spectaculaire,Cinq personnes ont été blessées dans un accide...,https://www.journaldemontreal.com/2020/09/06/m...,actualites
5,Nana se dirige vers le Mexique,"L’ouragan Nana, qui a rapidement perdu de sa p...",https://www.journaldemontreal.com/2020/09/03/l...,international
16,Nemaska Lithium plusieurs fois dénoncée à l’AMF,"La minière Nemaska Lithium, dans laquelle Québ...",https://www.journaldemontreal.com/2020/09/03/n...,affaires
7,De publicitaire à fabricant de produits sanit...,"Après 12 ans dans le milieu de la pub, Fayçal ...",https://www.journaldemontreal.com/2020/06/17/d...,affaires
10,Le virus ferme les pubs et assèche la campagne,"Dans les pubs de Dunmore, les verres de Guinne...",https://www.journaldemontreal.com/2020/09/05/e...,international
4,"Indétrônable, le Kim Crawford","Le vin blanc Kim Crawford, qui est offert dans...",https://www.journaldemontreal.com/2020/09/05/s...,affaires


In [49]:
df_jdm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 704 entries, 0 to 17
Data columns (total 4 columns):
title       704 non-null object
content     704 non-null object
link        704 non-null object
category    704 non-null object
dtypes: object(4)
memory usage: 27.5+ KB


We now save this into a csv file. Since we will do this for different days, we create a string with the date.

In [52]:
from datetime import date

today = date.today()
day = today.strftime("%m%d%y")
print("The day in string format is =", day)

#Create a string for the csv file
file = 'articles_journal_de_montreal_' + day + '.csv'
print("File name: ", file)

The day in string format is = 090620
File name:  articles_journal_de_montreal_090620.csv


In [53]:
df_jdm.to_csv(file, header=True, index=True)