# Scraping the archived news from La Presse 

We use web scraping with the help of BeautifulSoup to create a dataset. This is a scraping of the archived news from La Presse, which we will use to train our models. We label our datas into 5 categories: 'Sports', 'Culture', 'Actualite', 'International' and 'Affaires'

We will obtain the news from 50 random days from 2011 to 2020 (5 per year). If we were to obtain all the datas from the same time period, this could influence the outcome, as they would probably be talking about similar subjects. 

We import the relevant libraries. 

In [4]:
from requests import get
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

We will only collect the news from the following list of subcategories. 

In [1]:
#Put the different subsections into a list
#Actualites 
sub_act_list = ['national', 'politique', 'grand-montreal', 'regional',
                'justice-et-faits-divers', 'sante', 'education', 'enquetes',
                'insolite', 'environnement', 'sciences']
#International 
sub_int_list = ['afrique', 'amerique-latine', 'asie-et-oceanie', 
               'caraibes', 'etats-unis', 'europe', 'moyen-orient']

#Affaires 
sub_aff_list = ['economie', 'marches', 'entreprises', 'techno', 'medias',
               'finances-personnelles', 'pme', 'portfolio', 'tetes-daffiche']
#Sports 
sub_spo_list = ['hockey', 'tokyo-2020', 'soccer', 'football', 'tennis',
               'baseball', 'course-automobile', 'golf', 'sports-de-combat', 
               'sports-dhiver', 'basketball', 'cyclisme']
#Culture 
sub_cul_arts_list = ['musique', 'television', 'theatre', 'litterature', 
                    'arts-visuels', 'spectacles', 'humour', 'celebrites']
sub_cul_cinema_list = ['cinema']

#Concatenated list 
sub_lp_list = sub_act_list +sub_int_list + sub_aff_list + sub_spo_list + sub_cul_arts_list + sub_cul_cinema_list

The archives in La Presse are in a site which has an url of the form "https://www.lapresse.ca/archives/DATE.php", where DATE has format YYYY/m/d. 

In [42]:
# Create a list of urls for 50 random dates
import random
random.seed(42)

years = list(range(2011, 2021))
urls = []
i = 0
days = [random.randint(1,28) for i in range(0,50)]
months = [random.randint(1,12) for i in range(0,45)]

#2020 is not complete, so we make sure the date is from the past
months_2020 = [random.randint(1,8) for i in range(0,5)]

while i <10: 
    j = 0
    year = str(years[i])
    while j < 5:
        n = 5*i + j
        day = str(days[n])
        if i == 9:
            month = str(months_2020[j])
        else:
            month = str(months[n])
        url_temp = 'https://www.lapresse.ca/archives/' + year + '/' + month + '/' + day + '.php'
        urls.append(url_temp)
        j = j+1
    i = i+1

We check to see if it worked.

In [43]:
urls

['https://www.lapresse.ca/archives/2011/2/21.php',
 'https://www.lapresse.ca/archives/2011/7/4.php',
 'https://www.lapresse.ca/archives/2011/2/1.php',
 'https://www.lapresse.ca/archives/2011/6/24.php',
 'https://www.lapresse.ca/archives/2011/6/9.php',
 'https://www.lapresse.ca/archives/2012/10/8.php',
 'https://www.lapresse.ca/archives/2012/5/8.php',
 'https://www.lapresse.ca/archives/2012/1/5.php',
 'https://www.lapresse.ca/archives/2012/12/24.php',
 'https://www.lapresse.ca/archives/2012/8/4.php',
 'https://www.lapresse.ca/archives/2013/9/22.php',
 'https://www.lapresse.ca/archives/2013/2/24.php',
 'https://www.lapresse.ca/archives/2013/7/18.php',
 'https://www.lapresse.ca/archives/2013/2/3.php',
 'https://www.lapresse.ca/archives/2013/9/19.php',
 'https://www.lapresse.ca/archives/2014/5/14.php',
 'https://www.lapresse.ca/archives/2014/11/2.php',
 'https://www.lapresse.ca/archives/2014/10/1.php',
 'https://www.lapresse.ca/archives/2014/6/3.php',
 'https://www.lapresse.ca/archives/201

We make sure there are no duplicates. 

In [44]:
len(urls) == len(set(urls))

True

We create requests.

In [45]:
#Requests
headers = {'Accept': 'text/html', 'User-Agent':'Mozilla/5.0'}
response_lp = {}
for url in urls:
    response_lp[url] = get(url, headers = headers)
    print("Status for", url, "is", response_lp[url].status_code)

Status for https://www.lapresse.ca/archives/2011/2/21.php is 200
Status for https://www.lapresse.ca/archives/2011/7/4.php is 200
Status for https://www.lapresse.ca/archives/2011/2/1.php is 200
Status for https://www.lapresse.ca/archives/2011/6/24.php is 200
Status for https://www.lapresse.ca/archives/2011/6/9.php is 200
Status for https://www.lapresse.ca/archives/2012/10/8.php is 200
Status for https://www.lapresse.ca/archives/2012/5/8.php is 200
Status for https://www.lapresse.ca/archives/2012/1/5.php is 200
Status for https://www.lapresse.ca/archives/2012/12/24.php is 200
Status for https://www.lapresse.ca/archives/2012/8/4.php is 200
Status for https://www.lapresse.ca/archives/2013/9/22.php is 200
Status for https://www.lapresse.ca/archives/2013/2/24.php is 200
Status for https://www.lapresse.ca/archives/2013/7/18.php is 200
Status for https://www.lapresse.ca/archives/2013/2/3.php is 200
Status for https://www.lapresse.ca/archives/2013/9/19.php is 200
Status for https://www.lapresse

We create a soup. 

In [46]:
#Save the main page content
mainpage_lp = {}
for url in urls:
    mainpage_lp[url] = response_lp[url].content 

#Soup Creation 
soup_lp = {}
for url in urls:
    soup_lp[url] = BeautifulSoup(mainpage_lp[url], 'html.parser')

We parse each article. 

In [57]:
headline_lp = {}
for url in urls:
    for ultag in soup_lp[url].find_all("ul", class_ = 'square square-spread'):
        headline_lp[url] = ultag.find_all('li')
headline_lp['https://www.lapresse.ca/archives/2012/5/8.php'][1]

<li>(07:00) <a alt="Un après-midi avec les bonobos" href="https://www.lapresse.ca/voyage/destinations/afrique/201205/07/01-4522889-un-apres-midi-avec-les-bonobos.php" title="Un après-midi avec les bonobos">Un après-midi avec les bonobos</a></li>

From this, we see that we can easily extract the title and the description of each article, as well as the link for the complete article. We will use that to go to the article and extract the title and full text. 

In [59]:
articles = {}
titles = {}
links = {}

# Loop over each subsection in the dictionnary
for url in urls: 
    #Create lists
    articles[url] = []
    titles[url] = []
    links[url] = []
    
    #Loop over each article in a section 
    for n in np.arange(0, len(headline_lp[url])):
        if headline_lp[url][n].find('a')['title'] is None:
            print('NonType in sub', url, 'number', n)
        else:
        
            #Access link to the article 
            link = headline_lp[url][n].find('a')['href']
            
        
            #Getting the title
            title = headline_lp[url][n].find('a')['title']
        
            #Getting the content of the article
            article = get(link)
            article_content = article.content
            soup_article = BeautifulSoup(article_content, 'html.parser')
            body = soup_article.find_all('div', class_ = 'articleBody')
            if len(body) ==0:
                print('Empty body in sub', url, 'number', n)
            else:
                x = body[0].find_all('p', {'class' : ['lead textModule textModule--type-lead ', 'paragraph textModule textModule--type-paragraph ']}) 
        
                #Unifying the paragraphs
                list_paragraphs = []
                for p in np.arange(0, len(x)):
                    paragraph = x[p].get_text()
                    list_paragraphs.append(paragraph)
                    final_article = " ".join(list_paragraphs)
        
                articles[url].append(final_article)
                links[url].append(link)
                titles[url].append(title)

Empty body in sub https://www.lapresse.ca/archives/2014/6/3.php number 5
Empty body in sub https://www.lapresse.ca/archives/2017/5/8.php number 91
Empty body in sub https://www.lapresse.ca/archives/2018/3/26.php number 24
Empty body in sub https://www.lapresse.ca/archives/2018/6/1.php number 0
Empty body in sub https://www.lapresse.ca/archives/2018/4/25.php number 104
Empty body in sub https://www.lapresse.ca/archives/2018/11/26.php number 0
Empty body in sub https://www.lapresse.ca/archives/2018/11/26.php number 1
Empty body in sub https://www.lapresse.ca/archives/2019/5/6.php number 0
Empty body in sub https://www.lapresse.ca/archives/2019/5/6.php number 3
Empty body in sub https://www.lapresse.ca/archives/2019/5/6.php number 62
Empty body in sub https://www.lapresse.ca/archives/2019/11/11.php number 0
Empty body in sub https://www.lapresse.ca/archives/2020/4/7.php number 1
Empty body in sub https://www.lapresse.ca/archives/2020/7/4.php number 1


Since the number of errors is low compared to the total number of datas, we leave it as is for now. We check a few examples to make sure the algorithm worked. 

In [60]:
titles['https://www.lapresse.ca/archives/2018/3/26.php']

["Mon clin d'oeil du lundi 26 mars 2018",
 'Arcade Fire couronné au gala des prix Juno',
 'Les femmes du\xa0président',
 'Partis, quelles données utilisez-vous?',
 'Doug Ford et le\xa0casse-tête ontarien',
 'Dure semaine pour la CAQ',
 "Expulsion imminente d'une Guatémaltèque: «Nous ne savons plus quoi faire»",
 "Des jumeaux «cosmiques» révèlent les effets de l'espace",
 "Québec doit en faire plus pour reconnaître les diplômes étrangers, selon l'IDQ",
 'Tuerie à la mosquée de Québec: Bissonnette plaide non coupable',
 'Remorquages sauvages: une entreprise de Montréal sous la loupe',
 'Carles Puigdemont reste en détention en Allemagne',
 'Philips choisit une entreprise montréalaise pour son\xa0échographe portable',
 "Le fabricant d'armes Remington en difficulté",
 'Obèse... mais en santé',
 'Triomphe assuré pour Sissi à la présidentielle en Égypte',
 'Agences de voyages: des fraudeurs se paient des forfaits tout inclus',
 'Trudeau innocente six chefs autochtones exécutés il y a plus de 

In [61]:
links['https://www.lapresse.ca/archives/2013/9/22.php']

['https://www.lapresse.ca/voyage/trucs-conseils/201309/19/01-4691116-comment-se-rendre-a-new-york.php',
 'https://www.lapresse.ca/arts/musique/critiques-cd/201309/20/01-4691383-chelsea-wolfe-douleur-et-beaute-.php',
 'https://www.lapresse.ca/arts/musique/critiques-cd/201309/20/01-4691392-elvis-costello-the-roots-un-trip-de-grands-musiciens-.php',
 'https://www.lapresse.ca/arts/musique/critiques-cd/201309/20/01-4691399-joseph-rouleau-un-terrifiant-boris-.php',
 'https://www.lapresse.ca/arts/musique/critiques-cd/201309/20/01-4691408-azam-ali-loga-ramin-torkian-futur-anterieur-.php',
 'https://www.lapresse.ca/arts/musique/critiques-cd/201309/20/01-4691417-jack-johnson-zone-de-confort-.php',
 'https://www.lapresse.ca/arts/musique/critiques-cd/201309/20/01-4691421-chvrches-du-remplissage-.php',
 'https://www.lapresse.ca/arts/musique/critiques-cd/201309/20/01-4691427-misteur-valaire-divertissement-consensuel-et-mondialise-.php',
 'https://www.lapresse.ca/arts/musique/critiques-cd/201309/20/0

In [63]:
articles['https://www.lapresse.ca/archives/2020/8/11.php'][2]

'Le Liban est à genoux.  L’explosion catastrophique dans le port de Beyrouth est un nouvel évènement révélateur de la totale incompétence et de la corruption de la classe politique libanaise, contestée depuis plusieurs mois par la rue libanaise, qui exige la démission de tous les leaders politiques, au premier chef le président, le gouvernement et le Parlement. Leur niveau de déliquescence et de faillite est devenu intolérable. Le Canada doit se montrer à la hauteur du moment.  En appelant à des réformes démocratiques et socioéconomiques profondes et en indiquant qu’aucune aide canadienne ne transitera par les structures politiques en place, mais plutôt par l’ONU et les ONG, Ottawa emboîte le pas à la France et à l’Europe, qui exercent une pression maximum pour le départ des autorités en place. Comme la communauté internationale, Ottawa a annoncé une aide ponctuelle humanitaire pour gérer les conséquences immédiates de l’explosion. C’est bien. Mais le plus gros reste à faire, car la cr

In [65]:
#We count the number of data points.

count = 0
for key, value in articles.items(): 
    if isinstance(value, list): 
        count += len(value) 
print('Number of articles:', count)

count = 0
for key, value in titles.items(): 
    if isinstance(value, list): 
        count += len(value) 
print('Number of titles:', count)

count = 0
for key, value in links.items(): 
    if isinstance(value, list): 
        count += len(value) 
print('Number of links:', count)



Number of articles: 6904
Number of titles: 6904
Number of links: 6904


We put all the datas in a dataframe. We frst find the information about the category in the urls in the list 'links' and we create a dataframe for that information. 

In [119]:
categories = {}
for url in links.keys():
    categories[url] = []
    for n in np.arange(0, len(links[url])):
        string = links[url][n].split("https://www.lapresse.ca/",1)[1].split("/", 1)[0]
        if string == 'arts': 
            string = 'culture'
        if string == 'cinema': 
            string = 'culture'
        #if string in lp_list:
        categories[url].append(string)  

In [120]:
#We count the number of datas fitting in the categories we are interested in.

count = 0
for key, value in categories.items(): 
    if isinstance(value, list): 
        count += len(value) 
print('Number of articles:', count)

Number of articles: 6904


In [121]:
#Create empty Dataframe
df_lp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])

for url in categories.keys(): 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[url]
    df_temp['content'] = articles[url]
    df_temp['link'] = links[url]
    df_temp['category'] = categories[url]
    df_lp = df_lp.append(df_temp)

In [122]:
df_lp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6904 entries, 0 to 88
Data columns (total 4 columns):
title       6904 non-null object
content     6904 non-null object
link        6904 non-null object
category    6904 non-null object
dtypes: object(4)
memory usage: 269.7+ KB


In [123]:
df_lp.sample(10)

Unnamed: 0,title,content,link,category
113,CPE: Québec lève son moratoire,"La ministre de la Famille, Francine Charbonnea...",https://www.lapresse.ca/actualites/politique/p...,actualites
50,Ceux qui n'ont plus de pression...,Le phénomène n'est pas nouveau. Le club est éc...,https://www.lapresse.ca/sports/hockey/poolers/...,sports
48,Nouvelle vague de piratage: le Canada est mal ...,"Ces derniers jours, les regroupements de pirat...",https://www.lapresse.ca/techno/internet/201107...,techno
29,Plan large sur le cinéma québécois,Si vous passez par le Quartier latin durant le...,https://www.lapresse.ca/cinema/cinema-quebecoi...,culture
74,Scheer veut relancer les travaux du Parlement ...,(Ottawa) Le chef conservateur Andrew Scheer pr...,https://www.lapresse.ca/actualites/politique/2...,actualites
40,Zaz est au Québec,La populaire chanteuse française Zaz est à Mon...,https://www.lapresse.ca/arts/musique/201512/17...,culture
28,Déjà dimanche!: un duo à apprivoiser,"Serons-nous des adeptes de Déjà dimanche !, le...",https://www.lapresse.ca/arts/television/201605...,culture
63,"Baseball: paris risqués, paris gagnants",Tous les joueurs autonomes ne peuvent évidemme...,https://www.lapresse.ca/sports/baseball/201106...,sports
108,Les Sabres coupent les ailes des Wings en fusi...,Tyler Ennis et Zemgus Girgensons ont touché la...,https://www.lapresse.ca/sports/hockey/201411/0...,sports
96,Une femme tente de s'évader de prison en se fa...,"Une femme incarcérée à la prison de Sequedin, ...",https://www.lapresse.ca/actualites/insolite/20...,actualites


Some articles do not fit in the categories we have decided to gather. We remove those rows. 

In [124]:
#List of categories we want to gather
lp_list = ['actualites', 'international', 'affaires', 'sports', 'culture']

df_lp = df_lp[df_lp['category'].isin(lp_list)]

In retrospect, we would have put this code when we gathered the titles, links and article bodies, this would have saved some unnecessary computing. This will be done for the archive datas from Le Journal de Montreal.  

In [125]:
df_lp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5706 entries, 3 to 88
Data columns (total 4 columns):
title       5706 non-null object
content     5706 non-null object
link        5706 non-null object
category    5706 non-null object
dtypes: object(4)
memory usage: 222.9+ KB


In [126]:
df_lp.sample(10)

Unnamed: 0,title,content,link,category
153,Les Américains célèbrent leur indépendance,"Les Américains ont célébré, lundi, le 235e ann...",https://www.lapresse.ca/international/etats-un...,international
90,Un vêtement pour bébé jugé offensant pour les ...,Walmart Canada a annoncé qu'il retirait de ses...,https://www.lapresse.ca/actualites/201707/14/0...,actualites
31,Montréal rend hommage à Julie Hamelin,"Le parc Jean-Rivard, situé dans la Cité des ar...",https://www.lapresse.ca/arts/spectacles-et-the...,culture
58,JO de 2020: 15 nouvelles épreuves à Tokyo,Un relais mixte 4 x 400 m en athlétisme et le ...,https://www.lapresse.ca/sports/autres-sports/o...,sports
75,Scott Dixon de nouveau vainqueur à Indianapolis,"(Indianapolis) Finalement, Scott Dixon a de no...",https://www.lapresse.ca/sports/course-automobi...,sports
36,Accusations contre l'ex-patron de l'IAAF: «Le ...,«Tout vient des Britanniques» et de la volonté...,https://www.lapresse.ca/sports/autres-sports/a...,sports
183,L'acteur Stephen Collins soupçonné d'attouchem...,La police de Los Angeles a confirmé mardi réex...,https://www.lapresse.ca/arts/vie-de-stars/2014...,culture
81,Le mystère des paraboles sur les huttes de cas...,"(Ottawa) Au Canada, de mystérieuses paraboles ...",https://www.lapresse.ca/actualites/insolite/20...,actualites
124,Prince avait une concentration excessive de fe...,Un rapport de toxicologie découlant de l'autop...,https://www.lapresse.ca/arts/musique/201803/26...,culture
176,Londres a déjoué un complot terroriste de l'EI,La police et le Service de sécurité britanniqu...,https://www.lapresse.ca/international/europe/2...,international


In [127]:
df_lp.to_csv('articles_lapresse_archives.csv', header=True, index=True)