# News Scraping for La Presse

We use web scraping with the help of BeautifulSoup to create a dataset. We label our datas into 5 categories: 'Sports', 'Culture', 'Actualite', 'International' and 'Affaires'

We import the relevant libraries. 

In [1]:
from requests import get
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time 

### La Presse

We use web scraping to get datas from La Presse. We will access different subpages, so we create dictionnaries for all of them. 

In [2]:
#Put the different subsections into a list
#Actualites 
sub_act_list = ['national', 'politique', 'grand-montreal', 'regional',
                'justice-et-faits-divers', 'sante', 'education', 'enquetes',
                'insolite', 'environnement', 'sciences']
#International 
sub_int_list = ['afrique', 'amerique-latine', 'asie-et-oceanie', 
               'caraibes', 'etats-unis', 'europe', 'moyen-orient']

#Affaires 
sub_aff_list = ['economie', 'marches', 'entreprises', 'techno', 'medias',
               'finances-personnelles', 'pme', 'portfolio', 'tetes-daffiche']
#Sports 
sub_spo_list = ['hockey', 'tokyo-2020', 'soccer', 'football', 'tennis',
               'baseball', 'course-automobile', 'golf', 'sports-de-combat', 
               'sports-dhiver', 'basketball', 'cyclisme']
#Culture 
sub_cul_arts_list = ['musique', 'television', 'theatre', 'litterature', 
                    'arts-visuels', 'spectacles', 'humour', 'celebrites']
sub_cul_cinema_list = ['cinema']

#Concatenated list 
sub_lp_list = sub_act_list +sub_int_list + sub_aff_list + sub_spo_list + sub_cul_arts_list + sub_cul_cinema_list
    
#url definitinons 
url = {}
for sub in sub_act_list:
    sub_url = "https://www.lapresse.ca/actualites/" + sub
    url[sub] = sub_url
for sub in sub_int_list:
    sub_url = "https://www.lapresse.ca/international/" + sub
    url[sub] = sub_url
for sub in sub_aff_list:
    sub_url = "https://www.lapresse.ca/affaires/" + sub
    url[sub] = sub_url
for sub in sub_spo_list:
    sub_url = "https://www.lapresse.ca/sports/" + sub
    url[sub] = sub_url
for sub in sub_cul_arts_list:
    sub_url = "https://www.lapresse.ca/arts/" + sub
    url[sub] = sub_url
url['cinema'] = "https://www.lapresse.ca/cinema/"


We create requests.

In [3]:
#Requests
headers = {'Accept': 'text/html', 'User-Agent':'Mozilla/5.0'}
response_lp = {}
for sub in sub_lp_list:
    response_lp[sub] = get(url[sub], headers = headers)
    print("Status for", sub, "is", response_lp[sub].status_code)

Status for national is 200
Status for politique is 200
Status for grand-montreal is 200
Status for regional is 200
Status for justice-et-faits-divers is 200
Status for sante is 200
Status for education is 200
Status for enquetes is 200
Status for insolite is 200
Status for environnement is 200
Status for sciences is 200
Status for afrique is 200
Status for amerique-latine is 200
Status for asie-et-oceanie is 200
Status for caraibes is 200
Status for etats-unis is 200
Status for europe is 200
Status for moyen-orient is 200
Status for economie is 200
Status for marches is 200
Status for entreprises is 200
Status for techno is 200
Status for medias is 200
Status for finances-personnelles is 200
Status for pme is 200
Status for portfolio is 200
Status for tetes-daffiche is 200
Status for hockey is 200
Status for tokyo-2020 is 200
Status for soccer is 200
Status for football is 200
Status for tennis is 200
Status for baseball is 200
Status for course-automobile is 200
Status for golf is 200

We create a soup. 

In [4]:
#Save the main page content
mainpage_lp = {}
for sub in sub_lp_list:
    mainpage_lp[sub] = response_lp[sub].content 

#Soup Creation 
soup_lp = {}
for sub in sub_lp_list:
    soup_lp[sub] = BeautifulSoup(mainpage_lp[sub], 'html.parser')

We parse each article. 

In [5]:
headline_lp = {}
for sub in sub_lp_list:
    headline_lp[sub] = soup_lp[sub].find_all("article")
    #print(len(headline_lp[sub]))
headline_lp['techno'][0]

<article class=" storyCard storyCard--position-1 card AFF mostRecentCard " data-position="1">
<div class="webpart adminContainer adminContainer--layout-list" data-webpart-type="crayon" data-webpart-url="/ops/webpart/crayon/article/4/0f6243054a7033a3a7fa9129db372ead/5288044" data-webpart-validate-crayon="true">
</div>
<a class="visual" data-target-id="0f6243054a7033a3a7fa9129db372ead" data-target-legacy-id="5288044" data-target-type="story" href="https://www.lapresse.ca/affaires/techno/2020-09-05/la-minute-tiktok-comment-recycler-ses-soins-dentaires.php" title="La minute TikTok : comment recycler ses soins dentaires">
<span class="visualWrapper">
<img alt=" La minute TikTok : comment recycler ses soins dentaires)" src="https://mobile-img.lpcdn.ca/lpca/357x/r3996/7a618fb8-eed3-11ea-b8ad-02fe89184577.jpg"/>
<span class="videoSticker">
<span class="videoSticker__duration">00:58</span>
</span>
</span>
</a>
<div class="articleDetail mostRecentCard__detail">
<a data-target-id="0f6243054a7033a

From this, we see that we can easily extract the title and the description of each article, as well as the link for the complete article. We will use that to go to the article and extract the title and full text. 

In [6]:
articles = {}
titles = {}
links = {}

# Loop over each subsection in the dictionnary
for sub in sub_lp_list: 
    #Create lists
    articles[sub] = []
    titles[sub] = []
    links[sub] = []
    
    #Loop over each article in a section 
    for n in np.arange(0, len(headline_lp[sub])):
        if headline_lp[sub][n].find('span', {'class': ['title mostRecentCard__title ', 'title mostRecentCard__title hasFeaturedAuthor', 'headlineCard__title ']}) is None:
            print('NonType in sub', sub, 'number', n)
        else:
        
            #Access link to the article 
            link = headline_lp[sub][n].find('a')['href']
            links[sub].append(link)
        
            #Getting the title
            title = headline_lp[sub][n].find('span', {'class': ['title mostRecentCard__title ', 'title mostRecentCard__title hasFeaturedAuthor', 'headlineCard__title ']}).get_text()
            titles[sub].append(title)
        
            #Getting the content of the article
            article = get(link)
            article_content = article.content
            soup_article = BeautifulSoup(article_content, 'html.parser')
            body = soup_article.find_all('div', class_ = 'articleBody')
            x = body[0].find_all('p', {'class' : ['lead textModule textModule--type-lead ', 'paragraph textModule textModule--type-paragraph ']}) 
        
            #Unifying the paragraphs
            list_paragraphs = []
            for p in np.arange(0, len(x)):
                paragraph = x[p].get_text()
                list_paragraphs.append(paragraph)
                final_article = " ".join(list_paragraphs)
        
            articles[sub].append(final_article)

We make sure that the algorithm worked. 

In [7]:
titles['techno']

['La minute TikTok\xa0: comment recycler ses soins dentaires',
 'Le Pentagone réaffirme son choix de Microsoft pour son mégacontrat de «cloud»',
 'Facebook retire les comptes du groupe américain d’extrême droite Patriot Prayer',
 'Chine: Apple clarifie sa politique sur la liberté d’expression',
 'Pandémie: Amazon embauchera 7000 personnes de plus au Royaume-Uni',
 'Le compte Twitter du premier ministre indien piraté',
 'Verizon investit près de 2 milliards supplémentaires dans la 5G',
 'COVID-19: Apple et Google intègrent le traçage directement dans les téléphones',
 'Donald Trump ne lâche rien sur la vente de TikTok',
 'Facebook et Twitter démantèlent une petite campagne de désinformation russe',
 'Rogers étend son réseau\xa05G à 50\xa0marchés',
 'Clins d’œil technologiques',
 'TikTok se conformera à la nouvelle réglementation de Pékin sur les exportations',
 'Manifestations à Kenosha: Zuckerberg reconnaît que Facebook a erré',
 'Apple interdit son magasin d’applications à Epic Games'

In [8]:
links['techno']

['https://www.lapresse.ca/affaires/techno/2020-09-05/la-minute-tiktok-comment-recycler-ses-soins-dentaires.php',
 'https://www.lapresse.ca/affaires/techno/2020-09-04/le-pentagone-reaffirme-son-choix-de-microsoft-pour-son-megacontrat-de-cloud.php',
 'https://www.lapresse.ca/affaires/techno/2020-09-04/facebook-retire-les-comptes-du-groupe-americain-d-extreme-droite-patriot-prayer.php',
 'https://www.lapresse.ca/affaires/techno/2020-09-04/chine-apple-clarifie-sa-politique-sur-la-liberte-d-expression.php',
 'https://www.lapresse.ca/affaires/techno/2020-09-03/pandemie-amazon-embauchera-7000-personnes-de-plus-au-royaume-uni.php',
 'https://www.lapresse.ca/affaires/techno/2020-09-03/le-compte-twitter-du-premier-ministre-indien-pirate.php',
 'https://www.lapresse.ca/affaires/techno/2020-09-02/verizon-investit-pres-de-2-milliards-supplementaires-dans-la-5g.php',
 'https://www.lapresse.ca/affaires/techno/2020-09-02/covid-19-apple-et-google-integrent-le-tracage-directement-dans-les-telephones.php

In [9]:
articles['techno']

["Des broches sur des décorations de Noël, des leçons sur le langage des signes... Voici quelques perles dénichées sur le réseau social TikTok cette semaine. Nous ne nous ferons pas de cachette : les soins dentaires peuvent être onéreux, surtout lorsqu'on doit y ajouter des traitements orthodontiques. L'histoire d'un jeune homme de 22 ans, originaire du Michigan, est vite devenue virale, grâce au partage d'une tranche de vie un peu embarrassante impliquant ses broches et sa mère. Alors qu'il était assis sur la chaise du dentiste afin de se faire retirer ses broches, sa mère a refusé que le spécialiste les jette à la poubelle. La raison? La facture de 6000\xa0$US qu'elle a dû payer.  Afin de rentabiliser son investissement, la mère du jeune homme a bricolé une décoration de Noël, un sapin, avec les broches. Selon BuzzFeed, plus de 1,4 million de visionnements ont été enregistrés, sans compter les nombreux commentaires de mères qui trouvent l'idée particulièrement bonne.  Une jeune femme

In [10]:
count = 0
for key, value in articles.items(): 
    if isinstance(value, list): 
        count += len(value) 
print(count) 

772


We now put all the datas into a Dataframe. 

In [11]:
#Create empty Dataframe
df_lp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])

#Actualites
for sub in sub_act_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'actualites'
    df_lp = df_lp.append(df_temp)
    
#International
for sub in sub_int_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'international'
    df_lp = df_lp.append(df_temp)

#Affaires
for sub in sub_aff_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'affaires'
    df_lp = df_lp.append(df_temp)

#Sports
for sub in sub_spo_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'sports'
    df_lp = df_lp.append(df_temp)

#Culture
for sub in sub_cul_arts_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'culture'
    df_lp = df_lp.append(df_temp)
    
for sub in sub_cul_cinema_list: 
    df_temp = pd.DataFrame(columns = ['title', 'content', 'link', 'category'])
    df_temp['title'] = titles[sub]
    df_temp['content'] = articles[sub]
    df_temp['link'] = links[sub]
    df_temp['category'] = 'culture'
    df_lp = df_lp.append(df_temp)

In [12]:
df_lp.sample(10)

Unnamed: 0,title,content,link,category
8,Japon: recherches désespérées pour retrouver l...,(Tokyo) Les garde-côtes japonais poursuivaient...,https://www.lapresse.ca/international/asie-et-...,international
7,Enquête La Presse et le Toronto Star: quand le...,Derrière la faillite à répétition se cache un ...,https://www.lapresse.ca/actualites/enquetes/20...,actualites
9,La flamme olympique bientôt exposée à Tokyo,(Tokyo) La flamme olympique sera prochainement...,https://www.lapresse.ca/sports/tokyo-2020/2020...,sports
12,Rentrée scolaire pour les élèves des écoles an...,C’est la rentrée ce matin dans le plus grand c...,https://www.lapresse.ca/actualites/education/2...,actualites
7,Le Toronto FC privé de son capitaine pour plus...,(Toronto) Le capitaine du Toronto FC Michael B...,https://www.lapresse.ca/sports/soccer/2020-09-...,sports
13,José Iglesias joue les héros contre les Blue Jays,(Buffalo) José Iglesias et Bryan Holaday ont p...,https://www.lapresse.ca/sports/baseball/2020-0...,sports
11,Podcast: SiriusXM rachète Stitcher pour 325 mi...,(New York) Le géant américain de la radio en l...,https://www.lapresse.ca/affaires/medias/2020-0...,affaires
6,Cinq cadavres retrouvés dans une maison d’Osha...,(Oshawa) L’auteur présumé d’un carnage qui a f...,https://www.lapresse.ca/actualites/justice-et-...,actualites
8,Trudeau admet qu'il y a des retards dans la li...,(Ottawa) Justin Trudeau a admis jeudi qu’il y ...,https://www.lapresse.ca/affaires/economie/2020...,affaires
3,"Courrier des lecteurs: snowbirds, COVID-19 et ...",« Je suis un homme de 65 ans qui travaille enc...,https://www.lapresse.ca/affaires/finances-pers...,affaires


In [13]:
df_lp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 772 entries, 0 to 19
Data columns (total 4 columns):
title       772 non-null object
content     772 non-null object
link        772 non-null object
category    772 non-null object
dtypes: object(4)
memory usage: 30.2+ KB


We now save this into a csv file. 

In [14]:
df_lp.to_csv("articles_lapresse.csv", header=True, index=True)