# Data Scraping

This notebook contains the code necessary to scrape Nicaraguan news outlets' websites for the names, links, and textual contents of all of their news articles. The process proceeds in two steps. Firstly, article links and titles are retrieved. Secondly, the links thus gathered are used to scrape the textual content of the articles.

### Package Imports

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from lxml import html
from random import randint
from time import sleep
from time import time
from itertools import cycle
import pickle
import concurrent
import logging
import glob
from newsplease import NewsPlease
from fake_useragent import UserAgent

### Scraping Titles and Links from the mainpages of the outlets

In the following dictionary, I collect the information necessary to loop over the news pages. Each entry specifies the name of the news outlet, the url, the number of pages, the xpath to access article titles and the xpath to access the article links. In some cases, the links only contain a part of the link. In those cases, the necessary prefix is included as a key too. In some cases, there are two xpaths which need to be accessed to retrieve all relevant titles and links. In those cases, title and link paths are recorded in a list.

In [None]:
pages = [{'name':'Canal2',
        'url':'https://canal2tv.com/category/nacionales/page/',
        'pages':132,
        'titlepath': "//div[@class='post-container']//a[@class='post-title']/h2",
        'prefix': "",
        'linkpath': "//div[@class='post-container']//a[@class='post-title']"},
        {'name':'Canal4',
        'url':'https://www.canal4.com.ni/nicaragua/page/',
        'pages':1565,
        'titlepath': "//div[@class='tg-col-control']//h3/a",
        'prefix': "",
        'linkpath': "//div[@class='tg-col-control']//h3/a"},
        {'name':'Canal6',
        'url':'https://canal6.com.ni/category/nacionales/page/',
        'pages':307,
        'titlepath': "//figure[@class='figure']//a",
        'prefix': "",
        'linkpath': "//figure[@class='figure']//a"},
        {'name':'Canal10',
        'url':'https://www.canal10.com.ni/category/nacionales/page/',
        'pages':1234,
        'titlepath': "//div[@class='item card-type-a child']//h2/a",
        'prefix': "",
        'linkpath': "//div[@class='item card-type-a child']//h2/a"},
        {'name':'Canal13_politica',
        'url':'https://www.vivanicaragua.com.ni/category/politica/page/',
        'pages':445,
        'titlepath': "//a[@class='card-title']//h3",
        'prefix': "",
        'linkpath': "//a[@class='card-title']"},
        {'name':'Canal13_economia',
        'url':'https://www.vivanicaragua.com.ni/category/economia/page/',
        'pages':363,
        'titlepath': "//a[@class='card-title']//h3",
        'prefix': "",
        'linkpath': "//a[@class='card-title']"},
        {'name':'Canal13_sociales',
        'url':'https://www.vivanicaragua.com.ni/category/sociales/page/',
        'pages':2997,
        'titlepath': "//a[@class='card-title']//h3",
        'prefix': "",
        'linkpath': "//a[@class='card-title']"},
        {'name':'Canal14',
        'url':'https://www.vostv.com.ni/nacionales/?page=',
        'pages':669,
        'titlepath': "//section[@class='secondary-news']//h3",
        'prefix': 'https://www.vostv.com.ni',
        'linkpath': "//section[@class='secondary-news']//div[@class='figure-cap']/a[1]"},
        {'name':'Radio la Primerisima',
        'url':'https://radiolaprimerisima.com/noticias-generales/page/',
        'pages':797,
        'titlepath': "//div[@class='post_title']//a/span[1]",
        'prefix': "",
        'linkpath': "//div[@class='post_title']//a"},
        {'name':'La Nueva Radio Ya',
        'url':'https://nuevaya.com.ni/nacionales/page/',
        'pages':1430,
        'titlepath': "//div[@class='vc_column tdi_52 wpb_column vc_column_container tdc-column td-pb-span9']//h3[@class='entry-title td-module-title']//a",
        'prefix': "",
        'linkpath': "//div[@class='vc_column tdi_52 wpb_column vc_column_container tdc-column td-pb-span9']//h3[@class='entry-title td-module-title']//a"},
        {'name':'Radio 800',
        'url':'https://radio800ni.com/category/nacionales/page/',
        'pages':81,
        'titlepath': "//h2[@class='post-title']/a",
        'prefix': "",
        'linkpath': "//h2[@class='post-title']/a"},
        {'name':'Radio Nicaragua',
        'url':'https://radionicaragua.com.ni/category/nacionales/page/',
        'pages':2161,
        'titlepath': "//figcaption/a/h2",
        'prefix': "",
        'linkpath': "//figcaption/a"},
        {'name':'Radio Corporacion_nacional',
        'url':'https://radio-corporacion.com/blog/archivos/category/nacional/page/',
        'pages':584,
        'titlepath': "//h3[@class='mh-loop-title']/a",
        'prefix': "",
        'linkpath': "//h3[@class='mh-loop-title']/a"},
        {'name':'Radio Corporacion_politica',
        'url':'https://radio-corporacion.com/blog/archivos/category/politica/page/',
        'pages':264,
        'titlepath': "//h3[@class='mh-loop-title']/a",
        'prefix': "",
        'linkpath': "//h3[@class='mh-loop-title']/a"},
        {'name':'Radio Corporacion_eco',
        'url':'https://radio-corporacion.com/blog/archivos/category/eco/page/',
        'pages':116,
        'titlepath': "//h3[@class='mh-loop-title']/a",
        'prefix': "",
        'linkpath': "//h3[@class='mh-loop-title']/a"},
        {'name':'Confidencial_politica',
        'url':'https://www.confidencial.com.ni/politica/page/',
        'pages':355,
        'titlepath': "//h2[@class='archive-titles']/a",
        'prefix': "",
        'linkpath': "//h2[@class='archive-titles']/a"},
        {'name':'Confidencial_economia',
        'url':'https://www.confidencial.com.ni/economia/page/',
        'pages':168,
        'titlepath': "//h2[@class='archive-titles']/a",
        'prefix': "",
        'linkpath': "//h2[@class='archive-titles']/a"},
        {'name':'Confidencial_nacion',
        'url':'https://www.confidencial.com.ni/nacion/page/',
        'pages':637,
        'titlepath': "//h2[@class='archive-titles']/a",
        'prefix': "",
        'linkpath': "//h2[@class='archive-titles']/a"},
        {'name':'100% Noticias_nacionales',
        'url':'https://100noticias.com.ni/nacionales/?page=',
        'pages':747,
        'titlepath': ["//div[@class='col-md-6 m-bottom-10']//a//h5", "//div[@class='col-6 col-md-4']/a//h5"],
        'prefix' : "https://100noticias.com.ni",
        'linkpath': ["//div[@class='col-md-6 m-bottom-10']//a", "//div[@class='col-6 col-md-4']/a"]},
        {'name':'100% Noticias_economia',
        'url':'https://100noticias.com.ni/economia/?page=',
        'pages':73,
        'titlepath': ["//div[@class='col-md-6 m-bottom-10']//a//h5", "//div[@class='col-6 col-md-4']/a//h5"],
        'prefix' : "https://100noticias.com.ni",
        'linkpath': ["//div[@class='col-md-6 m-bottom-10']//a", "//div[@class='col-6 col-md-4']/a"]},
        {'name':'100% Noticias_politica',
        'url':'https://100noticias.com.ni/politica/?page=',
        'pages':114,
        'titlepath': ["//div[@class='col-md-6 m-bottom-10']//a//h5", "//div[@class='col-6 col-md-4']/a//h5"],
        'prefix': "https://100noticias.com.ni",
        'linkpath': ["//div[@class='col-md-6 m-bottom-10']//a", "//div[@class='col-6 col-md-4']/a"]}]

This function executes the scraping.

In [None]:
def scrape_articles(outlet):
    
    headers = requests.utils.default_headers()
    headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    linklist = []
    titlelist = []
    baseurl = outlet['url']
    name = outlet['name']
    
    logger.info(f'Working on {name}.')
    
    # start looping through pages
    for i in range(1, outlet['pages']+1):
        # report on status at every ten pages
        if i % 10 == 0: logger.info(f"Status {name}: page {i}")
        try:
            url = baseurl+str(i)
            source = requests.get(url, headers=headers).text
            tree = html.fromstring(source)
            if isinstance(outlet['titlepath'] , list):
                links = [outlet['prefix'] + l.attrib['href'] for l in (tree.xpath(outlet["linkpath"][0]) + tree.xpath(outlet["linkpath"][1]))]
                titles = [l.text for l in (tree.xpath(outlet["titlepath"][0]) + tree.xpath(outlet["titlepath"][1]))]
            else:
                links = [outlet['prefix'] + l.attrib['href']for l in tree.xpath(outlet["linkpath"])]
                titles = [l.text for l in tree.xpath(outlet["titlepath"])]
            [linklist.append(x) for x in links]
            [titlelist.append(x) for x in titles]
        except Exception as e:
            logger.error(f"Error with {name} at page {i}:")
            logger.error(e)
        sleep(randint(3, 6))
    
    combined = [linklist, titlelist]
    with open(f'{name}-links-titles.pkl', 'wb') as f:
        pickle.dump(combined, f)

In [None]:
#Creating and Configuring Logger
Log_Format = "%(levelname)s %(asctime)s - %(message)s"

logging.basicConfig(filename = f"logfile.log",
                    filemode = "w",
                    format = Log_Format, 
                    level = logging.INFO,
                    force = True)
logger = logging.getLogger()

with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    executor.map(scrape_articles, [pages[8]])

In [None]:
# merge individual files into df and save to csv
df = pd.DataFrame(columns = ['page', 'date', 'title', 'text', 'link'])
for file in [x for x in glob.glob("*") if x.endswith(".pkl")]:
    with open(file, "rb") as f:
        articles = pickle.load(f)
    name = file.split("-")[0]
    df = df.append(pd.DataFrame({'link': articles[0], 'title': articles[1]}))
    df.page.fillna(value = name, inplace =True)
df.to_csv('data.csv', sep= ";")    

In [None]:
# unifying individual categories in to single pickle files
with open("Radio Corporacion_eco-links-titles.pkl", "rb") as f:
    articles1 = pickle.load(f)
with open("Radio Corporacion_nacional-links-titles.pkl", "rb") as f:
    articles2 = pickle.load(f)
with open("Radio Corporacion_politica-links-titles.pkl", "rb") as f:
    articles3 = pickle.load(f)
    
articles[0] = articles1[0] + articles2[0] + articles3[0]
articles[1] = articles1[1] + articles2[1] + articles3[1]

with open(f'Radio Corporacion.pkl', 'wb') as f:
        pickle.dump(articles, f)

### Using newsplease to scrape article text and date

In [1]:
# this one goes through articles per outlet. It is designed to be used with multithreading (one thread per outlet)
outlets_instructions = [{'name': 'Canal2-links-titles.pkl',
                         'xpath': None},
                        {'name': 'Canal6-links-titles.pkl',
                         'xpath': None},
                        {'name': 'Canal10-links-titles.pkl',
                         'xpath': None},
                        {'name': 'Canal13-links-titles.pkl',
                         'xpath': None},
                        {'name': 'Confidencial-links-titles.pkl',
                         'xpath': None},
                        {'name': 'Radio Corporacion-links-titles.pkl',
                         'xpath': None},
                        {'name': 'Radio la Primerisima-links-titles.pkl',
                         'xpath': None},
                        {'name': 'Radio Nicaragua-links-titles.pkl',
                         'xpath': None},
                        {'name': '100% Noticias-links-titles.pkl',
                         'xpath': "//div[@class='story-body']"},
                        {'name': 'Radio 800-links-titles.pkl',
                         'xpath': "//div[@class='post-body padd-top']/p"},
                        {'name': 'Canal14-links-titles.pkl',
                         'xpath': "//div[@class='story-body']"},
                        {'name': 'Canal4-links-titles.pkl',
                         'xpath': "//span[@style='color: #000000;']"}]

# made a copy of other instructions, because I need to run scraper again, this time only looking for date
outlets_instructions1 = [{'name': 'Canal2-links-titles.pkl_full.pkl',
                         'xpath': None},
                        {'name': 'Canal10-links-titles.pkl_full.pkl',
                         'xpath': "//div[@class='date']"},
                        {'name': 'Confidencial-links-titles.pkl_full.pkl',
                         'xpath': None},
                        {'name': 'Radio Corporacion-links-titles.pkl_full.pkl',
                         'xpath': None},
                        {'name': 'Radio la Primerisima-links-titles.pkl_full.pkl',
                         'xpath': None},
                        {'name': 'Radio Nicaragua-links-titles.pkl_full.pkl',
                         'xpath': None},
                        {'name': '100% Noticias-links-titles.pkl_full.pkl',
                         'xpath': "//div[@class='story-meta top-meta text-center']/span[2]"},
                        {'name': 'Canal14-links-titles.pkl_full.pkl',
                         'xpath': "//ul[@class='story-meta m-bottom-20']/li[2]"},
                        {'name': 'Canal4-links-titles.pkl_full.pkl',
                         'xpath': None}]

def scrape_text(outlet):
    headers = requests.utils.default_headers()
    headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    text_list = []
    with open(outlet["name"], "rb") as f:
        articles = pickle.load(f)
    logger.info(f'Working on {outlet["name"]}.')
    if outlet['xpath'] is None:
        for url in articles[0]:
            try:
                art = NewsPlease.from_url(url)
                text = art.maintext
                text_list.append(text)
            except Exception as e:
                logger.error(f"{outlet['name']}-{url}: {e}")
                text_list.append(None)
            sleep(randint(3, 6))
    else:
        for url in articles[0]:
            try:
                source = requests.get(url, headers=headers).text
                tree = html.fromstring(source)
                text = " ".join([l.text_content() for l in tree.xpath(outlet['xpath'])])
                text_list.append(text)
            except Exception as e:
                logger.error(f"{outlet['name']}-{url}: {e}")
                text_list.append(None)
            sleep(randint(3, 6))
    combined = [articles[0], articles[1], text_list]
    with open(f'{outlet["name"]}_full.pkl', 'wb') as f:
        pickle.dump(combined, f)

def scrape_date(outlet):
    headers = requests.utils.default_headers()
    headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    date_list = []
    with open(outlet["name"], "rb") as f:
        articles = pickle.load(f)
    logger.warning(f'Working on {outlet["name"]}.')
    if outlet['xpath'] is None:
        for url in articles[0]:
            logger.warning(f'{outlet["name"]}: {url}.')
            try:
                art = NewsPlease.from_url(url)
                date = art.date_publish
                date_list.append(date)
            except Exception as e:
                logger.error(f"{outlet['name']}-{url}: {e}")
                date_list.append(None)
            sleep(randint(3, 6))
    else:
        for url in articles[0]:
            logger.warning(f'{outlet["name"]}: {url}.')
            try:
                source = requests.get(url, headers=headers).text
                tree = html.fromstring(source)
                date = " ".join([l.text_content() for l in tree.xpath(outlet['xpath'])])
                date_list.append(date)
            except Exception as e:
                logger.error(f"{outlet['name']}-{url}: {e}")
                date_list.append(None)
            sleep(randint(3, 6))
    combined = [articles[0], articles[1], articles[2], date_list]
    with open(f'{outlet["name"]}_full.pkl', 'wb') as f:
        pickle.dump(combined, f)

### Running Scraping only for Canal13, which has so many articles I need to multithread

In [2]:
# this one was made for a single outlet.
# Multithreading should be used here to go through a couple of urls at the same time.
def scrape_13(page):
    if 'text' not in canal13_list[page]:
        if page % 50 == 0: logger.warning(f"Status: url {page}")
        url = canal13_list[page]['url']
        try:
            art = NewsPlease.from_url(url)
            text = art.maintext
            canal13_list[page].update({"text": text})
        except Exception as e:
            text = None
            logger.error(f"{url}: {e}")
        sleep(randint(3, 6))

In [None]:
# transforming to list of dicts
with open('Canal13-links-titles.pkl', "rb") as f:
    articles = pickle.load(f)

canal13_list = []
for i in range(len(articles[0])):
    canal13_list.append({"url":articles[0][i], "title": articles[1][i]})
    
with open("Canal13-dict.pkl", "wb") as f:
    pickle.dump(canal13_list, f)

In [3]:
# this will load the canal13 file into memory, work on urls without text and save once interrupted
with open("Canal13-dict.pkl", "rb") as f:
    canal13_list = pickle.load(f)

Log_Format = "%(levelname)s %(asctime)s - %(message)s"

logging.basicConfig(filename = f"log_canal13.log",
                    format = Log_Format, 
                    level = logging.WARNING,
                    force = True)
logger = logging.getLogger()
try:
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        executor.map(scrape_13, range(len(canal13_list)))
except KeyboardInterrupt:
    for i in range(len(canal13_list)):
        if 'text' in canal13_list[i]:
            pass
        else:
            logger.warning(f":Interrupted at {i}")
            break
    with open("Canal13-dict.pkl", "wb") as f:
        pickle.dump(canal13_list, f)
        pass

In [4]:
# checking which articles are still missing the text
with open("Canal13-dict.pkl", "rb") as f:
    canal13_list = pickle.load(f)
for i in reversed(range(len(canal13_list))):
    if 'text' in canal13_list[i]:
        pass
    else:
        print(i)

41765
41764


### Executing Scraper

In [54]:
# running scraper with logging file without multithreading

Log_Format = "%(levelname)s %(asctime)s - %(message)s"

logging.basicConfig(filename = f"logfile.log",
                    filemode = "w",
                    format = Log_Format, 
                    level = logging.WARNING,
                    force = True)
logger = logging.getLogger()

for x in outlets_instructions1:
    scrape_date(x)

working on Canal2-links-titles.pkl_full.pkl
work on https://canal2tv.com/nacionales/covid-19-minsa-recuperacion-nicaraguenses/
work on https://canal2tv.com/nacionales/murillo-dios-ruben-sandino-gesta/
work on https://canal2tv.com/nacionales/ortega-razonen-hecatombe-planeta/
work on https://canal2tv.com/nacionales/estudiantes-secundaria-campo-recibiran-becas/
work on https://canal2tv.com/nacionales/lluvias-provocadas-frente-frio/


KeyboardInterrupt: 

In [3]:
# running scraper with logging file with multithreading

Log_Format = "%(levelname)s %(asctime)s - %(message)s"

logging.basicConfig(filename = f"logfile.log",
                    filemode = "w",
                    format = Log_Format, 
                    level = logging.WARNING,
                    force = True)
logger = logging.getLogger()

with concurrent.futures.ThreadPoolExecutor(max_workers=9) as executor:
    executor.map(scrape_date, outlets_instructions1)

### Trying out xpaths

In [34]:
#for trying stuff out
url = "https://www.canal10.com.ni/noticia/violencia-sexual-en-el-vinculo-matrimonial/"
source = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text
tree = html.fromstring(source)
text = "".join([l.text_content() for l in tree.xpath("//div[@class='date']")])
text

'\n              Tuesday 03 de August 2021            '

In [42]:
url = "https://radionicaragua.com.ni/artistas-y-escritores-rinden-homenaje-a-sandino-siempre-mas-alla/"
art = NewsPlease.from_url(url)
date = art.date_publish
date

datetime.datetime(2022, 2, 21, 0, 0)