## Commons helper: uploads from diario.madrid.es

This notebook helps users to upload images from **Diario de Madrid** [press room](https://diario.madrid.es/blog/notas-de-prensa/). It is coded using some Python 3.6 features such as f-strings and therefore won't run in prior versions.

You can run this notebook from [PAWS](http://paws.wmflabs.org/) or from your own environment.

#### Prerequisites:
- Clone the repository where this notebook is available (it imports some functions from `utils.py`, located in a folder within this notebook parent folder).
- Create a Python 3 virtual environment and activate it.
- Install `pywikibot`:
```bash
pip install pywikibot
```
- Install `mako`:
```bash
pip install mako
```
or:
```bash
conda install mako
```
- Install `beautifulsoup4`:
```bash
pip install beautifulsoup4
```
or:
```bash
conda install beautifulsoup4
```
- Create a properly formatted `user-config.py` file.
- Launch `jupyter notebook` using the kernel associated to the virtual environment.

#### To-do list
1. Align with remaining tools (config)
2. Finish the documentation of this notebook
3. Support file formats other than JPG.
4. Create a generic function for image uploading

In [None]:
#!/usr/bin/python
# -*- coding: utf-8 -*-

import pywikibot as pb
from pywikibot.specialbots import UploadRobot

import requests
from requests.compat import quote
from bs4 import BeautifulSoup
from mako.template import Template

import os, re
import shutil
import calendar

commons_site = pb.Site("commons", "commons")

In [None]:
# Path handling for importing utils.py
import sys, inspect
current_folder = os.path.realpath(os.path.abspath(os.path.split(inspect.getfile(inspect.currentframe()))[0]))
folder_parts = current_folder.split(os.sep)
parent_folder = os.sep.join(folder_parts[:-1])

if current_folder not in sys.path:
    sys.path.insert(0, current_folder)
if parent_folder not in sys.path:
    sys.path.insert(0, parent_folder)
    
from wikimedia.utils import is_commons_file, get_hash

In [None]:
# Creation of images folder
cwd = os.getcwd()

images_directory = os.path.join(cwd, 'images')
if not os.path.exists(images_directory):
    os.makedirs(images_directory)

In [None]:
#### User input:
url = 'https://diario.madrid.es/blog/notas-de-prensa/el-ayuntamiento-homenajea-con-una-calle-a-soledad-cazorla-impulsora-de-la-ley-contra-la-violencia-de-genero/'

user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'
headers = {'User-Agent' : user_agent}

#### User input:
categories = ['Calle de Soledad Cazorla, Madrid',
              'Francisca Sauquillo',
              'Celia Mayer',
              'Purificación Causapié',
              'Carlos Mato',
              'Mar Espinar']

In [None]:
#### User input:
upload_categories = []
categories = categories + upload_categories
categories

In [None]:
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
title = soup.find_all("h1", class_="post-title")[0].get_text().strip().replace('  ', ' ')
title = title.replace(':', ' -').replace('  ', ' ')
title

In [None]:
date = '-'.join(soup.find_all("div", class_="post-date")[0].get_text().strip().split('/')[::-1])
#date = ""
year = date.split('-')[0]
month = calendar.month_name[int(date.split('-')[1])]
date

In [None]:
body = soup.find_all("div", class_="post-content")
p_description = body[0].find_all("p")
for p in p_description :
    if len(p.get_text()) > 10:
        description = p.get_text()
        break
#description=u'Una delegación del Ayuntamiento de Madrid asiste a la manifestación de Barcelona en contra de los atentados terroristas, presidida por Manuela Carmena, y con representación de todos los grupos políticos municipales. Esta representación municipal ha estado compuesta por la portavoz del gobierno municipal, Rita Maestre: el delegado de Coordinación Territorial y Cooperación Público-Social y concejal del Grupo Municipal Ahora Madrid, Nacho Murgui; la concejala y delegada del portavoz del Grupo Municipal Partido Popular, Isabel Rosell; el concejal y delegado de la portavoz del Grupo Municipal Socialista, Ignacio de Benito y la portavoz del Grupo Municipal Ciudadanos-Partido de la Ciudadanía, Begoña Villacís.\n"Madrid como ciudad está aquí, con Barcelona, por solidaridad y por la necesidad de expresar la protesta que significan los actos de esta barbarie, para hacer posible que no se vuelvan a repetir", ha dicho la alcaldesa a su llegada a la capital catalana.'
description

In [None]:
template = u"""=={{int:filedesc}}==
{{Information
|description={{es|1=${description}}}
|date=${date}
|source=[${url} Diario de Madrid - ${title}]
|author=[https://diario.madrid.es/ Diario de Madrid]
|permission=[https://diario.madrid.es/contenidos-libres/ License information for all contents in diario.madrid.es]
|other versions=
}}

=={{int:license-header}}==
{{Diario de Madrid}}

[[Category:Images from Ayuntamiento de Madrid (to classify)]]
[[Category:${month} ${year} in Madrid]]
${cat_string}"""

vars = {
    "url": url,
    "description": description,
    "year": year,
    "month": month,
    "date": date,
    "title": title,
    "cat_string": '\n'.join(['[[Category:'+i+']]' for i in categories])
}
t = Template(template)
_text = t.render(**vars)
_text

In [None]:
images = [image.a["href"] for image in soup.find_all("div", class_="gallery-icon")]
images

In [None]:
# Image retrieval and upload to Commons
excluded = []

used_names = []
global_counter = 1
for i, image in enumerate(images):
    # If the image is excluded, let's skip it
    if i in excluded:
        print ("Image excluded. Skipping")
        continue
        
    # First, the image is downloaded and locally stored
    image_url = quote(image.encode('utf-8'), ':/')
    image_name = f'{title} {global_counter:02d}.jpg'
    global_counter = global_counter + 1
    image_path = os.path.join(images_directory, image_name)
    try: 
        print ('Trying download')
        r = requests.get(image_url, headers=headers, stream=True)
        with open(image_path, 'wb') as out_file:
            shutil.copyfileobj(r.raw, out_file)
        print ('Image downloaded. Starting upload process')
    except Exception as e:
        print (e)
        print ('Failed download. Skipping')
        continue

    # If the image is already in Commons, let's skip it
    if is_commons_file(get_hash(image_path)) :
        print ("Image already in commons. Skipping")
        os.remove(image_path)
        global_counter = global_counter - 1
        continue

    # If the image name is already in commons, find a new name
    if pb.Page(commons_site, image_name, ns=6).exists():
        print (f"Image name ({image_name}) already used in Commons")
        used_names.append(image_name)
        
    while True:
        if image_name in used_names :
            # Finding a new name
            image_subject = '.'.join(image_name.split('.')[:-1])
            image_extension = 'jpg'
            p = re.compile('(.*) ([0-9]{2}\.jpg)')
            m = p.match(image_name)
            if m is None:
                image_name = f"{image_subject} 01.{image_extension}"
            else :
                counter = int(m.group(2)[:2]) + 1
                image_name = '{m.group(1)} {counter:02d}.{image_extension}'

            if pb.Page(commons_site, image_name, ns=6).exists():
                print (f"Image name ({image_name}) already used in Commons. Finding a new name")
                used_names.append(image_name)
        else :
            print (f"Preparing to upload image with name {image_name}")
            used_names.append(image_name)
            global_counter = counter + 1
            break

    # image upload
    bot = UploadRobot([image_path],
                      description = _text,
                      useFilename = image_name,
                      keepFilename = True,
                      verifyDescription = False,
                      ignoreWarning = True,
                      targetSite = commons_site)
    bot.run()
    os.remove(image_path)