## Commons helper: uploads from gencat.cat

This notebook helps users to upload images from the [press room](http://premsa.gencat.cat/) at `gencat.cat`. It is coded using some Python 3.6 features such as f-strings and therefore won't run in prior versions.

You can run this notebook from [PAWS](http://paws.wmflabs.org/) or from your own environment.

#### Prerequisites (for execution from own environment)
- Clone the repository where this notebook is available (it imports some functions from `utils.py`, located in a folder within this notebook parent folder).
- Create a Python 3 virtual environment and activate it.
- Install `pywikibot`:
```bash
pip install pywikibot
```
- Install `mako`:
```bash
pip install mako
```
or:
```bash
conda install mako
```
- Install `beautifulsoup4`:
```bash
pip install beautifulsoup4
```
or:
```bash
conda install beautifulsoup4
```
- Create a properly formatted `user-config.py` file.
- Launch `jupyter notebook` using the kernel associated to the virtual environment.

#### Configuration
This notebook takes all the photograms in a given URL (provided that this URL hosts pictures as attachments or inline) and uploads them to commons inserting the proper license templates. The following features are automatically extracted:
- **Image name**: The name of the images is taken from the title of attachment. For inline photographs, the image name is taken from the page title.
- **Image description**: The description is usually the first paragraph in the page.
- **Image date**: The date is extracted from the page date.

**However** you can override or update most of them by editing the `config` dictionary in the notebook, add additional categories or determine which images to upload
- `url`: This is where the press note is available. This configuration element is **mandatory**.
- `categories`: Include here as many categories as you want to assign to all images (for a category for a particular image you must do it afterwards, once uploaded to Commons). If empty, no categories but the automatically detected will be added.
- `uploader_category`: If you wish to assign a category for you as uploader, do it here. If empty, no category will be added.
- `title`: Include your own name if you don't like the one being extracted. If you assign a title, it will be used as name for all the images, with an autoincremental number appended to the title to distinguish between all the photographs.
- `pub_date`: Use the following format: YYYY-MM-DD (i.e. 2018-13-24)
- `excluded`: A list with the indices of the pictures you don't wish to upload. Inline images as appended at the end.

#### To-do list
1. Support file formats other than JPG.
2. Create a generic function for image uploading

In [None]:
#!/usr/bin/python
# -*- coding: utf-8 -*-

import pywikibot as pb
from pywikibot.specialbots import UploadRobot

import requests
from requests.compat import quote
from bs4 import BeautifulSoup
from mako.template import Template

import os, re
import shutil
import imghdr

commons_site = pb.Site("commons", "commons")

In [None]:
# Path handling for importing utils.py
import sys, inspect
current_folder = os.path.realpath(os.path.abspath(os.path.split(inspect.getfile(inspect.currentframe()))[0]))
folder_parts = current_folder.split(os.sep)
parent_folder = os.sep.join(folder_parts[:-1])

if current_folder not in sys.path:
    sys.path.insert(0, current_folder)
if parent_folder not in sys.path:
    sys.path.insert(0, parent_folder)
    
from wikimedia.utils import is_commons_file, get_hash

In [None]:
# Creation of images folder
cwd = os.getcwd()

images_directory = os.path.join(cwd, 'images')
if not os.path.exists(images_directory):
    os.makedirs(images_directory)

In [None]:
# Configuration
config = {
    'url': 'http://premsa.gencat.cat/pres_fsvp/AppJava/notapremsavw/305550/ca/ports-crea-nou-itinerari-per-vianants-integra-platja-cavaio-amb-port-darenys-mar.do',
    'categories': ["Port d'Arenys",
                   'Platja del Cavaió'],
    'uploader_category': None,
    'head_picture': False,
    'title': None,
    'pub_date': None,
    'article_content': "L’obertura d’un tram del mur de l’espatller permet crear un camí tou que uneix la platja del Cavaió amb el port d’Arenys de Mar, ampliant la xarxa d’itineraris per a vianants i bicicletes dins la zona portuària",
    'excluded': []
}

categories = [category for category in (config['categories'] + [config['uploader_category']]) if category]

In [None]:
# Retrieval of base page for extracting gallery information
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
r = requests.get(config['url'], headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
# Image date
if not config['pub_date']:
    pub_date=soup.find_all("span", attrs={"itemprop": "datePublished"})[0].get_text().strip().split(' ')[0].split('-')
    pub_date.reverse()
    pub_date='-'.join(pub_date)
else:
    pub_date = config['pub_date']
    
pub_date

In [None]:
# Gallery title
if not config['title']:
    title = soup.find_all("h1", class_="FW_headline")[0].get_text().strip().replace('  ', ' ')
else:
    title = config['title']
title = title.replace(':', ' -').replace('  ', ' ')
title

In [None]:
# Image description
if not config['article_content']:
    article_content = soup.find_all("div", class_="FW_article-content")[0].get_text().strip().split('\n')[0]
else :
    article_content = config['article_content']

article_content

In [None]:
template = u"""=={{int:filedesc}}==
{{Information
|description={{ca|1=${description}}}
|date=${date}
|source=[${url} Nota de Premsa - ${title}]
|author=Generalitat de Catalunya
|permission=
|other versions=
}}

=={{int:license-header}}==
{{LicenseReview}}
{{attribution-gencat}}

${cat_string}"""

vars = {
    "url": config['url'],
    "description": article_content,
    "date": pub_date,
    "title": title,
    "cat_string": '\n'.join(['[[Category:'+i+']]' for i in categories])
}
t = Template(template)
_text = t.render(**vars)
_text

In [None]:
image_list = [{"url": image["href"].strip(), "name": image["title"].replace(':', ' -').replace('  ', ' ').strip()} for image in soup.find_all("a", class_="external") if ('.jpg' in image['href'].lower() or '.jpeg' in image['href'].lower())]
image_list

In [None]:
if config['head_picture']:
    image_list.extend([{"url": item["src"], "name": title} for item in soup.find_all("img", class_="FW_object-attached") if item not in soup.find_all("img", class_="FW_object-attached_banner")])
image_list

In [None]:
# Image retrieval and upload to Commons
excluded = config['excluded']

used_names = []
for i, image in enumerate(image_list):
    # If the image is excluded, let's skip it
    if i in excluded:
        print ("Image excluded. Skipping")
        continue
        
    # First, the image is downloaded and locally stored
    image_url = quote(image["url"].encode('utf-8'), ':/')
    image_name = image["name"].replace(':', ' -').replace('  ', ' ') + '.jpg'
    image_path = os.path.join(images_directory, image_name)
    try: 
        r = requests.get(image_url, headers=headers, stream=True)
        with open(image_path, 'wb') as out_file:
            shutil.copyfileobj(r.raw, out_file)
        # hack for PNG files wrongly given the JPG extension
        if imghdr.what(image_path) == "png":
            os.rename(image_path, image_path.replace("jpg", "png"))
            image_path = image_path.replace("jpg", "png")
            image_name = image_name.replace("jpg", "png")
    except Exception as e:
        print (e)
        print ('Failed download. Skipping')
        continue

    # If the image is already in Commons, let's skip it
    if is_commons_file(get_hash(image_path)) :
        print ("Image already in commons. Skipping")
        os.remove(image_path)
        continue

    # If the image name is already in commons, find a new name
    if pb.Page(commons_site, image_name, ns=6).exists():
        print (f"Image name ({image_name}) already used in Commons")
        used_names.append(image_name)
        
    while True:
        if image_name in used_names :
            # Finding a new name
            image_subject = '.'.join(image_name.split('.')[:-1])
            image_extension = 'jpg'
            p = re.compile('(.*) ([0-9]{2}\.jpg)')
            m = p.match(image_name)
            if m is None:
                image_name = f"{image_subject} 01.{image_extension}"
            else :
                counter = int(m.group(2)[:2]) + 1
                image_name = '{m.group(1)} {counter:02d}.{image_extension}'

            if pb.Page(commons_site, image_name, ns=6).exists():
                print (f"Image name ({image_name}) already used in Commons. Finding a new name")
                used_names.append(image_name)
        else :
            print (f"Preparing to upload image with name {image_name}")
            used_names.append(image_name)
            break

    # image upload
    bot = UploadRobot([image_path],
                      description = _text,
                      useFilename = image_name,
                      keepFilename = True,
                      verifyDescription = False,
                      ignoreWarning = True,
                      targetSite = commons_site)
    bot.run()
    os.remove(image_path)