## Commons helper: uploads from gencat.cat

This notebook helps users to upload images from the [press room] at `gencat.cat`](http://premsa.gencat.cat/).

You can run this notebook from [PAWS](http://paws.wmflabs.org/) or from your own environment.

#### Prerequisites (for execution from own environment)
- Create a Python 3 virtual environment and activate it.
- Install `pywikibot`:
```bash
pip install pywikibot
```
- Install `mako`:
```bash
pip install mako
```
or:
```bash
conda install -c anaconda mako
```
- Create a properly formatted `user-config.py` file.
- Launch `jupyter notebook`

#### Configuration
This notebook takes all the photograms in a given URL (provided that this URL hosts pictures as attachments or inline) and uploads them to commons inserting the proper license templates. The following features are automatically extracted:
- **Image name**: The name of the images is taken from the title of attachment. For inline photographs, the image name is taken from the page title.
- **Image description**: The description is usually the first paragraph in the page.
- **Image date**: The date is extracted from the page date.

**However** you can override or update most of them by editing the `config` dictionary in the notebook, add additional categories or determine which images to upload
- `url`: This is where the press note is available. This configuration element is **mandatory**.
- `categories`: Include here as many categories as you want to assign to all images (for a category for a particular image you must do it afterwards, once uploaded to Commons). If empty, no categories but the automatically detected will be added.
- `uploader_category`: If you wish to assign a category for you as uploader, do it here. If empty, no category will be added.
- `title`: Include your own name if you don't like the one being extracted. If you assign a title, it will be used as name for all the images, with an autoincremental number appended to the title to distinguish between all the photographs.
- `pub_date`: Use the following format: YYYY-MM-DD (i.e. 2018-13-24)
- `excluded`: A list with the indices of the pictures you don't wish to upload. Inline images as appended at the end.

In [1]:
#!/usr/bin/python
# -*- coding: latin-1 -*-

#!/usr/bin/python
# -*- coding: latin-1 -*-

import pywikibot as pb
from pywikibot.specialbots import UploadRobot

import requests
from requests.compat import quote
from bs4 import BeautifulSoup
from mako.template import Template

import re
import shutil
import os
import imghdr

commons_site = pb.Site("commons", "commons")

In [2]:
import sys, inspect
current_folder = os.path.realpath(os.path.abspath(os.path.split(inspect.getfile(inspect.currentframe()))[0]))
folder_parts = current_folder.split(os.sep)
parent_folder = os.sep.join(folder_parts[:-1])

if current_folder not in sys.path:
    sys.path.insert(0, current_folder)
if parent_folder not in sys.path:
    sys.path.insert(0, parent_folder)
    
from wikimedia.utils import is_commons_file, get_hash

In [3]:
cwd = os.getcwd()

images_directory = os.path.join(cwd, 'images')
if not os.path.exists(images_directory):
    os.makedirs(images_directory)

In [4]:
# Configuration
config = {
    'url': 'http://premsa.gencat.cat/pres_fsvp/AppJava/notapremsavw/306260/ca/president-quim-torra-insta-els-consellers-lexili-portaveus-clam-llibertat-dels-catalans.do',
    'categories': ['Casa de la República',
                   'May 2018 in Belgium',
                   'Quim Torra in 2018',
                   'Antoni Comín i Oliveres',
                   'Lluís Puig i Gordi',
                   'Meritxell Serret i Aleu'],
    'uploader_category': 'Files uploaded by User:Discasto',
    'head_picture': False,
    'title': None,
    'pub_date': None,
    'article_content': None,
    'excluded': []
}

categories = [category for category in (config['categories'] + [config['uploader_category']]) if category]

In [35]:
# Retrieval of base page for extracting gallery information
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
r = requests.get(config['url'], headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

In [36]:
# Image date
if not config['pub_date']:
    pub_date=soup.find_all("span", attrs={"itemprop": "datePublished"})[0].get_text().strip().split(' ')[0].split('-')
    pub_date.reverse()
    pub_date='-'.join(pub_date)
else:
    pub_date = config['pub_date']
    
pub_date

'2018-05-30'

In [37]:
# Gallery title
if not config['title']:
    title = soup.find_all("h1", class_="FW_headline")[0].get_text().strip().replace('  ', ' ')
else:
    title = config['title']
title

'El president Quim Torra insta els consellers a l\'exili a ser "portaveus del clam de llibertat dels catalans"'

In [38]:
# Image description
if not config['article_content']:
    article_content = soup.find_all("div", class_="FW_article-content")[0].get_text().strip().split('\n')[0]
else :
    article_content = config['article_content']

article_content

'El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín, a Waterloo. Autor: Jordi Bedmar\r'

In [39]:
template = u"""=={{int:filedesc}}==
{{Information
|description={{ca|1=${description}}}
|date=${date}
|source=[${url} Nota de Premsa - ${title}]
|author=Generalitat de Catalunya
|permission=
|other versions=
}}

=={{int:license-header}}==
{{LicenseReview}}
{{attribution-gencat}}

${cat_string}"""

vars = {
    "url": config['url'],
    "description": article_content,
    "date": pub_date,
    "title": title,
    "cat_string": '\n'.join(['[[Category:'+i+']]' for i in categories])
}
t = Template(template)
_text = t.render(**vars)
_text

'=={{int:filedesc}}==\n{{Information\n|description={{ca|1=El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín, a Waterloo. Autor: Jordi Bedmar\r}}\n|date=2018-05-30\n|source=[http://premsa.gencat.cat/pres_fsvp/AppJava/notapremsavw/306260/ca/president-quim-torra-insta-els-consellers-lexili-portaveus-clam-llibertat-dels-catalans.do Nota de Premsa - El president Quim Torra insta els consellers a l\'exili a ser "portaveus del clam de llibertat dels catalans"]\n|author=Generalitat de Catalunya\n|permission=\n|other versions=\n}}\n\n=={{int:license-header}}==\n{{LicenseReview}}\n{{attribution-gencat}}\n\n[[Category:Casa de la República]]\n[[Category:May 2018 in Belgium]]\n[[Category:Quim Torra in 2018]]\n[[Category:Antoni Comín i Oliveres]]\n[[Category:Lluís Puig i Gordi]]\n[[Category:Meritxell Serret i Aleu]]\n[[Category:Files uploaded by User:Discasto]]'

In [40]:
image_list = [{"url": image["href"].strip(), "name": image["title"].replace(':', ' -').replace('  ', ' ').strip()} for image in soup.find_all("a", class_="external") if ('.jpg' in image['href'].lower() or '.jpeg' in image['href'].lower())]
image_list

[{'url': 'http://www.gencat.cat/big/img/196/BIG_196295520060618_03.jpg',
  'name': 'El president Torra, amb els consellers exiliats Toni Comín i Lluís Puig'},
 {'url': 'http://www.gencat.cat/big/img/324/BIG_324243718060618_03.jpg',
  'name': 'El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín'},
 {'url': 'http://www.gencat.cat/big/img/103/BIG_103313721060618_03.jpg',
  'name': 'Atenció als mitjans'},
 {'url': 'http://www.gencat.cat/big/img/180/BIG_180443109060618_03.jpg',
  'name': 'Atenció als mitjans'}]

In [41]:
if config['head_picture']:
    image_list.extend([{"url": item["src"], "name": title} for item in soup.find_all("img", class_="FW_object-attached") if item not in soup.find_all("img", class_="FW_object-attached_banner")])
image_list

[{'url': 'http://www.gencat.cat/big/img/196/BIG_196295520060618_03.jpg',
  'name': 'El president Torra, amb els consellers exiliats Toni Comín i Lluís Puig'},
 {'url': 'http://www.gencat.cat/big/img/324/BIG_324243718060618_03.jpg',
  'name': 'El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín'},
 {'url': 'http://www.gencat.cat/big/img/103/BIG_103313721060618_03.jpg',
  'name': 'Atenció als mitjans'},
 {'url': 'http://www.gencat.cat/big/img/180/BIG_180443109060618_03.jpg',
  'name': 'Atenció als mitjans'}]

In [42]:
# Image retrieval and upload to Commons
excluded = config['excluded']

used_names = []
for i, image in enumerate(image_list):
    # If the image is excluded, skip
    if i in excluded:
        print ("Image excluded. Skipping")
        continue
        
    # First, the image is downloaded and stored
    image_url = quote(image["url"].encode('utf-8'), ':/')
    image_name = image["name"].replace(':', ' -').replace('  ', ' ') + '.jpg'
    image_path = os.path.join(images_directory, image_name)
    try: 
        r = requests.get(image_url, headers=headers, stream=True)
        with open(image_path, 'wb') as out_file:
            shutil.copyfileobj(r.raw, out_file)
        # hack for PNG files wrongly given the JPG extension
        if imghdr.what(image_path) == "png":
            os.rename(image_path, image_path.replace("jpg", "png"))
            image_path = image_path.replace("jpg", "png")
            image_name = image_name.replace("jpg", "png")
    except Exception as e:
        print (e)
        print ('Failed download. Skipping')
        continue

    # If the image is already in Commons, skip
    if is_commons_file(get_hash(image_path)) :
        print ("Image already in commons. Skipping")
        os.remove(image_path)
        continue

    # If the image name is already in commons, find a new name
    if pb.Page(commons_site, image_name, ns=6).exists():
        print ("Image name ({0}) already used in Commons".format(image_name))
        used_names.append(image_name)
        
    while True:
        if image_name in used_names :
            # Finding a new name
            image_subject = '.'.join(image_name.split('.')[:-1])
            image_extension = 'jpg'
            p = re.compile('(.*) ([0-9]{2}\.jpg)')
            m = p.match(image_name)
            if m is None:
                image_name = image_subject + ' 01.' + image_extension
            else :
                counter = int(m.group(2)[:2]) + 1
                image_name = '{} {:02d}.{}'.format(m.group(1), counter, image_extension)

            if pb.Page(commons_site, image_name, ns=6).exists():
                print ("Image name ({0}) already used in Commons. Finding a new name".format(image_name))
                used_names.append(image_name)
        else :
            print ("Preparing to upload image with name {0}".format(image_name))
            used_names.append(image_name)
            break

    # image upload
    bot = UploadRobot([image_path],
                      description = _text,
                      useFilename = image_name,
                      keepFilename = True,
                      verifyDescription = False,
                      ignoreWarning = True,
                      targetSite = commons_site)
    bot.run()
    os.remove(image_path)

Preparing to upload image with name El president Torra, amb els consellers exiliats Toni Comín i Lluís Puig.jpg


The suggested description is:
=={{int:filedesc}}==
{{Information
}}escription={{ca|1=El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín, a Waterloo. Autor: Jordi Bedmar
|date=2018-05-30
|source=[http://premsa.gencat.cat/pres_fsvp/AppJava/notapremsavw/306260/ca/president-quim-torra-insta-els-consellers-lexili-portaveus-clam-llibertat-dels-catalans.do Nota de Premsa - El president Quim Torra insta els consellers a l'exili a ser "portaveus del clam de llibertat dels catalans"]
|author=Generalitat de Catalunya
|permission=
|other versions=
}}

=={{int:license-header}}==
{{LicenseReview}}
{{attribution-gencat}}

[[Category:Casa de la República]]
[[Category:May 2018 in Belgium]]
[[Category:Quim Torra in 2018]]
[[Category:Antoni Comín i Oliveres]]
[[Category:Lluís Puig i Gordi]]
[[Category:Meritxell Serret i Aleu]]
[[Category:Files uploaded by User:Discasto]]
Uploading file to commons:commons...
Upload successful.
Upload of El president Torra, amb els co

Preparing to upload image with name El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín.jpg


The suggested description is:
=={{int:filedesc}}==
{{Information
}}escription={{ca|1=El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín, a Waterloo. Autor: Jordi Bedmar
|date=2018-05-30
|source=[http://premsa.gencat.cat/pres_fsvp/AppJava/notapremsavw/306260/ca/president-quim-torra-insta-els-consellers-lexili-portaveus-clam-llibertat-dels-catalans.do Nota de Premsa - El president Quim Torra insta els consellers a l'exili a ser "portaveus del clam de llibertat dels catalans"]
|author=Generalitat de Catalunya
|permission=
|other versions=
}}

=={{int:license-header}}==
{{LicenseReview}}
{{attribution-gencat}}

[[Category:Casa de la República]]
[[Category:May 2018 in Belgium]]
[[Category:Quim Torra in 2018]]
[[Category:Antoni Comín i Oliveres]]
[[Category:Lluís Puig i Gordi]]
[[Category:Meritxell Serret i Aleu]]
[[Category:Files uploaded by User:Discasto]]
Uploading file to commons:commons...
Sleeping for 4.0 seconds, 2018-06-10 23:21:05
Upload succes

Preparing to upload image with name Atenció als mitjans.jpg


The suggested description is:
=={{int:filedesc}}==
{{Information
}}escription={{ca|1=El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín, a Waterloo. Autor: Jordi Bedmar
|date=2018-05-30
|source=[http://premsa.gencat.cat/pres_fsvp/AppJava/notapremsavw/306260/ca/president-quim-torra-insta-els-consellers-lexili-portaveus-clam-llibertat-dels-catalans.do Nota de Premsa - El president Quim Torra insta els consellers a l'exili a ser "portaveus del clam de llibertat dels catalans"]
|author=Generalitat de Catalunya
|permission=
|other versions=
}}

=={{int:license-header}}==
{{LicenseReview}}
{{attribution-gencat}}

[[Category:Casa de la República]]
[[Category:May 2018 in Belgium]]
[[Category:Quim Torra in 2018]]
[[Category:Antoni Comín i Oliveres]]
[[Category:Lluís Puig i Gordi]]
[[Category:Meritxell Serret i Aleu]]
[[Category:Files uploaded by User:Discasto]]
Uploading file to commons:commons...
Sleeping for 5.1 seconds, 2018-06-10 23:21:14
Upload succes

Image name (Atenció als mitjans.jpg) already used in Commons
Preparing to upload image with name Atenció als mitjans 01.jpg


The suggested description is:
=={{int:filedesc}}==
{{Information
}}escription={{ca|1=El president Torra, amb els consellers exiliats Meritxell Serret, Lluís Puig i Toni Comín, a Waterloo. Autor: Jordi Bedmar
|date=2018-05-30
|source=[http://premsa.gencat.cat/pres_fsvp/AppJava/notapremsavw/306260/ca/president-quim-torra-insta-els-consellers-lexili-portaveus-clam-llibertat-dels-catalans.do Nota de Premsa - El president Quim Torra insta els consellers a l'exili a ser "portaveus del clam de llibertat dels catalans"]
|author=Generalitat de Catalunya
|permission=
|other versions=
}}

=={{int:license-header}}==
{{LicenseReview}}
{{attribution-gencat}}

[[Category:Casa de la República]]
[[Category:May 2018 in Belgium]]
[[Category:Quim Torra in 2018]]
[[Category:Antoni Comín i Oliveres]]
[[Category:Lluís Puig i Gordi]]
[[Category:Meritxell Serret i Aleu]]
[[Category:Files uploaded by User:Discasto]]
Uploading file to commons:commons...
Upload successful.
Upload of Atenció als mitjans 01.jpg suc