# Landmark dataset URL analysis

[Google Landmark Recognition Challenge](https://www.kaggle.com/c/landmark-recognition-challenge) doesn't provide a ready-to-download dataset. All you have is a list of URLs. Unfortunately, the source images are quite big: many of them have the size about 1600x1600 pixels. One may want to download smaller versions of those images. However, you need to understand how to change the URLs first, so that the servers will return smaller files.

The credit for discovering the fact that the competition URLs have "smaller" versions goes to [dhayalkarsahilr](https://www.kaggle.com/c/landmark-recognition-challenge/discussion/49703).

_NOTE: Kaggle forbids making arbitrary network requests in their kernels. Therefore, outputs from many important cells are not shown. It is advised to run this notebook in an environment where network requests are alllowed._

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from pathlib import Path
import zipfile
import pandas as pd
import seaborn as sns
import requests
from PIL import Image
from io import BytesIO
from functools import partial

In [None]:
sns.set_style('whitegrid')
sns.set_context('notebook')

## Obtain the data

Retrieve the data with [Kaggle CLI](https://github.com/Kaggle/kaggle-api/) and load it into DataFrame.

_NOTE: Kaggle kernels already have the downloaded and unpacked dataset, so these steps are commented out._

In [None]:
# path = Path('recognition')
path = Path('..')/'input'

In [None]:
# !kaggle competitions download -c landmark-recognition-challenge -p {path}

In [None]:
# for f in path.iterdir():
#     zipfile.ZipFile(f).extractall(path=path)

In [None]:
!ls {path}

In [None]:
!head -5 {path}/train.csv

In [None]:
!head -5 {path}/test.csv

In [None]:
!head -5 {path}/sample_submission.csv

In [None]:
df = pd.read_csv(path/'train.csv')

In [None]:
df.head()

## Basic stats

In [None]:
df.info()

In [None]:
df.apply(lambda col: col.duplicated().any())

There are no missing values. Image IDs and URLs are unique.

In [None]:
df.landmark_id.nunique(), df.landmark_id.min(), df.landmark_id.max()

There are ~15K landmark IDs, represented as 0-based indices.

In [None]:
df.landmark_id.value_counts().reset_index(drop=True).plot(logy=True);

Landmark ID distribution is very imbalanced.

## Domain names

In [None]:
df['url_domain'] = df.url.str.split('/').str[2]

In [None]:
df.url_domain.nunique()

In [None]:
df.url_domain.value_counts(ascending=True).plot.barh(title='Full domains');

In [None]:
df['url_sld'] = df.url_domain.str.split('.').str[-2:].str.join('.')

In [None]:
df.url_sld.value_counts(ascending=True).plot.barh(title='Second-level domains');

Almost all source images are coming either from Google CDNs or from [Panoramio](https://en.wikipedia.org/wiki/Panoramio) (acquired by Google).

## Analyze URLs

### Group by parts

In order to understand the URL structure, we group URLs by the number of their path parts. For example, `http://example.com/p1/p2/p3` has 3 path parts. Presumably, each such group has a distinct pattern.

In [None]:
# The first 3 slashes belong to protocol spec and domain name.
df['url_path'] = df.url.str.split('/').str[3:].str.join('/')

In [None]:
# Assume that the URL "http://example.com/p1" has 1 path part.
df['url_pathparts'] = df.url_path.str.count('/') + 1

In [None]:
df.url_pathparts.value_counts(sort=False)

Paths with 1, 2, 5, 7, 9 and 11 parts are very rare, so analyze them first. Then, analyze paths with 3, 4 and 6 parts.

### Analyze rare URLs

The following functions accept `nparts` as the number of path parts, and work only with the subset of the data that has the specified number of path parts.

_NOTE: In the `listurl` function, the `network` parameter is always `False`. You need to remove the first line when running the notebook on a machine with network access._

In [None]:
def listurls(nparts, network=True, n=None):
    """Print landmark ID, number of items with the same landmark ID, response code and URL."""
    network = False  # NOTE: Remove this line when running the code on a machine with network access.
    print('{:>6} {:>6} {:>6} {}'.format('LMID', 'COUNT', 'RESP', 'URL'))
    df_filtered = df.loc[df.url_pathparts == nparts, ['url', 'landmark_id']]
    if n is not None:
        df_filtered = df_filtered[:n]
    for row in df_filtered.itertuples():
        lid = row.landmark_id
        count = len(df[df.landmark_id == lid])
        if network:
            try:
                code = requests.head(row.url, allow_redirects=True, timeout=1).status_code
            except requests.exceptions.RequestException:
                code = 'FAIL'
        else:
            code = 'SKIP'
        print(f'{lid:>6} {count:>6} {code:>6} {row.url}')

In [None]:
def urlmatches(nparts, regex):
    """True if all URLs match the regex."""
    return df.url_path[df.url_pathparts == nparts].str.match(regex).all()

In [None]:
def listnonmatches(nparts, regex, n=5):
    """DataFrame with URLs that don't match the regex (limit size with n)"""
    return df.url_path[df.url_pathparts == nparts].pipe(lambda up: up[~up.str.match(regex)][:n])

In [None]:
def urlmatchcounts(nparts, regex, group=0):
    """Match the regex and print counts of identical match groups."""
    return df.url_path[df.url_pathparts == nparts].str.extract(regex, expand=True)[group].value_counts()

_NOTE: In the `showimage` function, remove the first line to actually see the images._

In [None]:
def showimage(url):
    """Print the image's (width, height) and display it."""
    print('NO NETWORK'); return None  # NOTE: Remove this line to actually see the images.
    img = Image.open(BytesIO(requests.get(url).content))
    print(img.size)
    return img

#### 1-part URLs

In [None]:
listurls(1, network=False, n=5)

In [None]:
urlmatches(1, r'[\w\d_-]+(?:%3Dw\d+-h\d+-no)?')

Take a single URL and try to change the last bits that look like size parameters.

In [None]:
df.url[df.url_pathparts == 1].iloc[0]

In [None]:
showimage('https://lh3.googleusercontent.com/xV1jw21l7RxwdEkLhKNxBDn0hox29kT2XYPLb3vnfw')

In [None]:
showimage('https://lh3.googleusercontent.com/xV1jw21l7RxwdEkLhKNxBDn0hox29kT2XYPLb3vnfw' + '%3Dw100')

In [None]:
showimage('https://lh3.googleusercontent.com/xV1jw21l7RxwdEkLhKNxBDn0hox29kT2XYPLb3vnfw' + '%3Dh100')

In [None]:
showimage('https://lh3.googleusercontent.com/xV1jw21l7RxwdEkLhKNxBDn0hox29kT2XYPLb3vnfw' + '%3Dw100-h50')

In [None]:
showimage('https://lh3.googleusercontent.com/xV1jw21l7RxwdEkLhKNxBDn0hox29kT2XYPLb3vnfw' + '%3Dw1000-h1000')

It looks like 1-part URLs have the pattern `/<imageid>` or `/<imageid>%3D<sizespec>`, where `<sizespec>` is `w<width>-h<height>-no`. When both dimensions are specified, the image is resized to the smallest one, while keeping the original aspect ratio. If one of the dimensions is bigger than the original image, it is ignored.

#### 2-part URLs

In [None]:
listurls(2)

The image by 2-part URL is missing.

#### 5-part URLs

In [None]:
listurls(5)

5-part URLs that actually return content have the pattern `/-<id1>/<id2>/<id3>/<id4>/`, where `<idN>` is a string that matches `[a-zA-Z0-9_-]{11}`.

#### 7-part URLs

In [None]:
listurls(7)

The image by 7-part URL is missing.

#### 9-part URLs

In [None]:
listurls(9)

9-part URL has the header `Content-Disposition: attachment;filename=p.txt`, but it's actually a JPEG file with the [Benjamin Franklin Bridge](https://en.wikipedia.org/wiki/Benjamin_Franklin_Bridge) landmark.

#### 11-part URLs

In [None]:
listurls(11)

11-part URL have the header `Content-Disposition: attachment;filename=p.txt`, but it's actually a JPEG file with the [Church of Saint Francis](https://en.wikipedia.org/wiki/Church_of_S%C3%A3o_Francisco_(Porto) landmark.

Since all these URLs except 1-part ones are either unaccessible or don't contain really rare landmarks, it's reasonable to drop them.

### Analyze frequent URLs

#### 3- and 4-part URLs

In [None]:
listurls(3, network=False, n=5)

In [None]:
urlmatches(3, r'photos/[\w-]+/\d+\.jpg')

In [None]:
urlmatchcounts(3, r'photos/([\w-]+)/\d+\.jpg')

In [None]:
listurls(4, network=False, n=5)

In [None]:
urlmatches(4, r'(?:mw-panoramio|static\.panoramio\.com)/photos/[\w-]+/\d+\.jpg')

In [None]:
urlmatchcounts(4, r'(?:mw-panoramio|static\.panoramio\.com)/photos/([\w-]+)/\d+\.jpg')

3- and 4-part URLs have the following 3 last path parts: `photos`, `<sizecategory>` and `<imageid>.jpg`. Let's find out how size categories map to actual image sizes:

_NOTE: In the `listsizecategories` function, remove the first line to actually see the size categories._

In [None]:
def listsizecategories(urlfmt, cats):
    print('NO NETWORK'); return  # NOTE: Remove this line to actually see the size categories.
    for cat in cats:
        try:
            url = urlfmt.format(cat)
            sz = Image.open(BytesIO(requests.get(url).content)).size
        except requests.exceptions.RequestException:
            sz = 'FAIL'
        print('{:>20}: {}'.format(cat, sz))

In [None]:
listsizecategories('http://static.panoramio.com/photos/{}/70761397.jpg',
                   '''original large 1920x1280 medium small thumbnail
                      iw-thumbnail square mini_square'''.split())

```
            original: (3951, 2963)
               large: (1024, 768)
           1920x1280: (1707, 1280)
              medium: (500, 375)
               small: (240, 180)
           thumbnail: (100, 75)
        iw-thumbnail: (120, 90)
              square: (60, 60)
         mini_square: (32, 32)
```

#### 6-part URLs

In [None]:
listurls(6, network=False, n=5)

In [None]:
urlmatches(6, r'(?:[\w\d.-]+/){4}(?:[\w\d%-])*/')

In [None]:
urlmatchcounts(6, r'(?:[\w\d.-]+/){4}([\w\d%-]*)/')[:10]

6-path URLs are somewhat complicated. It seems that the final path part has some kind of numeric parameter. Let's try to match only the prefix without numbers:

In [None]:
urlmatchcounts(6, r'(?:[\w\d.-]+/){4}([a-z]+)(?:[\w\d%-])*/')

Now, try to list some categories with predefined sizes.

In [None]:
listsizecategories('http://lh6.ggpht.com/-vKr5G5MEusk/SR6r6SJi6mI/AAAAAAAAAco/-7JrhF1dfso/{}/',
                   's100 w100 h100 rj d'.split())

```
                s100: (100, 75)
                w100: (100, 75)
                h100: (133, 100)
                  rj: (512, 384)
                   d: (1600, 1200)
```

In [None]:
listsizecategories('https://lh3.googleusercontent.com/-LOW2cjAqubA/RvE11dfgUaI/AAAAAAAABoU/ItwXEejtwHg/{}/',
                   's200 w200 h200 rj d'.split())

```
                s200: (150, 200)
                w200: (200, 267)
                h200: (150, 200)
                  rj: (384, 512)
                   d: (1200, 1600)
```

It seems that the last path part has one of the following formats: `(w|h|s)\d`, `rj` or `d`. The first format returns an image with the specified width, height or whatever side is the biggest. The second returns an image with the size 512x384, and the third — 1600x1200 (width and height may be swapped in those cases).

## Convert URLs

Now we've collected all necessary information to transform URLs into their "smaller" versions.

In [None]:
def resizeurl(url, minsize, sizecategory):
    parts = url.split('/')
    # As before, don't count protocol spec and domain name slashes.
    nparts = len(parts) - 3
    if nparts == 1:
        before = parts[-1].partition('%')[0]
        after = f'3Dw{minsize}-h{minsize}'
        parts[-1] = before + '%' + after
    elif nparts == 3 or nparts == 4:
        parts[-2] = sizecategory
    elif nparts == 6:
        parts[-2] = f's{minsize}'
    else:
        return None
    return '/'.join(parts)

In [None]:
df['url_resized'] = df.url.transform(partial(resizeurl, minsize=500, sizecategory='large'))

In [None]:
df.url_resized.isnull().sum()

There are 9 dropped rows, which matches the number of "rare" URLs (not counting the 1-part ones). Now, write the transformed URLs to the CSV with the same format as the original CSV.

In [None]:
(df
 .dropna()
 .drop(columns=['url'])
 .rename(columns=dict(url_resized='url'))
 .to_csv('trainsmall.csv', index=False, columns=['id', 'url', 'landmark_id']))

In [None]:
!head -5 trainsmall.csv