In [1]:
import glob
import pickle
import pandas as pd
import requests
from bs4 import BeautifulSoup



After sucessfully running a clutering pipeline on the CVPR 2018 papers data, the aim of this notebook is to manually assign a label to each cluster to identify the topic of the papers in each cluster. 

## Preparing the input data

In [2]:
kmeans = pickle.load(open('models/kmeans.p', 'rb'))

In [3]:
def get_papers():
    papers = sorted(glob.glob('data/*.txt'))
    df = pd.DataFrame(columns=['paper', 'len'], index=range(len(papers)))

    i = 0
    for paper in papers:
        with open(paper, 'r') as f:
            text = f.readlines()
        df.iloc[i, :] = [paper, len(text[0])]
        i = i + 1
    df = df[~(df['len'] < 5000) & ~(df['len'] > 80000)]
    
    return df
df = get_papers()

In [4]:
paper_files = [p
                .replace('data/', '')
                .replace('.txt', '.pdf')
                for p in df.paper]

## Scraping the paper names

We parse the page again to get the complete paper names from the pdf file name.

In [5]:
# Getting the page
response = requests.get("http://openaccess.thecvf.com/CVPR2018.py")
page_html = response.text
soup = BeautifulSoup(page_html, 'html.parser')

In [6]:
df = pd.DataFrame([], columns=['name', 'pdf'])

dl_tag = soup.find_all('dl')[0]
dt_tag = dl_tag.find_all('dt')
dd_tag = dl_tag.findChildren('dd', recursive=False)
for i in range(len(dt_tag)):
    dt = dt_tag[i]
    dd = dd_tag[i*2 + 1]
    
    name = dt.text
    pdf = dd.findChildren('a')[0]['href'].replace('content_cvpr_2018/papers/','')
    
    df.loc[i] = [name, pdf]
df.head()

Unnamed: 0,name,pdf
0,Embodied Question Answering,Das_Embodied_Question_Answering_CVPR_2018_pape...
1,Learning by Asking Questions,Misra_Learning_by_Asking_CVPR_2018_paper.pdf
2,Finding Tiny Faces in the Wild With Generative...,Bai_Finding_Tiny_Faces_CVPR_2018_paper.pdf
3,Learning Face Age Progression: A Pyramid Archi...,Yang_Learning_Face_Age_CVPR_2018_paper.pdf
4,PairedCycleGAN: Asymmetric Style Transfer for ...,Chang_PairedCycleGAN_Asymmetric_Style_CVPR_201...


In [7]:
# Filtering out the discarded pdf files
df = df[df.pdf.isin(paper_files)]
df = df.sort_values(by='pdf')
df.head()

Unnamed: 0,name,pdf
174,A High-Quality Denoising Dataset for Smartphon...,Abdelhamed_A_High-Quality_Denoising_CVPR_2018_...
552,When Will You Do What? - Anticipating Temporal...,Abu_Farha_When_Will_You_CVPR_2018_paper.pdf
88,Efficient Interactive Annotation of Segmentati...,Acuna_Efficient_Interactive_Annotation_CVPR_20...
514,Don't Just Assume; Look and Answer: Overcoming...,Agrawal_Dont_Just_Assume_CVPR_2018_paper.pdf
268,Image Collection Pop-Up: 3D Reconstruction and...,Agudo_Image_Collection_Pop-Up_CVPR_2018_paper.pdf


In [8]:
df['cluster'] = kmeans.labels_
df.head()

Unnamed: 0,name,pdf,cluster
174,A High-Quality Denoising Dataset for Smartphon...,Abdelhamed_A_High-Quality_Denoising_CVPR_2018_...,1
552,When Will You Do What? - Anticipating Temporal...,Abu_Farha_When_Will_You_CVPR_2018_paper.pdf,13
88,Efficient Interactive Annotation of Segmentati...,Acuna_Efficient_Interactive_Annotation_CVPR_20...,3
514,Don't Just Assume; Look and Answer: Overcoming...,Agrawal_Dont_Just_Assume_CVPR_2018_paper.pdf,16
268,Image Collection Pop-Up: 3D Reconstruction and...,Agudo_Image_Collection_Pop-Up_CVPR_2018_paper.pdf,20


## Assign topics


In [9]:
cluster_size = df.groupby('cluster').count()['name'].sort_values()
cluster_size

cluster
25    12
24    12
22    16
18    17
7     22
4     24
11    25
21    27
2     29
16    30
8     31
20    32
5     34
19    38
23    44
6     46
15    47
9     47
10    48
0     49
14    49
1     51
12    51
13    60
17    62
3     63
Name: name, dtype: int64

We are going to display the papers in each cluster in order of cluster size.

The list of cluster labels below was manually created after inspecting the list of papers two cells below:

In [10]:
cluster_labels = {
    25: 'Generative Adversarial Networks (GAN)',
    24: 'Stereo',
    22: 'Face detection/classification',
    18: 'Visual saliency and gaze detection',
    7: '3D objects',
    4: 'Text detection/recognition',
    11: 'Object recognition from very few examples',
    21: 'Tracking',
    2: 'Self-Localisation and Mapping (SLAM) and 3D reconstruction',
    16: 'Visual questions answering',
    8: 'Re-identification',
    20: '3D reconstruction',
    5: 'Transfer learning, domain adaptation and adversarial networks',
    19: 'Lighting models',
    23: 'Point clouds correspondences and matching',
    6: 'Face detection/recognition',
    15: 'Pose estimation',
    9: 'Image/video captioning, object detection',
    10: 'Generative Adversarial Networks (GAN)',
    0: 'Object detection, video/object segmentation',
    14: 'Image/points registration, visual features',
    1: 'Image processing',
    12: 'Deep learning',
    13: 'Temporal representation/prediction',
    17: 'Neural networks',
    3: 'Object localisation/detection'
}

In [11]:
for c in cluster_size.index.values.tolist():
    print('CLUSTER ' + str(c) + ': ' + cluster_labels[c] + '\n')
    df_c = df[df.cluster == c]
    for name in df_c.name:
        print('  ' + name)
    print('==============\n')

CLUSTER 25: Generative Adversarial Networks (GAN)

  Multi-Content GAN for Few-Shot Font Style Transfer
  PairedCycleGAN: Asymmetric Style Transfer for Applying and Removing Makeup
  CartoonGAN: Generative Adversarial Networks for Photo Cartoonization
  Disentangling Structure and Aesthetics for Style-Aware Image Completion
  Arbitrary Style Transfer With Deep Feature Reshuffle
  Creating Capsule Wardrobes From Fashion Images
  Multi-Task Adversarial Network for Disentangled Feature Learning
  A Common Framework for Interactive Texture Transfer
  Neural Style Transfer via Meta Networks
  Avatar-Net: Multi-Scale Zero-Shot Style Transfer by Feature Decoration
  TextureGAN: Controlling Deep Image Synthesis With Texture Patches
  Separating Style and Content for Generalized Style Transfer

CLUSTER 24: Stereo

  A Low Power, High Throughput, Fully Event-Based Stereo System
  CBMV: A Coalesced Bidirectional Matching Volume for Disparity Estimation
  Pyramid Stereo Matching Network
  Stereosc

## Conclusions

In this set of notebooks I created a paper segmentation pipeline based on NLP to split the papers published at the CVPR 2018 conference into topics. Labelling each cluster (i.e. assigning topic labels to clusters) was done manually after visually inspecting the title of the papers in each cluster. These labels try to identify the predominant topic in each cluster, but given the nature of the K-means algorithm, it may happen that some clusters contain some papers that do not seem to fit very well with the overall cluster topic. One reason for this may be the existence of papers which are focused on general topics rather than specific applications, but being driven into an specific cluster due to the examples provided in the paper. 

Additionally, and also due to the nature of K-means algorithm, we may encounter some topics that are split accross more than one cluster, like 3D reconstruction (clusters 20 and 2), face detection/recognition (clusters 6 and 22), or GANs (25, 10).

I felt more confident labelling some clusters than others. Generally, I would feel more confident labelling the smallest clusters, so these labels have to be taken with a pinch of salt. One way of improving the work presented here could be by analysing the proportion of papers in each cluster that genuilly match the cluster label. 

Finally, I'd like to add that I started this analysis because I was interested in finding out what were the main current topics in the Computer Vision field after many years working on other unrelated topics. The fact that CVPR 2018, the most important Computer Vision Conference, made the published papers openly available on the Internet seemed like a very good opportunity to perform this task. If I was still part of Academia, I would use this work as a first step towards identifying the papers I could be interested in from the total amount of about 1000 papers published in the conference. 