In [10]:
import glob
import pickle
import pandas as pd

After sucessfully running a clutering pipeline on the CVPR 2018 papers data, the aim of this notebook is to manually assign a label to each cluster to identify the topic of the papers in each cluster. 

## Preparing the input data

In [11]:
kmeans = pickle.load(open('models/kmeans.p', 'rb'))

In [13]:
def get_papers():
    papers = sorted(glob.glob('data/*.txt'))
    df = pd.DataFrame(columns=['paper', 'len'], index=range(len(papers)))

    i = 0
    for paper in papers:
        with open(paper, 'r') as f:
            text = f.readlines()
        df.iloc[i, :] = [paper, len(text[0])]
        i = i + 1
    df = df[~(df['len'] < 5000) & ~(df['len'] > 80000)]
    
    return df
df = get_papers()

In [18]:
paper_files = [p
                .replace('data/', '')
                .replace('.txt', '.pdf')
                for p in df.paper]

## Scraping the paper names

We parse the page again to get the complete paper names from the pdf file name.

In [20]:
# Getting the page
response = requests.get("http://openaccess.thecvf.com/CVPR2018.py")
page_html = response.text
soup = BeautifulSoup(page_html, 'html.parser')

In [49]:
df = pd.DataFrame([], columns=['name', 'pdf'])

dl_tag = soup.find_all('dl')[0]
dt_tag = dl_tag.find_all('dt')
dd_tag = dl_tag.findChildren('dd', recursive=False)
for i in range(len(dt_tag)):
    dt = dt_tag[i]
    dd = dd_tag[i*2 + 1]
    
    name = dt.text
    pdf = dd.findChildren('a')[0]['href'].replace('content_cvpr_2018/papers/','')
    
    df.loc[i] = [name, pdf]
df.head()

Unnamed: 0,name,pdf
0,Embodied Question Answering,Das_Embodied_Question_Answering_CVPR_2018_pape...
1,Learning by Asking Questions,Misra_Learning_by_Asking_CVPR_2018_paper.pdf
2,Finding Tiny Faces in the Wild With Generative...,Bai_Finding_Tiny_Faces_CVPR_2018_paper.pdf
3,Learning Face Age Progression: A Pyramid Archi...,Yang_Learning_Face_Age_CVPR_2018_paper.pdf
4,PairedCycleGAN: Asymmetric Style Transfer for ...,Chang_PairedCycleGAN_Asymmetric_Style_CVPR_201...


In [50]:
# Filtering out the discarded pdf files
df = df[df.pdf.isin(paper_files)]
df.head()

Unnamed: 0,name,pdf
0,Embodied Question Answering,Das_Embodied_Question_Answering_CVPR_2018_pape...
1,Learning by Asking Questions,Misra_Learning_by_Asking_CVPR_2018_paper.pdf
2,Finding Tiny Faces in the Wild With Generative...,Bai_Finding_Tiny_Faces_CVPR_2018_paper.pdf
3,Learning Face Age Progression: A Pyramid Archi...,Yang_Learning_Face_Age_CVPR_2018_paper.pdf
4,PairedCycleGAN: Asymmetric Style Transfer for ...,Chang_PairedCycleGAN_Asymmetric_Style_CVPR_201...


In [52]:
df['cluster'] = kmeans.labels_
df.head()

Unnamed: 0,name,pdf,cluster
0,Embodied Question Answering,Das_Embodied_Question_Answering_CVPR_2018_pape...,1
1,Learning by Asking Questions,Misra_Learning_by_Asking_CVPR_2018_paper.pdf,13
2,Finding Tiny Faces in the Wild With Generative...,Bai_Finding_Tiny_Faces_CVPR_2018_paper.pdf,3
3,Learning Face Age Progression: A Pyramid Archi...,Yang_Learning_Face_Age_CVPR_2018_paper.pdf,16
4,PairedCycleGAN: Asymmetric Style Transfer for ...,Chang_PairedCycleGAN_Asymmetric_Style_CVPR_201...,20


## Assign topics

We looked at the paper titles in each cluster to manually assign category names. 

In [61]:
cluster_size = df.groupby('cluster').count()['name'].sort_values()
cluster_size

cluster
25    12
24    12
22    16
18    17
7     22
4     24
11    25
21    27
2     29
16    30
8     31
20    32
5     34
19    38
23    44
6     46
15    47
9     47
10    48
0     49
14    49
1     51
12    51
13    60
17    62
3     63
Name: name, dtype: int64

In [65]:
for c in cluster_size.index.values.tolist():
    print('CLUSTER ' + str(c) + '\n')
    df_c = df[df.cluster == c]
    for name in df_c.name:
        print('  ' + name)
    print('==============\n')

CLUSTER 25

  PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation
  FaceID-GAN: Learning a Symmetry Three-Player GAN for Identity-Preserving Face Synthesis
  Super-Resolving Very Low-Resolution Face Images With Supplementary Attributes
  Salience Guided Depth Calibration for Perceptually Optimized Compressive Light Field 3D Display
  3D Human Sensing, Action and Emotion Recognition in Robot Assisted Therapy of Children With Autism
  Residual Dense Network for Image Super-Resolution
  Partially Shared Multi-Task Convolutional Neural Network With Local Constraint for Face Attribute Learning
  Sim2Real Viewpoint Invariant Visual Servoing by Recurrent Control
  MX-LSTM: Mixing Tracklets and Vislets to Jointly Forecast Trajectories and Head Poses
  FlipDial: A Generative Model for Two-Way Visual Dialogue
  Learning Compressible 360Â° Video Isomers
  Learning Spatial-Aware Regressions for Visual Tracking

CLUSTER 24

  Single View Stereo Matching
  Disentangling 3D Pose in a Dendr