### In this notebook, we'll compare the clustering of atom-ph articles (clustering-atom-ph.ipynb) with DAMOP2016 abstracts

In [202]:
from collections import Counter
import json
from sklearn.externals import joblib

In [203]:
# First, load cluster predictor for atom-ph articles
clf = joblib.load('cluster-atom-ph.pkl') 

In [204]:
# Second, load articles from DAMOP
with open('../../damop data/damop2016.json') as f:
    damop = json.load(f)

In [205]:
exclude_list = ['Graduate Student Symposium',
                'DAMOP Prize Session',
                'DAMOP Thesis Prize Session',
               ]

In [206]:
sessions_all = 0
sessions_one_majority = 0
sessions_two_majority = 0

n_clusters = clf.get_params()['clf__n_clusters']
cluster_to_session = dict((x, []) for x in range(n_clusters))
sessions_unclassified = []

for session in damop:
    abstracts = map(lambda x: x['abstract'], session['abstracts'])
    if (len(abstracts) > 5) and (len(abstracts) < 40):
        y = clf.predict(abstracts)
        count = Counter(y)
        session_number_name = "{}: {}".format(session['number'], session['name'])
        print session_number_name
        sessions_all += 1

        if 1.*count.most_common(1)[0][1] >= 0.5*len(abstracts):
            print 'Majority cluster: {}'.format(count.most_common(1)[0][0])
            sessions_one_majority += 1
            
            cluster_to_session[count.most_common(1)[0][0]].append(session_number_name + ' (*)')
            
        elif 1.*(count.most_common(2)[0][1] + count.most_common(2)[1][1]) >= 0.5*len(abstracts):
            print 'Majority clusters: {}, {}'.format(count.most_common(2)[0][0], count.most_common(2)[1][0])
            sessions_two_majority += 1
            
            cluster_to_session[count.most_common(2)[0][0]].append(session_number_name)
            cluster_to_session[count.most_common(2)[1][0]].append(session_number_name)
            
        else:
            print y
            sessions_unclassified.append(session_number_name)
        print ''
        
#         if session['number'] == 'N6':
#             break

1A: Graduate Student Symposium
Majority cluster: 8

B3: Quantum Gases with Dipolar Interactions
Majority cluster: 6

B4: Quantum Optics I
Majority cluster: 8

B5: Many-Body Localization and Disorder
Majority cluster: 8

B6: Progress in Spin-Orbit Coupling
Majority clusters: 8, 14

B7: Nonlinear Optics and Lasers
Majority cluster: 8

B9: Photoionization, Photodetachment and Photodissociation
Majority cluster: 16

C4: Hybrid Quantum Systems
Majority clusters: 14, 8

C5: BEC with Strong Interactions
Majority cluster: 5

C6: Quantum Gas Microscope
Majority cluster: 8

C7: Atomic Clocks
Majority cluster: 18

C9: Strong-Field Physics in Atoms, Molecules, and Clusters
Majority cluster: 16

G4: Quantum Measurement
Majority cluster: 8

G5: Atomic Magnetometers I
Majority cluster: 2

G6: One-Dimensional Gases and Nanofibers
Majority cluster: 5

G7: Interaction Effects in Spin-Orbit Coupled Gases
Majority cluster: 8

G8: Time-Resolved Electron Dynamics and Attosecond Spectroscopy
Majority cluster

#### What fraction of the DAMOP sessions are covered by one or two clusters?

In [207]:
print (sessions_one_majority)*1./sessions_all
print (sessions_one_majority + sessions_two_majority)*1./sessions_all

0.677966101695
0.966101694915


#### Print DAMOP sessions that fall into each cluster.

In [208]:
order_centroids = clf.named_steps['clf'].cluster_centers_.argsort()[:, ::-1]

terms =  clf.named_steps['vect'].get_feature_names()

for cluster, val in cluster_to_session.iteritems():
    print "Cluster {}: {}".format(cluster, ', '.join([terms[x] for x in order_centroids[cluster, :10]]))
    for session in val:
        print '    {}'.format(session)
    print ''

Cluster 0: probe, eit, atom, transparency, light, electromagnetically, electromagnetically induced, pump, single, induced transparency

Cluster 1: levels, calculations, ions, electron, relativistic, transitions, strengths, lines, fock, data
    N8: Electronic, Atomic, and Molecular Collisions
    T7: Spectroscopy, Lifetimes, Oscillator Strengths

Cluster 2: magnetic, magnetic field, field, fields, zeeman, atoms, magnetic fields, atomic, spin, magnetometer
    G5: Atomic Magnetometers I (*)
    T6: Atomic Magnetometers II (*)

Cluster 3: trap, cooling, ion, atoms, optical, ions, laser, mot, trapped, magneto
    M3: Focus: Cold and Ultracold Molecules
    N9: Cold and Ultracold Molecules II
    P6: Cooling Methods and Interacting BEC's (*)
    T5: New Techniques for Laser Cooling and Trapping (*)

Cluster 4: surface, atoms, atom, interaction, casimir, dipole, force, atomic, dielectric, temperature

Cluster 5: bose, condensate, gas, bose einstein, einstein, fermi, einstein condensate, bos

Which sessions were not classified?

In [209]:
    print 'Sessions without clusters'
    for session in sessions_unclassified:
        print session
        print ''
    print ''
    print 'Clusters without sessions'
    for cluster, session in cluster_to_session.iteritems():
        if len(session) == 0:
            print "Cluster {}: {}".format(cluster, ', '.join([terms[x] for x in order_centroids[cluster, :10]]))

Sessions without clusters
J5: Precision Measurements

N5: Atom Interferometers


Clusters without sessions
Cluster 0: probe, eit, atom, transparency, light, electromagnetically, electromagnetically induced, pump, single, induced transparency
Cluster 4: surface, atoms, atom, interaction, casimir, dipole, force, atomic, dielectric, temperature
Cluster 9: antihydrogen, physics, particles, fundamental, mass, trap, precision, measurements, antiproton, gravitational
Cluster 10: alpha, mu, variation, fine structure, fine structure constant, fine, structure constant, constant, structure, constant alpha
Cluster 12: alpha, corrections, hydrogen, proton, nuclear, muonic, lamb shift, shift, lamb, correction
Cluster 13: parity, cluster, coupled cluster, nuclear, edm, dipole, relativistic, electric, coupled, relativistic coupled
Cluster 17: functions, wave, equation, method, body, function, states, bound, coulomb, energy
