We would like to compare and contrast case clustering based on the opinion text (natural language processing) vs. based on the citation structure (network community detection). 

Commmunity detection on the network
- modularity
- walktrap
- SBM (todo)

Clustering on the opinion texts
- compute TD-IDF vectors of opinions
    - KNN on tdidf vectors
    - Gaussian mixture models (TODO)
    - spectral clustering on similarity matrix (TODO)
- topic modeling (TODO)
    - LDA
    - nonegative matrix factorization
    
Relational topic models (see blei paper) (TODO)

### TODO
- match clusters from different algos
- find representatives for clusters
    - top td-idf words
    - 'most central case' in community (is this a thing?)
- more CD algos
    - fix number of communities same as number of NLP clusters 
- more NLP based clustering algos

# Notes

borrowing some code from http://brandonrose.org/clustering

In [104]:
top_directory = '/Users/iaincarmichael/Dropbox/Research/law/law-net/'

from __future__ import division

import os
import sys
import time
from math import *
import copy
import cPickle as pickle

# data
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans


# viz
import matplotlib.pyplot as plt


# graph
import igraph as ig


# NLP
from nltk.corpus import stopwords


# our code
sys.path.append(top_directory + 'code/')
from load_data import load_and_clean_graph, case_info
from pipeline.download_data import download_bulk_resource
from pipeline.make_clean_data import *
from viz import print_describe


sys.path.append(top_directory + 'explore/vertex_metrics_experiment/code/')
from make_snapshots import *
from make_edge_df import *
from attachment_model_inference import *
from compute_ranking_metrics import *
from pipeline_helper_functions import *
from make_case_text_files import *
from bag_of_words import *
from similarity_matrix import *

# directory set up
data_dir = top_directory + 'data/'
experiment_data_dir = data_dir + 'vertex_metrics_experiment/'

court_name = 'scotus'

# jupyter notebook settings
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [105]:
# load the graph
G = load_and_clean_graph(data_dir, court_name)

# largest connected component

restrict our attention to the largest connected componenet on the network. also we are missing some text files from 2016 so lets ignore 2016.

In [106]:
# limit ourselves to cases upto and including 2015 since we are missing some textfiles from 2016
G = G.subgraph(G.vs.select(year_le=2015))

# make graph undirected
Gud = G.copy()
Gud = Gud.as_undirected()

# get largest connected componenet
components = Gud.clusters(mode='STRONG')
g = components.subgraphs()[np.argmax(components.sizes())]

# CL ids of cases in largest connected component
CLids = g.vs['name']

# graph clustering

Do community detection on network

## modularity

In [107]:
# modularity clustering

%time cd_modularity = g.community_fastgreedy() # .as_clustering().membership

mod_clust = cd_modularity.as_clustering()

mod_clust.summary()

CPU times: user 1min 50s, sys: 1.54 s, total: 1min 52s
Wall time: 2min 6s


'Clustering with 27539 elements and 172 clusters'

In [108]:
graph_clusters = pd.Series(mod_clust.membership, index=g.vs['name'])

## walktrap

In [109]:
# %time cd_walktrap = g.community_walktrap()

# wt_clust = cd_walktrap.as_clustering()

# wt_clust.summary()

# NLP clustering

## make td-idf vectors

In [110]:
%time normalized_text_dict = get_normalized_text_dict(experiment_data_dir)

# only look at cases in the connected componenet
normalized_text_dict = {k: normalized_text_dict[k] for k in CLids}

CPU times: user 1min 31s, sys: 8 s, total: 1min 39s
Wall time: 2min 1s


In [111]:
min_df = .2
max_df = .8

%time tfidf_matrix, vocab, CLid_to_index = get_td_idf(normalized_text_dict, min_df, max_df)

CPU times: user 2min 58s, sys: 11 s, total: 3min 9s
Wall time: 3min 26s


## K means clustering on td-idf

In [112]:
# set number of clusters
num_clusters = 30

# run kmeans
km = KMeans(n_clusters=num_clusters)
%time km.fit(tfidf_matrix)

nlp_clusters = km.labels_.tolist()

CPU times: user 7min 13s, sys: 13.1 s, total: 7min 26s
Wall time: 5min


# Compare NLP vs graph clustering

In [113]:
clusters = pd.DataFrame(index=normalized_text_dict.keys(), columns=['nlp', 'graph'])

# add in NLP clusters
clusters['nlp'] = nlp_clusters


# add in communities 
clusters['graph'] = graph_clusters

# consider nodes not considered in CD to be their own cluster
# i.e. nodes outside the largest connected component
clusters['graph'].fillna(max(graph_clusters) + 1, inplace=True)

# make formatting
clusters['graph'] = clusters['graph'].astype(np.int)

In [114]:
clusters

Unnamed: 0,nlp,graph
145658,18,2
89370,14,0
89371,14,0
89372,29,0
89373,4,0
89374,19,0
89375,1,3
89376,19,0
89377,6,0
89378,5,0


In [115]:
# TODO: match clusters