# Classifying Reddit Boards Based on Word Embeddings

Here we will build a word2vec model with all of the reddit corpus, and then train/test on pairwise disorders. Note that the results dataframe is missing the "kernels" variable - these are SVM models, and each was tested with linear, rbf, and polynomial kernels. We can of course find an optimal kernel if this approach turns out to be useful/interesting.

In [11]:
import pandas
pandas.set_option('display.max_rows', 400)
classifiers = pandas.read_csv("classifiers.tsv",sep="\t",index_col=0)
classifiers = classifiers.sort(columns=["accuracy","TN","TP"],ascending=False)
classifiers

Unnamed: 0,accuracy,N,board1,board2,board1N,board2N,N_train,N_test,TP,FP,TN,FN
105,1.0,746,Alzheimers,rage,737,9,746,150,150,0,0,0
106,1.0,746,Alzheimers,rage,737,9,746,150,150,0,0,0
107,1.0,746,Alzheimers,rage,737,9,746,150,150,0,0,0
108,1.0,746,Alzheimers,rage,737,9,746,150,150,0,0,0
369,0.997809,4564,BPD,rage,4555,9,4564,913,911,2,0,0
370,0.997809,4564,BPD,rage,4555,9,4564,913,911,2,0,0
371,0.997809,4564,BPD,rage,4555,9,4564,913,911,2,0,0
372,0.997809,4564,BPD,rage,4555,9,4564,913,911,2,0,0
238,0.985075,1005,AvPD,rage,996,9,1005,201,196,3,2,0
74,0.976757,9248,Alzheimers,loseit,737,8511,9248,1850,100,13,1707,30


It's not really meaningful to have classifiers with a small test set for either board1 or board2 (for example, "rage"), so let's just nix those. We could of course go back and select better train/test sets based on composition of groups, but a rough (20% of entire N for test, 80% for train) is reasonable to start with.

In [15]:
classifiers = classifiers[classifiers.board2N > 200]
classifiers = classifiers[classifiers.board1N > 200]
classifiers

Unnamed: 0,accuracy,N,board1,board2,board1N,board2N,N_train,N_test,TP,FP,TN,FN
74,0.976757,9248,Alzheimers,loseit,737,8511,9248,1850,100,13,1707,30
98,0.97173,14677,Alzheimers,politics,737,13940,14677,2936,67,15,2786,68
114,0.966527,13142,Alzheimers,relationships,737,12405,13142,2629,64,7,2477,81
126,0.965414,8384,Alzheimers,sex,737,7647,8384,1677,126,15,1493,43
86,0.963383,5733,Alzheimers,narcolepsy,737,4996,5733,1147,94,10,1011,32
22,0.963303,5447,Alzheimers,EOOD,737,4710,5447,1090,123,8,927,32
310,0.961296,5551,BPD,amnesia,4555,996,5551,1111,898,34,170,9
206,0.960568,9507,AvPD,loseit,996,8511,9507,1902,168,31,1659,44
230,0.958501,14936,AvPD,politics,996,13940,14936,2988,95,30,2769,94
14,0.958333,4677,Alzheimers,CompulsiveSkinPicking,737,3940,4677,936,110,9,787,30


It's interesting that we cannot distinguish Alzheimer's very well from the "psychoticreddit" board, but it distinguishes very well from something like "sex" or "narcolepsy." I think if we do a simple Venn Diagram to look at closest word embedding vectors for support vectors vs. vectors in both classes, we could get insight to the words that were meaningful to derive the model.