# reddit Comment Ranker
ACL 2015 Submission, February 2015  

### Demonstration notebook
This provides a brief guide to running our rank-learning pipeline on reddit data. 

You'll need to have the latest versions of `scikits-learn` and `pandas` installed, and your directory structure should match the following.

To run feature extraction, you'll also need `sqlalchemy`, `pyhyphen`, and `nltk`.

In [1]:
ls .

commentDB.py  evaluation.py  features.py    README.md    settings.py
Demo.ipynb    [0m[01;32mextractor.py[0m*  [01;32mrankmodel.py[0m*  [01;32mscraper.py[0m*  text_pre.py


In [2]:
ls ../data

data-askreddit.h5   redditDB-askreddit.sqlite
data-askscience.h5  redditDB-askscience.sqlite


## Run a ranking model
This takes as input the name of a dataset HDF5 (.h5) file, containing a table of comments and submissions with pre-computed features as output by the `extractor.py` script.

The dataset as `cid` and `sid` fields, for comment and submission ID, which are used to identify and group the rows.

In [3]:
import pandas as pd
df = pd.read_hdf("../data/data-askscience.h5", 'data')
print len(df)
print "\t".join(df.columns) # available fields
df.head()

61924
n_chars	n_words	n_sentences	n_paragraphs	n_uppercase	tok_n_links	tok_n_emph	tok_n_nums	tok_n_quote	SMOG	entropy	pos_n_noun	pos_n_nounproper	pos_n_verb	pos_n_adj	pos_n_adv	pos_n_inter	pos_n_wh	pos_n_particle	pos_n_numeral	timedelta	position_rank	parent_term_overlap	parent_jaccard_overlap	parent_tfidf_overlap	informativeness_thread	informativeness_global	num_reports	distinguished	gilded	num_replies	convo_depth	is_mod	is_gold	has_verified_email	user_local_comment_count	user_local_comment_pos_karma	user_local_comment_neg_karma	user_local_comment_net_karma	user_local_comment_avg_pos_karma	user_local_comment_avg_neg_karma	user_local_comment_avg_net_karma	user_local_sub_count	user_local_sub_pos_karma	user_local_sub_neg_karma	user_local_sub_net_karma	user_local_sub_avg_pos_karma	user_local_sub_avg_neg_karma	user_local_sub_avg_net_karma	user_global_comment_count	user_global_comment_pos_karma	user_global_comment_neg_karma	user_global_comment_net_karma	user_global_comment_avg_pos_karma	user

Unnamed: 0,n_chars,n_words,n_sentences,n_paragraphs,n_uppercase,tok_n_links,tok_n_emph,tok_n_nums,tok_n_quote,SMOG,...,pos_f_verb,pos_f_adj,pos_f_adv,pos_f_inter,pos_f_wh,pos_f_particle,pos_f_numeral,cid,sid,truncated_score
0,3851,745,33,8,34,1,3,1,0,11.022393,...,0.134228,0.069799,0.068456,0,0.012081,0.042953,0.001342,t1_cdyze99,t3_1slyfe,3422
1,236,49,4,0,0,0,0,0,0,8.841846,...,0.183673,0.040816,0.0,0,0.020408,0.142857,0.0,t1_cdz9c6g,t3_1slyfe,16
2,305,64,3,1,3,0,0,0,0,7.793538,...,0.109375,0.078125,0.09375,0,0.03125,0.046875,0.0,t1_cdz22rj,t3_1slyfe,37
3,224,47,3,1,3,1,0,0,0,6.427356,...,0.212766,0.042553,0.085106,0,0.021277,0.06383,0.0,t1_cdz1y5q,t3_1slyfe,13
4,321,66,4,0,6,0,0,1,0,7.168622,...,0.075758,0.090909,0.0,0,0.030303,0.045455,0.0,t1_cdzkkvy,t3_1slyfe,2


Run a classifier on the *askscience* dataset, using `log_score` as regression target, random forest classifier, and "stock" parameters given in `settings.py`. These are:

In [4]:
import settings
settings.STANDARD_PARAMS['rf']

{'criterion': 'mse',
 'max_depth': 5,
 'max_features': 'auto',
 'n_estimators': 100}

`--fg all` will use the `all` featuregroup, also specified in `settings.py`.

In [5]:
settings.FEATURE_GROUPS['all']

{'SMOG',
 'entropy',
 'has_verified_email',
 'informativeness_global',
 'informativeness_thread',
 'is_gold',
 'is_mod',
 'n_chars',
 'n_paragraphs',
 'n_sentences',
 'n_uppercase',
 'n_words',
 'parent_jaccard_overlap',
 'parent_term_overlap',
 'parent_tfidf_overlap',
 'pos_f_adj',
 'pos_f_adv',
 'pos_f_inter',
 'pos_f_noun',
 'pos_f_nounproper',
 'pos_f_numeral',
 'pos_f_particle',
 'pos_f_verb',
 'pos_f_wh',
 'position_rank',
 'timedelta',
 'tok_n_emph',
 'tok_n_links',
 'tok_n_nums',
 'tok_n_quote',
 'user_global_comment_avg_neg_karma',
 'user_global_comment_avg_net_karma',
 'user_global_comment_avg_pos_karma',
 'user_global_comment_count',
 'user_global_comment_neg_karma',
 'user_global_comment_net_karma',
 'user_global_comment_pos_karma',
 'user_global_sub_avg_neg_karma',
 'user_global_sub_avg_net_karma',
 'user_global_sub_avg_pos_karma',
 'user_global_sub_count',
 'user_global_sub_neg_karma',
 'user_global_sub_net_karma',
 'user_global_sub_pos_karma',
 'user_local_comment_avg_ne

Now we'll actually train and evaluate the model. The train/test split is handled internally by the code:

In [6]:
run rankmodel.py ../data/data-askscience.h5 \
            --fg all -t log_score -c rf --stock \
            --savename TEST --limit-data 5000

Original data dims: (61924, 86)
Using features: 
  SMOG
  entropy
  has_verified_email
  informativeness_global
  informativeness_thread
  is_gold
  is_mod
  n_chars
  n_paragraphs
  n_sentences
  n_uppercase
  n_words
  parent_jaccard_overlap
  parent_term_overlap
  parent_tfidf_overlap
  pos_f_adj
  pos_f_adv
  pos_f_inter
  pos_f_noun
  pos_f_nounproper
  pos_f_numeral
  pos_f_particle
  pos_f_verb
  pos_f_wh
  position_rank
  timedelta
  tok_n_emph
  tok_n_links
  tok_n_nums
  tok_n_quote
  user_global_comment_avg_neg_karma
  user_global_comment_avg_net_karma
  user_global_comment_avg_pos_karma
  user_global_comment_count
  user_global_comment_neg_karma
  user_global_comment_net_karma
  user_global_comment_pos_karma
  user_global_sub_avg_neg_karma
  user_global_sub_avg_net_karma
  user_global_sub_avg_pos_karma
  user_global_sub_count
  user_global_sub_neg_karma
  user_global_sub_net_karma
  user_global_sub_pos_karma
  user_local_comment_avg_neg_karma
  user_local_comment_avg_net_ka

The script will output a pickled model, an HDF5 file containing a table of predicted scores for each comment, a list of top features, and a .csv file of the NDCG results.

In [7]:
ls TEST.*

TEST.model.pkl  TEST.results.csv  TEST.scores.h5  TEST.topfeatures.txt


In [8]:
cat TEST.results.csv

,k,method,ndcg_test,ndcg_train
0,1,target,0.7697299494368539,0.818583403121477
1,2,target,0.8609206839639917,0.8911167135487507
2,3,target,0.8928766805696992,0.9162161894496057
3,4,target,0.9020730057404682,0.9233317410578216
4,5,target,0.9065253563907629,0.9266787876186752
5,6,target,0.9086718071385317,0.9273836629504505
6,7,target,0.9096612823553226,0.9279466460735105
7,8,target,0.9101772365861643,0.9287020413672444
8,9,target,0.9105601560823124,0.9288611596983916
9,10,target,0.9107192415745402,0.9289862947954527
10,11,target,0.9109188425539624,0.929021950053897
11,12,target,0.911025280561519,0.9290178406610781
12,13,target,0.9111300390436914,0.9290578468411812
13,14,target,0.9112079561054302,0.9293083020096758
14,15,target,0.9112523815019522,0.9293700494107329
15,16,target,0.9112967900149902,0.9293721396966182
16,17,target,0.9113445655020902,0.9293819069427458
17,18,target,0.9113663199470534,0.9293917949059699
18,19,target,0.9113896417284142,0.9293974026041729
19

In [9]:
scoredf = pd.read_hdf("TEST.scores.h5", 'data')
scoredf[scoredf.set == 'test'].head()

Unnamed: 0,self_id,parent_id,cid,sid,set,log_score,pred_log_score,truncated_score
2587,t1_cdyze99,t3_1slyfe,t1_cdyze99,t3_1slyfe,test,8.138273,4.512632,3422
2588,t1_cdz9c6g,t3_1slyfe,t1_cdz9c6g,t3_1slyfe,test,2.833213,0.744502,16
2589,t1_cdyytsi,t3_1slyfe,t1_cdyytsi,t3_1slyfe,test,1.94591,2.039198,6
2590,t1_cdz0g4s,t3_1slyfe,t1_cdz0g4s,t3_1slyfe,test,1.098612,1.329672,2
2591,t1_cgssgym,t3_1slyfe,t1_cgssgym,t3_1slyfe,test,0.0,0.525887,0


## Generating your own features

The `extractor.py` script will build a table of features that is compatible with `rankmodel.py`. This can be quite slow, but needs only be done once per dataset. See Section 3 of the paper for details on these features, or `features.py` for their implementation.

The input to this step is a SQLite database file emitted by our scraper, `scraper.py`. This consists of Comment, Submission, and User objects, the schema for which is defined using SQLAlchemy and described in `commentDB.py`.

In [10]:
# --f_pos runs part-of-speech tagging (slow)
# --N_subs limits the amount of data to process, by number of threads
#     omit this flag to process the whole dataset
%run extractor.py --dbfile ../data/redditDB-askscience.sqlite \
        --savename test.dataset.h5 --f_pos --N_subs 100

== Loading comments for 100 submissions ==
  last 500: 1.17 s (500 loaded)
  last 500: 1.31 s (1000 loaded)
  last 500: 1.46 s (1500 loaded)
  last 500: 1.45 s (2000 loaded)
  last 500: 1.64 s (2500 loaded)
  [loaded 2651 comments in 7.5 seconds]
== Generating global VSM ==
  [completed in 8.1 seconds]
== Generating per-thread VSMs ==
  [completed in 0.52 seconds]

Loaded user for t1_cdyze99 (user therationalpi)
Loaded user for t1_cdz9c6g (user dragon_fiesta)
Loaded user for t1_cdz22rj (user TurboSexophonic)
Loaded user for t1_cdz1y5q (user lydocia)
Loaded user for t1_cdzkkvy (user Fikap4us)
Loaded user for t1_cdyytsi (user trabo)
Loaded user for t1_cdz0g4s (user Fritzkreig)
Loaded user for t1_cgssgym (user IGLOOICE9)
Loaded user for t1_cgna1hq (user science_com)
Loaded user for t1_cgn7ta8 (user ramennoodle)
Loaded user for t1_cgn9swn (user iamadogforreal)
Loaded user for t1_cgn8zad (user Bobby_Newmark)
Loaded user for t1_cgnjugj (user KoNP)
Loaded user for t1_cgnivu8 (user karmabaiter)
Loaded user for t1_cgnssvd (user championchilli)
Loaded user for t1_cgn871f (user dabbin)
Loaded user for t1_cgnc8ox (user Giz0311)
Loaded user for t1_cgn9rqy (user CarbonFiber_Funk)
Loaded user for t1_cgnb0rv (user hutima)
Loaded user for t1_cgnipvm (user bloonail)
Loaded user for t1_cgnxxl2 (user ehj)
Loaded user for t1_cgo11xg (user nfaguy)
Loaded user for t1_c


== Extracting general features: 2651 total comments ==
  -> last 265: 5.89 s (10.0% done)


Loaded user for t1_cgv5rh3 (user dragonczeck)
Loaded user for t1_cgvalhc (user VPPTrue)
Loaded user for t1_cgvd7fy (user bdbbdbss)
Loaded user for t1_cgvlpim (user EvOllj)
Loaded user for t1_cgw1tk2 (user DurantOKCThunder)
Loaded user for t1_cgtyaim (user LengthContracted)
Loaded user for t1_cgty29y (user tauneutrino9)
Loaded user for t1_cgu3572 (user DamnInteresting)
Loaded user for t1_cgu4h0j (user Aethermancer)
Loaded user for t1_cgu30ni (user g0kh4n)
Loaded user for t1_cgua1tk (user Coffeeshopman)
Loaded user for t1_cgugwn2 (user SurprisedPotato)
Loaded user for t1_cguc6z5 (user OlliG)
Loaded user for t1_cgu0z7u (user Wikiwnt)
Loaded user for t1_cgukn7c (user WhoCaresFck)
Loaded user for t1_cgu5ub9 (user keepthepace)
Loaded user for t1_cgu6ozk (user Karl_der_Geile)
Loaded user for t1_cguoyej (user g253)
Loaded user for t1_cgu97ul (user MythosJunkie)
Loaded user for t1_cgugrx2 (user upboatugboat)
Loaded user for t1_cgum36u (user reptomin)
Loaded user for t1_cgum8tx (user rmoonsong)


  -> last 265: 6.28 s (20.0% done)


Loaded user for t1_c7xyma3 (user craigers521)
Loaded user for t1_c7xzj2p (user tobiasfunke415)
Loaded user for t1_c7y5f38 (user Keyframe)
Loaded user for t1_c7y7bb3 (user EvOllj)
Loaded user for t1_c7xwxaf (user Spomajom)
Loaded user for t1_c7y2a6l (user BeforeTime)
Loaded user for t1_c7y2vuo (user insigmah)
Missing user data for t1_c7y4aof
Loaded user for t1_c7y4mq9 (user fruzz)
Loaded user for t1_c7y4q9b (user lemmywink_1)
Loaded user for t1_c7y56c8 (user cbrules3033)
Loaded user for t1_c7y6z8w (user DensityStrike)
Loaded user for t1_c7y7gll (user Boonsfarb)
Loaded user for t1_c7y8ofn (user seatharama)
Loaded user for t1_c7ybtdo (user hugemuffin)
Loaded user for t1_c7yewq8 (user cocoamix)
Loaded user for t1_c7yf673 (user blue_thorns)
Loaded user for t1_c7yfasv (user beard5000)
Loaded user for t1_c7yg82s (user baileyaye)
Loaded user for t1_c7yhe9b (user easterlingman)
Loaded user for t1_c9qo1s7 (user lolzforfun)
Loaded user for t1_cawtidq (user USpeame93)
Loaded user for t1_c7y1qcg (


  -> last 265: 5.85 s (30.0% done)


Loaded user for t1_cdlz6us (user ThatInternetGuy)
Loaded user for t1_cdlwc5k (user ArabianNightmare)
Loaded user for t1_cdmg52h (user Zeakk1)
Loaded user for t1_cdmh8d8 (user mcM4rk)
Loaded user for t1_cgmuep0 (user ahumananimation)
Loaded user for t1_cdlzvzx (user JohnPombrio)
Loaded user for t1_cdmelu9 (user Thalesian)
Loaded user for t1_c5j8yzo (user jbeta137)
Loaded user for t1_c5j6e8h (user singlewordedpoem)
Loaded user for t1_c5j7doi (user jecniencikn)
Loaded user for t1_c5j8wnx (user jenkel)
Loaded user for t1_c5jb6vp (user MostlyIronicLatinGuy)
Missing user data for t1_c5j8kbt
Loaded user for t1_c5jbovb (user jmpherso)
Loaded user for t1_c5j7ko5 (user Ikkath)
Loaded user for t1_c5j7f07 (user boonamobile)
Loaded user for t1_c5jai48 (user tute666)
Loaded user for t1_c5jkbdo (user Klashus)
Loaded user for t1_c5jkha3 (user Gingerbreadman_)
Loaded user for t1_c5j74tr (user Epilepep)
Loaded user for t1_c5j9u3h (user eleventhzeppelin)
Missing user data for t1_c5j6v2x
Loaded user for 


  -> last 265: 5.59 s (40.0% done)


Loaded user for t1_cecwj94 (user skrupa15)
Loaded user for t1_cecyb11 (user gunnk)
Loaded user for t1_cecyzmm (user trekyncc1701)
Loaded user for t1_cecz2f5 (user Philosopher1618)
Loaded user for t1_cectw0n (user thewetness)
Loaded user for t1_cecw8gn (user StarStealingScholar)
Loaded user for t1_ced2t01 (user otterbry)
Loaded user for t1_cecqxdc (user abracadabramonkey)
Loaded user for t1_cecl6b9 (user MDFrankey)
Loaded user for t1_cecli7f (user atad2much)
Loaded user for t1_cecm4ze (user sagequeen)
Loaded user for t1_cecmfqp (user jdepps113)
Loaded user for t1_cecsiqt (user baby_pool_lifeguard)
Loaded user for t1_ced8hvk (user MyRealSelfie)
Loaded user for t1_cecm0qp (user hsfrey)
Loaded user for t1_cd4iz9l (user Twigs2013)
Loaded user for t1_cd4i3nh (user chui101)
Loaded user for t1_cd4ovzl (user IhasAcellular)
Loaded user for t1_cd4jbev (user shiningPate)
Loaded user for t1_cd4t2gz (user NiTenIchiRyu)
Loaded user for t1_cd4kqso (user Hurricane1088)
Loaded user for t1_cd4kzdu (user


  -> last 265: 5.68 s (50.0% done)


Loaded user for t1_caukii6 (user MoneyIsTiming)
Loaded user for t1_caukqik (user Noob2Pro)
Loaded user for t1_cau8kg5 (user Atheist_Simon_Haddad)
Loaded user for t1_cauab6r (user boarhog)
Loaded user for t1_cau0eul (user S3XonWh33lz)
Loaded user for t1_cau79nz (user Venarius)
Loaded user for t1_caucrt5 (user FellOutOfTheWindow)
Loaded user for t1_caura7s (user tertl3)
Loaded user for t1_cavgpuf (user Lemmings1583)
Loaded user for t1_cau8s0c (user adomorn)
Loaded user for t1_c9w7gu6 (user AbramsLullaby)
Loaded user for t1_c9w81d5 (user MyRespectableAccount)
Loaded user for t1_c9wikny (user chrisamiller)
Loaded user for t1_c9wfxko (user SoulWager)
Missing user data for t1_c9w97hi
Loaded user for t1_c9wcryp (user darkenspirit)
Loaded user for t1_c9wk2u3 (user gojukebox)
Loaded user for t1_c9whk0k (user SpaceToaster)
Loaded user for t1_c9wi187 (user LNMagic)
Loaded user for t1_c9w6q8m (user derglow)
Loaded user for t1_c9w6cac (user intogamer)
Loaded user for t1_c9whteq (user lmnoonml)
Loa


  -> last 265: 5.52 s (60.0% done)


Loaded user for t1_c7mmotm (user lars332)
Loaded user for t1_c7mn50d (user kivarenn82)
Missing user data for t1_c7mnhdy
Loaded user for t1_c7mnypq (user ShakaUVM)
Loaded user for t1_c7moj3a (user SirUtnut)
Loaded user for t1_c7mokwl (user coolmichelle)
Loaded user for t1_c7moxpq (user joekuli)
Loaded user for t1_c7mp7bh (user calinet6)
Loaded user for t1_c7mpz6n (user DyslexicHobbit)
Loaded user for t1_c7mq25d (user thePlunger)
Missing user data for t1_c7mqkki
Missing user data for t1_c7mqz25
Loaded user for t1_c7mrctd (user The_Empty_Chair)
Loaded user for t1_c7mvrd8 (user YourConsciousness)
Loaded user for t1_c7mxqux (user SandwichSound)
Loaded user for t1_c7mljb0 (user divvd)
Loaded user for t1_c7mrhe9 (user EvM)
Loaded user for t1_c7nfj81 (user PLUR11)
Loaded user for t1_c95swro (user orbital1337)
Loaded user for t1_c95rqrz (user qdvision)
Loaded user for t1_c95ts7w (user T0mo)
Loaded user for t1_c95rq74 (user the_petman)
Loaded user for t1_c95wmqp (user Shmalculus)
Loaded user fo


  -> last 265: 5.39 s (70.0% done)


Loaded user for t1_cggea97 (user jimjamcunningham)
Loaded user for t1_cggdv6a (user UberSam)
Loaded user for t1_cggdiep (user Enolator)
Loaded user for t1_cggbdos (user Arthur_Boo_Radley)
Loaded user for t1_cggdejo (user Lonelyunderpants)
Loaded user for t1_cggdu57 (user Faptiludrop)
Loaded user for t1_cggkq3l (user totakreasonadam)
Loaded user for t1_cbeb09a (user TheDVille)
Loaded user for t1_cbe9wr1 (user verticalfuzz)
Loaded user for t1_cbe9ulu (user rupert1920)
Loaded user for t1_cbeomfp (user asking_science)
Loaded user for t1_cben64z (user Quarter_Twenty)
Loaded user for t1_cbecgr6 (user kotzwuerg)
Loaded user for t1_c9pldid (user adamsolomon)
Loaded user for t1_c9pl4jf (user iorgfeflkd)
Loaded user for t1_c9plrsh (user mc2222)
Loaded user for t1_c9pqy6n (user Strayphoenix6)
Loaded user for t1_c9pri97 (user carlinco)
Loaded user for t1_c9pp8pu (user jbeta137)
Loaded user for t1_c9przj5 (user Jake0024)
Loaded user for t1_c9pvkdl (user NuneShelping)
Loaded user for t1_c9pxjfj (us


  -> last 265: 5.28 s (80.0% done)


Loaded user for t1_ce4s025 (user impetus6)
Loaded user for t1_ce4uei8 (user some_generic_dude)
Missing user data for t1_ccv3gln
Loaded user for t1_ccvd57j (user jjsav)
Loaded user for t1_ccv8td6 (user iamfuzzydunlop)
Loaded user for t1_ccv6lc8 (user atomfullerene)
Loaded user for t1_ccv6wwo (user JimbobjoDahobo)
Loaded user for t1_ccve79g (user WhiteZoneShitAgain)
Loaded user for t1_ccvfb49 (user ataraxic89)
Loaded user for t1_ccvcdm3 (user zombie_eyes)
Loaded user for t1_ccvcn16 (user motsanciens)
Loaded user for t1_ccvbe10 (user Lonebeast)
Loaded user for t1_ccvcnpr (user grenya07)
Loaded user for t1_ccvcygk (user BravoFoxtrotDelta)
Loaded user for t1_ccvdfr2 (user SwanJumper)
Loaded user for t1_ccvifzp (user iamroth)
Loaded user for t1_ccvka8w (user jokoon)
Loaded user for t1_ccvkfg0 (user LARK12)
Loaded user for t1_ccvkj6i (user Thistookmedays)
Loaded user for t1_ccv8xg0 (user caliopy)
Loaded user for t1_ccv4ksy (user ballinlb)
Loaded user for t1_ccv5n91 (user mglee)
Loaded user f


  -> last 265: 5.20 s (90.0% done)


Loaded user for t1_caon9de (user Pneu6)
Loaded user for t1_caolp5l (user eosha)
Loaded user for t1_caomyyz (user Problem119V-0800)
Loaded user for t1_caolwqx (user z940912)
Loaded user for t1_caotpky (user dtkb)
Loaded user for t1_caourry (user olynyk)
Loaded user for t1_caomgn2 (user anthropophobe)
Loaded user for t1_caon5uq (user IanP23)
Loaded user for t1_caonqjh (user Braekyn)
Loaded user for t1_caos350 (user everycredit)
Loaded user for t1_caotry4 (user darkman41)
Loaded user for t1_caozt1h (user Mach10X)
Loaded user for t1_caorhc2 (user TheGrammarBolshevik)
Loaded user for t1_caosv7c (user Grimk)
Loaded user for t1_caovbmh (user darter22)
Loaded user for t1_caovvfz (user spinningmagnets)
Loaded user for t1_caowkl1 (user thecoyote23)
Loaded user for t1_caoz9q3 (user hotsizzlepancakes)
Loaded user for t1_caon4sd (user joshberry90)
Loaded user for t1_caowjni (user interputed)
Loaded user for t1_caownwk (user The_Realest_Realism)
Loaded user for t1_c8x9qt5 (user Funkit)
Loaded user 


  -> last 265: 5.23 s (100.0% done)
  [completed 2651 in 55.92 s]
  (21 ms per comment)
== Extracting local features: 2651 total comments ==
  -> last 100: 12.99 s (3.8% done)
  -> last 100: 5.23 s (7.5% done)
  -> last 100: 8.99 s (11.3% done)
  -> last 100: 5.13 s (15.1% done)
  -> last 100: 8.19 s (18.9% done)
  -> last 100: 9.84 s (22.6% done)
  -> last 100: 10.08 s (26.4% done)
  -> last 100: 11.76 s (30.2% done)
  -> last 100: 9.51 s (33.9% done)
  -> last 100: 14.03 s (37.7% done)
  -> last 100: 10.94 s (41.5% done)
  -> last 100: 6.61 s (45.3% done)
  -> last 100: 11.27 s (49.0% done)
  -> last 100: 6.75 s (52.8% done)
  -> last 100: 10.16 s (56.6% done)
  -> last 100: 8.86 s (60.4% done)
  -> last 100: 7.31 s (64.1% done)
  -> last 100: 9.49 s (67.9% done)
  -> last 100: 8.25 s (71.7% done)
  -> last 100: 8.84 s (75.4% done)
  -> last 100: 10.01 s (79.2% done)
  -> last 100: 9.45 s (83.0% done)
  -> last 100: 9.57 s (86.8% done)
  -> last 100: 8.45 s (90.5% done)
  -> last 10


Loaded user for t1_c8yjiqo (user myelination)
Loaded user for t1_c8yemd5 (user trout007)


In [11]:
testdf = pd.read_hdf("test.dataset.h5", 'data')
print len(testdf)
testdf.head()

2651


Unnamed: 0,n_chars,n_words,n_sentences,n_paragraphs,n_uppercase,tok_n_links,tok_n_emph,tok_n_nums,tok_n_quote,SMOG,...,pos_f_nounproper,pos_f_verb,pos_f_adj,pos_f_adv,pos_f_inter,pos_f_wh,pos_f_particle,pos_f_numeral,cid,sid
0,3851,745,33,8,34,1,3,1,0,11.022393,...,0.012081,0.134228,0.069799,0.068456,0,0.012081,0.042953,0.001342,t1_cdyze99,t3_1slyfe
1,236,49,4,0,0,0,0,0,0,8.841846,...,0.0,0.183673,0.040816,0.0,0,0.020408,0.142857,0.0,t1_cdz9c6g,t3_1slyfe
2,305,64,3,1,3,0,0,0,0,7.793538,...,0.015625,0.109375,0.078125,0.09375,0,0.03125,0.046875,0.0,t1_cdz22rj,t3_1slyfe
3,224,47,3,1,3,1,0,0,0,6.427356,...,0.042553,0.212766,0.042553,0.085106,0,0.021277,0.06383,0.0,t1_cdz1y5q,t3_1slyfe
4,321,66,4,0,6,0,0,1,0,7.168622,...,0.060606,0.075758,0.090909,0.0,0,0.030303,0.045455,0.0,t1_cdzkkvy,t3_1slyfe
