This repository has been archived by the owner. It is now read-only.
testing out some trending algorithms, mostly written in hadoop pig
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
kalman_filter
lib
pig
site
tt
.gitignore
README
cheese_tweets.sample.tsv
rc

README

trying out some simple trending algorithms

notes on http://matpalm.com/blog/tag/e15

TODOS:
 - inverted index from trending terms back to the documents that use them
 - ability to facet; ie trends per forum as well as overall trends
 - include ignore_punc.rb; (eg ["can","'","t"] -> ["can't"]
 - only count term once per document (?)

DATA PREP:

# need to sort by time, not id, since that's how we bucket into the timeslots
$ zcat ../tt/tt.posts.tsv.gz | head -n1000 | sort -t$'\t' -k2 -n | ../tt/extract_body.rb | split_into_chunks.rb

v1)

only consider tokens freq when it token occurs

to run ruby version
bash> source rc
then see lib/run.sh for the end to end script to build all the data for generating the prj page graphs

to run pig version
cd pig
cat run.sh for info

trending score = fraction over twice sd

v3a)

combination; start considering tokens when they are first seen but from then if token is not seen then
assume zero value for that timeslice

forget about fft cases

consider trending value BEFORE folding chunk into model 
(makes huge difference to 1,2,3,2,2,3,4,20 style cases)

trending score = fraction of sd over the mean

if token appears n times in a single document it counts for n in chunk;
deciding if need to change this to counting for 1 per chunk (since cases of a post like 'shut up shut up shut up shut up shut up shut up' cause grief
perhaps tf/idf would be better actually...