This repository has been archived by the owner. It is now read-only.
Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
testing out some trending algorithms, mostly written in hadoop pig http://matpalm.com/blog
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
trying out some simple trending algorithms notes on http://matpalm.com/blog/tag/e15 TODOS: - inverted index from trending terms back to the documents that use them - ability to facet; ie trends per forum as well as overall trends - include ignore_punc.rb; (eg ["can","'","t"] -> ["can't"] - only count term once per document (?) DATA PREP: # need to sort by time, not id, since that's how we bucket into the timeslots $ zcat ../tt/tt.posts.tsv.gz | head -n1000 | sort -t$'\t' -k2 -n | ../tt/extract_body.rb | split_into_chunks.rb v1) only consider tokens freq when it token occurs to run ruby version bash> source rc then see lib/run.sh for the end to end script to build all the data for generating the prj page graphs to run pig version cd pig cat run.sh for info trending score = fraction over twice sd v3a) combination; start considering tokens when they are first seen but from then if token is not seen then assume zero value for that timeslice forget about fft cases consider trending value BEFORE folding chunk into model (makes huge difference to 1,2,3,2,2,3,4,20 style cases) trending score = fraction of sd over the mean if token appears n times in a single document it counts for n in chunk; deciding if need to change this to counting for 1 per chunk (since cases of a post like 'shut up shut up shut up shut up shut up shut up' cause grief perhaps tf/idf would be better actually...