journal

distance to philosophy

http://xkcd.com/903/

not such a big deal

questions
1) do all wikipedia articles link to philosophy?
2) what distribution do the distances take?

method:
1) get wikipedia dump from volume
2) parse to make a graph; term -> term
3) connected components; is it one? if not which one is philosophy in?
4) histogram of distances

part 1) get wikipedia dump from volume

go to snapshot snap-1781757e Wikipedia Extraction-WEX (Linux)
from it make a volume vol-? (in, say,  us-east-1c)
make an instance; ebs backed
attach volume to instance  i-71941410 (also in us-east-1c)

device /dev/sdk

copy from ebs to hdfs

 mkdir wiki; 
 sudo mount /dev/xvdk wiki
 hadoop fs -mkdir /full/articles
 hadoop fs -copyFromLocal wiki/rawd/freebase-wex-2009-01-12-articles.tsv /full/articles_one_file # 7 min
 hadoop fs -mkdir /full/redirects
 hadoop fs -copyFromLocal wiki/rawd/freebase-wex-2009-01-12-redirects.tsv /full/redirects

and for testing...
 hadoop fs -mkdir /sample/articles
 head -n100 wiki/rawd/freebase-wex-2009-01-12-articles.tsv > sample
 hadoop fs -copyFromLocal sample /sample/articles/freebase-wex-2009-01-12-articles.tsv

the interesting file is wiki/rawd/freebase-wex-2009-01-12-articles.tsv
which is 31G; 4,183,153 articles

http://wiki.freebase.com/wiki/WEX/Documentation

it has 5 columns 
0 - id
1 - title
2 - date
3 - xml
4 - plain text

(maybe ignore this)
before going too far it'd be interesting to extract the actual list of titles..
 hadoop jar ~/contrib/streaming/hadoop-streaming.jar -input /sample/articles/ -output /sample/titles -mapper '/usr/bin/cut -f2' -numReduceTasks 0
b(maybe ignore this)

first need to split into chunks for in/out of s3 (with gzipping)

hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
 -D mapred.min.split.size=419430400 \
 -D mapred.output.compress=true \
 -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec  \
 -input /full/articles_one_file/ -output /full/articles \
 -mapper /bin/cat -numReduceTasks 0 

reduces it to 79 files, 75mb each, 6gb total

after playing with the data and bit (and refining the algorithm) the basic heuristic is
 ignore until first <sentence>
 find first <target> that isn't article name (as often, the first one is)

of course it's not that simple, there are lots of extra special cases...

eg [File:BSicon ABZvlr.svg] [Category:AB Castellón players] or [Template:Sharpness Branch Line]

what the distinct values of these?

 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /sample/articles/ -output /sample/article_types \
  -mapper '/usr/bin/python metaArticleTypes.py' -file metaArticleTypes.py \
  -reducer aggregate

of course, nothing is clean :D

$ hfs -cat /full/article_types/*|sort -k2 -t'    ' -nr|head -n20
normal file		 2684331
File   862604
Category	434783
Template	164138
Portal		15543
Portal talk	1175
File talk	903
Category:Wikipedians by alma mater	739
Meanings of minor planet names		218
List of minor planets	 203
ISO 3166-2    198
List of United Kingdom locations	125
List of drugs  114
Star Wars      91
Star Trek      85
Template:Ph    83
Library of Congress Classification	83
Theme Time Radio Hour			81
Live Phish Downloads			74
Batman	   66

so maybe ignore; File, Category, Template, Portal, Portal talk, File talk
curious now, what do these represent?

$ hfs -cat /full/articles/part-00232.gz | gunzip | cut -f2 | grep ^File | shuf | head
File:TriGeo Logo.JPG
File:Tuckerdorothy.jpg
File:Hippoquarium.jpg
File:Gbridge-cap.jpg
File:Twoc.jpg
File:Eurovision 81.jpg
File:BMW 003 jet engine.JPG
File:Lochailort.jpg
File:New Zealand General Service Medal - Iraq.jpg
File:Pixiesheadon.jpg

$ hfs -cat /full/articles/part-00232.gz | gunzip | cut -f2 | grep ^Category | shuf | head
Category:User bcl-3
Category:Sport in Hamilton, New Zealand
Category:People murdered in Norfolk Island
Category:Top-importance Old-time Base Ball articles
Category:Calgary Mustangs players
Category:NA-Class Japanese baseball articles
Category:Deaths by firearm in Nebraska
Category:Irish folk-song collectors
Category:University of Maryland, Baltimore County faculty
Category:Human death in Nebraska

$ hfs -cat /full/articles/part-00232.gz | gunzip | cut -f2 | grep ^Portal | shuf | head
Portal:Furry/Did you know/2
Portal:Western Australia/Selected article/September 2008
Portal:Edgar Allan Poe/Selected picture/October
Portal:Tropical cyclones/Featured article/Monsoon trough
Portal:Spaceflight/On This Day/5 September
Portal:Philadelphia/Philadelphia news/September 2008
Portal:BBC/Selected article/2
Portal:Greater Manchester/Did you know/archive
Portal:Japan/Did you know/56
Portal talk:Trains/Anniversaries/August 30


 ignore until first <sentence>
 find first <target> that 
  isn't article name (as often, the first one is)
  doesn't start with File:, Category:, Template:, Portal:, Portal talk:, File talk:

sometimes we just need to ignore the "article" too.
i think when the 'plain text' (col[4]) is <100 characters it's probably a meta article too...

in fact this seems to be a more general case;

so... 
 ignore if plain_text < 100 chars
 ignore until first sentence
  find first <target> that isn't article name

there is also another interesting file is freebase-wex-2009-01-12-redirects.tsv
which i suspect will be required since sometimes the second target will require redirect dereference

 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /sample/articles/ -output /sample/edges \
  -mapper '/usr/bin/python articleParser.py' -file articleParser.py
  -numReduceTasks 0


agassi should be 'List of ATP number 1 ranked players'
(would be 'Kirk Kerkorian', his middle name, if we include synthetic links)

 remove all templates 
  <template.*?</template>
 remove all sythentic links 
  <link synthetic="true">.*?</link> 
 first target
  andre agassi link removed since synthetic

structure was...
articles
 article
  paragraph
   sentence
    first target  

though i think just targets after removing templates is enough...

comparing with ayn_rand
 it would be 'American values' if we include 'link synthetic="true"'
 but 'novelist' if we exclude 'link synthetic="true"'

sometimes there iis text before the 1st paragraph, eg disambiguation info...
so need to trim to first paragraph too!!

final is then..

 ignore "articles" that have a name starting with 'File:', 'Template:',etc
 ignore "articles" that have plain text less than 30 chars

 trim to first paragraph
 remove all templates 
  <template.*?</template>
 remove all sythentic links 
  <link synthetic="true">.*?</link> 
 first target
  andre agassi link removed since synthetic

 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /sample/articles/ -output /sample/edges \
  -mapper '/usr/bin/python articleParser.py' -file articleParser.py
  -numReduceTasks 0


no_outbound_link_found 15,491
plain_text_too_short   61,863	
metafile 	       1,480,805	
Map input records      4,183,153 	
Map output records     2,624,994 	  

looking through the diff of the from_nodes to the to_nodes in the edges
looks like we need to convert the to_nodes to their redirects

 cut -f1 all.edges | sort > all.edges.from             # 2624994 lines
 cut -f2 all.edges | uniq | sort | uniq > all.edges.to # 497461 lines

the redirects file isn't huge (3.3e6 rows, 154mb) so i thought it'd feasbile to do this in memory.
eg see dereferenceToLinks.py
but it's a complete fail. splitting into 16 chunks & running in parallel on cc1.4xlarge and it's a fail. 
(running 10+hrs and each derefernce dict lookup taking roughly 2s for every 3 records (?! that's a 45hr runtime)
i obviously need to learn me some more python...

the main reason i did this since i thought they'd be multiple redirects but looking at ~200e3 samples it's not the
case, if there is a redirect it's directly done

which means this is just a join; do it in pig

-- pig -p SET=sample|full -f edges_dereferenced.pig
edges = load '$SET/edges' as (from_node: chararray, to_node: chararray);
redirects_with_id = load '$SET/redirects' as (id:long, from_node: chararray, to_node: chararray);
redirects = foreach redirects_with_id generate from_node, to_node;
joined = join edges by to_node left outer, redirects by from_node;
edges_dereferenced = foreach joined generate
	edges::from_node as from_node, 
	(redirects::to_node is null ? edges::to_node : redirects::to_node) as to_node;
store edges_dereferenced into '$SET/edges_dereferenced';

takes a minute. though of course this is not a fundamental pig vs python thing, it's an algorithm difference.
a simpler merge approach could have been done in python must faster too i'm sure

 cut -f1 all.edges_dereferenced | sort > all.edges_dereferenced.from             # 2624994 lines (sanity)
 cut -f2 all.edges_dereferenced | uniq | sort | uniq > all.edges_dereferenced.to # 455224 lines


now we can work on calculating the distance for each page from 'Philosophy', and this is simply a breadth first search

again, trying to be pragmatic, i wrote a version in python (distanceFromPhilosophy.py) but my god it's slow...
?seconds for ? lookups in a dict? what am i missing?

Tue Aug  2 05:39:38 UTC 2011

=== move to newer version of the dump

mkfifo articles
hadoop fs -copyFromLocal articles /full/articles-2011-07-08/freebase-wex-2011-07-08-articles.tsv &
curl http://download.freebase.com/wex/latest/freebase-wex-2011-07-08-articles.tsv.bz2 | bunzip2 > articles &

hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
 -D mapred.min.split.size=300000000 \
 -input /full/2011-07-08/articles \
 -output /full/2011-07-08/edges \
 -mapper articleParser.py \
 -file articleParser.py \
 -numReduceTasks 0

Job Counters
Rack-local map tasks       0    0       189
Launched map tasks         0   0        190
Data-local map tasks       0   0        1
FileSystemCounters
HDFS_BYTES_READ            55,499,655,843       0       55,499,655,843
HDFS_BYTES_WRITTEN         129,764,139          0               129,764,139
parse
no_outbound_link_found     9,647        0       9,647
plain_text_too_short       87,599                               0       87,599
metafile                   1,896,275                            0       1,896,275
Map-Reduce Framework
Map input records          5,596,834    0       5,596,834
Spilled Records            0   0     0
Map input bytes            55,487,987,179       0       55,487,987,179
Map output records         3,603,313            0       3,603,313

#articles => 5,596,834
#edges extracted from articles 3,603,313

# edges.from              3603313
# edges.from (uniq)       3603249
# edges.to   (uniq)        684632

pig -p SET=/full/2011-07-08/ -f edges_dereferenced.pig

# edges_dereferenced.from        3603313 (sanity)
# edges_dereferenced.from (uniq) 3603249 (sanity)
# edges_dereferenced.from         621604 (not as many as the 2009-01-12 dataset...)

would downcasing help? maybe, but it's drifting further away from the true data

-- run the fscker!
 hfs -cat /full/2011-07-08/edges_dereferenced/* | java -classpath . com.Test "Philosophy" >edges 2>progress &

lots of examples that didn't work

eg truth, which is returning 'François Lemoyne' instead of 'Reality'

need another gold set to work with

under articles.eg
17th_Delaware_General_Assembly.eg
1949_Coupe_de_France_Final.eg
1999_Japan_Open_Tennis_Championships_Womens_Singles.eg
BAFC.eg
Bird_Gets_the_Worm.eg
Category_1022_books.eg
File_Pasquale_Caggiano_png.eg
Fort_Baxter.eg
Jinxiang_dialect.eg
Truth.eg

$ cat article.egs/* | ./articleParser.py  2>/dev/null
17th Delaware General Assembly		  Delaware Senate
1949 Coupe de France Final		  soccer
1999 Japan Open Tennis Championships – Women's Singles	Ai Sugiyama
Bird Gets the Worm     Charlie Parker
Jinxiang dialect       People's Republic of China
Truth	 François Lemoyne

should be....
17th Delaware General Assembly		  Delaware Senate
1949 Coupe de France Final		  soccer
1999 Japan Open Tennis Championships – Women's Singles	Ai Sugiyama
Bird Gets the Worm     Charlie Parker
Jinxiang dialect       Taihu Wu dialects ** different
Truth	 Fact ** different

(note: 
consider 'Jinxiang dialect'

$ cut -f4 article.egs/Jinxiang_dialect.eg | sed -es/\\\\n/\ /g | xmllint --format -

picked up target is from side bar...
.. <param name="states"><link><target>People's Republic of China</target></link> ..

correct target is later...
.. or a Northern <link synthetic="true"><target>Taihu Wu dialects</target><part>Wu dialect</part></link>, spoken in ..

$ cut -f5 article.egs/Jinxiang_dialect.eg | less
The<space/><bold><link synthetic="true"><target>1949 Coupe de France Final</target><part>Coupe de France Final</part></link> 1949</bold><space/>was a<space/><link><target>soccer</target><part>football</part></link>

plain text is 
Jinxiang dialect (金鄉話), is a Taihu Wu dialect, or a Northern Wu dialect, spoken in ...

so perhaps a better parsing strategy is
 extract all target linkes, href and link text
 choose the link target whose plain text appears first in the plain text

as a sanity check consider '1949 Coupe de France Final'

plain text is 
The Coupe de France Final 1949 was a football match held at Stade ...

going to need beautiful soup

 wget http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.0.tar.gz
 tar zxf BeautifulSoup-3.2.0.tar.gz
 cd BeautifulSoup-3.2.0
 sudo python ./setup.py install

python
 from BeautifulSoup import BeautifulStoneSoup 
 f = open('article.egs/1949_Coupe_de_France_Final.xml','r')
 soup = BeautifulStoneSoup(f.read())
 links = soup.findAll('link')

then for
 <link synthetic="true"><target>1949 Coupe de France Final</target><part>Coupe de France Final</part></link>

>>> links[0].target
<target>1949 Coupe de France Final</target>
>>> links[0].target.string
u'1949 Coupe de France Final'
>>> links[0].part
<part>Coupe de France Final</part>
>>> links[0].part.string
u'Coupe de France Final'

and for
 <link><target>Stade Olympique Yves-du-Manoir</target></link>

>>> links[2].target
<target>Stade Olympique Yves-du-Manoir</target>
>>> links[2].target.string
u'Stade Olympique Yves-du-Manoir'
>>> links[2].part == None
True

for template in soup.findAll('template'):
  template.extract()

also needed to another heuristic which was to only examine the first 10 links 
(otherwise the link 'fact' deep in the truth article matched a 'fact' plain text at the start of the article)
feels dangerous...

also noticed that 1949_Coupe_de_France_Final.eg -> soccer 
and not Soccer as it should, and there is no redirect, and the current life page is correctly Soccer
might need to handle this in graph redirecting, if not present, and lower case, try upper case


17th Delaware General Assembly 	   Delaware Senate
1949 Coupe de France Final 	   soccer
1999 Japan Open Tennis Championships – Women's Singles	Ai Sugiyama
Bird Gets the Worm     Charlie Parker
Brendan Foster 	       Order of the British Empire
Garh More 	       Jhang District
Harbour View, New Zealand    Lower Hutt
Jinxiang dialect  Taihu Wu dialects
Truth 	 Reality
  
 
restart another cluster from scratch

elastic-mapreduce --create --alive \
 --num-instances 5 --master-instance-type m1.large --slave-instance-type m1.large \
 --bootstrap-action s3://mkelcey/wikipediaPhilosophy/install_beautiful_soup.sh

then on master
 mkfifo articles
 hadoop fs -copyFromLocal articles /full/2011-07-23/articles/freebase-wex-2011-07-23-articles.tsv &
 curl -s http://download.freebase.com/wex/2011-07-23/freebase-wex-2011-07-23-articles.tsv.bz2 | bunzip2 > articles &
 mkfifo redirects
 hadoop fs -copyFromLocal redirects /full/2011-07-23/redirects/freebase-wex-2011-07-23-redirects.tsv &
 curl -s http://download.freebase.com/wex/2011-07-23/freebase-wex-2011-07-23-redirects.tsv.bz2 | bunzip2 > redirects &

 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /full/2011-07-08/articles/ -output /full/2011-07-08/edges \
  -mapper '/usr/bin/python articleParser.py' -file articleParser.py

Job Counters 		   
Launched reduce tasks   0	0	10
Rack-local map tasks 	   0 	    0 	   94
Launched map tasks 	   0 	    0 	   423
Data-local map tasks 	   0 	    0 	   329
FileSystemCounters 	   
FILE_BYTES_READ	0	84,818,230	84,818,230
HDFS_BYTES_READ 	   55,520,558,984 	0 	55,520,558,984
FILE_BYTES_WRITTEN 	   104,093,878 		84,818,230	188,912,108
HDFS_BYTES_WRITTEN 	   0 			130,919,661 	130,919,661
parse 			   
10_links_examin		   1,918,250	0	1,918,250
no_match 		   189,136 	0 		189,136
exception 		   37,768 	0 		37,768
metafile 		   1,896,275 	0 		1,896,275
Map-Reduce Framework 	   
Reduce input groups 			0 		3,473,630	3,473,630
Combine output records 	   0 	  0 	0
Map input records 	   5,596,834 	0	5,596,834
Reduce shuffle bytes 	   0 		103,747,842	103,747,842
Reduce output records 	   0 		3,473,655 	3,473,655
Spilled Records 	   3,473,655 	3,473,655 	6,947,310
Map output bytes 	   130,919,781 	0 		130,919,781
Map input bytes 	   55,487,987,179 		0	55,487,987,179
Map output records 	   3,473,655 			0 	3,473,655
Combine input records 	   0 				0 	0
Reduce input records 	   0 				3,473,655	3,473,655

examining a random attempt
( /mnt/var/log/hadoop/userlogs/attempt_201107310225_0047_m_000017_0 )
on one of the slaves we see this breakdown
 cat stderr |sort|uniq -c
    849 parse exception <class 'sre_constants.error'>
    510 parse exception <type 'exceptions.TypeError'>

adjusted err output to include articles name and reran to grab some samples
 hfs -cat /full/2011-07-08/articles/freebase-wex-2011-07-08-articles.tsv | ./articleParser.py 2>&1 >/dev/null | grep -v ^reporter

made some fixes (robustness for character) and kicked off again...

 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /full/2011-07-08/articles/ -output /full/2011-07-08/edges \
  -mapper '/usr/bin/python articleParser.py' -file articleParser.py

Job Counters 		  
Launched reduce tasks   0	0	10
Rack-local map tasks 	   0 	    0 	   200
Launched map tasks 	   0 	    0 	   448
Data-local map tasks 	   0 	    0 	   248
FileSystemCounters 	   
FILE_BYTES_READ	           0	86,274,065	86,274,065
HDFS_BYTES_READ 	   55,520,558,984 	0 	55,520,558,984
FILE_BYTES_WRITTEN 	   105,741,514 		86,274,065	192,015,579
HDFS_BYTES_WRITTEN 	   0 			132,365,083 	132,365,083
parse 			   
10_links_examined_limit    1,971,706	0	1,971,706
no_match 		   188,725 	0 		188,725
metafile 		   1,896,275 	0 		1,896,275
Map-Reduce Framework 	   
Reduce input groups 			0 		3,511,809	3,511,809
Combine output records 	   0 	  0 	0
Map input records 	   5,596,834 	0	5,596,834
Reduce shuffle bytes 	   0 		105,382,939	105,382,939
Reduce output records 	   0 		3,511,834 	3,511,834
Spilled Records 	   3,511,834 	3,511,834 	7,023,668
Map output bytes 	   132,365,209 	0 		132,365,209
Map input bytes 	   55,487,987,179 		0	55,487,987,179
Map output records 	   3,511,834 			0 	3,511,834
Combine input records 	   0 				0 	0
Reduce input records 	   0 				3,511,834	3,511,834

and then again on the newer dataset

 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /full/2011-07-23/articles/ -output /full/2011-07-23/edges \
  -mapper '/usr/bin/python articleParser.py' -file articleParser.py

Job Counters 		   
Launched reduce tasks   0	0	10
Rack-local map tasks 	   0 	    0 	   263
Launched map tasks 	   0 	    0 	   470
Data-local map tasks 	   0 	    0 	   207
FileSystemCounters 	   
FILE_BYTES_READ	           0	86,548,215	86,548,215
HDFS_BYTES_READ 	   55,858,012,523 	0 	55,858,012,523
FILE_BYTES_WRITTEN 	   106,092,612 		86,548,215	192,640,827
HDFS_BYTES_WRITTEN 	   0 			132,850,425 	132,850,425
parse 			   
10_links_examined_limit    1,979,326	0	1,979,326
no_match 		   189,558 				0 		189,558
metafile 		   1,902,795 				0 		1,902,795
Map-Reduce Framework 	   
Reduce input groups 	   0 		3,523,601	3,523,601
Combine output records 	   0 	  0 	0
Map input records 	   5,615,981 	0	5,615,981
Reduce shuffle bytes 	   0 		105,796,399	105,796,399
Reduce output records 	   0 		3,523,628 	3,523,628
Spilled Records 	   3,523,628 	3,523,628 	7,047,256
Map output bytes 	   132,850,552 	0 		132,850,552
Map input bytes 	   55,825,702,860 		0	55,825,702,860
Map output records 	   3,523,628 			0 	3,523,628
Combine input records 	   0 				0 	0
Reduce input records 	   0 				3,523,628	3,523,628

rewrote article parser, much much simpler now. rerun from scratch

 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /full/2011-07-23/articles/ -output /full/2011-07-23/edges \
  -mapper '/usr/bin/python articleParser.py' -file articleParser.py

found a big problem, one of the major links re: philosophy, "Greeks", 
is \N for xml and plain text in 2011-07-23. that sucks

in fact there are 13065 blank for 2011-07-08 & 13278 blank for 2011-07-23; wonder if greeks is also blank in 2011-07-08?
turns out it is... how about 2011-06-26? blank too..

so might give up on freebase dump, how about more official wiki dump?

wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

7gb; ont big xml file

sampling a bit...

------------------------------

  <page>
    <title>AccessibleComputing</title>
    <id>10</id>
    <redirect />
    <revision>
      <id>381202555</id>
      <timestamp>2010-08-26T22:38:36Z</timestamp>
      <contributor>
        <username>OlEnglish</username>
        <id>7181920</id>
      </contributor>
      <minor />
      <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
    </revision>
  </page>

redirect from AccessibleComputing -> Computer accessibility

- want to collapse page -> /page into a single line for processing
- if <text> starts with #REDIRECT then process as redirect

----------------------

  <page>
    <title>Anarchism</title>
    <id>12</id>
    <revision>
      <id>442817224</id>
      <timestamp>2011-08-03T09:10:07Z</timestamp>
      <contributor>
        <username>Eduen</username>
        <id>7527773</id>
      </contributor>
      <comment>Emma Goldman identifying anarchy as more than no state</comment>
      <text xml:space="preserve">{{Redirect|Anarchist|the fictional character|Anarchist (comics)}}
{{Redirect|Anarchists}}
{{Anarchism sidebar}}

'''Anarchism''' is a [[political philosophy]] which considers the [[state (polity)|state]] undesirable, unnecessary, and harmful, and instead promotes a [[stateless society]], or [[anarchy]].&lt;ref name=&quot;definition&quot;&gt;

- in text
-- remove all {{..}}
-- look for first instance of [[.*?]]

Anarchism -> political philosophy (though page is 'Political philosophy'

---------------------------

  <page>
    <title>Autism</title>
    <id>25</id>
    <revision>
      <id>440170653</id>
      <timestamp>2011-07-18T19:18:21Z</timestamp>
      <contributor>
        <username>GrouchoBot</username>
        <id>8453292</id>
      </contributor>
      <minor />
      <comment>r2.6.4) (robot Adding: [[kk:Аутизм]]</comment>
      <text xml:space="preserve">{{pp-semi-indef}}
{{dablink|This article is about the classic autistic disorder; some writers use the word ''autism'' when referring to the range of disorders on the [[autism spectrum]] or to the various [[pervasive developmental disorder]]s.&lt;ref name=Caronna/&gt;}}
{{pp-move-indef}}
&lt;!-- NOTES:
1) Please follow the Wikipedia style guidelines for editing medical articles [[WP:MEDMOS]].
2) Use &lt;ref&gt; for explicitly cited references.
3) Reference anything you put here with notable references, as this subject tends to attract a lot of controversy.--&gt;
{{pp-move-indef}}
{{Infobox Disease
 | Name = Autism
 | Image = Autism-stacking-cans 2nd edit.jpg
 | Alt = Young red-haired boy facing away from camera, stacking a seventh can atop a column of six food cans on the kitchen floor. An open pantry contains many more cans.
 | Caption = Repetitively stacking or lining up objects is a behavior occasionally associated with individuals with autism.
 | DiseasesDB = 1142
 | ICD10 = {{ICD10|F|84|0|f|80}}
 | ICD9 =  299.00
 | ICDO =
 | OMIM =  209850
 | MedlinePlus = 001526
 | eMedicineSubj = med
 | eMedicineTopic = 3202
 | eMedicine_mult = {{eMedicine2|ped|180}}
 | MeshID = D001321
 | GeneReviewsID   = autism-overview
 | GeneReviewsName = Autism overview
}}
'''Autism''' is a [[Neurodevelopmental disorder|disorder of neural development]] characterized by impaired ....

Autism -> Neurodevelopmental disorder

- look for text
-- remove {{.*}}
-- look for first [[ ]], might include |

------------------------

  <page>
    <title>Alchemy</title>
    <id>573</id>
    <restrictions>move=:edit=</restrictions>
    <revision>
      <id>442807146</id>
      <timestamp>2011-08-03T07:28:33Z</timestamp>
      <contributor>
        <username>Huntster</username>
        <id>92632</id>
      </contributor>
      <minor />
      <comment>Reverted 1 edit by [[Special:Contributions/75.65.177.88|75.65.177.88]] ([[User talk:75.65.177.88|talk]]) identified as [[WP:VAND|vandalism]] to last revision by Captainmighty. ([[WP:TW|TW]])</comment>
      <text xml:space="preserve">{{Redirect|Alchemist}}
{{Other uses}}

[[File:Raimundus Lullus alchemic page.jpg|thumb|right|Page from alchemic treatise of [[Ramon Llull]], 16th century]]

'''Alchemy''' is an ancient [[tradition]], the primary objective of which was the

Alchemy -> tradtion

- can't just look for first [[ since it would pick up this meta file
- do we need to look for '''? or ignore [[File: ?

---------------------------------------------

  <page>
    <title>A</title>
      ....
      <comment>r2.7.1) (robot Adding: [[nap:A]]</comment>
      <text xml:space="preserve">{{Dablink|Due to [[Wikipedia:Naming conventions (technical restrictions)#Forbidden characters|technical restrictions]], A# redirects here. For other uses, see [[A-sharp (disambiguation)]].}}
{{pp-move-indef}}
{{Two other uses|the letter|the indefinite article|A and an}}
{{Latin alphabet navbox|uc=A|lc=a}}
'''A''' ({{IPAc-en|En-us-A.ogg|eɪ}}; [[English_alphabet#Letter_names|named]] ''a'', plural ''aes'')&lt;ref name=&quot;OED&quot;/&gt; is the first [[Letter (alphabet)|letter]] and a [[vowel]] in the [[basic modern Latin alphabet]]. It is similar to the Ancient Greek letter [[Alpha]], from which it derives.

plaing text is
 A (named a, plural aes) is the first letter and a vowel

unsure if i want English_alphabet or Letter

in either case shows need for understanding # in [[English_alphabet#Letter_names|named]]

--------------------------------

 <page>
    <title>Alabama</title>
      ...
      <comment>Undid revision 442794985 by [[Special:Contributions/76.73.178.189|76.73.178.189]] ([[User talk:76.73.178.189|talk]])</comment>
      <text xml:space="preserve">{{About|the U.S. state of Alabama|the river|Alabama River|other uses|Alabama (disambiguation)}}
{{pp-move-indef}}
{{Infobox U.S. state
 |Name = Alabama
 |Fullname = State of Alabama
 |Flag = Flag of Alabama.svg
 |Flaglink = [[Flag of Alabama|Flag
 .....
 |Route Marker = Alabama 67.svg
 |Quarter = 2003 AL Proof.png
 |QuarterReleaseDate = 2003
}}

'''Alabama''' ({{IPAc-en|en-us-Alabama.ogg|ˌ|æ|l|ə|ˈ|b|æ|m|ə}}) is a [[U.S. state|state]] located in the [[Southern United States|southeastern region]] of 

Alabama -> U.S. state

-----------------------------

  <page>
    .....
      <comment>Corrected Greek spelling to be consistent with more modern and graphically accurate rendering of kappa as 'k.'</comment>
      <text xml:space="preserve">{{Redirect|Achilleus|the emperor with this name|Achilleus (emperor)|other uses|Achilles (disambiguation)}}
[[Image:Leon Benouville The Wrath of Achilles.jpg|thumb|''The Wrath of Achilles'', by [[François-Léon Benouville]] (1821–1859) ([[Musée Fabre]])]]

In [[Greek mythology]], '''Achilles''' ([[Ancient Greek]]: {{polytonic|Ἀχιλλεύς}}, ''Akhille

first case of link _before_ '''d term
no guarentee there even will be a ''' term i guess...
feels like simplest way is parse [['s, one at a time, and ignore those with [[??:  treating the : as a meta char?

--------------------------------------

  <page>
    <title>Actrius</title>
    <id>330</id>
    <revision>
      ....
|country             = [[Spain]]
|production_company  = [[Els Films de la Rambla, S.A.]]
}}

'''''Actrius''''' ([[Catalan language|Catalan]]: ''Actresses'') is a [[1996 in film|1996 film]] directed by [[Ventura Pons]]. In the film, there are no male actors and the four leading actresses dubbed themselves in the Castilian version.

again, should it be 'Catalan language' or '1996 in film'
i think the second is better

- ignore things in brackets
-- 

elastic-mapreduce -c ~/security/credentials.json --create --alive \
 --master-instance-type cc1.4xlarge --slave-instance-type cc1.4xlarge --instance-count 3 \
 --bootstrap-action s3://beta.elasticmapreduce/bootstrap-actions/install-ganglia \
 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive \
 --bootstrap-action s3://mkelcey/wikipediaPhilosophy/install_beautiful_soup.sh \
 --pig-interactive --name mkelcey_1313088555


walking from slayer i see...
http://en.wikipedia.org/wiki/Slayer
http://en.wikipedia.org/wiki/Thrash_metal
http://en.wikipedia.org/wiki/Heavy_metal_music
http://en.wikipedia.org/wiki/Rock_music
http://en.wikipedia.org/wiki/Popular_music
http://en.wikipedia.org/wiki/Musical_genre
http://en.wikipedia.org/wiki/Genres
http://en.wikipedia.org/wiki/Literature
http://en.wikipedia.org/wiki/Latin
http://en.wikipedia.org/wiki/Italic_language
http://en.wikipedia.org/wiki/Indo-European_languages
http://en.wikipedia.org/wiki/Language_family
http://en.wikipedia.org/wiki/Language
http://en.wikipedia.org/wiki/Human
http://en.wikipedia.org/wiki/Taxonomy
http://en.wikipedia.org/wiki/Ancient_Greek
http://en.wikipedia.org/wiki/Archaic_Greece
gets a bit fuzzy...

setup
 sudo apt-get install emacs22-nox

get data
 wget http://download.wikimedia.org/enwiki/20110722/enwiki-20110722-pages-articles.xml.bz2   # 7.1gb

flatten to single line
 bzcat enwiki-20110722-pages-articles.xml.bz2 | ~/flattenToOnePagePerLine.py > enwiki-20110722-pages-articles.pageperline.xml # 30gb

split into redirects and articles
 cat enwiki-20110722-pages-articles.pageperline.xml | grep \<redirect\ \/\> > enwiki-20110722-pages-redirects.xml &   
 cat enwiki-20110722-pages-articles.pageperline.xml | grep -v \<redirect\ \/\> > enwiki-20110722-pages-articles.xml & 

wc -l redirects 5177302
wc -l articles  6307301

move xml for articles and redirects into hdfs 
 hadoop fs -mkdir /full/articles.xml
 hadoop fs -copyFromLocal /mnt/enwiki-20110722-pages-articles.xml /full/articles.xml
 hadoop fs -mkdir /full/redirects.xml
 hadoop fs -copyFromLocal /mnt/enwiki-20110722-pages-redirects.xml /full/redirects.xml

parse redirects
 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /full/redirects.xml -output /full/redirects \
  -mapper redirectParser.py -file redirectParser.py

parse articles
 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /full/articles.xml -output /full/edges \
  -mapper articleParser.py -file articleParser.py

  parse	
	exception_parsing_article=29
	no_valid_links=18902
	cant_find_any_links=2002
	ignore_meta_article=2593294

dereference redirects
 pig -p INPUT=/full/edges -p OUTPUT=/full/edges.dereferenced1 -f dereference_redirects.pig

and again (to check there are no double redirects)
 pig -p INPUT=/full/edges.dereferenced1 -p OUTPUT=/full/edges.dereferenced2 -f dereference_redirects.pig

and again (to check there are no double redirects)
/* pig -p INPUT=/full/edges.dereferenced2 -p OUTPUT=/full/edges.dereferenced3 -f dereference_redirects.pig

looks like a job for iterative pig 0.9 :) !

get results locally to check them
 hfs -cat /full/edges/* | sort > /mnt/edges &
 hfs -cat /full/edges.dereferenced1/* | sort > /mnt/edges.dereferenced1 &
 hfs -cat /full/edges.dereferenced2/* | sort > /mnt/edges.dereferenced2 &
 
/mnt/edges.dereferenced1 & /mnt/edges.dereferenced2 same size, 
but md5sum different?
and diff shows "difference"? must be whitespace or unicode weirdness... looks good enough...
hadoop@ip-10-17-178-207:/mnt$ diff edges.dereferenced[12]
121977d121976
< 6₂ knot       Knot theory
121978a121978
> 6₂ knot       Knot theory
254926d254925
< A♭ (musical note)     Semitone
254927a254927
> A♭ (musical note)     Semitone
1295453d1295452
< G♭ (musical note)     Semitone

-- do the walk!

 time java -Xmx8g -cp . DistanceToPhilosophy Philosophy /mnt/edges.dereferenced3 \
  >DistanceToPhilosophy.stdout 2>DistanceToPhilosophy.stderr 

./explorer.py

Slayer -> Thrash metal -> Heavy metal music -> Rock music -> Popular music -> Music genre -> Genre -> Literature -> Letter (alphabet) -> Grapheme -> Writing system -> Symbolic system -> Psychology -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

Beer -> Alcoholic beverage -> Drink -> Liquid -> State of matter -> Phase (matter) -> Outline of physical science -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

Linux -> Unix-like -> Operating system -> Computer software -> Computer program -> Instruction set -> Computer architecture -> Computer science -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

Parachuting -> Parachute -> Atmosphere -> Gas -> State of matter -> Phase (matter) -> Outline of physical science -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

Bad Religion -> Punk rock -> Rock music -> Popular music -> Music genre -> Genre -> Literature -> Letter (alphabet) -> Grapheme -> Writing system -> Symbolic system -> Psychology -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

Vegemite -> Yeast extract -> Yeast -> Eukaryote -> Organism -> Biology -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

Hobart -> List of Australian capital cities -> States and territories of Australia -> Australia -> Southern Hemisphere -> Earth -> Planet -> Orbit -> Physics -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

some quirks;
 Natural science -> Branch_(academia) in live
 Fact -> Truth

having problems around Antwerp

path is actually: Antwerp -> Municipality -> Australia -> Southern Hemisphere -> Earth -> Planet -> Orbit -> Physics -> Natural science -> Science
-> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

but distance lists
didn't visit antwerp, Municipality, Australia or Southern Hemisphere
Earth however is , distance 14

there is no edge, Southern Hemisphere -> Earth
the parser must be broken.

fixed it again, and all article.egs work (from testArticleParser)

run redirects against redirects
 pig -p INPUT=/full/redirects -p OUTPUT=/full/redirects.dereferenced1 -f dereference_redirects.pig
 pig -p INPUT=/full/redirects.dereferenced1 -p OUTPUT=/full/redirects.dereferenced2 -f dereference_redirects.pig
 pig -p INPUT=/full/redirects.dereferenced2 -p OUTPUT=/full/redirects.dereferenced3 -f dereference_redirects.pig
 pig -p INPUT=/full/redirects.dereferenced3 -p OUTPUT=/full/redirects.dereferenced4 -f dereference_redirects.pig
 hfs -mv /full/redirects /full/redirects.original
 hfs -mv /full/redirects.dereferenced4 /full/redirects

run extracrion
 hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
  -input /full/articles.xml -output /full/edges \
  -mapper articleParser.py -file articleParser.py

run redirects against edges
 pig -p INPUT=/full/edges -p OUTPUT=/full/edges.dereferenced -f dereference_redirects.pig
 pig -p INPUT=/full/edges.dereferenced -p OUTPUT=/full/edges.dereferenced2 -f dereference_redirects.pig # sanity check, should be no different

get to local filesystem
 hadoop fs -cat /full/edges.dereferenced/* > data/edges

and run it up
 time java -Xmx8g -cp . DistanceToPhilosophy Philosophy data/edges >DistanceToPhilosophy.stdout 2>DistanceToPhilosophy.stderr

work out which nodes we didn't visit
 grep ^didnt DistanceToPhilosophy.stdout | sed -es/didnt\ visit\ // > didnt_visit

summarise why we didn't visit them
 ./walk_till_end.py < didnt_visit > walk_till_end.stdout
 grep end\ of\ line$ walk_till_end.stdout | cut -f2 | sort | uniq -c | sort -nr | head

  10397 List of United States cities by population
   6282 Abnormality (behavior)
   3447 Local development ministry  * no article (live)
   2062 West Slavs
   1802 Azana (gnat) * no article (live)

Abnormality is one that i can't see how to fix, 
perhaps just need these as special cases?

tried to fix with another hack but screw it; just need special cases

after...
 hadoop fs -cat /full/edges.dereferenced/* > data/edges

run
 cat special_edges_cases >> data/edges

with ' removal
   6192 Abnormality (behavior)
   4216 Special administrative region (People s Republic of China)
   3718 People s Republic of China
   3447 Local development ministry
   2063 West Slavs
   1963 Direct-controlled municipality of the People s Republic of China
   1800 Azana (gnat)
   1539 Earth s atmosphere
   1486 Administrative divisions of the People s Republic of China
    922 R&B;

-- perhaps time to try a mediawiki parser...

sudo apt-get install libxml2-dev libxslt-dev
wget http://pypi.python.org/packages/source/m/mwlib/mwlib-0.12.15.zip#md5=fae8cab1ef1421202c734c8c5f12b51a
unzip mwlib-0.12.15.zip
cd mwlib-0.12.15
sudo python setup.py install

from mwlib.uparser import simpleparse
from BeautifulSoup import BeautifulStoneSoup
from xml.sax.saxutils import unescape
file_contents = open('article.egs/arcology.eg','r').read()
xml = BeautifulStoneSoup(file_contents)
text = xml.find('text').string
text = unescape(text, {"&apos;": "'", "&quot;": '"'})
parsed = simpleparse(text)

from mwlib.uparser import parseString
from mwlib.xhtmlwriter import MWXHTMLWriter, preprocess
from BeautifulSoup import BeautifulStoneSoup
from xml.sax.saxutils import unescape
file_contents = open('article.egs/arcology.eg','r').read()
xml = BeautifulStoneSoup(file_contents)
wikitext = xml.find('text').string
wikitext = unescape(wikitext, {"&apos;": "'", "&quot;": '"'})

r = parseString(title='', raw=wikitext)
preprocess(r)
dbw = MWXHTMLWriter()
dbw.writeBook(r)
xml2 = BeautifulStoneSoup(dbw.asstring())
paras = xml2.findAll('div',{"class":"mwx.paragraph"})
paras

links = paras[1].findAll('a',{"class":"mwx.link.article"})
links[0].text
# but requires upper casing etc


a.eg
- remove { }s
- recurse Article
- ignore elements with style ''
- recurse Node elements
- if text & includes '(' ignore until next ')'
- first ArticleLink
- with previous casing rules etc

albert.eg
ok; doesnt require { } removal

allah.eg
- ignore ImageLink
- recurse Paragraph
- recurse Style '''

analytic_geometry.eg
- previous link processing; remove after #, replace _
- ignore links pointing to self

all good!

start from scratch

elastic-mapreduce --create --alive \
 --num-instances 5 --master-instance-type c1.xlarge --slave-instance-type c1.xlarge \
 --bootstrap-action s3://mkelcey/wikipediaPhilosophy/install_beautiful_soup_and_mwlib.sh

then follow readme

elastic-mapreduce -c ~/security/credentials.json --create --alive \
 --master-instance-type cc1.4xlarge --slave-instance-type cc1.4xlarge --instance-count 3 \
 --bootstrap-action s3://beta.elasticmapreduce/bootstrap-actions/install-ganglia \
 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive \
 --bootstrap-action s3://mkelcey/wikipediaPhilosophy/install_beautiful_soup_and_mwlib.sh \
 --pig-interactive --name mkelcey_1313088555

articleParser
Map input records 6,293,014
exception_parsing_article 	 166
no_link  34,440
ignore_meta_title 2,569,682 	
neg_parantheses_depth 1,914 	
Map output records 	3,688,729

visited 3,540,120
didnt visit 185,252

hadoop@ip-10-17-182-55:~/wikipediaPhilosophy$ grep end\ of\ line$ walk_till_end.stdout | cut -f2 | sort | uniq -c | sort -nr | head
  35022 India 
   3393 Local development ministry (no wiki page)
   1620 Iraq 
   1529 Azana (gnat)
    739 Professional sports
    561 Uzbekistan
    479 Bhutan
    416 Tajikistan
    349 Borsod-Aba???
    313 Mental state

looks like there's always going to be special cases...
India -> South Asia
Iraq -> Western Asia
Professional sports -> Amateur sports
Uzbekistan -> Landlocked country
Bhutan -> Landlocked country
Tajikistan -> Landlocked country
Mental state -> Psychology

brings it down a  bit
visited 3579277
didnt visit 146095

grep end\ of\ line$ walk_till_end.stdout | cut -f2 | sort | uniq -c | sort -nr | head
   3393 Local development ministry
   1529 Azana (gnat)
    349 Borsod-Abaúj-Zemplén County
    312 Cell biology
    274 Russian Soviet Federative Socialist Republic
    268 Discing
    251 Assyrian people
    247 German Navy
    236 Indoor American football
    225 March 21 (Eastern Orthodox liturgics)

getting deminising returns...

cut -f4 walk_till_end.stdout|sort|uniq -c
  51362 cycle
  58097 end of line
  36636 no node

grep no\ node$ walk_till_end.stdout | cut -f2 | sort | uniq -c
 36636 NA

-- looking at frequencies

cut -f2 distances | uniq -c | perl -plne's/^\s+//' > distance.freqs

dists = read.delim('distance.freqs', header=FALSE, sep=" ")
names(dists) <- c("freq","distance")
library(ggplot2)
ggplot(head(dists, 70), aes(x=distance, y=freq)) + 
 geom_line() + geom_point() + 
 xlab('number of clicks away from philosophy') + ylab('number of articles') +
 opts(title='distance to philosophy')

--------------------------------
-- blog notes

general notes to add at start
- wikipedia is under heavy edit churn. i've been doing this project in 15-20 minutes chunks for a few weeks and it's amazing
 how often i'd compare the parsing to live wikipedia and find out the page had already subtely changed.
- i wrote all the code for this in python (trying to move away from ruby to get better data related library support) _except_ for
the depth first search which i did in java. a 3e6 node dict was _insanely_ slow to access, i must be doing something wrong...


to calculate the distance from philosophy for all terms it's a straight forward breadth first search, 
and because this search doesnt <a href="http://en.wikipedia.org/wiki/Graph_cycle">cycle</a> back to Philosophy again it ends
up building a <a href="http://en.wikipedia.org/wiki/Tree_(graph_theory)">tree</a>

we can start answering some of our original questions now..

1) there are 3,500,000 articles that lead to philosophy but there are 100,000 articles that don't. these articles fail into a number of sub categories

- cycles; 50,000 articles end up getting are stuck in cycles which is remarkably low given 3,500,000 make it to philosophy.

the vast majority of the cycles are two nodes; eg Waste management -> Waste collection -> Waste management

my favorite that i stumbled across is 'Sand fence -> Snow fence -> Sand fence'
the first sentence of Snow fence being "A snow fence is a structure, similar to a sand fence ..."
the first sentence of Sand fence being "A sand fence is a structure similar to a snow fence ..."

- another 50k are dead ends; all sorts of examples for this, mainly around pages that were never written or have been deleted; 
eg Windsurfing -> Surface water sports -> Discing (which was deleted)

so 

Philosophy -> Reason -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

2) of the articles that do lead to philosophy here's a graph showing the distribution of their distances
the bulk are defintely between 10 to 30 clicks away.

<img src="num_articles__num_clicks__philosophy.png" />

i've trimmed this graph at 70 clicks away since there's a long tail of one single path that is 1001 clicks away

List of state leaders in 1977 -> List of state leaders in 1976 -> List of state leaders in 1975 -> 
.... -> List of state leaders in 1001 -> List of state leaders in 1000 -> Fatimid Caliphate -> Arab people 
-> Panethnicity -> Ethnic group -> Social group 
-> Social sciences -> List of academic disciplines -> Academia -> Community -> Living -> Life 
-> Physical body -> Physics -> Natural science -> Science -> Knowledge -> Fact -> Information 
-> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy

The longest one I found that didn't include a chain of lists is Violet & Daisy which is 37 articles long.

Violet & Daisy -> Saoirse Ronan -> BAFTA Award for Best Actress in a Supporting Role -> British Academy Film Awards -> British Academy of Film and Television Arts -> David Lean -> Order of the British Empire -> Chivalric order -> Knight -> Warrior -> Combat -> Violence -> Psychological manipulation -> Social influence -> Conformity -> Unconscious mind -> Germans -> Germanic peoples -> Proto-Indo-Europeans -> Proto-Indo-European language -> Linguistic reconstruction -> Internal reconstruction -> Language -> Human -> Extant taxon -> Biology -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
considering the first fews depths of the tree

distance freq  comment
0        1     Philosophy itself
1        1084  Articles that are one click away from Philosophy
2        1535  Articles that are two clicks away

-- so the idea is it's not even, there are trodden paths

# calculate number of descendants for each node in tree from philosophy
cut -f1 distances > articles
hfs -mkdir /articles_that_led_to_philosophy
hfs -copyFromLocal articles /articles_that_led_to_philosophy
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
 -input /articles_that_led_to_philosophy -output /num_descendants \
 -mapper 'count_descendants.py edges' -file count_descendants.py -file edges \
 -reducer aggregate
hfs -cat /num_descendants/* | sort -k2 -t"    " -nr > descendants.sorted

# draw graph of the top 1000
head -n 10 descendants.sorted > descendants.top10
./filter_nodes.py descendants.top10 < edges > filtered.edges.top10
./to_dot.py filtered.edges.top10 descendants.top10 | dot -Tpng > blah.png


top200 zoom it
<script src="http://zoom.it/adTw.js?width=auto&height=400px"></script>

top1000 zoom it
<script src="http://zoom.it/QyGA.js?width=auto&height=400px"></script>

Philosophy -> Reason -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy