This repository has been archived by the owner on Aug 28, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
journal
1196 lines (937 loc) · 46.1 KB
/
journal
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
distance to philosophy
http://xkcd.com/903/
not such a big deal
questions
1) do all wikipedia articles link to philosophy?
2) what distribution do the distances take?
method:
1) get wikipedia dump from volume
2) parse to make a graph; term -> term
3) connected components; is it one? if not which one is philosophy in?
4) histogram of distances
part 1) get wikipedia dump from volume
go to snapshot snap-1781757e Wikipedia Extraction-WEX (Linux)
from it make a volume vol-? (in, say, us-east-1c)
make an instance; ebs backed
attach volume to instance i-71941410 (also in us-east-1c)
device /dev/sdk
copy from ebs to hdfs
mkdir wiki;
sudo mount /dev/xvdk wiki
hadoop fs -mkdir /full/articles
hadoop fs -copyFromLocal wiki/rawd/freebase-wex-2009-01-12-articles.tsv /full/articles_one_file # 7 min
hadoop fs -mkdir /full/redirects
hadoop fs -copyFromLocal wiki/rawd/freebase-wex-2009-01-12-redirects.tsv /full/redirects
and for testing...
hadoop fs -mkdir /sample/articles
head -n100 wiki/rawd/freebase-wex-2009-01-12-articles.tsv > sample
hadoop fs -copyFromLocal sample /sample/articles/freebase-wex-2009-01-12-articles.tsv
the interesting file is wiki/rawd/freebase-wex-2009-01-12-articles.tsv
which is 31G; 4,183,153 articles
http://wiki.freebase.com/wiki/WEX/Documentation
it has 5 columns
0 - id
1 - title
2 - date
3 - xml
4 - plain text
(maybe ignore this)
before going too far it'd be interesting to extract the actual list of titles..
hadoop jar ~/contrib/streaming/hadoop-streaming.jar -input /sample/articles/ -output /sample/titles -mapper '/usr/bin/cut -f2' -numReduceTasks 0
b(maybe ignore this)
first need to split into chunks for in/out of s3 (with gzipping)
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-D mapred.min.split.size=419430400 \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input /full/articles_one_file/ -output /full/articles \
-mapper /bin/cat -numReduceTasks 0
reduces it to 79 files, 75mb each, 6gb total
after playing with the data and bit (and refining the algorithm) the basic heuristic is
ignore until first <sentence>
find first <target> that isn't article name (as often, the first one is)
of course it's not that simple, there are lots of extra special cases...
eg [File:BSicon ABZvlr.svg] [Category:AB Castellón players] or [Template:Sharpness Branch Line]
what the distinct values of these?
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /sample/articles/ -output /sample/article_types \
-mapper '/usr/bin/python metaArticleTypes.py' -file metaArticleTypes.py \
-reducer aggregate
of course, nothing is clean :D
$ hfs -cat /full/article_types/*|sort -k2 -t' ' -nr|head -n20
normal file 2684331
File 862604
Category 434783
Template 164138
Portal 15543
Portal talk 1175
File talk 903
Category:Wikipedians by alma mater 739
Meanings of minor planet names 218
List of minor planets 203
ISO 3166-2 198
List of United Kingdom locations 125
List of drugs 114
Star Wars 91
Star Trek 85
Template:Ph 83
Library of Congress Classification 83
Theme Time Radio Hour 81
Live Phish Downloads 74
Batman 66
so maybe ignore; File, Category, Template, Portal, Portal talk, File talk
curious now, what do these represent?
$ hfs -cat /full/articles/part-00232.gz | gunzip | cut -f2 | grep ^File | shuf | head
File:TriGeo Logo.JPG
File:Tuckerdorothy.jpg
File:Hippoquarium.jpg
File:Gbridge-cap.jpg
File:Twoc.jpg
File:Eurovision 81.jpg
File:BMW 003 jet engine.JPG
File:Lochailort.jpg
File:New Zealand General Service Medal - Iraq.jpg
File:Pixiesheadon.jpg
$ hfs -cat /full/articles/part-00232.gz | gunzip | cut -f2 | grep ^Category | shuf | head
Category:User bcl-3
Category:Sport in Hamilton, New Zealand
Category:People murdered in Norfolk Island
Category:Top-importance Old-time Base Ball articles
Category:Calgary Mustangs players
Category:NA-Class Japanese baseball articles
Category:Deaths by firearm in Nebraska
Category:Irish folk-song collectors
Category:University of Maryland, Baltimore County faculty
Category:Human death in Nebraska
$ hfs -cat /full/articles/part-00232.gz | gunzip | cut -f2 | grep ^Portal | shuf | head
Portal:Furry/Did you know/2
Portal:Western Australia/Selected article/September 2008
Portal:Edgar Allan Poe/Selected picture/October
Portal:Tropical cyclones/Featured article/Monsoon trough
Portal:Spaceflight/On This Day/5 September
Portal:Philadelphia/Philadelphia news/September 2008
Portal:BBC/Selected article/2
Portal:Greater Manchester/Did you know/archive
Portal:Japan/Did you know/56
Portal talk:Trains/Anniversaries/August 30
ignore until first <sentence>
find first <target> that
isn't article name (as often, the first one is)
doesn't start with File:, Category:, Template:, Portal:, Portal talk:, File talk:
sometimes we just need to ignore the "article" too.
i think when the 'plain text' (col[4]) is <100 characters it's probably a meta article too...
in fact this seems to be a more general case;
so...
ignore if plain_text < 100 chars
ignore until first sentence
find first <target> that isn't article name
there is also another interesting file is freebase-wex-2009-01-12-redirects.tsv
which i suspect will be required since sometimes the second target will require redirect dereference
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /sample/articles/ -output /sample/edges \
-mapper '/usr/bin/python articleParser.py' -file articleParser.py
-numReduceTasks 0
agassi should be 'List of ATP number 1 ranked players'
(would be 'Kirk Kerkorian', his middle name, if we include synthetic links)
remove all templates
<template.*?</template>
remove all sythentic links
<link synthetic="true">.*?</link>
first target
andre agassi link removed since synthetic
structure was...
articles
article
paragraph
sentence
first target
though i think just targets after removing templates is enough...
comparing with ayn_rand
it would be 'American values' if we include 'link synthetic="true"'
but 'novelist' if we exclude 'link synthetic="true"'
sometimes there iis text before the 1st paragraph, eg disambiguation info...
so need to trim to first paragraph too!!
final is then..
ignore "articles" that have a name starting with 'File:', 'Template:',etc
ignore "articles" that have plain text less than 30 chars
trim to first paragraph
remove all templates
<template.*?</template>
remove all sythentic links
<link synthetic="true">.*?</link>
first target
andre agassi link removed since synthetic
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /sample/articles/ -output /sample/edges \
-mapper '/usr/bin/python articleParser.py' -file articleParser.py
-numReduceTasks 0
no_outbound_link_found 15,491
plain_text_too_short 61,863
metafile 1,480,805
Map input records 4,183,153
Map output records 2,624,994
looking through the diff of the from_nodes to the to_nodes in the edges
looks like we need to convert the to_nodes to their redirects
cut -f1 all.edges | sort > all.edges.from # 2624994 lines
cut -f2 all.edges | uniq | sort | uniq > all.edges.to # 497461 lines
the redirects file isn't huge (3.3e6 rows, 154mb) so i thought it'd feasbile to do this in memory.
eg see dereferenceToLinks.py
but it's a complete fail. splitting into 16 chunks & running in parallel on cc1.4xlarge and it's a fail.
(running 10+hrs and each derefernce dict lookup taking roughly 2s for every 3 records (?! that's a 45hr runtime)
i obviously need to learn me some more python...
the main reason i did this since i thought they'd be multiple redirects but looking at ~200e3 samples it's not the
case, if there is a redirect it's directly done
which means this is just a join; do it in pig
-- pig -p SET=sample|full -f edges_dereferenced.pig
edges = load '$SET/edges' as (from_node: chararray, to_node: chararray);
redirects_with_id = load '$SET/redirects' as (id:long, from_node: chararray, to_node: chararray);
redirects = foreach redirects_with_id generate from_node, to_node;
joined = join edges by to_node left outer, redirects by from_node;
edges_dereferenced = foreach joined generate
edges::from_node as from_node,
(redirects::to_node is null ? edges::to_node : redirects::to_node) as to_node;
store edges_dereferenced into '$SET/edges_dereferenced';
takes a minute. though of course this is not a fundamental pig vs python thing, it's an algorithm difference.
a simpler merge approach could have been done in python must faster too i'm sure
cut -f1 all.edges_dereferenced | sort > all.edges_dereferenced.from # 2624994 lines (sanity)
cut -f2 all.edges_dereferenced | uniq | sort | uniq > all.edges_dereferenced.to # 455224 lines
now we can work on calculating the distance for each page from 'Philosophy', and this is simply a breadth first search
again, trying to be pragmatic, i wrote a version in python (distanceFromPhilosophy.py) but my god it's slow...
?seconds for ? lookups in a dict? what am i missing?
Tue Aug 2 05:39:38 UTC 2011
=== move to newer version of the dump
mkfifo articles
hadoop fs -copyFromLocal articles /full/articles-2011-07-08/freebase-wex-2011-07-08-articles.tsv &
curl http://download.freebase.com/wex/latest/freebase-wex-2011-07-08-articles.tsv.bz2 | bunzip2 > articles &
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-D mapred.min.split.size=300000000 \
-input /full/2011-07-08/articles \
-output /full/2011-07-08/edges \
-mapper articleParser.py \
-file articleParser.py \
-numReduceTasks 0
Job Counters
Rack-local map tasks 0 0 189
Launched map tasks 0 0 190
Data-local map tasks 0 0 1
FileSystemCounters
HDFS_BYTES_READ 55,499,655,843 0 55,499,655,843
HDFS_BYTES_WRITTEN 129,764,139 0 129,764,139
parse
no_outbound_link_found 9,647 0 9,647
plain_text_too_short 87,599 0 87,599
metafile 1,896,275 0 1,896,275
Map-Reduce Framework
Map input records 5,596,834 0 5,596,834
Spilled Records 0 0 0
Map input bytes 55,487,987,179 0 55,487,987,179
Map output records 3,603,313 0 3,603,313
#articles => 5,596,834
#edges extracted from articles 3,603,313
# edges.from 3603313
# edges.from (uniq) 3603249
# edges.to (uniq) 684632
pig -p SET=/full/2011-07-08/ -f edges_dereferenced.pig
# edges_dereferenced.from 3603313 (sanity)
# edges_dereferenced.from (uniq) 3603249 (sanity)
# edges_dereferenced.from 621604 (not as many as the 2009-01-12 dataset...)
would downcasing help? maybe, but it's drifting further away from the true data
-- run the fscker!
hfs -cat /full/2011-07-08/edges_dereferenced/* | java -classpath . com.Test "Philosophy" >edges 2>progress &
lots of examples that didn't work
eg truth, which is returning 'François Lemoyne' instead of 'Reality'
need another gold set to work with
under articles.eg
17th_Delaware_General_Assembly.eg
1949_Coupe_de_France_Final.eg
1999_Japan_Open_Tennis_Championships_Womens_Singles.eg
BAFC.eg
Bird_Gets_the_Worm.eg
Category_1022_books.eg
File_Pasquale_Caggiano_png.eg
Fort_Baxter.eg
Jinxiang_dialect.eg
Truth.eg
$ cat article.egs/* | ./articleParser.py 2>/dev/null
17th Delaware General Assembly Delaware Senate
1949 Coupe de France Final soccer
1999 Japan Open Tennis Championships – Women's Singles Ai Sugiyama
Bird Gets the Worm Charlie Parker
Jinxiang dialect People's Republic of China
Truth François Lemoyne
should be....
17th Delaware General Assembly Delaware Senate
1949 Coupe de France Final soccer
1999 Japan Open Tennis Championships – Women's Singles Ai Sugiyama
Bird Gets the Worm Charlie Parker
Jinxiang dialect Taihu Wu dialects ** different
Truth Fact ** different
(note:
consider 'Jinxiang dialect'
$ cut -f4 article.egs/Jinxiang_dialect.eg | sed -es/\\\\n/\ /g | xmllint --format -
picked up target is from side bar...
.. <param name="states"><link><target>People's Republic of China</target></link> ..
correct target is later...
.. or a Northern <link synthetic="true"><target>Taihu Wu dialects</target><part>Wu dialect</part></link>, spoken in ..
$ cut -f5 article.egs/Jinxiang_dialect.eg | less
The<space/><bold><link synthetic="true"><target>1949 Coupe de France Final</target><part>Coupe de France Final</part></link> 1949</bold><space/>was a<space/><link><target>soccer</target><part>football</part></link>
plain text is
Jinxiang dialect (金鄉話), is a Taihu Wu dialect, or a Northern Wu dialect, spoken in ...
so perhaps a better parsing strategy is
extract all target linkes, href and link text
choose the link target whose plain text appears first in the plain text
as a sanity check consider '1949 Coupe de France Final'
plain text is
The Coupe de France Final 1949 was a football match held at Stade ...
going to need beautiful soup
wget http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.2.0.tar.gz
tar zxf BeautifulSoup-3.2.0.tar.gz
cd BeautifulSoup-3.2.0
sudo python ./setup.py install
python
from BeautifulSoup import BeautifulStoneSoup
f = open('article.egs/1949_Coupe_de_France_Final.xml','r')
soup = BeautifulStoneSoup(f.read())
links = soup.findAll('link')
then for
<link synthetic="true"><target>1949 Coupe de France Final</target><part>Coupe de France Final</part></link>
>>> links[0].target
<target>1949 Coupe de France Final</target>
>>> links[0].target.string
u'1949 Coupe de France Final'
>>> links[0].part
<part>Coupe de France Final</part>
>>> links[0].part.string
u'Coupe de France Final'
and for
<link><target>Stade Olympique Yves-du-Manoir</target></link>
>>> links[2].target
<target>Stade Olympique Yves-du-Manoir</target>
>>> links[2].target.string
u'Stade Olympique Yves-du-Manoir'
>>> links[2].part == None
True
for template in soup.findAll('template'):
template.extract()
also needed to another heuristic which was to only examine the first 10 links
(otherwise the link 'fact' deep in the truth article matched a 'fact' plain text at the start of the article)
feels dangerous...
also noticed that 1949_Coupe_de_France_Final.eg -> soccer
and not Soccer as it should, and there is no redirect, and the current life page is correctly Soccer
might need to handle this in graph redirecting, if not present, and lower case, try upper case
17th Delaware General Assembly Delaware Senate
1949 Coupe de France Final soccer
1999 Japan Open Tennis Championships – Women's Singles Ai Sugiyama
Bird Gets the Worm Charlie Parker
Brendan Foster Order of the British Empire
Garh More Jhang District
Harbour View, New Zealand Lower Hutt
Jinxiang dialect Taihu Wu dialects
Truth Reality
restart another cluster from scratch
elastic-mapreduce --create --alive \
--num-instances 5 --master-instance-type m1.large --slave-instance-type m1.large \
--bootstrap-action s3://mkelcey/wikipediaPhilosophy/install_beautiful_soup.sh
then on master
mkfifo articles
hadoop fs -copyFromLocal articles /full/2011-07-23/articles/freebase-wex-2011-07-23-articles.tsv &
curl -s http://download.freebase.com/wex/2011-07-23/freebase-wex-2011-07-23-articles.tsv.bz2 | bunzip2 > articles &
mkfifo redirects
hadoop fs -copyFromLocal redirects /full/2011-07-23/redirects/freebase-wex-2011-07-23-redirects.tsv &
curl -s http://download.freebase.com/wex/2011-07-23/freebase-wex-2011-07-23-redirects.tsv.bz2 | bunzip2 > redirects &
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /full/2011-07-08/articles/ -output /full/2011-07-08/edges \
-mapper '/usr/bin/python articleParser.py' -file articleParser.py
Job Counters
Launched reduce tasks 0 0 10
Rack-local map tasks 0 0 94
Launched map tasks 0 0 423
Data-local map tasks 0 0 329
FileSystemCounters
FILE_BYTES_READ 0 84,818,230 84,818,230
HDFS_BYTES_READ 55,520,558,984 0 55,520,558,984
FILE_BYTES_WRITTEN 104,093,878 84,818,230 188,912,108
HDFS_BYTES_WRITTEN 0 130,919,661 130,919,661
parse
10_links_examin 1,918,250 0 1,918,250
no_match 189,136 0 189,136
exception 37,768 0 37,768
metafile 1,896,275 0 1,896,275
Map-Reduce Framework
Reduce input groups 0 3,473,630 3,473,630
Combine output records 0 0 0
Map input records 5,596,834 0 5,596,834
Reduce shuffle bytes 0 103,747,842 103,747,842
Reduce output records 0 3,473,655 3,473,655
Spilled Records 3,473,655 3,473,655 6,947,310
Map output bytes 130,919,781 0 130,919,781
Map input bytes 55,487,987,179 0 55,487,987,179
Map output records 3,473,655 0 3,473,655
Combine input records 0 0 0
Reduce input records 0 3,473,655 3,473,655
examining a random attempt
( /mnt/var/log/hadoop/userlogs/attempt_201107310225_0047_m_000017_0 )
on one of the slaves we see this breakdown
cat stderr |sort|uniq -c
849 parse exception <class 'sre_constants.error'>
510 parse exception <type 'exceptions.TypeError'>
adjusted err output to include articles name and reran to grab some samples
hfs -cat /full/2011-07-08/articles/freebase-wex-2011-07-08-articles.tsv | ./articleParser.py 2>&1 >/dev/null | grep -v ^reporter
made some fixes (robustness for character) and kicked off again...
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /full/2011-07-08/articles/ -output /full/2011-07-08/edges \
-mapper '/usr/bin/python articleParser.py' -file articleParser.py
Job Counters
Launched reduce tasks 0 0 10
Rack-local map tasks 0 0 200
Launched map tasks 0 0 448
Data-local map tasks 0 0 248
FileSystemCounters
FILE_BYTES_READ 0 86,274,065 86,274,065
HDFS_BYTES_READ 55,520,558,984 0 55,520,558,984
FILE_BYTES_WRITTEN 105,741,514 86,274,065 192,015,579
HDFS_BYTES_WRITTEN 0 132,365,083 132,365,083
parse
10_links_examined_limit 1,971,706 0 1,971,706
no_match 188,725 0 188,725
metafile 1,896,275 0 1,896,275
Map-Reduce Framework
Reduce input groups 0 3,511,809 3,511,809
Combine output records 0 0 0
Map input records 5,596,834 0 5,596,834
Reduce shuffle bytes 0 105,382,939 105,382,939
Reduce output records 0 3,511,834 3,511,834
Spilled Records 3,511,834 3,511,834 7,023,668
Map output bytes 132,365,209 0 132,365,209
Map input bytes 55,487,987,179 0 55,487,987,179
Map output records 3,511,834 0 3,511,834
Combine input records 0 0 0
Reduce input records 0 3,511,834 3,511,834
and then again on the newer dataset
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /full/2011-07-23/articles/ -output /full/2011-07-23/edges \
-mapper '/usr/bin/python articleParser.py' -file articleParser.py
Job Counters
Launched reduce tasks 0 0 10
Rack-local map tasks 0 0 263
Launched map tasks 0 0 470
Data-local map tasks 0 0 207
FileSystemCounters
FILE_BYTES_READ 0 86,548,215 86,548,215
HDFS_BYTES_READ 55,858,012,523 0 55,858,012,523
FILE_BYTES_WRITTEN 106,092,612 86,548,215 192,640,827
HDFS_BYTES_WRITTEN 0 132,850,425 132,850,425
parse
10_links_examined_limit 1,979,326 0 1,979,326
no_match 189,558 0 189,558
metafile 1,902,795 0 1,902,795
Map-Reduce Framework
Reduce input groups 0 3,523,601 3,523,601
Combine output records 0 0 0
Map input records 5,615,981 0 5,615,981
Reduce shuffle bytes 0 105,796,399 105,796,399
Reduce output records 0 3,523,628 3,523,628
Spilled Records 3,523,628 3,523,628 7,047,256
Map output bytes 132,850,552 0 132,850,552
Map input bytes 55,825,702,860 0 55,825,702,860
Map output records 3,523,628 0 3,523,628
Combine input records 0 0 0
Reduce input records 0 3,523,628 3,523,628
rewrote article parser, much much simpler now. rerun from scratch
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /full/2011-07-23/articles/ -output /full/2011-07-23/edges \
-mapper '/usr/bin/python articleParser.py' -file articleParser.py
found a big problem, one of the major links re: philosophy, "Greeks",
is \N for xml and plain text in 2011-07-23. that sucks
in fact there are 13065 blank for 2011-07-08 & 13278 blank for 2011-07-23; wonder if greeks is also blank in 2011-07-08?
turns out it is... how about 2011-06-26? blank too..
so might give up on freebase dump, how about more official wiki dump?
wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
7gb; ont big xml file
sampling a bit...
------------------------------
<page>
<title>AccessibleComputing</title>
<id>10</id>
<redirect />
<revision>
<id>381202555</id>
<timestamp>2010-08-26T22:38:36Z</timestamp>
<contributor>
<username>OlEnglish</username>
<id>7181920</id>
</contributor>
<minor />
<comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
</revision>
</page>
redirect from AccessibleComputing -> Computer accessibility
- want to collapse page -> /page into a single line for processing
- if <text> starts with #REDIRECT then process as redirect
----------------------
<page>
<title>Anarchism</title>
<id>12</id>
<revision>
<id>442817224</id>
<timestamp>2011-08-03T09:10:07Z</timestamp>
<contributor>
<username>Eduen</username>
<id>7527773</id>
</contributor>
<comment>Emma Goldman identifying anarchy as more than no state</comment>
<text xml:space="preserve">{{Redirect|Anarchist|the fictional character|Anarchist (comics)}}
{{Redirect|Anarchists}}
{{Anarchism sidebar}}
'''Anarchism''' is a [[political philosophy]] which considers the [[state (polity)|state]] undesirable, unnecessary, and harmful, and instead promotes a [[stateless society]], or [[anarchy]].<ref name="definition">
- in text
-- remove all {{..}}
-- look for first instance of [[.*?]]
Anarchism -> political philosophy (though page is 'Political philosophy'
---------------------------
<page>
<title>Autism</title>
<id>25</id>
<revision>
<id>440170653</id>
<timestamp>2011-07-18T19:18:21Z</timestamp>
<contributor>
<username>GrouchoBot</username>
<id>8453292</id>
</contributor>
<minor />
<comment>r2.6.4) (robot Adding: [[kk:Аутизм]]</comment>
<text xml:space="preserve">{{pp-semi-indef}}
{{dablink|This article is about the classic autistic disorder; some writers use the word ''autism'' when referring to the range of disorders on the [[autism spectrum]] or to the various [[pervasive developmental disorder]]s.<ref name=Caronna/>}}
{{pp-move-indef}}
<!-- NOTES:
1) Please follow the Wikipedia style guidelines for editing medical articles [[WP:MEDMOS]].
2) Use <ref> for explicitly cited references.
3) Reference anything you put here with notable references, as this subject tends to attract a lot of controversy.-->
{{pp-move-indef}}
{{Infobox Disease
| Name = Autism
| Image = Autism-stacking-cans 2nd edit.jpg
| Alt = Young red-haired boy facing away from camera, stacking a seventh can atop a column of six food cans on the kitchen floor. An open pantry contains many more cans.
| Caption = Repetitively stacking or lining up objects is a behavior occasionally associated with individuals with autism.
| DiseasesDB = 1142
| ICD10 = {{ICD10|F|84|0|f|80}}
| ICD9 = 299.00
| ICDO =
| OMIM = 209850
| MedlinePlus = 001526
| eMedicineSubj = med
| eMedicineTopic = 3202
| eMedicine_mult = {{eMedicine2|ped|180}}
| MeshID = D001321
| GeneReviewsID = autism-overview
| GeneReviewsName = Autism overview
}}
'''Autism''' is a [[Neurodevelopmental disorder|disorder of neural development]] characterized by impaired ....
Autism -> Neurodevelopmental disorder
- look for text
-- remove {{.*}}
-- look for first [[ ]], might include |
------------------------
<page>
<title>Alchemy</title>
<id>573</id>
<restrictions>move=:edit=</restrictions>
<revision>
<id>442807146</id>
<timestamp>2011-08-03T07:28:33Z</timestamp>
<contributor>
<username>Huntster</username>
<id>92632</id>
</contributor>
<minor />
<comment>Reverted 1 edit by [[Special:Contributions/75.65.177.88|75.65.177.88]] ([[User talk:75.65.177.88|talk]]) identified as [[WP:VAND|vandalism]] to last revision by Captainmighty. ([[WP:TW|TW]])</comment>
<text xml:space="preserve">{{Redirect|Alchemist}}
{{Other uses}}
[[File:Raimundus Lullus alchemic page.jpg|thumb|right|Page from alchemic treatise of [[Ramon Llull]], 16th century]]
'''Alchemy''' is an ancient [[tradition]], the primary objective of which was the
Alchemy -> tradtion
- can't just look for first [[ since it would pick up this meta file
- do we need to look for '''? or ignore [[File: ?
---------------------------------------------
<page>
<title>A</title>
....
<comment>r2.7.1) (robot Adding: [[nap:A]]</comment>
<text xml:space="preserve">{{Dablink|Due to [[Wikipedia:Naming conventions (technical restrictions)#Forbidden characters|technical restrictions]], A# redirects here. For other uses, see [[A-sharp (disambiguation)]].}}
{{pp-move-indef}}
{{Two other uses|the letter|the indefinite article|A and an}}
{{Latin alphabet navbox|uc=A|lc=a}}
'''A''' ({{IPAc-en|En-us-A.ogg|eɪ}}; [[English_alphabet#Letter_names|named]] ''a'', plural ''aes'')<ref name="OED"/> is the first [[Letter (alphabet)|letter]] and a [[vowel]] in the [[basic modern Latin alphabet]]. It is similar to the Ancient Greek letter [[Alpha]], from which it derives.
plaing text is
A (named a, plural aes) is the first letter and a vowel
unsure if i want English_alphabet or Letter
in either case shows need for understanding # in [[English_alphabet#Letter_names|named]]
--------------------------------
<page>
<title>Alabama</title>
...
<comment>Undid revision 442794985 by [[Special:Contributions/76.73.178.189|76.73.178.189]] ([[User talk:76.73.178.189|talk]])</comment>
<text xml:space="preserve">{{About|the U.S. state of Alabama|the river|Alabama River|other uses|Alabama (disambiguation)}}
{{pp-move-indef}}
{{Infobox U.S. state
|Name = Alabama
|Fullname = State of Alabama
|Flag = Flag of Alabama.svg
|Flaglink = [[Flag of Alabama|Flag
.....
|Route Marker = Alabama 67.svg
|Quarter = 2003 AL Proof.png
|QuarterReleaseDate = 2003
}}
'''Alabama''' ({{IPAc-en|en-us-Alabama.ogg|ˌ|æ|l|ə|ˈ|b|æ|m|ə}}) is a [[U.S. state|state]] located in the [[Southern United States|southeastern region]] of
Alabama -> U.S. state
-----------------------------
<page>
.....
<comment>Corrected Greek spelling to be consistent with more modern and graphically accurate rendering of kappa as 'k.'</comment>
<text xml:space="preserve">{{Redirect|Achilleus|the emperor with this name|Achilleus (emperor)|other uses|Achilles (disambiguation)}}
[[Image:Leon Benouville The Wrath of Achilles.jpg|thumb|''The Wrath of Achilles'', by [[François-Léon Benouville]] (1821–1859) ([[Musée Fabre]])]]
In [[Greek mythology]], '''Achilles''' ([[Ancient Greek]]: {{polytonic|Ἀχιλλεύς}}, ''Akhille
first case of link _before_ '''d term
no guarentee there even will be a ''' term i guess...
feels like simplest way is parse [['s, one at a time, and ignore those with [[??: treating the : as a meta char?
--------------------------------------
<page>
<title>Actrius</title>
<id>330</id>
<revision>
....
|country = [[Spain]]
|production_company = [[Els Films de la Rambla, S.A.]]
}}
'''''Actrius''''' ([[Catalan language|Catalan]]: ''Actresses'') is a [[1996 in film|1996 film]] directed by [[Ventura Pons]]. In the film, there are no male actors and the four leading actresses dubbed themselves in the Castilian version.
again, should it be 'Catalan language' or '1996 in film'
i think the second is better
- ignore things in brackets
--
elastic-mapreduce -c ~/security/credentials.json --create --alive \
--master-instance-type cc1.4xlarge --slave-instance-type cc1.4xlarge --instance-count 3 \
--bootstrap-action s3://beta.elasticmapreduce/bootstrap-actions/install-ganglia \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive \
--bootstrap-action s3://mkelcey/wikipediaPhilosophy/install_beautiful_soup.sh \
--pig-interactive --name mkelcey_1313088555
walking from slayer i see...
http://en.wikipedia.org/wiki/Slayer
http://en.wikipedia.org/wiki/Thrash_metal
http://en.wikipedia.org/wiki/Heavy_metal_music
http://en.wikipedia.org/wiki/Rock_music
http://en.wikipedia.org/wiki/Popular_music
http://en.wikipedia.org/wiki/Musical_genre
http://en.wikipedia.org/wiki/Genres
http://en.wikipedia.org/wiki/Literature
http://en.wikipedia.org/wiki/Latin
http://en.wikipedia.org/wiki/Italic_language
http://en.wikipedia.org/wiki/Indo-European_languages
http://en.wikipedia.org/wiki/Language_family
http://en.wikipedia.org/wiki/Language
http://en.wikipedia.org/wiki/Human
http://en.wikipedia.org/wiki/Taxonomy
http://en.wikipedia.org/wiki/Ancient_Greek
http://en.wikipedia.org/wiki/Archaic_Greece
gets a bit fuzzy...
setup
sudo apt-get install emacs22-nox
get data
wget http://download.wikimedia.org/enwiki/20110722/enwiki-20110722-pages-articles.xml.bz2 # 7.1gb
flatten to single line
bzcat enwiki-20110722-pages-articles.xml.bz2 | ~/flattenToOnePagePerLine.py > enwiki-20110722-pages-articles.pageperline.xml # 30gb
split into redirects and articles
cat enwiki-20110722-pages-articles.pageperline.xml | grep \<redirect\ \/\> > enwiki-20110722-pages-redirects.xml &
cat enwiki-20110722-pages-articles.pageperline.xml | grep -v \<redirect\ \/\> > enwiki-20110722-pages-articles.xml &
wc -l redirects 5177302
wc -l articles 6307301
move xml for articles and redirects into hdfs
hadoop fs -mkdir /full/articles.xml
hadoop fs -copyFromLocal /mnt/enwiki-20110722-pages-articles.xml /full/articles.xml
hadoop fs -mkdir /full/redirects.xml
hadoop fs -copyFromLocal /mnt/enwiki-20110722-pages-redirects.xml /full/redirects.xml
parse redirects
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /full/redirects.xml -output /full/redirects \
-mapper redirectParser.py -file redirectParser.py
parse articles
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /full/articles.xml -output /full/edges \
-mapper articleParser.py -file articleParser.py
parse
exception_parsing_article=29
no_valid_links=18902
cant_find_any_links=2002
ignore_meta_article=2593294
dereference redirects
pig -p INPUT=/full/edges -p OUTPUT=/full/edges.dereferenced1 -f dereference_redirects.pig
and again (to check there are no double redirects)
pig -p INPUT=/full/edges.dereferenced1 -p OUTPUT=/full/edges.dereferenced2 -f dereference_redirects.pig
and again (to check there are no double redirects)
/* pig -p INPUT=/full/edges.dereferenced2 -p OUTPUT=/full/edges.dereferenced3 -f dereference_redirects.pig
looks like a job for iterative pig 0.9 :) !
get results locally to check them
hfs -cat /full/edges/* | sort > /mnt/edges &
hfs -cat /full/edges.dereferenced1/* | sort > /mnt/edges.dereferenced1 &
hfs -cat /full/edges.dereferenced2/* | sort > /mnt/edges.dereferenced2 &
/mnt/edges.dereferenced1 & /mnt/edges.dereferenced2 same size,
but md5sum different?
and diff shows "difference"? must be whitespace or unicode weirdness... looks good enough...
hadoop@ip-10-17-178-207:/mnt$ diff edges.dereferenced[12]
121977d121976
< 6₂ knot Knot theory
121978a121978
> 6₂ knot Knot theory
254926d254925
< A♭ (musical note) Semitone
254927a254927
> A♭ (musical note) Semitone
1295453d1295452
< G♭ (musical note) Semitone
-- do the walk!
time java -Xmx8g -cp . DistanceToPhilosophy Philosophy /mnt/edges.dereferenced3 \
>DistanceToPhilosophy.stdout 2>DistanceToPhilosophy.stderr
./explorer.py
Slayer -> Thrash metal -> Heavy metal music -> Rock music -> Popular music -> Music genre -> Genre -> Literature -> Letter (alphabet) -> Grapheme -> Writing system -> Symbolic system -> Psychology -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
Beer -> Alcoholic beverage -> Drink -> Liquid -> State of matter -> Phase (matter) -> Outline of physical science -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
Linux -> Unix-like -> Operating system -> Computer software -> Computer program -> Instruction set -> Computer architecture -> Computer science -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
Parachuting -> Parachute -> Atmosphere -> Gas -> State of matter -> Phase (matter) -> Outline of physical science -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
Bad Religion -> Punk rock -> Rock music -> Popular music -> Music genre -> Genre -> Literature -> Letter (alphabet) -> Grapheme -> Writing system -> Symbolic system -> Psychology -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
Vegemite -> Yeast extract -> Yeast -> Eukaryote -> Organism -> Biology -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
Hobart -> List of Australian capital cities -> States and territories of Australia -> Australia -> Southern Hemisphere -> Earth -> Planet -> Orbit -> Physics -> Natural science -> Science -> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
some quirks;
Natural science -> Branch_(academia) in live
Fact -> Truth
having problems around Antwerp
path is actually: Antwerp -> Municipality -> Australia -> Southern Hemisphere -> Earth -> Planet -> Orbit -> Physics -> Natural science -> Science
-> Knowledge -> Fact -> Information -> Sequence -> Mathematics -> Quantity -> Property (philosophy) -> Modern philosophy -> Philosophy
but distance lists
didn't visit antwerp, Municipality, Australia or Southern Hemisphere
Earth however is , distance 14
there is no edge, Southern Hemisphere -> Earth
the parser must be broken.
fixed it again, and all article.egs work (from testArticleParser)
run redirects against redirects
pig -p INPUT=/full/redirects -p OUTPUT=/full/redirects.dereferenced1 -f dereference_redirects.pig
pig -p INPUT=/full/redirects.dereferenced1 -p OUTPUT=/full/redirects.dereferenced2 -f dereference_redirects.pig
pig -p INPUT=/full/redirects.dereferenced2 -p OUTPUT=/full/redirects.dereferenced3 -f dereference_redirects.pig
pig -p INPUT=/full/redirects.dereferenced3 -p OUTPUT=/full/redirects.dereferenced4 -f dereference_redirects.pig
hfs -mv /full/redirects /full/redirects.original
hfs -mv /full/redirects.dereferenced4 /full/redirects
run extracrion
hadoop jar ~/contrib/streaming/hadoop-streaming.jar \
-input /full/articles.xml -output /full/edges \
-mapper articleParser.py -file articleParser.py
run redirects against edges
pig -p INPUT=/full/edges -p OUTPUT=/full/edges.dereferenced -f dereference_redirects.pig
pig -p INPUT=/full/edges.dereferenced -p OUTPUT=/full/edges.dereferenced2 -f dereference_redirects.pig # sanity check, should be no different
get to local filesystem
hadoop fs -cat /full/edges.dereferenced/* > data/edges
and run it up
time java -Xmx8g -cp . DistanceToPhilosophy Philosophy data/edges >DistanceToPhilosophy.stdout 2>DistanceToPhilosophy.stderr
work out which nodes we didn't visit
grep ^didnt DistanceToPhilosophy.stdout | sed -es/didnt\ visit\ // > didnt_visit
summarise why we didn't visit them
./walk_till_end.py < didnt_visit > walk_till_end.stdout
grep end\ of\ line$ walk_till_end.stdout | cut -f2 | sort | uniq -c | sort -nr | head
10397 List of United States cities by population
6282 Abnormality (behavior)
3447 Local development ministry * no article (live)
2062 West Slavs
1802 Azana (gnat) * no article (live)
Abnormality is one that i can't see how to fix,
perhaps just need these as special cases?
tried to fix with another hack but screw it; just need special cases
after...
hadoop fs -cat /full/edges.dereferenced/* > data/edges
run
cat special_edges_cases >> data/edges
with ' removal
6192 Abnormality (behavior)
4216 Special administrative region (People s Republic of China)
3718 People s Republic of China
3447 Local development ministry
2063 West Slavs
1963 Direct-controlled municipality of the People s Republic of China
1800 Azana (gnat)
1539 Earth s atmosphere
1486 Administrative divisions of the People s Republic of China
922 R&B;
-- perhaps time to try a mediawiki parser...
sudo apt-get install libxml2-dev libxslt-dev
wget http://pypi.python.org/packages/source/m/mwlib/mwlib-0.12.15.zip#md5=fae8cab1ef1421202c734c8c5f12b51a
unzip mwlib-0.12.15.zip
cd mwlib-0.12.15
sudo python setup.py install
from mwlib.uparser import simpleparse
from BeautifulSoup import BeautifulStoneSoup
from xml.sax.saxutils import unescape
file_contents = open('article.egs/arcology.eg','r').read()
xml = BeautifulStoneSoup(file_contents)
text = xml.find('text').string
text = unescape(text, {"'": "'", """: '"'})
parsed = simpleparse(text)
from mwlib.uparser import parseString
from mwlib.xhtmlwriter import MWXHTMLWriter, preprocess
from BeautifulSoup import BeautifulStoneSoup
from xml.sax.saxutils import unescape
file_contents = open('article.egs/arcology.eg','r').read()
xml = BeautifulStoneSoup(file_contents)
wikitext = xml.find('text').string
wikitext = unescape(wikitext, {"'": "'", """: '"'})
r = parseString(title='', raw=wikitext)
preprocess(r)
dbw = MWXHTMLWriter()
dbw.writeBook(r)
xml2 = BeautifulStoneSoup(dbw.asstring())
paras = xml2.findAll('div',{"class":"mwx.paragraph"})
paras