## Описание спайдера

Был создан спайдер, собирающий страницы с вики ([код спайдера](https://github.com/matveinazaruk/mag_information_search/tree/master/wiki_spider)).
Он ничинал с двух стартовых страниц: 
~~~~ python
start_urls = ['https://en.wikipedia.org/wiki/Los_Angeles',
              'https://en.wikipedia.org/wiki/Information_retrieval']
~~~~

Настройки спайдера:

~~~~ python
# кол-во параллельно обрабатываемых запросов
CONCURRENT_REQUESTS = 32
# задержка перед запросом
DOWNLOAD_DELAY = 0.1
# условие остановки, количество айтемов
CLOSESPIDER_ITEMCOUNT = 10000
~~~~

Айтемы записывались в json lines (.jl) файл в след формате:

~~~~ json
{
    "id": "...",
    "url": "...",
    "snippet": "...",
    "links": ["...", "..."],
}
~~~~

Он проработал около 4.5 часов и собрал **100034** айтема. Для это использовалась плтформа [scrapinghub](http://scrapinghub.com "scrapinghub")
Но он собрал айтемы не только английско википедии. Для начала посмотрим только английскую часть ([en.wikipedia.org](en.wikipedia.org)). А в конце сравним с общими результатами.  
Английских статей - **42092**

## Парсим файл и создаем граф:

In [189]:
import json
import networkx as nx

items_file = 'items_full.jl'
item_ids = {}
item_relations = {}

def is_english_wiki(url):
    return 'en.wiki' in url

def print_top_info(top10, pagerank, show_snippet=False):
    top10urls = {}
    with open(items_file) as item_json:
        for i, line in enumerate(item_json):
            if i in top10:
                item = json.loads(line)
                top10urls[i] = item

    for i in top10:
        print top10urls[i]['url']
        print 'Pagerank: ', pagerank[i]
        if show_snippet:
            print top10urls[i]['snippet']
        print

with open(items_file) as item_json:
    for line in item_json:
        item = json.loads(line)
        if is_english_wiki(item['url']):
            item_ids[item['url']] = item['id']

with open(items_file) as item_json:
    for line in item_json:
        item = json.loads(line)
        if is_english_wiki(item['url']):
            item_relations[item['id']] = [item_ids[url] for url in item['links'] if url in item_ids]

graph = nx.DiGraph()
graph.add_nodes_from(item_relations)
for from_id in item_relations:
    graph.add_edges_from([(from_id, to_id) for to_id in item_relations[from_id]])

## Pagerank с дефолтным значением alpha

In [160]:
pagerank = nx.pagerank(graph)
top10 = sorted(pagerank, key=lambda k: pagerank[k], reverse=True)[:10]
print_top_info(top10, pagerank, True)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.0147880292089
World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. It involved the vast majority of the world's countries—including all of the grea...
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.0116592539118
The Pacific Ocean is the largest and deepest of the Earth's oceanic divisions. It extends from the Arctic Ocean in the north to the Southern Ocean (or, depending on definition, to Antarctica) in the south and is bounded by Asia and Australia in the west a...
https://en.wikipedia.org/wiki/London
Pagerank:  0.00700272726047
London i/ˈlʌndən/ is the capital and most populous city of England and the United Kingdom.[7][8] Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Rom...
https://en.wikipe


**Дальше будем выводить без сниппета, потому что он занимает много места.  
Приведем ниже топ с другими значениями alpha:**

## alpha = 0.3

In [163]:
pagerank = nx.pagerank(graph, 0.3)
top10 = sorted(pagerank, key=lambda k: pagerank[k], reverse=True)[:10]
print_top_info(top10, pagerank)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.00558947774906
https://en.wikipedia.org/wiki/London
Pagerank:  0.0026815674597
https://en.wikipedia.org/wiki/Los_Angeles
Pagerank:  0.00130647888071
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.00130075019907
https://en.wikipedia.org/wiki/Spanish_language
Pagerank:  0.00116745750868
https://en.wikipedia.org/wiki/Bolshevik
Pagerank:  0.00090263100118
https://en.wikipedia.org/wiki/Portugal
Pagerank:  0.000845448870245
https://en.wikipedia.org/wiki/Barack_Obama
Pagerank:  0.000751013302733
https://en.wikipedia.org/wiki/Roman_Catholic
Pagerank:  0.000669814271271
https://en.wikipedia.org/wiki/South_America
Pagerank:  0.000409215329487


## alpha = 0.5

In [164]:
pagerank = nx.pagerank(graph, 0.5)
top10 = sorted(pagerank, key=lambda k: pagerank[k], reverse=True)[:10]
print_top_info(top10, pagerank)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.00935972579562
https://en.wikipedia.org/wiki/London
Pagerank:  0.00431633133757
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.0033975813142
https://en.wikipedia.org/wiki/Bolshevik
Pagerank:  0.00251074307908
https://en.wikipedia.org/wiki/Spanish_language
Pagerank:  0.00234269869496
https://en.wikipedia.org/wiki/Los_Angeles
Pagerank:  0.00226282149776
https://en.wikipedia.org/wiki/Portugal
Pagerank:  0.00174639643687
https://en.wikipedia.org/wiki/Barack_Obama
Pagerank:  0.00123045252626
https://en.wikipedia.org/wiki/Pinko
Pagerank:  0.00122553148907
https://en.wikipedia.org/wiki/Roman_Catholic
Pagerank:  0.00106120698426


## alpha = 0.7

In [165]:
pagerank = nx.pagerank(graph, 0.7)
top10 = sorted(pagerank, key=lambda k: pagerank[k], reverse=True)[:10]
print_top_info(top10, pagerank)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.0127242172838
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.00701210355829
https://en.wikipedia.org/wiki/London
Pagerank:  0.00594005544866
https://en.wikipedia.org/wiki/Bolshevik
Pagerank:  0.00464438844094
https://en.wikipedia.org/wiki/Spanish_language
Pagerank:  0.0040760425632
https://en.wikipedia.org/wiki/Barack_Obama
Pagerank:  0.00394129647248
https://en.wikipedia.org/wiki/Pinko
Pagerank:  0.00338953195705
https://en.wikipedia.org/wiki/Los_Angeles
Pagerank:  0.00327722732879
https://en.wikipedia.org/wiki/Portugal
Pagerank:  0.00306817384256
https://en.wikipedia.org/wiki/Cantar_de_mio_Cid
Pagerank:  0.00280821074113


## alpha = 0.9

In [173]:
pagerank = nx.pagerank(graph, 0.9)
top10 = sorted(pagerank, key=lambda k: pagerank[k], reverse=True)[:10]
print_top_info(top10, pagerank)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.0123480557859
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.00885810421104
https://en.wikipedia.org/wiki/London
Pagerank:  0.0057715699261
https://en.wikipedia.org/wiki/Bolshevik
Pagerank:  0.00569036973362
https://en.wikipedia.org/wiki/Barack_Obama
Pagerank:  0.00539521048687
https://en.wikipedia.org/wiki/Pinko
Pagerank:  0.00505543367327
https://en.wikipedia.org/wiki/Spanish_language
Pagerank:  0.00474028056369
https://en.wikipedia.org/wiki/Cantar_de_mio_Cid
Pagerank:  0.00384546466103
https://en.wikipedia.org/wiki/Portugal
Pagerank:  0.00360313817537
https://en.wikipedia.org/wiki/Los_Angeles
Pagerank:  0.00336406350704


Как мы видим,на первом месте всегда https://en.wikipedia.org/wiki/World_War_II.  
Дальше места меняются в зависимости от alpha. Сам список топ10 - меняется не сильно (+-1 элемент).

Теперь посмотрим результат для все ссылок вместо, когда их 100034:

In [177]:
items_file = 'items_full.jl'
item_ids_full = {}
item_relations_full = {}


with open(items_file) as item_json:
    for line in item_json:
        item = json.loads(line)
        item_ids_full[item['url']] = item['id']

with open(items_file) as item_json:
    for line in item_json:
        item = json.loads(line)
        item_relations_full[item['id']] = [item_ids_full[url] for url in item['links'] if url in item_ids_full]

graph_full = nx.DiGraph()
graph_full.add_nodes_from(item_relations_full)
for from_id in item_relations_full:
    graph_full.add_edges_from([(from_id, to_id) for to_id in item_relations_full[from_id]])

## alphа - 0.85

In [181]:
pagerank_full = nx.pagerank(graph_full)
top10_full = sorted(pagerank_full, key=lambda k: pagerank_full[k], reverse=True)[:10]
print_top_info(top10_full, pagerank_full)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.00654265735122
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.00448297229308
https://en.wikipedia.org/wiki/London
Pagerank:  0.00309974345443
https://en.wikipedia.org/wiki/Bolshevik
Pagerank:  0.00297036250961
https://en.wikipedia.org/wiki/Barack_Obama
Pagerank:  0.00279542454996
https://en.wikipedia.org/wiki/Pinko
Pagerank:  0.00264677650011
https://en.wikipedia.org/wiki/Spanish_language
Pagerank:  0.00239900662955
https://it.wikipedia.org/wiki/Aiuto:IPA
Pagerank:  0.00205708848643
https://en.wikipedia.org/wiki/Cantar_de_mio_Cid
Pagerank:  0.00189738262538
https://ru.wikipedia.org/wiki/%D0%9D%D0%B5%D0%BC%D0%B5%D1%86%D0%BA%D0%B8%D0%B9_%D1%8F%D0%B7%D1%8B%D0%BA
Pagerank:  0.00185135266378


И в этом случае топ документов не сильно поменялся. 

## alpha = 0.3

In [185]:
pagerank_full = nx.pagerank(graph_full, 0.3)
top10_full = sorted(pagerank_full, key=lambda k: pagerank_full[k], reverse=True)[:10]
print_top_info(top10_full, pagerank_full)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.00235270596754
https://en.wikipedia.org/wiki/London
Pagerank:  0.00112874161576
https://it.wikipedia.org/wiki/2005
Pagerank:  0.000686833275541
https://en.wikipedia.org/wiki/Los_Angeles
Pagerank:  0.000549924486117
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.000547399815594
https://en.wikipedia.org/wiki/Spanish_language
Pagerank:  0.000491403845372
https://it.wikipedia.org/wiki/Aiuto:IPA
Pagerank:  0.000468742589922
https://en.wikipedia.org/wiki/Bolshevik
Pagerank:  0.000379832327959
https://de.wikipedia.org/wiki/Griechische_Sprache
Pagerank:  0.000370082462965
https://de.wikipedia.org/wiki/Mathematik
Pagerank:  0.000359606302297


## alpha = 0.5

In [187]:
pagerank_full = nx.pagerank(graph_full, 0.5)
top10_full = sorted(pagerank_full, key=lambda k: pagerank_full[k], reverse=True)[:10]
print_top_info(top10_full, pagerank_full)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.00400822008919
https://en.wikipedia.org/wiki/London
Pagerank:  0.00185928303391
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.00130903644264
https://it.wikipedia.org/wiki/2005
Pagerank:  0.00111005568607
https://en.wikipedia.org/wiki/Bolshevik
Pagerank:  0.00100080818922
https://it.wikipedia.org/wiki/Aiuto:IPA
Pagerank:  0.000964753030048
https://en.wikipedia.org/wiki/Los_Angeles
Pagerank:  0.000942909856044
https://en.wikipedia.org/wiki/Spanish_language
Pagerank:  0.000857657479486
https://de.wikipedia.org/wiki/Griechische_Sprache
Pagerank:  0.000683462809011
https://en.wikipedia.org/wiki/Portugal
Pagerank:  0.000618978946147


## alpha = 0.9

In [188]:
pagerank_full = nx.pagerank(graph_full, 0.9)
top10_full = sorted(pagerank_full, key=lambda k: pagerank_full[k], reverse=True)[:10]
print_top_info(top10_full, pagerank_full)

https://en.wikipedia.org/wiki/World_War_II
Pagerank:  0.00679257020724
https://en.wikipedia.org/wiki/Pacific_Ocean
Pagerank:  0.00524352433964
https://en.wikipedia.org/wiki/Barack_Obama
Pagerank:  0.0035660010517
https://en.wikipedia.org/wiki/London
Pagerank:  0.00321833340381
https://en.wikipedia.org/wiki/Bolshevik
Pagerank:  0.00317274578262
https://en.wikipedia.org/wiki/Pinko
Pagerank:  0.00301996853632
https://en.wikipedia.org/wiki/Spanish_language
Pagerank:  0.0028105393512
https://ru.wikipedia.org/wiki/%D0%9D%D0%B5%D0%BC%D0%B5%D1%86%D0%BA%D0%B8%D0%B9_%D1%8F%D0%B7%D1%8B%D0%BA
Pagerank:  0.00263896221135
https://en.wikipedia.org/wiki/Cantar_de_mio_Cid
Pagerank:  0.00241045696902
https://de.wikipedia.org/wiki/Niedersachsen
Pagerank:  0.00227287255946


Как мы видим при разных alpha в топ пробиваются итальянские, росскийские и немецкие статьи.  
