<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Wikipedia_articles_to_hyperlinks_network_(quick_and_dirty).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wikipedia articles to hyperlinks network (quick and dirty)

**Input:** a list of Wikipedia articles (CSV).

**Output:** a network of Wikipedia articles connected by hyperlinks (GEXF).

This scripts queries Wikipedia for each article of the input list. It retrieves the hyperlinks for each article. Then it outputs a directed network where the nodes are exactly the articles of the input list, and the edges are the hyperlinks between these articles (if any).


## How to use

1. Put your input file in the same folder as the notebook
1. Edit the settings if needed
1. Run all the cells
1. Take the output file from the notebook folder


# SETTINGS

In [2]:
# Input file
input_file = "wikipedia-articles.csv"

# Which column contains the article title?
article_name_column = "Article"

# Output file
output_file = "Wikipedia-articles-hyperlinks-network.GEXF"

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [3]:
# Install (if needed)
!pip install wikipedia-api
!pip install wikipedia
!pip install pandas
!pip install networkx

# Import
import wikipediaapi
import wikipedia
import pandas as pd
import networkx as nx
import csv

print("Done.")

Done.


### Read the input file

In [4]:
article_df = pd.read_csv(input_file, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the article list:")
article_df

Preview of the article list:


Unnamed: 0,Article
0,Search engine privacy
1,Member Berries
2,Real-name system
3,CSipSimple
4,Spam blog
...,...
539,Helix Kitten
540,Alternative Informatics Association
541,CyberSource
542,Flyposting


### Harvest Wikipedia

In [5]:
# This is an object we use to connect to the API.
# Note that we configure it to use the English Wikipedia.
wiki_wiki = wikipediaapi.Wikipedia(
  language='en',
  extract_format = wikipediaapi.ExtractFormat.WIKI
)

seen = []
network = {}
print("Harvesting all links from "+str(len(article_df.index))+" wikipedia pages. This might take a while...")
count=1

# Harvest each article one by one
for title in article_df[article_name_column]:
  if count % 50 == 0:
    print("All links harvested from "+str(count)+" pages out of "+str(len(article_df.index))+". Continuing...")
  if not title in seen: # Do not harvest twice the same...
    seen.append(title)
    try:
      page = wiki_wiki.page(title)
      text_links = []
      links = page.links
      for link_title in sorted(links.keys()):
        text_links.append(link_title)
      network.update({title:text_links})

    except:
        print('SKIPPED: '+title+' (an error occurred)')
  count=count+1
    
print("Done.")

Harvesting all links from 544 wikipedia pages. This might take a while...

All links harvested from 50 pages out of 544. Continuing harvest...
All links harvested from 100 pages out of 544. Continuing harvest...
All links harvested from 150 pages out of 544. Continuing harvest...
All links harvested from 200 pages out of 544. Continuing harvest...
All links harvested from 250 pages out of 544. Continuing harvest...
All links harvested from 300 pages out of 544. Continuing harvest...
All links harvested from 350 pages out of 544. Continuing harvest...
All links harvested from 400 pages out of 544. Continuing harvest...
All links harvested from 450 pages out of 544. Continuing harvest...
All links harvested from 500 pages out of 544. Continuing harvest...
Done.


### Build network

In [6]:
membersonly_edges = []
members = network.keys()
print("Building network...")
for source in network:
  for target in network[source]:
    edge = (source,target)
    if target in members:
      membersonly_edges.append(edge)
print("Saving network...")
G = nx.DiGraph()
G.add_edges_from(membersonly_edges)
nx.write_gexf(G, output_file)

print('Done.')

Building network...
Saving network...
Done.
