Skip to content

Graph-theoretic analysis of Wikipedia's internal link structurem for the paper "Large-Scale Analysis of Wikipedia’s Link Structure and its Applications in Learning Path Construction"

License

Notifications You must be signed in to change notification settings

harrow-turing-2022/wisteria-core

Repository files navigation

Wisteria Core

| Paper | Citation | License |

Project Wisteria's core Wikipedia link graph generation, serialisation, and analysis tools. This is developed for the paper "Large-Scale Analysis of Wikipedia’s Link Structure and its Applications in Learning Path Construction", by Y. Song and C. H. Leung, published at the 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI), Bellevue, WA, USA [doi: 10.1109/IRI58017.2023.00051].

Table of contents:

Getting Started

  1. Make sure you have Julia installed and accessible via the command line. If not, install Julia. This code has been tested with Julia 1.7.3.

  2. If on Windows, make sure you have the curl and 7z commands available on your command line. To add the 7z command provided by 7zip, download 7zr.exe, rename it 7z.exe, and add it to PATH.

  3. Open your Julia terminal and run the setup script to get started:

    julia setup.jl

    Internet connection is required to automatically download and extract the necessary Wikipedia dump files for Wisteria to work correctly.

  4. To extract link relationships between Wikipedia articles, run the command:

    julia run.jl

    An Internet connection is required as Wikipedia dumps will be downloaded, unzipped, and deleted on the fly to save storage space.

    • If, for any reason, your link extraction is incomplete, you can go into ./logs to find out the last complete parsing of a data dump (there are 63 dumps to be parsed in total). Each log file is saved under the index of its dump (i.e. logs for dump index 1 are stored under ./logs/1).

      A completely parsed dump will generate something like the following at the end of title_errors.txt:

      Average number of links: 34.715
      Pages with links: 16226
      Number of pages (not counting redirects): 15704852
      

      If this is not generated, then parsing of that dump is incomplete, and you can instruct Wisteria to start parsing there.

      For example, suppose dump 21 is incomplete. To pick up progress from there, simply run

      julia run.jl 21
  5. We use PyPlot.jl to generate graphs for our experiments. This relies on an existing Python interpreter with matplotlib installed. To run our experiments, please first install Python and add the matplotlib package using pip install matplotlib.

  6. To run experiments/indexStats.jl, you will need the GitHub version of Pingouin.jl. Enter package manager mode in Julia by pressing], and run:

    add https://github.com/clementpoiret/Pingouin.jl.git
    

Important Notes

Since the Pageman object of our system uses a relative path to reference the list of titles on Wikipedia, please make sure that:

  • The file enwiki-20230101-all-titles-in-ns0 is present in ./data (this should be done automatically by setup.jl)
  • You are running any Julia scripts from the root of this repository (i.e. where you can see explore.jl, parser.jl, ./data, ./graph, etc.)

Otherwise, things might not work!

Reusing Graphs

You can easily browse and reuse the graphs generated by someone else. Just place links.ria and pm.jld2 into the ./graph directory, and you should be able to load, serialise, and explore the graph without any problems.

File Structure

  • parser.jl: Parses XML files for links and connection strengths (under development).
  • run.jl: Downloads all required Wikipedia dump files, extracts links, and stores graph in ./graph, with a list of unidentifiable titles stored in ./logs.
  • serialise.jl: Serialises graph into ./ser with all supported file formats.
  • setup.jl: Installs Julia packages, creates directories, checks commands, downloads data... If this runs without failure, you should be able to run the rest of Wisteria.
  • utils.jl: Utilities for saving links in the .RIA file format; defines the Pageman (page management) object for handling page IDs, titles, and redirects.
  • wikigraph.jl: Defines the Wikigraph object for capturing links and relationships between Wikipedia pages; functions to serialise Wikigraph into various file formats.

Wikigraph Docs

To load a Wikigraph:

# Include wisteria graph loading functions
include("utils.jl")
include("wikigraph.jl")

# Load a Wikigraph object
wg = loadwg("path/to/graph-directory", "path/to/all-titles-file")
# E.g. wg = loadwg("graph/", "data/enwiki-20230101-all-titles-in-ns0")

Attributes of a Wikigraph object:

  • wg.pm::Pageman

    Attributes of a Pageman object:

    • id2title::Vector{String}

      A vector mapping from Int32 IDs to String titles

    • title2id::Dict{String,Int32}

      A vector mapping from String titles to Int32 IDs

    • redirs::Vector{Int32}

      A vector mapping from Int32 IDs to its redirected Int32 ID (maps back to the same ID if it is not redirected)

    • numpages::Int32

      Number of non-redirected pages

    • totalpages::Int32

      Total number of pages (including redirects)

  • wg.links::Vector{Vector{Pair{Int32, Int32}}}

    A vector mapping from Int32 IDs to a vector of Int32 IDs connected to it.

Tying the above together, the following is a sample code to extract all links and weights of node ID 1:

# Keep track of linked IDs
linked = Int32[]

# Loop through all IDs and weights connected to node 1
for (id, weight) in wg.links[1]

    # Handle redirected pages
    redirected_id = traceRedir!(wg.pm, id)

    # Check if ID is already linked
    if !(redirected_id in linked)

        # If not, add it to the vector of linked IDs
        push!(linked, redirected_id)

        # Print out ID and weight
        println("Connected to ", redirected_id, " with weight ", weight)
    end
end

Citation

If you found our work helpful, here is our citation:

@INPROCEEDINGS{song2023large,
  author={Song, Yiding and Leung, Chun Hei},
  booktitle={2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI)}, 
  title={Large-Scale Analysis of Wikipedia’s Link Structure and its Applications in Learning Path Construction}, 
  year={2023},
  pages={254-260},
  doi={10.1109/IRI58017.2023.00051}
}

License

All code in this repository is licensed under the GNU General Public License version 3.

About

Graph-theoretic analysis of Wikipedia's internal link structurem for the paper "Large-Scale Analysis of Wikipedia’s Link Structure and its Applications in Learning Path Construction"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages