Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

README.rst

Introduction

I have created this code to extract turkish wiki dumps for testing NLP algorithms. Other extractors I've tried have character issues and didn't work well with Turkish Wiki Dump.

This module contains code for manipulating wikipedia dumps available at http://download.wikimedia.org/backup-index.html

It is tested with ("https://dumps.wikimedia.org/trwiki/20170601/trwiki-20170601-pages-articles.xml.bz2") Turkish Wikipedia Dump. It is not tested with other dumps.

Installation

Required libraries are re, string, mwxml and cleantext. Written in Python 3.6

Program requires wiki dump to be named as "wiki.xml" in the root directory.

About

A Wikipedia dump extractor for Turkish language

Resources

Releases

No releases published

Packages

No packages published

Languages