Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Software to build a dataset collection similar to the techtc300
Python Shell
branch: master



techtc-builder is a software to build a dataset collection similar to
the techtc300


* For

- Python 2.7 or above

- python-lxml

- dmoz RDF dump files structure.rdf.u8 and content.rdf.u8. Can be
  downloaded at

- One the following text based web browser w3m, lynx, elinks, links,

- wget

* For

- gnu parallel

- PyStemmer

* For

- OpenCog (more specifically feature-selection)


** This is not required if you run the scripts from the project root **

Run the following script (most likely in super user mode)

# ./

You can specify the prefix where to install with option --prefix, such as

# ./install --prefix /usr


1) Download structure.rdf.u8 and content.rdf.u8 from

2) Strip the dmoz files from useless content:

$ structure.rdf.u8 -o structure_stripped.rdf.u8

$ content.rdf.u8 -o content_stripped.rdf.u8

This is gonna take a few minutes but will greatly speed up the
next step as well as lower memory usage.

3) Create a techtc300

$ ./ -c content_stripped.rdf.u8 -s structure_stripped.rdf.u8 -S 300

This will create a directory techtc300 with the dataset collection
inside. In fact the options are the default ones so the following

$ ./

does the same thing.

There are multiple options, you can get the list of them using
--help. The default (and recommended) tool used to convert html to
text is w3m, you can change that using option -H such as

$ -H html2text

here it will use html2text
( instead of w3m.

4) Remove ill-formed directories (if positive or negative text files
are missing). Here the directory of the collection is techtc300,
replace if appropriate

$ ./strip_techtc techtc300

5) -- optional -- convert the text files into feature vectors in CSV
format. Each row corresponds to a document, and each column
corresponds to a word (0 if it does not appear in the document, 1
otherwise). The first row shows the corresponding word. Just run ()

$ ./ techtc300

is gonna create data.csv files under each Exp_XXXX_XXXX directory.

Then, if you don't need any more the intermediary text files you can

$ ./ techtc300

Beware that this is gonna permanently remove everything expect the CSV

Since the web is full of mess you may end-up with CSV file with
thousands of words plus lots of gibberish. To filter that out you can
use a filter that will keep only feature with their mutual information
with the target above a certain threshold, run

$ ./ techtc300 0.05

The script is gonna copy the content under


where 0.05 is the threshold, then rewrite all data.csv files
filtered. This steps uses feature-selection a tool provided with
OpenCog project


Nil Geisweiller
Something went wrong with that request. Please try again.