Skip to content
Software to build a dataset collection similar to the techtc300
Python Shell
Find file
Latest commit c032bc4 @ngeiswei updated README
Failed to load latest commit information. added to filter the obtained data set
README updated README some minor corrections added scripts to convert techtc datasets to CSV files reset-author added to strip the dmoz files from useless content,… fixed now uses parallel (to speed-up the convertion)



techtc-builder is a software to build a dataset collection similar to
the techtc300


* For

- Python 2.7 or above

- python-lxml

- dmoz RDF dump files structure.rdf.u8 and content.rdf.u8. Can be
  downloaded at

- One the following text based web browser w3m, lynx, elinks, links,

- wget

* For

- gnu parallel

- PyStemmer

* For

- OpenCog (more specifically feature-selection)


** This is not required if you run the scripts from the project root **

Run the following script (most likely in super user mode)

# ./

You can specify the prefix where to install with option --prefix, such as

# ./install --prefix /usr


1) Download structure.rdf.u8 and content.rdf.u8 from

2) Strip the dmoz files from useless content:

$ structure.rdf.u8 -o structure_stripped.rdf.u8

$ content.rdf.u8 -o content_stripped.rdf.u8

This is gonna take a few minutes but will greatly speed up the
next step as well as lower memory usage.

3) Create a techtc300

$ ./ -c content_stripped.rdf.u8 -s structure_stripped.rdf.u8 -S 300

This will create a directory techtc300 with the dataset collection
inside. In fact the options are the default ones so the following

$ ./

does the same thing.

There are multiple options, you can get the list of them using
--help. The default (and recommended) tool used to convert html to
text is w3m, you can change that using option -H such as

$ -H html2text

here it will use html2text
( instead of w3m.

4) Remove ill-formed directories (if positive or negative text files
are missing). Here the directory of the collection is techtc300,
replace if appropriate

$ ./strip_techtc techtc300

5) -- optional -- convert the text files into feature vectors in CSV
format. Each row corresponds to a document, and each column
corresponds to a word (0 if it does not appear in the document, 1
otherwise). The first row shows the corresponding word. Just run ()

$ ./ techtc300

is gonna create data.csv files under each Exp_XXXX_XXXX directory.

Then, if you don't need any more the intermediary text files you can

$ ./ techtc300

Beware that this is gonna permanently remove everything expect the CSV

Since the web is full of mess you may end-up with CSV file with
thousands of words plus lots of gibberish. To filter that out you can
use a filter that will keep only feature with their mutual information
with the target above a certain threshold, run

$ ./ techtc300 0.05

The script is gonna copy the content under


where 0.05 is the threshold, then rewrite all data.csv files
filtered. This steps uses feature-selection a tool provided with
OpenCog project


Nil Geisweiller
Something went wrong with that request. Please try again.