# Unpacker

This script unpacks and untars archive files from the Dark Net Market Archives (Brawnen, 2017). These files are read into memory to be able to extract data from HTML-files about market behavior of the 11 biggest markets in the dataset.


In [1]:
import os
import pandas as pd
import tarfile
import time
import csv

In [2]:
MAIN_DIR = "/Volumes/Extreme SSD"
DATA_DIR = os.path.join(MAIN_DIR, "data")

os.chdir(os.path.join(DATA_DIR, "archive"))
files = os.listdir(os.path.join(DATA_DIR, "archive"))
files.sort()
file_name = os.path.join(os.path.join(DATA_DIR, "archive"), files[4])

## Function Definition

In the following section functions are defined to decompress and untar the market archive. This function does ... things: 
1. It opens the tarball
2. It asserts valid files to be untarred to be sorted into memory
3. It writes a logbook for data quality purposes, such that in saves the name, time, mode, type and size of the unpacked files. 
4. It extracts valid files and saves in a designated folder. 

In [5]:
# get name of market
def market_name(file_name):
    """
    This function read the market name from the tar-file name
    """
    return file_name.split('/')[-1].split('.')[0]


def extract_market(_tar, _path, _members):
    _tar.extractall(path=_path, members=_members)


def untar_market(file_name):
    junk = ('img', 'jpg', 'jpeg', 'png', 'css', 'eot',
            'woff', 'svg', 'woff', 'ttf', 'eot?', 'ico')

    # opening the zip file in READ mode
    with tarfile.open(file_name) as tar:
        members = tar.getmembers()
        valid_files = [
            tarinfo for tarinfo in members
            if tarinfo.name.split('.')[-1].lower() not in junk
        ]

        # writing log file for file extraction
        log_file = os.path.join(DATA_DIR, "logs", "".join(
            ["log_", market_name(file_name), ".csv"]))
        with open(log_file, 'w', newline='', encoding='utf8') as file:
            l = ['name', 'time', 'mode', 'type', 'size']
            writer = csv.writer(
                file, quoting=csv.QUOTE_NONNUMERIC, delimiter=';')
            writer.writerow(l)

        for info in valid_files:
            _name = info.name
            _time = time.ctime(info.mtime)
            _mode = oct(info.mode)
            _type = info.type
            _size = info.size

            with open(log_file, 'a', encoding='utf-16', errors="surrogateescape") as file:
                l = [_name, _time, _mode, _type, _size]
                writer = csv.writer(
                    file, quoting=csv.QUOTE_NONNUMERIC, delimiter=';')
                writer.writerow(l)

        # extracting files using multiprocessing
        extract_market(tar, os.path.join(DATA_DIR, "unpacked"), valid_files)

        tar.close()

## Market Selection

The 11 biggest markets were selected to be unpacked. The code below makes a list of selected markets sequentially makes a list of indices to iterate more efficiently over the list if files. 

In [6]:
markets = ["outlawmarket.tar.xz", 
           "agora.tar.xz", 
           "nucleus.tar.xz", 
           "silkroad2.tar.xz", 
           "abraxas.tar.xz", 
           "diabolus.tar.xz",
           "themarketplace.tar.xz",
           "cryptomarket.tar.xz",
           "cloudnine.tar.xz",
           "hydra.tar.xz",
           "alphabay.tar.xz"]

selected_markets = [files[i] for i in [files.index(m) for m in markets]]

## Bulk Unpack

Unpacks all the markets that have not been read into memory.

In [7]:
def unpack_all(list_of_files):
    for file in list_of_files:
        # asserts whether data has already been unpacked
        if (os.path.exists(os.path.join(DATA_DIR, "unpacked", file.split(".")[0]))):
            print(f"%s has already been loaded into memory \n" % file)
        else:
            t = time.strftime("%a, %d %b %Y %H:%M:%S", time.localtime())
            print(f"%s is being unpacked. \nstarted on %s. \n" % (file, t))
            untar_market(os.path.join(DATA_DIR, "archive", file))

In [8]:
unpack_all(selected_markets)

outlawmarket.tar.xz has already been loaded into memory 

agora.tar.xz has already been loaded into memory 

nucleus.tar.xz has already been loaded into memory 

silkroad2.tar.xz has already been loaded into memory 

abraxas.tar.xz has already been loaded into memory 

diabolus.tar.xz has already been loaded into memory 

themarketplace.tar.xz has already been loaded into memory 

cryptomarket.tar.xz has already been loaded into memory 

cloudnine.tar.xz has already been loaded into memory 

hydra.tar.xz has already been loaded into memory 

alphabay.tar.xz has already been loaded into memory 

