**This script downloads all the files listed in "data_files_to_download.txt", extracts them, and places them in the "data" folder. It takes two minutes to run on my laptop.**

The "data_files_to_download.txt" file right now (5/19) only contains links to the datasets listed on this webpage: https://botometer.iuni.iu.edu/bot-repository/datasets.html. (The same link is in our Slack channel.)

You don't have to worry about deleting the old data files before re-running the notebook; they will be overwritten.

The "data" folder is listed in our git repository's "gitignore" file, which means nothing inside it will get pushed to the online github repository. This is because our data files are, together, hundreds of megabytes and we don't have enough room on our online repository to hold all that data.

The output includes some folders and zipped csv files; I don't know where they come from. Probably one of caverlee-2011, cresci-2017, or cresci-2015.

Hopefully this helps.

# Imports

In [1]:
# handle files and directories
import os
import shutil

# handle internet
import requests

# decompress files of various formats
import tarfile
import zipfile
import gzip

# Make a "data" Directory

In [2]:
if 'data' not in os.listdir():
    os.mkdir('data')
    print('Created data directory')
else:
    assert os.path.isdir('data'), "There is a file named 'data', but the program needs a directory named 'data'"
    print('There is already a data directory')

There is already a data directory


# Function Definitions

In [7]:
def dload_extract(url, dirpath):
    
    # Move into directory
    os.chdir(dirpath)
    
    # Download
    print('downloading...', end=' ')
    r = requests.get(url)
    compressed_filename = url.split('/')[-1]
    compressed_file_path = compressed_filename
    compressed_file = open(compressed_filename, 'wb')
    compressed_file.write(r.content)
    compressed_file.close()
    print('complete')
    
    # Extract
    print('extracting...', end=' ')
    if '.tar' in compressed_filename:
        tar = tarfile.open(compressed_filename, 'r:*')
        tar.extractall()
        tar.close()
    elif compressed_filename.endswith('.zip'):
        with zipfile.ZipFile(compressed_filename,"r") as zip_ref:
            zip_ref.extractall()
            shutil.rmtree('__MACOSX', ignore_errors=True)
    elif compressed_filename.endswith('.gz'):
        print('gzip file')
    print('complete')
    
    print('cleaning up...', end=' ')
    # Remove compressed file
    os.remove(compressed_filename)
    
    # Return to original directory
    os.chdir('..')
    print('complete')
    
def return_to_project_dir():
    current_dir_name = os.path.basename(os.getcwd())
    assert current_dir_name in {'RonnieNickSanderChris', 'data'}, \
        'Program lost track of which folder it\'s in. Try restarting the notebook kernel.'
    if current_dir_name == 'data':
        os.chdir('..')

In [4]:
def try_to_dload_and_extract(url, dirpath):
    return_to_project_dir()
    print('extracting', url.split('/')[-1])
    print('source:', format(url))
    try:
        dload_extract(url, dirpath)
        print('extraction successful')
        print()
    except tarfile.ReadError as e:
        print('extraction failed:', e)
        return_to_project_dir()

# Download and Extract Everything

In [5]:
return_to_project_dir()
with open('data_files_to_download.txt', 'rt') as file:
    lines_including_comments = file.read().splitlines()
    # We ignore lines starting with the pound symbol in data_files_to_download.txt
    data_files_to_download = [line for line in lines_including_comments if line[0] != '#']

In [6]:
for url in data_files_to_download:
    try_to_dload_and_extract(url, 'data')

print()
print('DONE!')

extracting verified-2019.tar.gz
source: https://botometer.iuni.iu.edu/bot-repository/datasets/verified-2019/verified-2019.tar.gz
extraction successful

extracting botwiki-2019.tar.gz
source: https://botometer.iuni.iu.edu/bot-repository/datasets/botwiki-2019/botwiki-2019.tar.gz
extraction successful

extracting cresci-rtbust-2019.tar.gz
source: https://botometer.iuni.iu.edu/bot-repository/datasets/cresci-rtbust-2019/cresci-rtbust-2019.tar.gz
extraction successful

extracting political-bots-2019.tar.gz
source: https://botometer.iuni.iu.edu/bot-repository/datasets/political-bots-2019/political-bots-2019.tar.gz
extraction successful

extracting botometer-feedback-2019.tar.gz
source: https://botometer.iuni.iu.edu/bot-repository/datasets/botometer-feedback-2019/botometer-feedback-2019.tar.gz
extraction successful

extracting vendor-purchased-2019.tar.gz
source: https://botometer.iuni.iu.edu/bot-repository/datasets/vendor-purchased-2019/vendor-purchased-2019.tar.gz
extraction successful

extr