## Downloading schematics

I found this S3 bucket with some schematics in it. I think they might be useful for the model. The name of the bucket is `minecraft-schematics-raw`.

In [37]:
BUCKET_NAME = 'minecraft-schematics-raw'

### Logging

Before we get started, let's set up some logging. Advanced logging will be useful as we will be dealing with a lot of files and multiple threads.

In [36]:
import logging

logging.basicConfig(level=logging.INFO,
                        format='%(asctime)s - %(levelname)s - %(message)s')

### Download schematic

Let's first define a function that allows us to download a schematic from the bucket.

In [33]:
import boto3
import logging

def download_schematic(bucket_name, key):
    s3_client = boto3.client('s3')
    s3_client.download_file(bucket_name, key, f'schematics/{key}')
    logging.info(f"Downloaded: {key}")

### Listing all schematics

For some reason the Python API has a 1000 item limit on the number of items it can list. There are easy workarounds for this, but I'm 'll just going to use the AWS CLI and save the output to a file.

In [35]:
!aws s3api list-objects-v2 --bucket minecraft-schematics-raw --output json | jq -r ".Contents[].Key" | tee schematics.list

1.schematic
10.schematic
100.schematic
1000.schematic
10001.schematic
10002.schematic
10003.schematic
10005.schematic
10006.schematic
10007.schematic
10008.schematic
10009.schematic
1001.schematic
10010.schematic
10011.schematic
10012.schematic
10013.schematic
10014.schematic
10015.schematic
10016.schematic
10017.schematic
10018.schematic
10019.schematic
1002.schematic
10020.schematic
10021.schematic
10022.schematic
10023.schematic
10024.schematic
10025.schematic
10026.schematic
10027.schematic
10028.schematic
10029.schematic
1003.schematic
10030.schematic
10031.schematic
10032.schematic
10033.schematic
10034.schematic
10035.schematic
10036.schematic
10037.schematic
10038.schematic
10039.schematic
1004.schematic
10040.schematic
10041.schematic
10042.schematic
10043.schematic
10044.schematic
10045.schematic
10046.schematic
10047.schematic
10048.schematic
10049.schematic
1005.schematic
10050.schematic
10051.schematic
10052.schematic
10053.schematic
10054.schematic
10055.schematic
10056.s

Now we just read from the file and download the schematics. We'll use a thread pool to speed things up with `concurrent.futures.ThreadPoolExecutor`. We will also setup some logic so that we can resume the download if it fails without having to start from the beginning.

In [None]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from os import listdir

# Retrieve the list of objects in the bucket from the file
with open('schematic_list') as f:
    to_download = set([line.strip() for line in f.readlines()])

already_downloaded = set(listdir('schematics'))
logging.info(f'Already downloaded {len(already_downloaded)} schematics')
to_download = to_download - already_downloaded
logging.info(f'Downloading {len(to_download)} schematics')

# Create a thread pool with a limited number of threads
max_threads = 8  # Set the desired maximum number of concurrent threads
with ThreadPoolExecutor(max_workers=max_threads) as executor:
    # Submit tasks to the thread pool
    futures = [executor.submit(
        download_schematic, BUCKET_NAME, schematic_name) for schematic_name in to_download]

    # Wait for all tasks to complete
    for future in as_completed(futures):
        future.result()

logging.info("All schematics downloaded.")

## Processing

Now that all the schematics are downloaded, we can start processing them.

In [29]:
from os import listdir

schematics = set(listdir('schematics'))
print(f'Found {len(schematics)} schematics')
failed = set()
big = set()

Found 11003 schematics


### Separating good and bad schematics

We need to check the schematics for two things:
- Shape: We need to make sure that the schematics fit the 16x16x16 cube that we are using for our model.
- Errors: Sometimes the schematics are corrupted, or from a different version of Minecraft, this means the schematics are not usable as we cannot load them into our model.

In [30]:
from nbtschematic import SchematicFile
import logging


logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

lenght = len(schematics)

for i, schematic in enumerate(list(schematics)):
    try:
        sf = SchematicFile.load(f'schematics/{schematic}')
        shape = sf.shape
        if shape[0] > 16 or shape[1] > 16 or shape[2] > 16:
            logging.info(f'{i}/{lenght}: {schematic} is too big')
            big.add(schematic)
            schematics.remove(schematic)
            continue

        logging.info(f'{i}/{lenght}: {schematic} is good shape')
        continue
    except:
        logging.info(f'{i}/{lenght}: Failed to load {schematic}')
        failed.add(schematic)
        schematics.remove(schematic)
        continue

2023-07-15 22:41:34,145 - INFO - 0/11003: 12921.schematic is too big
2023-07-15 22:41:34,146 - INFO - 1/11003: 12092.schematic is too big
2023-07-15 22:41:34,148 - INFO - 2/11003: 10770.schematic is good shape
2023-07-15 22:41:34,152 - INFO - 3/11003: 8235.schematic is too big
2023-07-15 22:41:34,155 - INFO - 4/11003: 11238.schematic is too big
2023-07-15 22:41:34,156 - INFO - 5/11003: 10610.schematic is good shape
2023-07-15 22:41:34,160 - INFO - 6/11003: 8156.schematic is too big
2023-07-15 22:41:34,161 - INFO - 7/11003: 7306.schematic is too big
2023-07-15 22:41:34,161 - INFO - 8/11003: 5634.schematic is good shape
2023-07-15 22:41:34,173 - INFO - 9/11003: 6923.schematic is too big
2023-07-15 22:41:34,174 - INFO - 10/11003: 4599.schematic is too big
2023-07-15 22:41:34,175 - INFO - 11/11003: 4728.schematic is good shape
2023-07-15 22:41:34,176 - INFO - 12/11003: 10226.schematic is too big
2023-07-15 22:41:34,179 - INFO - 13/11003: 14171.schematic is too big
2023-07-15 22:41:34,186 -

### Moving the schematics

Now that we know which schematics are good and which are bad, we can move them to their respective folders.

In [32]:
import os

os.makedirs('schematics/valid', exist_ok=True)
os.makedirs('schematics/big', exist_ok=True)

# We will also remove the schematics that give an error when loading

for schematic in schematics:
    os.rename(f'schematics/{schematic}', f'schematics/valid/{schematic}')
    logging.info(f'Moved {schematic} to valid directory')

for schematic in failed:
    os.remove(f'schematics/{schematic}')
    logging.info(f'Removed {schematic}')

for schematic in big:
    os.rename(f'schematics/{schematic}', f'schematics/big/{schematic}')
    logging.info(f'Moved {schematic} to big directory')

2023-07-15 22:47:42,606 - INFO - Moved 10770.schematic to valid directory
2023-07-15 22:47:42,606 - INFO - Moved 10610.schematic to valid directory
2023-07-15 22:47:42,607 - INFO - Moved 5634.schematic to valid directory
2023-07-15 22:47:42,607 - INFO - Moved 4728.schematic to valid directory
2023-07-15 22:47:42,607 - INFO - Moved 2328.schematic to valid directory
2023-07-15 22:47:42,608 - INFO - Moved 2111.schematic to valid directory
2023-07-15 22:47:42,608 - INFO - Moved 5139.schematic to valid directory
2023-07-15 22:47:42,608 - INFO - Moved 7816.schematic to valid directory
2023-07-15 22:47:42,609 - INFO - Moved 9331.schematic to valid directory
2023-07-15 22:47:42,609 - INFO - Moved 5534.schematic to valid directory
2023-07-15 22:47:42,609 - INFO - Moved 15262.schematic to valid directory
2023-07-15 22:47:42,610 - INFO - Moved 4161.schematic to valid directory
2023-07-15 22:47:42,610 - INFO - Moved 4072.schematic to valid directory
2023-07-15 22:47:42,610 - INFO - Moved 14284.sch