This notebook will use the AWS CLI interface to download files programattically. The entire database is 165 TB, which is untenable to store on any single machine, so the goal is to select some random (so as to avoid whatever biases there might be in any given part of the data) sections of the census and use that as your training set

In [1]:
import numpy as np

In [2]:
#!aws s3 sync s3://nara-1950-census/1950census/43290879-Arizona/43290879-Arizona-185483 [destination] --no-sign-request --quiet

The above code will add a folder to the current directory called '\[destination\]' and add one specific set of files- the Arizona census files for a specific Enumeration District

In [2]:
!aws s3 ls s3://nara-1950-census/1950census/ --no-sign-request

                           PRE 43290879-Alabama/
                           PRE 43290879-Alaska/
                           PRE 43290879-American_Samoa/
                           PRE 43290879-Arizona/
                           PRE 43290879-Arkansas/
                           PRE 43290879-California/
                           PRE 43290879-Colorado/
                           PRE 43290879-Connecticut/
                           PRE 43290879-Delaware/
                           PRE 43290879-District_of_Columbia/
                           PRE 43290879-Florida/
                           PRE 43290879-Georgia/
                           PRE 43290879-Guam/
                           PRE 43290879-Hawaii/
                           PRE 43290879-Idaho/
                           PRE 43290879-Illinois/
                           PRE 43290879-Indiana/
                           PRE 43290879-Iowa/
                           PRE 43290879-Kansas/
                           PRE 43290879-Kentucky/

So these are all the census files by state, let's look at what's in them

In [7]:
ga_vars = !aws s3 ls s3://nara-1950-census/1950census/43290879-Georgia/ --no-sign-request

In [18]:
ga_vars = [x.strip()[4:] for x in ga_vars]

In [20]:
len(ga_vars)

5141

So now comes the important parts- we know that that initial commented out part returns about 200 MB of data, so we don't want to go crazy when we're looking at this- although keep in mind, the sections of the data we'll be looking at are rather small snippets of the page as a whole- so we don't have to worry as much about the data being larger than RAM- yet- but we might have to worry that the whole pages might be larger than free memory

With what he have said- let's go through a process like this- for every state/territory folder, we'll randomly pull 2 folders from the list- that should come out to around 20 GB. If more is needed, we will take more, but that seems unlikely. For reproducibility, we will set a random state and write out what files we have

In [8]:
def grab_fifty_census_section(rand,files_per_state,storage_folder):
    '''
    Downloads a portion of the population schedule of the 1950 Census from https://registry.opendata.aws/nara-1950-census/
    In particular, takes a set amount of folders per each state/territory and stores to a specified folder
    Also creates a text file in specified folder that lists out folders taken from 1950 Census for use by others
    
    Variables:
    ~~~~~~~~~~~
    rand: (int) random digit for reproducibility
    
    files_per_state: (int) Number of file folders you want per state- each folder corresponds to an enumeration district. Check maps on above site for further details on enumeration districts
    
    storage_folder: (str) folder (in current directory) where you want the files and associated folders to be saved. A '/' is required at the end of the storage folder string
    
    Returns:
    ~~~~~~~~
    Void
    '''
    rng = np.random.default_rng(rand) #from https://towardsdatascience.com/stop-using-numpy-random-seed-581a9972805f
    state_folders = !aws s3 ls s3://nara-1950-census/1950census/ --no-sign-request
    state_folders = [x.strip()[4:] for x in state_folders]
    file_list = np.array([])
    for state in state_folders:
        new_files = grab_state_section(rng,files_per_state,state,storage_folder+state)
        file_list = np.append(file_list,new_files)
        
    np.savetxt(storage_folder+'filelist.txt',file_list,fmt='%s')
        

In [10]:
def grab_state_section(randgen,files_per_state,state,storage_folder):
    '''
    Sub-function that grabs specific folders per each state for grab_fifty_census_section.
    
    Variables:
    ~~~~~~~~~~
    randgen: RNG object created to make a reproducible state
    
    files_per_state: (int) number of folders you want per state
    
    state: (str) state/territory that's being dealt with- in particular, the folder name of said state
    
    storage_folder: (str) folder in which to store said subfolders- notably, includes state
    '''
    
    filepath = 's3://nara-1950-census/1950census/' + state
    state_vars = !aws s3 ls {filepath} --no-sign-request
    state_vars = [x.strip()[4:] for x in state_vars]
    selected_folders = randgen.choice(state_vars,files_per_state,replace=False)
    
    for folder in selected_folders:
        !aws s3 sync {filepath+folder} {storage_folder+folder} --no-sign-request --quiet  
        #thanks to https://stackoverflow.com/questions/35497069/passing-ipython-variables-as-arguments-to-bash-commands
        #for the tip on putting variables in command-line
    return np.array([filepath+y for y in selected_folders])
    

In [9]:
grab_fifty_census_section(815,2,'data/')

In total, 15.5 GB. The random folder we picked was slightly larger than average, but there were are also more folders than previously anticipated

Further steps should look at reducing these pages to individual cells that we can actually properly train a model on- then, starting the labeling process