# CreateTrainingRandomSample #

**Author: ** Andrew Larkin <br>
**Date Created: ** October 21st, 2019 <br>
**Organization:** Oregon State University, College of Public Health and Human Sciences


**Summary** <br>
Our initial training dataset sampling plan was based on a 50m sampling grid of Google Street View (GSV) imagery across all urban areas in Washington.  However, the number of images was greater than expected (~3 million).  To reduce the number of images in our training dataset, we decided to use a two-step sampling approach.  

1) The number of images sampled from each urban area is proportional to the number of Twin Registry participants living within the urban area (5km buffer).  
2) Images will be sampled to maximize the uniform intra-urban distribution of features across the training sample for each urban area in Washington.

This script performs the methodology to execute step 1.  From the grid sampling dataset, the number of target images for each urban area is calculated.  For each urban area, a random sample 100x the sample of the target number of images is then taken from the 50m grid of GSV images.

### import required libraries ###

In [None]:
import pandas as ps
import numpy as np
import globalConstants as gConst
import os

### define global constants ###

In [14]:
PARENT_FOLDER = gConst.PARENT_FOLDER
INPUT_FILE = PARENT_FOLDER + "Twin_Address_Training_Subset_10_17_19_reduced.csv"
OUTPUT_FILE = PARENT_FOLDER + "Twin_Proportional_Training_Samples_10_17_19.csv"
COLUMN_NAMES = ['FID','UATYP10','n_twins','prop_twins','n_train','n_test','n_dev','n_download']
REGION_FILE = PARENT_FOLDER + "Twin_Proportional_Training_Samples_10_17_19.csv"
GSV_GRID = PARENT_FOLDER + "pointsToSampleInt_10_23_19.csv"
N_TRAIN = 1000
N_DEV = 100
N_TEST = 100

### load data ###

In [None]:
columnVals = [[] for i in range(len(COLUMN_NAMES))]
rawData = ps.read_csv(INPUT_FILE)
numSamples = len(rawData['FID'])
uniqueFIDs = list(set(rawData['FID']))
rawData.head()

### calculate the number of test, train, and dev images for each urban area ###

In [None]:
for uniqueFID in uniqueFIDs:
    subsetData = rawData.loc[rawData['FID'] == uniqueFID]
    nSubsetRecords = len(subsetData['FID'])
    propRecords = nSubsetRecords/numSamples
    columnVals[COLUMN_NAMES.index('FID')].append(uniqueFID)
    columnVals[COLUMN_NAMES.index('UATYP10')].append(list(subsetData['UATYP10'])[0])
    columnVals[COLUMN_NAMES.index('n_twins')].append(nSubsetRecords)
    columnVals[COLUMN_NAMES.index('prop_twins')].append(propRecords)
    columnVals[COLUMN_NAMES.index('n_train')].append(propRecords*N_TRAIN)
    columnVals[COLUMN_NAMES.index('n_test')].append(propRecords*N_TEST)
    columnVals[COLUMN_NAMES.index('n_dev')].append(propRecords*N_DEV)
    columnVals[COLUMN_NAMES.index('n_download')].append(propRecords*100*(N_TRAIN+N_DEV+N_TEST))

### save urban area image stats to csv ###

In [None]:
df = ps.DataFrame({
    COLUMN_NAMES[0]:columnVals[0],
    COLUMN_NAMES[1]:columnVals[1],
    COLUMN_NAMES[2]:columnVals[2],
    COLUMN_NAMES[3]:columnVals[3],
    COLUMN_NAMES[4]:columnVals[4],
    COLUMN_NAMES[5]:columnVals[5],
    COLUMN_NAMES[6]:columnVals[6],
    COLUMN_NAMES[7]:columnVals[7]
        })
df.to_csv(OUTPUT_FILE,index=False)

### load urban area metadata from csv ###
**Inputs**
- **inputFilepath** (str) - absolute filepath of csv file containing urban metadata

**Outputs**
- **subsetRecords** (pandas dataframe) - loaded records

In [9]:
def getRegionsToSample(inputFilepath):
    rawData = ps.read_csv(inputFilepath)
    subsetRecords = rawData[rawData['status']=='id']
    return(subsetRecords)

### get number of images to sample for one urban area ###
**Inputs**
- **FID** (str) - unique urban area id
- **regionSampleRates** (pandas dataframe) - metadata for each urban area

**Outputs**
- **unnamed** (int) - number of images to sample for the urban rea with unique id = FID 

In [None]:
def getNumberToSample(FID,regionSampleRates):
    rawDataSubset = regionSampleRates[regionSampleRates['FID'] == FID]
    return(list(rawDataSubset['n_download'])[0])

### create a proportional random sample of GSV images from each urban area and save to csv ###

In [22]:
regionSampleMeta = getRegionsToSample(REGION_FILE)
regionIds = list(regionSampleMeta['FID'])
sampleGrid = ps.read_csv(GSV_GRID)
for region in regionIds:
    sampleSize = int(round(list(regionSampleMeta[regionSampleMeta['FID']==region]['n_download'])[0]))
    sampleGridSubset = sampleGrid[sampleGrid['NEAR_FID']==region]
    numImagesAvailable = len(sampleGridSubset['NEAR_FID'])
    sampleSize = min(sampleSize,numImagesAvailable)
    if(sampleSize>=1):
        sampleGridSubset.sample(n=sampleSize)
        sampleGridSubset.to_csv(PARENT_FOLDER + "GSV_DOWNLOAD_" + str(region) + ".csv",index=False)
    else:
        print("region %i has sample size of 0",region)

region %i has sample size of 0 19
