This notebook is used to remove broken urls from the list of image urls. This step is necessary in order to correctly reference samples during training.
It reads in the csv of subsetted image URLs, compares the image ids against those that were downloaded using $downloadImages.py$, removes URLs that do not exist in the downloaded folder, and then resaves the corrected CSV under another name.

Input: train_subset.csv
    
Output: train_subset.csv (corrected)

In [10]:
# Packages
import sys, multiprocessing, csv
import numpy as np
import pandas as pd
import os

In [11]:
# Read in train_subset.csv
df = pd.read_csv("google-data/train_subset.csv")

# Create list of all image ids in the original csv
all_ids = df['id'].to_list()

In [12]:
len(df)

10979

In [13]:
# Create list of image ids that were actually downloaded

im_dir = os.listdir('google-data/images') # set path to your Images folder
downloaded_ids = [s.strip('.jpg') for s in im_dir]

In [14]:
# Checking that we have more ids in the original csv than were downloaded
print(len(all_ids))
print(len(downloaded_ids))

10979
10353


In [16]:
# Find ids that are in all_ids but not downloaded_ids
broken_ids = list(set(all_ids) - set(downloaded_ids))

In [19]:
# Checking the math on this -- looks good
print(len(broken_ids)) 
print(len(all_ids) - len(downloaded_ids))

626
626


As a final step, we remove the broken url image id's from train_subset and save it as our final data csv.

In [20]:
final_df = df[~df['id'].isin(broken_ids)]

In [22]:
len(final_df) # perfect

10353

In [25]:
# Saving this under a new name
final_df.to_csv('google-data/df_final.csv', index=False)