In [1]:
import wget
import zipfile
import pandas as pd
import json
import os
import shutil

Define some variables

In [2]:
data_path = "data/"
data_coco = "data/coco/"
# Number of training images to pick from each category
num_images = 10000

For person detection, the [COCO data set](https://cocodataset.org) is used. In the first step, the image data and the label information are downloaded. <br>
The data then gets extracted to the `"data/coco"` folder and the zip files will be deleted.<br>
It should be noted that the COCO training dataset is over 18GB (~120,000 images). Therefore, the download can take a longer time.

In [3]:
if not os.path.isdir(data_path):
    os.mkdir(data_path)
if not os.path.isdir(data_coco):
    os.mkdir(data_coco)

In [4]:
if not os.path.isdir(data_coco + "train2017"):
    url = "http://images.cocodataset.org/zips/train2017.zip"
    wget.download(url, data_coco)

    with zipfile.ZipFile(data_coco + "train2017.zip","r") as zip_ref:
        zip_ref.extractall(data_coco)

In [5]:
if not os.path.isdir(data_coco + "annotations"):
    url = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
    wget.download(url, data_coco)

    with zipfile.ZipFile(data_coco + "annotations_trainval2017.zip","r") as zip_ref:
        zip_ref.extractall(data_coco)

In [6]:
if os.path.isfile(data_coco + "train2017.zip"):
    os.remove(data_coco + "train2017.zip")
if os.path.isfile(data_coco + "annotations_trainval2017.zip"):
    os.remove(data_coco + "annotations_trainval2017.zip")

Get the data describing the images from the file you downloaded before.

In [7]:
json_data = json.load(open(data_coco + "annotations/instances_train2017.json"))

Just the annotations are important. This data is saved in a DataFrame.

In [8]:
annotations = pd.DataFrame(json_data["annotations"])
annotations.head()

Unnamed: 0,segmentation,area,iscrowd,image_id,bbox,category_id,id
0,"[[239.97, 260.24, 222.04, 270.49, 199.84, 253....",2765.14865,0,558840,"[199.84, 200.46, 77.71, 70.88]",58,156
1,"[[247.71, 354.7, 253.49, 346.99, 276.63, 337.3...",1545.4213,0,200365,"[234.22, 317.11, 149.39, 38.55]",58,509
2,"[[274.58, 405.68, 298.32, 405.68, 302.45, 402....",5607.66135,0,200365,"[239.48, 347.87, 160.0, 57.81]",58,603
3,"[[296.65, 388.33, 296.65, 388.33, 297.68, 388....",0.0,0,200365,"[296.65, 388.33, 1.03, 0.0]",58,918
4,"[[251.87, 356.13, 260.13, 343.74, 300.39, 335....",800.41325,0,200365,"[251.87, 333.42, 125.94, 22.71]",58,1072


All unnecessary data is removed from the DataFrame.

In [9]:
annotations.drop(labels=["segmentation", "area", "iscrowd", "bbox", "id"], axis=1, inplace=True)
annotations.head()

Unnamed: 0,image_id,category_id
0,558840,58
1,200365,58
2,200365,58
3,200365,58
4,200365,58


All images of the Person class are stored in a new DataFrame. Then all duplicates of an image are removed.<br>
Additionally, we just want to take `num_images` to train the model.

In [10]:
df_person = annotations[annotations["category_id"]==1]
df_person = df_person.drop_duplicates(subset=["image_id"])
df_person.reset_index(drop=True, inplace=True)
df_person = df_person.loc[:num_images-1]
df_person.shape

(10000, 2)

The same is done for the remaining classes.

In [11]:
df_no_person = annotations[annotations["category_id"]!=1]
df_no_person = df_no_person.drop_duplicates(subset=["image_id"])
df_no_person.reset_index(drop=True)
df_no_person.shape

(116912, 2)

It is checked if there are images that appear in both DataFrames.<br>
All images that appear in both DataFrames are removed from the DataFrame that contains the images without persons.<br>
After that the DataFrame is reduced to `num_images`.

In [12]:
cond = df_no_person["image_id"].isin(df_person["image_id"])
df_no_person.drop(df_no_person[cond].index, inplace = True)
df_no_person.reset_index(drop=True, inplace = True)
df_no_person = df_no_person.loc[:num_images-1]
df_no_person.shape

(10000, 2)

Before the new data set for person detection is created, the directories in which the images will be stored must be created.

In [13]:
train_path = data_path + "train/"
person_path = train_path + "person/"
no_person_path = train_path + "no_person/"

if not os.path.isdir(data_path + "train"):
    os.mkdir(data_path + "train")
    os.mkdir(person_path)
    os.mkdir(no_person_path)

The image names of all images that contain people and those that do not contain people are stored in a list.

In [14]:
dfp = list(df_person["image_id"])
dfnp = list(df_no_person["image_id"])
print("Person images:", len(dfp))
print("No person images:", len(dfnp))

Person images: 10000
No person images: 10000


10,000 images (`num_images`) for each of the two classes are copied from the downloaded COCO dataset into the folders created previously.

In [15]:
for i in range(num_images):
    shutil.copy(data_coco + "train2017/" + str(dfp[i]).zfill(12) + ".jpg", person_path + str(dfp[i]).zfill(12) + ".jpg")
    shutil.copy(data_coco + "train2017/" + str(dfnp[i]).zfill(12) + ".jpg", no_person_path + str(dfnp[i]).zfill(12) + ".jpg")