This notebook prepares a DataFrame that can be used for training single-label classification with the inception model. It will simply extract all images out of our data set that only have one label, then discard all those that have rare labels and store the results in a new DataFrame.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import os

In [2]:
if os.path.exists('data/prcsd_labels_df'):
    df = pd.read_pickle('data/prcsd_labels_df')
else:
    print('Please run the notebook for processing the labels first!')

Extract rows with pictures where just one label is attached:

In [3]:
df = df.loc[df.labels.apply(lambda lst: len(lst) == 1)]
df.head(10)

Unnamed: 0,img_url,labels
0,https://blok-production.imgix.net/photos/e7a3e...,[3]
1,https://blok-production.imgix.net/photos/e7a3e...,[3]
2,https://blok-production.imgix.net/photos/e7a3e...,[11]
3,https://blok-production.imgix.net/photos/e7a3e...,[1]
4,https://blok-production.imgix.net/photos/e7a3e...,[1]
10,https://blok-production.imgix.net/photos/e7a3e...,[4]
17,https://blok-production.imgix.net/photos/899b2...,[11]
23,https://blok-production.imgix.net/photos/899b2...,[4]
24,https://blok-production.imgix.net/photos/899b2...,[4]
32,https://blok-production.imgix.net/photos/7172c...,[11]


Make label list to integer (but only if it is not already integer...):

In [4]:
df.labels = df.labels.apply(lambda lst: lst[0] if type(lst) == list else lst)
df.head()

Unnamed: 0,img_url,labels
0,https://blok-production.imgix.net/photos/e7a3e...,3
1,https://blok-production.imgix.net/photos/e7a3e...,3
2,https://blok-production.imgix.net/photos/e7a3e...,11
3,https://blok-production.imgix.net/photos/e7a3e...,1
4,https://blok-production.imgix.net/photos/e7a3e...,1


Add file path and drop URLs:

In [5]:
df['paths'] = df.index.astype(str)
df.paths = df.paths.apply(lambda s: s + '.png')
df.drop(labels=['img_url'], axis=1, inplace=True)
df.head()

Unnamed: 0,labels,paths
0,3,0.png
1,3,1.png
2,11,2.png
3,1,3.png
4,1,4.png


There are still 5518 pictures and each label is in there, but some not very frequent:

In [6]:
len(df)

5501

In [7]:
sorted(list(df.labels.unique()))

[0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 16, 18, 19, 20, 21, 22, 23]

In [8]:
df.labels.value_counts()

4     1483
1     1120
11     856
3      532
0      531
6      386
7      213
14     134
5       92
2       52
18      40
9       20
13      15
23       8
21       4
10       4
16       4
20       3
12       2
19       1
22       1
Name: labels, dtype: int64

Drop the not frequent ones:

In [9]:
rare_labels = [2, 18, 9, 13, 23, 21, 10, 16, 20, 12, 19, 22]
df = df.loc[df.labels.apply(lambda i: i not in rare_labels)]

In [10]:
len(df)

5347

In [11]:
unique_labels = sorted(list(df.labels.unique()))
unique_labels

[0, 1, 3, 4, 5, 6, 7, 11, 14]

In [12]:
df.labels.value_counts()

4     1483
1     1120
11     856
3      532
0      531
6      386
7      213
14     134
5       92
Name: labels, dtype: int64

Lastly transfer labels to vectors:

In [13]:
def label_to_categorical(i):
    ret = np.zeros(max(unique_labels)+1)
    ret[i] = 1
    return ret


df.labels = df.labels.apply(label_to_categorical)

In [14]:
df.head()

Unnamed: 0,labels,paths
0,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.png
1,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",1.png
2,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",2.png
3,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",3.png
4,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",4.png


Nice! Write to file.

In [15]:
df.to_pickle('data/inception_df_v1')