### Imports

In [17]:
from fastai.vision import *
from pathlib import Path
import numpy as np

In [2]:
!pwd

/Users/fazzl/herbs


In [3]:
!ls

README.md              [1m[34mimage-links[m[m            names.xlsx
download-data.ipynb    lesson2-download.ipynb names2.xlsx
einsum.ipynb           names.csv


In [4]:
links_path = Path('image-links')
data_root_path = Path('/Users/fazzl/data/herbs')

In [5]:
links_path.ls()

[PosixPath('image-links/23.csv'),
 PosixPath('image-links/22.csv'),
 PosixPath('image-links/08.csv'),
 PosixPath('image-links/20.csv'),
 PosixPath('image-links/21.csv'),
 PosixPath('image-links/09.csv'),
 PosixPath('image-links/25.csv'),
 PosixPath('image-links/19.csv'),
 PosixPath('image-links/18.csv'),
 PosixPath('image-links/24.csv'),
 PosixPath('image-links/26.csv'),
 PosixPath('image-links/27.csv'),
 PosixPath('image-links/16.csv'),
 PosixPath('image-links/02.csv'),
 PosixPath('image-links/03.csv'),
 PosixPath('image-links/17.csv'),
 PosixPath('image-links/29.csv'),
 PosixPath('image-links/01.csv'),
 PosixPath('image-links/15.csv'),
 PosixPath('image-links/14.csv'),
 PosixPath('image-links/28.csv'),
 PosixPath('image-links/04.csv'),
 PosixPath('image-links/10.csv'),
 PosixPath('image-links/11.csv'),
 PosixPath('image-links/05.csv'),
 PosixPath('image-links/13.csv'),
 PosixPath('image-links/07.csv'),
 PosixPath('image-links/06.csv'),
 PosixPath('image-links/12.csv')]

In [6]:
folders = [p.stem for p in links_path.ls()]
files = [p.name for p in links_path.ls()]

In [7]:
folders[:5]

['23', '22', '08', '20', '21']

In [8]:
files[:5]

['23.csv', '22.csv', '08.csv', '20.csv', '21.csv']

## Making Folders for Data

Let's make folders for our data. We will keep all data in directory `data_root_path`.

Inside, we will have a folder for each on 29 classes of herbs. Folder names will be two-digit class number. Later I will make a mapping between class number and a name (Chinese, pinyin or English).

Below is the list of classes:

| ﻿Class 	|  Chinese 	|     Pinyin    	|      English     	|           Note          	|
|:-----:	|:--------:	|:-------------:	|:----------------:	|:-----------------------:	|
|     1 	| 白菜     	| bái cài       	| Chinese cabbage  	|                         	|
|     2 	| 菠菜     	| bō cài        	| spinach          	|                         	|
|     3 	| 菜心     	| cài xīn       	| choy sum         	|                         	|
|     4 	| 儿菜     	| ér cài        	|        ---       	|                         	|
|     5 	| 盖菜     	| gài cài       	| leaf mustard     	|                         	|
|     6 	| 芥蓝     	| jiè lán       	| Chinese broccoli 	|                         	|
|     7 	| 蒿子杆   	| hāo zǐ gān    	| Tricolor daisy   	|                         	|
|     8 	| 黄心菜   	| huáng xīn cài 	|        ---       	|                         	|
|     9 	| 茴香     	| huí xiāng     	| fennel           	|                         	|
|    10 	| 鸡毛菜   	| jī máo cài    	|        ---       	|                         	|
|    11 	| 韭菜     	| jiǔ cài       	| garlic chives    	|                         	|
|    12 	| 空心菜   	| kōng xīn cài  	| water spinach    	| same as 蕹菜 (wèng cài) 	|
|    13 	| 快菜     	| kuài cài      	|        ---       	|                         	|
|    14 	| 苦菊     	| kǔ jú         	| endive           	|                         	|
|    15 	| 芦笋     	| lú sǔn        	| asparagus        	|                         	|
|    16 	| 芹菜     	| qín cài xīn   	| celery           	|                         	|
|    17 	| 蒜黄     	| suàn huáng    	|        ---       	|                         	|
|    18 	| 蒜苔     	| suàn tái      	| garlic shoots    	|                         	|
|    19 	| 茼蒿菜   	| tóng hāo cài  	| crown daisy      	|                         	|
|    20 	| 豌豆苗   	| wān dòu miáo  	| pea shoots       	|                         	|
|    21 	| 莴苣     	| wō jù         	| lettuce          	|                         	|
|    22 	| 香菜     	| xiāng cài     	| cilantro         	|                         	|
|    23 	| 香芹     	| xiāng qín     	| parsley          	|                         	|
|    24 	| 小白菜   	| xiǎo bái cài  	| bok choy         	|                         	|
|    25 	| 西洋菜   	| xī yáng cài   	| watercress       	|                         	|
|    26 	| 叶生菜   	| yè shēng cài  	| lettuce          	|                         	|
|    27 	| 油菜     	| yóu cài       	| oilseed rape     	|                         	|
|    28 	| 羽衣甘蓝 	| yǔ yī gān lán 	| kale             	|                         	|
|    29 	| 竹笋     	| zhú sǔn       	| bamboo shoot     	|                         	|

Let's check where the data root folder for this project points:

In [9]:
data_root_path

PosixPath('/Users/fazzl/data/herbs')

For now the folder doesn't exits, so let's create it:

In [10]:
data_root_path.mkdir(exist_ok=True, parents=True)

In [11]:
data_root_path.exists()

True

It is an empty folder:

In [12]:
data_root_path.ls()

[PosixPath('/Users/fazzl/data/herbs/03'),
 PosixPath('/Users/fazzl/data/herbs/04'),
 PosixPath('/Users/fazzl/data/herbs/05'),
 PosixPath('/Users/fazzl/data/herbs/02'),
 PosixPath('/Users/fazzl/data/herbs/20'),
 PosixPath('/Users/fazzl/data/herbs/18'),
 PosixPath('/Users/fazzl/data/herbs/27'),
 PosixPath('/Users/fazzl/data/herbs/11'),
 PosixPath('/Users/fazzl/data/herbs/29'),
 PosixPath('/Users/fazzl/data/herbs/16'),
 PosixPath('/Users/fazzl/data/herbs/28'),
 PosixPath('/Users/fazzl/data/herbs/17'),
 PosixPath('/Users/fazzl/data/herbs/10'),
 PosixPath('/Users/fazzl/data/herbs/19'),
 PosixPath('/Users/fazzl/data/herbs/26'),
 PosixPath('/Users/fazzl/data/herbs/21'),
 PosixPath('/Users/fazzl/data/herbs/07'),
 PosixPath('/Users/fazzl/data/herbs/09'),
 PosixPath('/Users/fazzl/data/herbs/08'),
 PosixPath('/Users/fazzl/data/herbs/01'),
 PosixPath('/Users/fazzl/data/herbs/06'),
 PosixPath('/Users/fazzl/data/herbs/24'),
 PosixPath('/Users/fazzl/data/herbs/23'),
 PosixPath('/Users/fazzl/data/herb

Let's populate it with class folders.

1. First, we create a folder for each class
2. Then, we download images into that folder

**Note**: you only need to do it ONCE. To do this, uncomment the cell below.

In [37]:
# for folder, file in zip(folders, files):
#     dest = data_root_path/folder
#     dest.mkdir(parents=True, exist_ok=True)
#     download_images(links_path/file, dest)

Error https://cdn.101mediaimage.com/img/2017/08/14/17089665012derd.png HTTPSConnectionPool(host='cdn.101mediaimage.com', port=443): Read timed out. (read timeout=4)
Error https://images.freeimages.com/images/premium/previews/5146/5146263-cilantro.jpg HTTPSConnectionPool(host='images.freeimages.com', port=443): Max retries exceeded with url: /images/premium/previews/5146/5146263-cilantro.jpg (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d6ca0f0>, 'Connection to images.freeimages.com timed out. (connect timeout=4)'))
Error http://www.liaotuo.org/uploadfile/2018/0621/20180621032123360.jpg HTTPConnectionPool(host='www.liaotuo.org', port=80): Max retries exceeded with url: /uploadfile/2018/0621/20180621032123360.jpg (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d0d1f98>, 'Connection to www.liaotuo.org timed out. (connect timeout=4)'))
Error http://www.liaotuo.org/uploadfile/uploads/allimg/131228/14-13122Q03256.jp

Error http://pic.qqtn.com/up/2017-3/201703271030597872740.png HTTPConnectionPool(host='pic.qqtn.com', port=80): Max retries exceeded with url: /up/2017-3/201703271030597872740.png (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d6e6518>, 'Connection to pic.qqtn.com timed out. (connect timeout=4)'))
Error https://cp1.douguo.com/upload/caiku/5/3/c/yuan_53a65197918e28a61e45b2e5b4e100fc.jpg HTTPSConnectionPool(host='cp1.douguo.com', port=443): Read timed out. (read timeout=4)
Error https://pic.pingguolv.com/uploads/allimg/151116/99-151116205R1.jpg HTTPSConnectionPool(host='pic.pingguolv.com', port=443): Max retries exceeded with url: /uploads/allimg/151116/99-151116205R1.jpg (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d6d0828>, 'Connection to pic.pingguolv.com timed out. (connect timeout=4)'))
Error https://s.yimg.com/ny/api/res/1.2/SOd9jQz8APeUAc8VE0LFUw--~A/YXBwaWQ9aGlnaGxhbmRlcjtzbT0xO3c9ODAw/https://media-mb

Error https://i8.meishichina.com/attachment/recipe/201102/201102181153375.jpg?x-oss-process=style/p800 HTTPSConnectionPool(host='i8.meishichina.com', port=443): Max retries exceeded with url: /attachment/recipe/201102/201102181153375.jpg?x-oss-process=style/p800 (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d6fb940>, 'Connection to i8.meishichina.com timed out. (connect timeout=4)'))
Error https://img.ruten.com.tw/s1/a/9b/57/21717033494359_943.JPG ('Connection aborted.', OSError("(54, 'ECONNRESET')"))
Error https://i8.meishichina.com/attachment/recipe/2013/03/19/20130319105729375480972.jpg?x-oss-process=style/p800 HTTPSConnectionPool(host='i8.meishichina.com', port=443): Max retries exceeded with url: /attachment/recipe/2013/03/19/20130319105729375480972.jpg?x-oss-process=style/p800 (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d013cf8>, 'Connection to i8.meishichina.com timed out. (connect timeout=

Error https://pic.pingguolv.com/uploads/allimg/180202/77-1P202145935.jpg HTTPSConnectionPool(host='pic.pingguolv.com', port=443): Max retries exceeded with url: /uploads/allimg/180202/77-1P202145935.jpg (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d6e6710>, 'Connection to pic.pingguolv.com timed out. (connect timeout=4)'))
Error http://images.meishij.net/p/20120601/de8c0b640f4857240ff5b8d432b60ac0_150x150.jpg HTTPConnectionPool(host='images.meishij.net', port=80): Max retries exceeded with url: /p/20120601/de8c0b640f4857240ff5b8d432b60ac0_150x150.jpg (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d6e65c0>, 'Connection to images.meishij.net timed out. (connect timeout=4)'))
Error http://qiniu.69cy.net/20170324093047815.jpg HTTPConnectionPool(host='qiniu.69cy.net', port=80): Max retries exceeded with url: /20170324093047815.jpg (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d6

Error http://www.hnseeds.com/images/bs0205.png HTTPConnectionPool(host='www.hnseeds.com', port=80): Read timed out. (read timeout=4)
Error http://images.meishij.net/p/20110715/87d85cbd4b0942fcd85207ac79f50c26_180x180.jpg HTTPConnectionPool(host='images.meishij.net', port=80): Max retries exceeded with url: /p/20110715/87d85cbd4b0942fcd85207ac79f50c26_180x180.jpg (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d013320>, 'Connection to images.meishij.net timed out. (connect timeout=4)'))
Error http://upload2.95171.cn/article/20111104/%E5%B0%8F%E7%99%BD%E8%8F%9C.jpg HTTPConnectionPool(host='upload2.95171.cn', port=80): Read timed out.
Error https://img14.360buyimg.com/n1/jfs/t3391/260/2028368875/79667/9096bc69/5843bf2eN28a01db3.jpg HTTPSConnectionPool(host='img14.360buyimg.com', port=443): Read timed out. (read timeout=4)
Error http://pic15.nipic.com/20110709/7895973_093431459167_2.jpg HTTPConnectionPool(host='pic15.nipic.com', port=80): Read timed out. (r

Error https://www.uooyoo.com/img2017/9/15/2017091562516141.jpg HTTPSConnectionPool(host='www.uooyoo.com', port=443): Read timed out. (read timeout=4)
Error https://www.uooyoo.com/img2016/7/9/2016070938674717.jpg HTTPSConnectionPool(host='www.uooyoo.com', port=443): Read timed out. (read timeout=4)
Error https://pic.pingguolv.com/uploads/allimg/151119/77-1511191I413.jpg HTTPSConnectionPool(host='pic.pingguolv.com', port=443): Max retries exceeded with url: /uploads/allimg/151119/77-1511191I413.jpg (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d6ec048>, 'Connection to pic.pingguolv.com timed out. (connect timeout=4)'))
Error https://www.haocai777.com/Article/UploadFiles2012c/201703/2017031216572313.jpg HTTPSConnectionPool(host='www.haocai777.com', port=443): Max retries exceeded with url: /Article/UploadFiles2012c/201703/2017031216572313.jpg (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d6ec2e8>, 'Con

Error http://images.meishij.net/p/20100402/04dc39ed8b1222c0ae5ad6eff8454794.jpg HTTPConnectionPool(host='images.meishij.net', port=80): Max retries exceeded with url: /p/20100402/04dc39ed8b1222c0ae5ad6eff8454794.jpg (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d6e3048>, 'Connection to images.meishij.net timed out. (connect timeout=4)'))
Error http://site.meishij.net/r/115/13/2253365/a2253365_33591.jpg HTTPConnectionPool(host='site.meishij.net', port=80): Max retries exceeded with url: /r/115/13/2253365/a2253365_33591.jpg (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d6f32b0>, 'Connection to site.meishij.net timed out. (connect timeout=4)'))
Error https://img.ruten.com.tw/s1/3/d2/1b/21747354962459_183.jpg ('Connection aborted.', OSError("(54, 'ECONNRESET')"))
Error http://pic10.nipic.com/20101027/5736135_095147001042_2.jpg HTTPConnectionPool(host='pic10.nipic.com', port=80): Read timed out. (read timeout=4)
Error htt

Error https://pic.pingguolv.com/uploads/allimg/140302/53-140302102004.jpg HTTPSConnectionPool(host='pic.pingguolv.com', port=443): Max retries exceeded with url: /uploads/allimg/140302/53-140302102004.jpg (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d6f1b00>, 'Connection to pic.pingguolv.com timed out. (connect timeout=4)'))
Error https://static.cndzys.com/20160922/030b09f19e8e71e2b9635320d897becc.jpg HTTPSConnectionPool(host='static.cndzys.com', port=443): Max retries exceeded with url: /20160922/030b09f19e8e71e2b9635320d897becc.jpg (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1d6f1828>, 'Connection to static.cndzys.com timed out. (connect timeout=4)'))
Error http://www.forestry.gov.cn/html/main/main_72/20181016165445354223400/20181016165548957349704_1.jpg HTTPConnectionPool(host='www.forestry.gov.cn', port=80): Max retries exceeded with url: /html/main/main_72/20181016165445354223400/201810161655

Error https://pic.qyer.com/album/user/719/27/RkhcQB0CZg/index/680x HTTPSConnectionPool(host='pic.qyer.com', port=443): Read timed out. (read timeout=4)
Error https://www.yacook.org/sites/default/files/yacomt0x0000/image/yacook200905/R1-00001.jpg ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Error https://www.uooyoo.com/img2017/11/11/2017111137919469.png HTTPSConnectionPool(host='www.uooyoo.com', port=443): Read timed out. (read timeout=4)
Error http://pic34.nipic.com/20131019/4232515_013133317192_2.jpg HTTPConnectionPool(host='pic34.nipic.com', port=80): Read timed out. (read timeout=4)
Error http://img14.360buyimg.com/n7/jfs/t18853/363/855092958/256376/e4a9f62d/5aad4117Nc905d893.jpg HTTPConnectionPool(host='img14.360buyimg.com', port=80): Read timed out. (read timeout=4)
Error https://theblog.jessikerbakes.com/wp-content/uploads/2014/01/IMG_2558.jpg HTTPSConnectionPool(host='theblog.jessikerbakes.com', port=443): Max retries exceeded with

Error https://img.ruten.com.tw/s3/01d/991/2343455/5/2b/e1/21809720638433_789.jpg ('Connection aborted.', OSError("(54, 'ECONNRESET')"))
Error http://www.ddmeishi.com/uploads/allimg/170424/6-1F424095918.jpg HTTPConnectionPool(host='www.ddmeishi.com', port=80): Read timed out.
Error http://image.zhms.cn/2015-12/3e3bb68a9a914668bb4466a757afce3f.jpg?x-oss-process=image/format,jpg/interlace,1/resize,m_fill,h_270,w_270/watermark,image_RGVmYXVsdC9iLnBuZw==,t_35,g_se,x_10,y_10 HTTPConnectionPool(host='image.zhms.cn', port=80): Max retries exceeded with url: /2015-12/3e3bb68a9a914668bb4466a757afce3f.jpg?x-oss-process=image/format,jpg/interlace,1/resize,m_fill,h_270,w_270/watermark,image_RGVmYXVsdC9iLnBuZw==,t_35,g_se,x_10,y_10 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d0dff28>, 'Connection to image.zhms.cn timed out. (connect timeout=4)'))
Error https://img.ruten.com.tw/s2/3/12/38/21730777196088_473.JPG ('Connection aborted.', OSError("(54, 'ECONNRESET')")

Error http://img14.360buyimg.com/n7/jfs/t2680/203/542441041/964118/66007d3/5719b9ebNd158fb27.jpg HTTPConnectionPool(host='img14.360buyimg.com', port=80): Read timed out. (read timeout=4)
Error http://www.yydnc.com/data/upload/shop/store/goods/1/1_05708992051479736_360.jpg Exceeded 30 redirects.
Error http://r3.sinaimg.cn/201511/000/000/aHR0cDovL21tYml6LnFwaWMuY24vbW1iaXovN01odWVlSm81V0k1Sk1MaWN5dnd2Y2w5cjh3dWs0VUdYaWFpYUY1NDVpYXFmT0dvN1ZGQ2hic2ljaWJCVWFRMHYwRmZlZ2NvNVduV09JYXlLWUZublF1Y3U3WlEvMCtodHRwOi8vbXAud2VpeGluLnFxLmNvbS9zP19fYml6PU16QTNPRE16TURFeE5RPT0mbWlkPTQwMDQ3MTI1MyZpZHg9MSZzbj0xZGZlODhkNGRjOTJhZGYyNzgwNjU0ZTU3Y2FlNzRjMSYzcmQ9TXpBM01EVTROVFl6TXc9PSZzY2VuZT02.jpg HTTPConnectionPool(host='r3.sinaimg.cn', port=80): Max retries exceeded with url: /201511/000/000/aHR0cDovL21tYml6LnFwaWMuY24vbW1iaXovN01odWVlSm81V0k1Sk1MaWN5dnd2Y2w5cjh3dWs0VUdYaWFpYUY1NDVpYXFmT0dvN1ZGQ2hic2ljaWJCVWFRMHYwRmZlZ2NvNVduV09JYXlLWUZublF1Y3U3WlEvMCtodHRwOi8vbXAud2VpeGluLnFxLmNvbS9zP19fYml6PU16QTNPRE16TUR

Error http://img14.360buyimg.com/n7/jfs/t1/4701/28/9491/778833/5bad9020E3f1e05b9/504d3152ee59e4d4.png HTTPConnectionPool(host='img14.360buyimg.com', port=80): Read timed out.
Error http://www.foodqs.cn/memberpictures/2014/04/22/sxlzny_201404221122466493.JPG HTTPConnectionPool(host='www.foodqs.cn', port=80): Max retries exceeded with url: /memberpictures/2014/04/22/sxlzny_201404221122466493.JPG (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d0e3ac8>, 'Connection to www.foodqs.cn timed out. (connect timeout=4)'))
Error http://www.ny365.com.cn/bookpic/2011-1/20110104135848.jpg HTTPConnectionPool(host='www.ny365.com.cn', port=80): Max retries exceeded with url: /bookpic/2011-1/20110104135848.jpg (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x1a1d0e3ba8>, 'Connection to www.ny365.com.cn timed out. (connect timeout=4)'))
Error https://cp1.douguo.com/upload/caiku/b/9/f/yuan_b921048f9ec3373c94854220752bfc2f.jpg HTTPSConnectionPoo

Many links did not work, but we still got guite a few images. The smallest class has 185 images, the largest one has 398.

In [13]:
ns = [len(f.ls()) for f in data_root_path.ls()]
print(ns)
print(min(ns))
print(max(ns))

[195, 191, 188, 291, 328, 275, 295, 296, 379, 381, 398, 279, 294, 268, 385, 290, 193, 287, 290, 194, 191, 185, 200, 295, 289, 276, 299, 295, 297]
185
398


By looking in the folders, I see that many images are not relevant. They show cooked dishes or a totally irrelevant images. It's time to clean them.

### Mapping From Class Numbers to Names

In [14]:
names_path = 'names.csv'

'''
c - 2-character class string
hz - hanzi (Chinese characters)
py - pinyin (Chinese romanization)
e - engligh
'''

# class to English
c_to_e = {}
c_to_hz = {}
c_to_py = {}

with open(names_path, 'r') as f:
    for l in f:
        c, hz, py, e = l.strip('\n').split(',')
        print(c, hz, py, e)
        c_to_e[c] = e
        c_to_hz[c] = hz
        c_to_py[c] = py

﻿Class Chinese Pinyin English
01 白菜 bái cài  Chinese cabbage
02 菠菜 bō cài  spinach
03 菜心 cài xīn  choy sum
04 儿菜 ér cài  ér cài 
05 盖菜 gài cài  leaf mustard
06 芥蓝 jiè lán  Chinese broccoli
07 蒿子杆 hāo zǐ gān  Tricolor daisy
08 黄心菜 huáng xīn cài  huáng xīn cài 
09 茴香 huí xiāng  fennel
10 鸡毛菜 jī máo cài  jī máo cài 
11 韭菜 jiǔ cài  garlic chives
12 空心菜 kōng xīn cài  water spinach
13 快菜 kuài cài  kuài cài 
14 苦菊 kǔ jú  endive
15 芦笋 lú sǔn  asparagus
16 芹菜 qín cài xīn  celery
17 蒜黄 suàn huáng  suàn huáng 
18 蒜苔 suàn tái  garlic shoots
19 茼蒿菜 tóng hāo cài  crown daisy
20 豌豆苗 wān dòu miáo  pea shoots
21 莴苣 wō jù  lettuce
22 香菜 xiāng cài  cilantro
23 香芹 xiāng qín  parsley
24 小白菜 xiǎo bái cài  bok choy
25 西洋菜 xī yáng cài  watercress
26 叶生菜 yè shēng cài  lettuce
27 油菜 yóu cài  oilseed rape
28 羽衣甘蓝 yǔ yī gān lán  kale
29 竹笋 zhú sǔn bamboo shoot


## Cleaning the Data

I need to clean the data for each of the 29 classes, because there are a lot of irrelevant images.

### View Data

In [22]:
old_ns = [195, 191, 188, 291, 328, 275, 295, 296, 379, 381, 
          398, 279, 294, 268, 385, 290, 193, 287, 290, 194, 
          191, 185, 200, 295, 289, 276, 299, 295, 297]

new_ns = [len(f.ls()) for f in data_root_path.ls() if f.is_dir()]

In [23]:
np.sum(old_ns)

8024

In [24]:
np.sum(new_ns)

2117

In [25]:
np.sum(new_ns) / np.sum(old_ns)

0.26383349950149554

As you can see, only 26.3% of original data remains in the data set.

In [27]:
print(np.min(new_ns))
print(np.max(new_ns))
print(np.median(new_ns))
print(np.mean(new_ns))

17
156
73.0
73.0
