### Data preparation

#### This is part of the competition at kaggle. 
#### The data can be found here https://www.kaggle.com/c/imaterialist-challenge-furniture-2018/data

In this particular scenario we are provided with json files which contain url for images and the label for such images.

So the first thing to do is to actually fetch the data.

The structure of the data is

```
{
    "images" : [image],
    "annotations" : [annotation],
}

image {
    "image_id" : int,
    "url": [string]
}

annotation {
    "image_id" : int,
    "label_id" : int
}
```

We have 

* test.json
* train.json
* validation.json

Ultimately we will store our data in a structure like this
```
├── test
│   ├── 01
│   └── 02
├── train
│   ├── 01
│   └── 02
└── validation
    ├── 01
    └── 02

```

There are quite a few different catetories, in this example I just show how it would like with only two categories.

In [1]:
import json
import requests
import concurrent.futures
import shutil
import os

In [2]:
train_raw = json.load(open('train.json'))

In [3]:
len(train_raw["annotations"])

194828

### Almost 200k images!! We will need to use some sort of threaded downloader...

In [4]:
def read_json(file_name):
    raw = json.load(open(file_name))
    images = raw["images"]
    annotations = raw["annotations"]
    
    results = {} # Key is label_id, value is a list of urls 
    
    image_id_to_label_id = {}
    for annotation_dict in annotations:
        label_id = annotation_dict["label_id"]
        this_image_id = annotation_dict["image_id"]
        results[label_id] = []
        image_id_to_label_id[this_image_id] = label_id
    
    for image_dict in images:        
        image_id = image_dict["image_id"]
        url = image_dict["url"]
        
        this_label_id = image_id_to_label_id[image_id]
        results[this_label_id] += url

    
    
    
    return results

In [5]:
results = read_json("train.json")

In [6]:
total_samples = 0
for label_id, images in results.items():
    total_samples += len(images)
    print(label_id, " has ", len(images), " examples")

1  has  1254  examples
2  has  1521  examples
3  has  2368  examples
4  has  1500  examples
5  has  1599  examples
6  has  1115  examples
7  has  1609  examples
8  has  1365  examples
9  has  477  examples
10  has  1992  examples
11  has  1748  examples
12  has  2609  examples
13  has  1640  examples
14  has  1877  examples
15  has  1071  examples
16  has  1366  examples
17  has  1491  examples
18  has  1570  examples
19  has  855  examples
20  has  3996  examples
21  has  2577  examples
22  has  1445  examples
23  has  1202  examples
24  has  1774  examples
25  has  527  examples
26  has  1582  examples
27  has  2189  examples
28  has  1706  examples
29  has  1460  examples
30  has  1183  examples
31  has  2089  examples
32  has  1710  examples
33  has  1246  examples
34  has  716  examples
35  has  700  examples
36  has  1602  examples
37  has  2260  examples
38  has  2317  examples
39  has  1109  examples
40  has  2050  examples
41  has  625  examples
42  has  3973  examples
43  has

In [11]:
def download_image(image, label_id, path_prefix="train/"):
    request = requests.get(image, stream=True, headers={'Connection':'close'})
    if request.status_code == 200:
        
        dir_to_download = path_prefix + str(label_id) + "/"
        os.makedirs(dir_to_download,  exist_ok=True)
        path = dir_to_download + image.split("/")[-1]
        with open(path, 'wb') as f:
            request.raw.decode_content = True
            shutil.copyfileobj(request.raw, f)  
    else:
        print("Bummer! could not get image", image)

In [None]:
images_to_go = total_samples
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as e:
    for label_id, images in results.items():
        for image in images:
            images_to_go -=1
            e.submit(download_image, image, label_id)
            

Bummer! could not get image https://www.workinghouse.com.tw/FileArtPic.ashx?id=3711&cxtype=Product&w=1920&h=1440
Bummer! could not get image http://www.zsnews.cn/data/photo/Backup/2014/03/14/tw_201431410525397174.jpg
Bummer! could not get image http://pic2.cxtuku.com/00/10/90/b853eb9179a4.jpg
Bummer! could not get image http://www.chengdushuibei.com/upLoad/product/month_1405/201405211206142801.jpg
Bummer! could not get image http://pic3.nipic.com/20090624/540121_101428045_2.jpg
Bummer! could not get image http://www.cookpower.com.tw/wp-content/uploads/2014/11/VH-0020-0015-500.jpg
Bummer! could not get image http://pic12.nipic.com/20110113/913779_170623392130_2.jpg
Bummer! could not get image http://image6.huangye88.com/2014/09/30/088ed4e0f574d7ed.jpg
Bummer! could not get image http://img.alicdn.com/imgextra/i4/166264593/TB2Lpo4aGe5V1BjSspkXXcoqpXa_%21%21166264593.jpg
Bummer! could not get image http://opic.tbscache.com/manage/articles/2014/03-27/89547AB0-C734-DB59-8513-F250295F6949.jp

Bummer! could not get image http://www.fpshome365.com/upfile/proimage/20141125682892248.jpg
Bummer! could not get image http://banbao.chazidian.com/uploadfile/2016-02-17/s145567822888350.jpg
Bummer! could not get image http://pic1.cxtuku.com/00/03/30/b4661ba6425c.jpg
Bummer! could not get image http://img5.niutuku.com/phone/1212/1919/1919-niutuku.com-19682.jpg
Bummer! could not get image http://image5.huangye88.com/2013/06/22/5cc242a3f87b2b78.jpg
Bummer! could not get image http://memberpic.114my.cn/dgsjjj/uploadfile/image/20160623/20160623140517_1833891439.jpg
Bummer! could not get image http://image.gojiaju.com/userfile/cdgfsy/images/products/50787_1.jpg
Bummer! could not get image https://img.alicdn.com/imgextra/TB2O5r7go3IL1JjSZFMXXajrFXa_!!1647092695.jpg
Bummer! could not get image https://img.alicdn.com/imgextra/TB2gM0edtfJ8KJjy0FeXXXKEXXa_!!1103450587.jpg
Bummer! could not get image http://image.gojiaju.com/userfile/silandi/images/products/20090217172537.JPG
Bummer! could not ge

Bummer! could not get image http://images3.qianyan.biz/qy/1/11/90/2013102613495774608141.jpg
Bummer! could not get image http://www.isgo.com/userfiles/product/img/20140827/470/1409133198506.jpg
Bummer! could not get image http://l.b2b168.com/2016/07/28/15/201607281522400621274.jpg
Bummer! could not get image http://l.b2b168.com/2011/10/09/10/201110091055580413004.jpg
Bummer! could not get image http://l.b2b168.com/2017/05/16/07/201705160743329144724.jpg
Bummer! could not get image https://img.alicdn.com/imgextra/TB2_lGSe63z9KJjy0FmXXXiwXXa_!!355025641.jpg
Bummer! could not get image http://image.gojiaju.com/userfile/shuangma/images/products/53857_1.jpg
Bummer! could not get image http://image.gojiaju.com/userfile/shuangma/images/products/53854_1.gif
Bummer! could not get image http://image.gojiaju.com/userfile/hongdingxuan/images/products/55449_1.jpg
Bummer! could not get image http://image.gojiaju.com/userfile/hongdingxuan/images/products/55427_1.jpg
Bummer! could not get image http:/

Bummer! could not get image http://l.b2b168.com/2010/10/30/11/20101030114300217169.jpg
Bummer! could not get image http://www258com.b0.upaiyun.com/258com/20170810/d64fb2e4f5a45b925df00bf11c497305.jpg
Bummer! could not get image http://l.b2b168.com/2013/11/20/09/201311200934009639904.jpg
Bummer! could not get image http://memberpic.114my.cn/0343321/product/20144/2014041166644033.jpg
Bummer! could not get image https://img.alicdn.com/imgextra/TB2aW5yyOpnpuFjSZFkXXc4ZpXa_!!3319215549.jpg
Bummer! could not get image http://l.b2b168.com/2014/03/21/13/201403211329283410654.jpg
Bummer! could not get image http://piccdn.pptfans.cn/2016/12/29/pptfans_20161229120718547702948c.jpg
Bummer! could not get image https://pic1.zhimg.com/50/v2-168607fc842b99a598c1b244ecb26562_hd.jpg
Bummer! could not get image http://cultureguru.my/store/wp-content/uploads/sites/3/2016/08/3-9.jpg
Bummer! could not get image http://pic2.cxtuku.com/00/10/22/b567a3b46656.jpg
Bummer! could not get image http://www.homekoo.c