## 2.COCO Dataset
### 2.1. Dataset download

COCO Dataset is a large-scale object detection, segmentation, and captioning dataset. It offers a large number of images (from various contexts) with annotations (i.e. structured information on the contents of the image). These annotations, in particular, regard the contents of the image and, in particular, the objects contained within.

You can download a filtered and preprocessed version of COCO (which we will refer to as “modified”) from the following URL:
https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/modified_coco.json

This dataset is a JSON file. You can open it using the already introduced json module. The file contains a list of images and, for each image, the annotation key contains all the annotations available. The following are the annotations for one such image.

{

"file_name": "000000465265.png",

"image_id": 465265,

"annotations": [

"person",

"person",

"person",

"fire hydrant",

"handbag",

"chair",

"cell phone"]

}

This means that the image contains 3 people, a fire hydrant, a handbag, a chair and a cell phone.


### 2.2. Exercises

**1.** Implement your own version of Apriori You can use the below toy dataset for initial troubleshooting and testing.

- a,b
- b,c,d 
- a,c,d,e
- a,d,e
- a,b,c
- a,b,c,d
- b,c
- a,b,c
- a,b,d
- b,c,e

When run with minsup > 1 (or 0.1 in relative terms), the expected itemset (with their minsups) are:

- a -> 7,
- b -> 8,
- c -> 7,
- d -> 5,
- e -> 3,
- a,b -> 5,
- a,c -> 4, 
- a,d -> 4,
- a,e -> 2,
- b,c -> 6,
- b,d -> 3,
- c,d -> 3,
- c,e -> 2,
- d,e -> 2,
- a,b,c -> 3,
- a,b,d -> 2,
- a,c,d -> 2,
- a,d,e -> 2,
- b,c,d -> 2

In [1]:
toy = [["a","b"],["b","c","d"],["a","c","d","e"],
       ["a","d","e"],["a","b","c"],["a","b","c","d"],
       ["b","c"],["a","b","c"],["a","b","d"],["b","c","e"]]

In [18]:
from itertools import combinations
class Apriori:
    
    def __init__(self, dataset):
        self.ds = dataset
        
    def comb_n(self,n):
        a = set()
        for i in self.ds:
            a.update(i)
        if n==1:
            return sorted(list(a))
        else:
            return sorted(list(sorted(list(a)) for a in combinations(a,n)))
    
    def itemsets(self, n, minsup):
        candidates = {}
        for i in self.comb_n(n):
            c=0
            if n>1:
                for y in self.ds:
                    c += int(all(a in y for a in i))
            else:
                for y in self.ds:
                    if i in y:
                        c += 1
            if c>minsup:
                if n>1:
                    candidates[tuple(i)] = c
                else:
                    candidates[i] = c
        if len(candidates)>1:
            return candidates
        else:
            print(f"There is no itemset with {n} elements that satisfy minsup={minsup} condition")
            return None

In [5]:
maxlen = max(len(a) for a in toy)
start = 1
condition = "-"
alg = Apriori(toy)
while (condition != None) and (start<=maxlen):
    if alg.itemsets(start,1) != None:
        for a,b in alg.itemsets(start,1).items():
            print(a,"->",b)
        print("---------------------------")
    start +=1

a -> 7
b -> 8
c -> 7
d -> 5
e -> 3
---------------------------
('a', 'b') -> 5
('a', 'c') -> 4
('a', 'd') -> 4
('a', 'e') -> 2
('b', 'c') -> 6
('b', 'd') -> 3
('c', 'd') -> 3
('c', 'e') -> 2
('d', 'e') -> 2
---------------------------
('a', 'b', 'c') -> 3
('a', 'b', 'd') -> 2
('a', 'c', 'd') -> 2
('a', 'd', 'e') -> 2
('b', 'c', 'd') -> 2
---------------------------
There is no itemset with 4 elements that satisfy minsup=1 condition


**2.** Once you have implemented a working version of Apriori, you can load the modified COCO dataset from Subsection 1.2.2 into memory. From this, you should transform the dataset into a version compatible with the expected input of your Apriori implementation.

In [19]:
import json

In [20]:
with open("coco.json","r") as file:
    obj = json.load(file)
    l = []
    for i in obj:
        l.append(sorted(list(set(i['annotations']))))
l[:5]

[['car', 'stop sign', 'train'],
 ['bench', 'chair', 'dining table', 'person', 'potted plant'],
 ['stop sign'],
 ['fire hydrant', 'person'],
 ['bicycle', 'fire hydrant']]

**3.** You can now run your implementation on the modified COCO dataset

In [23]:
maxlen = 1 #max(len(a) for a in l) -- to have all possible subsets
start = 1
condition = "-"
coco = Apriori(l)
while (condition != None) and (start<=maxlen):
    if coco.itemsets(start,100) != None:
        for a,b in coco.itemsets(start,100).items():
            print(a,"->",b)
        print("---------------------------")
    start +=1

backpack -> 426
baseball bat -> 167
baseball glove -> 150
bench -> 2169
bicycle -> 381
bottle -> 140
bus -> 456
car -> 1852
cell phone -> 177
chair -> 301
clock -> 166
cup -> 111
dining table -> 193
dog -> 138
fire hydrant -> 673
handbag -> 615
motorcycle -> 205
parking meter -> 277
person -> 2943
potted plant -> 209
skateboard -> 172
sports ball -> 184
stop sign -> 666
suitcase -> 101
tennis racket -> 107
traffic light -> 1615
train -> 246
truck -> 643
umbrella -> 237
---------------------------
