We will look for the difference between the train set (2020) and the test set (2019). 

One can find more information about the train set and the test set, as well as about the difference between them, in the jupyter notebook "train.ipynb" and "test.ipynb", accordingly. 

In [1]:
# imports for code 
import pandas as pd
import numpy as np
import copy 

In [2]:
# load the train data csv file as data frame 
url_train = 'https://raw.githubusercontent.com/matankleiner/ProjectB/master/data/train/train.csv'
train_df = pd.read_csv(url_train) 
train_df

Unnamed: 0,id,landmark_id
0,17660ef415d37059,1
1,92b6290d571448f6,1
2,cd41bf948edc0340,1
3,fb09f1e98c6d2f70,1
4,25c9dfc7ea69838d,7
...,...,...
1580465,72c3b1c367e3d559,203092
1580466,7a6a2d9ea92684a6,203092
1580467,9401fad4c497e1f9,203092
1580468,aacc960c9a228b5f,203092


In [3]:
# load the test data csv file as data frame
url_test ='https://raw.githubusercontent.com/matankleiner/ProjectB/master/data/test/recognition_solution_v2.1.csv'
test_df = pd.read_csv(url_test) 
test_df = test_df.drop("Usage", axis=1)
test_df

Unnamed: 0,id,landmarks
0,e324e0f3e6d9e504,
1,d9e17c5f3e0c47b3,
2,1a748a755ed67512,
3,537bf9bdfccdafea,
4,13f4c974274ee08b,
...,...,...
117572,e351c3e672c25fbd,190441
117573,5426472625271a4d,
117574,7b6a585405978398,
117575,d885235ba249cf5d,


As we saw in the "test.ipynb" file, most of the images in the test set are not landmarks at all, therfore, for our purposes we will drop them.  

In [4]:
#test_df.dropna().to_csv('C:/Users/Matan/Desktop/projectB/data/test/test_dropna.csv', index = False)
test_df_dropna = test_df.dropna()
test_df_dropna

Unnamed: 0,id,landmarks
112,ed85edf01da02f26,179171
155,4d5d0e6264e6c7e0,124703
182,e153105026e18260,150977
234,db635e33c17229bb,92607
371,03b1294a0fa46763,184268
...,...,...
117154,4e4e7fdca971442f,95197
117242,efd80af423defb09,162786
117264,90e066e0d0ac2827,188823
117403,ee95080bf6187d9a,127232


As we saw, some images in test set correspond to more than one class, for example

In [5]:
test_df_dropna[test_df_dropna["landmarks"] == "118979 17049"]

Unnamed: 0,id,landmarks
691,aa55b28b8960b2e4,118979 17049
22772,0fcbd47ebba5eed7,118979 17049
64099,24c459d44b455532,118979 17049
64693,286d4ac3649c0f67,118979 17049
93802,561db31b28f29429,118979 17049


In the train set those classes, 118979 and 17049, are different classes:

In [6]:
train_df[train_df["landmark_id"] == 17049]

Unnamed: 0,id,landmark_id
130408,07e2a500f1ebc20e,17049
130409,43ae7a71affb63e9,17049
130410,47899b6ea49351fe,17049
130411,675841f3be3a43da,17049
130412,8c41453d5fbd68f3,17049
130413,9429b45bdc6661d0,17049
130414,b1fd8ec9e34141c7,17049
130415,b32c9005f1ad3852,17049
130416,c5975f7106245117,17049
130417,cca2ff4be64c4f31,17049


In [7]:
train_df[train_df["landmark_id"] == 118979]

Unnamed: 0,id,landmark_id
919855,12c2eac43e6cf187,118979
919856,1655195c59956224,118979
919857,18fb96898f600e7a,118979
919858,1b1ef98e45260ac0,118979
919859,1fc09f0ca51bcedd,118979
919860,22a6308acb8cab5e,118979
919861,24fcf5906fe14ad2,118979
919862,25aaa263d3d313a1,118979
919863,34e7c32d8da8c3eb,118979
919864,395a3a66f1ff30e8,118979


Those classes are of nearby landmarks, in this case, this landmarks is probably [this](https://www.google.com/maps/place/Triumphal+Arch/@50.8407278,4.3934422,16z/data=!4m5!3m4!1s0x47c3c4a5c4ce11d5:0x5d4a9cc8fc1faf04!8m2!3d50.8405283!4d4.3928857) or [this](https://www.google.com/maps/place/Art+%26+History+Museum/@50.8392659,4.3895678,16.75z/data=!4m5!3m4!1s0x0:0x76c671e867f1a1e7!8m2!3d50.8393481!4d4.391503). As one can see from the map, they are part of the same place and therefore makes sense that the network won't distnict between those two classes.
There are more images like this. 

In [8]:
test_series = test_df_dropna["landmarks"] 
test_sep = test_series.str.findall(r'[0-9]*') # seperate the landmarks ids to different str
test_sep = test_sep.reset_index()
test_sep = test_sep.drop("index", axis = 1)

# choose only the landmarks with more than 1 id   
test_mult_classes = [] 
for i in range(test_sep["landmarks"].shape[0]):
    if len(test_sep["landmarks"].values[i]) == 2:
        continue 
    else:
        test_mult_classes.append(test_sep["landmarks"].values[i])
print(f'There are {len(test_mult_classes)} images that correspond to more than one class')

#filter all spaces 
test_mult_classes_no_spaces = [] 
for i in range(len(test_mult_classes)):
    test_mult_classes_no_spaces.append(list(filter(lambda x: x != '', test_mult_classes[i])))
    
test_mult_classes_no_spaces = np.unique(test_mult_classes_no_spaces)
print(f'Those {len(test_mult_classes)} images correspond to {len(test_mult_classes_no_spaces)} different '
        'classes. Those classes are indistinctable and practically the same class.')

There are 317 images that correspond to more than one class
Those 317 images correspond to 140 different classes. Those classes are indistinctable and practically the same class


We would like to find the classes that are part of the test set but not part of the train set

In [9]:
# first we will make a data frame of all the classes in the test set 
tmp = test_df_dropna["landmarks"].str.split(" ", n = 10, expand = True) 
test_sep_df = pd.concat([tmp[0], tmp[1], tmp[2], tmp[3]]).dropna()
test_sep_df = test_sep_df.to_frame('landmarks').reset_index()
test_sep_df = test_sep_df.drop('index', axis=1)

train_classes = train_df["landmark_id"].unique() # take only unique valuse of classes in the train test 
test_classes = test_sep_df["landmarks"].unique() # take only unique valuse of classes in the test test 
print(f"There are {test_classes.shape[0]} different classes in the test set.")
test_classes = test_classes.astype(np.int) # convert test set classes type to int 
# find all classes that are part of the test set but not the train set using mask 
mask1 = np.in1d(train_classes, test_classes) 
train_class_masked = train_classes[mask1]
print(f"Of those {test_classes.shape[0]} different classes, only {train_class_masked.shape[0]} are also in the train set.")
mask2 = np.in1d(test_classes, train_class_masked, invert=True)
only_test = test_classes[mask2]
print("\nThe classes that are only part of the test set:")
print(only_test)

There are 852 different classes in the test set.
Of those 852 different classes, only 718 are also in the train set.

The classes that are only part of the test set:
[129293  78530  58151 164395 132345 187645  10936 126370 193772 191475
  75799 144991   2247  19136  61105  11890 143354  17564  78038 100782
   7931 158276 164713  15223    556 148225 140701 190161  81735  19886
  37212 189289 163584  15857  62611  35747  72033  50000  16898 190128
 147940 124997  68864  90591  63195 202767  99920 102539  67392  85159
  16239 193638 178080  38924 140587  31765  16492 113456 168455  14170
 149453 171935 106079 159817  31536 113023  91775  60583  36854 153307
 124147 195074 137601  44537 109142 190520  55101 165325  40005  63823
 135791 146282 100549   6664  80481 105983 122085  48402  17480 149883
 147155  58337  15778 103178  57479  48727  96094  43259 119944 103243
 114931 101911  39110 144420  77431 105684 107779 134790  56653 101757
  28129  46306 119337 170731  96980 162591 196108  97

We know that 134 classes are part of the test but not of the train.

We would like to look into the multiple classes a little bit more. We would liek to know how  many of those multiple classes are one of the three option: 
* neither one are part of the train set 
* one or more are part of the train set 
* all of them are part of the train set

After we know that we can choose how to handle this. 

In [10]:
mult_class = [y for x in test_mult_classes_no_spaces for y in x] # flatten the list
mult_class = np.array(mult_class)
mult_class = mult_class.astype(np.int)
mult_class = np.unique(mult_class)

# convert all list items from str to int 
for i in range(len(test_mult_classes_no_spaces)): 
    test_mult_classes_no_spaces[i] = [int(k) for k in test_mult_classes_no_spaces[i]] 

# check if any of the mult classes are only part of the test set
mask3 = np.in1d(mult_class, only_test) 
only_test_mult = mult_class[mask3]

# if one of the mult classes are only part of the test set we would like to remove them from our test set 
classes_to_remove = []
for list_ in test_mult_classes_no_spaces: 
    for class_ in list_: 
        mask = np.in1d(only_test_mult, class_)
        if any(mask):
            classes_to_remove.append(class_)


classes_to_remove_tmp = copy.deepcopy(classes_to_remove)
test_mult_classes_no_spaces_tmp = copy.deepcopy(test_mult_classes_no_spaces)

for list_ in test_mult_classes_no_spaces_tmp: 
    for class_ in list_:
        if classes_to_remove_tmp[0] == class_: 
            classes_to_remove_tmp.remove(classes_to_remove_tmp[0])
            list_.remove(class_)
            continue
            
# after we reomved all of the classes that are only in the test set and correspond to more than one image 
# we would like to split them to 3 groups: empty, less_than, equal 
empty, less_than, equal = [], [], []
for i, list_ in enumerate(test_mult_classes_no_spaces_tmp): 
    if len(list_) == len(test_mult_classes_no_spaces[i]): 
        equal.append(test_mult_classes_no_spaces[i])
    elif len(list_) < len(test_mult_classes_no_spaces[i]):
        less_than.append(test_mult_classes_no_spaces[i])
    elif len(list_) == 0:
        empty.append(test_mult_classes_no_spaces[i])

print(f"From the multiple classes, {len(less_than)} of them are smaller because we removed the classes that "
         f"are only part of the test set. {len(equal)} of them are the same and {len(empty)} of them are now empty classes")

From the multiple classes, 17 of them are smaller because we removed the classes that are only part of the test set. 123 of them are the same and 0 of them are now empty classes


That mean we have now 123 classes that correspond to the same image. We need to decide how to deal with them. 

The classifier output is a probability vector. We will check the probability of the given classes in images that correspond to more than one class and may classify them as both classes. 

In [11]:
test_df = pd.read_csv(url_test) 
test_df = test_df.drop("Usage", axis=1)

# seperate the string to a list 
landmarks_series = test_df["landmarks"] 
landmarks_sep = landmarks_series.str.findall(r'[0-9]*') # seperate the landmarks ids to different str
landmarks_sep = landmarks_sep.fillna(0)
landmarks_sep = landmarks_sep.reset_index()
landmarks_sep = landmarks_sep.drop("index", axis = 1)

# filter all spaces in the list  
landmarks_sep_no_spaces = [] 
for i in range(len(landmarks_sep['landmarks'])):
    if landmarks_sep['landmarks'][i] == 0:
        landmarks_sep_no_spaces.append(landmarks_sep['landmarks'][i]) 
    else:
        landmarks_sep_no_spaces.append(list(filter(lambda x: x != '', landmarks_sep['landmarks'][i])))
        
# convert list items type to int         
for i in range(len(landmarks_sep_no_spaces)):
    if isinstance(landmarks_sep_no_spaces[i],list):
        if len(landmarks_sep_no_spaces[i]) == 1: 
            landmarks_sep_no_spaces[i] = int(landmarks_sep_no_spaces[i][0])
        elif len(landmarks_sep_no_spaces[i]) > 1:
            for j in range(len(landmarks_sep_no_spaces[i])):
                landmarks_sep_no_spaces[i][j] = int(landmarks_sep_no_spaces[i][j])
        else: # do not suppose to get here   
            print("Error!") 
              
# change the 'landmarks' column to landmarks as list of int or as int
test_df['landmarks'] = landmarks_sep_no_spaces
# deep copy of test_df 
test_df_tmp1 = copy.deepcopy(test_df)
test_df_tmp2 = copy.deepcopy(test_df)

In [12]:
def is_in_list(list_, item): 
    if item not in list_:
        return False
    else:
        return True

# find the indices of the rows to drop and items to remove from lists 
rows_to_drop = []
items_remove = []
for i in range(len(test_df_tmp2['landmarks'])):
    if test_df_tmp2['landmarks'][i] != 0: # all out of domain classes are 0, don't remove them 
        if isinstance(test_df_tmp2['landmarks'][i], int): # images that correspond to only one class 
            if is_in_list(only_test, test_df_tmp2['landmarks'][i]):
                    rows_to_drop.append(i)
        if isinstance(test_df_tmp2['landmarks'][i], list): # images that correspond to only one class   
            for j in range(len(test_df_tmp1['landmarks'][i])): 
                if is_in_list(only_test, test_df_tmp1['landmarks'][i][j]): 
                    items_remove.append((i,j))

# if all items in the list need to be removed, append it to "row_to_drop"
items_remove_df = pd.DataFrame(data = items_remove)
rows_to_remove = [] # remove all the rows that appended to "row_to_drop" 
for i in range(len(items_remove)-1):
    if items_remove[i][0] == items_remove[i+1][0]:
        rows_to_drop.append(items_remove[i][0])
        rows_to_remove.append(i)
        rows_to_remove.append(i+1)
        
for row in rows_to_remove: 
        items_remove_df = items_remove_df.drop(row, axis = 0)
        
items_remove_df = items_remove_df.reset_index()
items_remove_df = items_remove_df.drop("index", axis = 1)

In [13]:
for i in range(items_remove_df.shape[0]):
    test_df_tmp2.landmarks[items_remove_df[0][i]].remove(test_df_tmp2.landmarks[items_remove_df[0][i]][items_remove_df[1][i]])
    
for row in rows_to_drop:
    test_df_tmp2 = test_df_tmp2.drop(row, axis = 0)

# reset indices  
test_df_tmp2 = test_df_tmp2.reset_index()
test_df_tmp2 = test_df_tmp2.drop("index", axis = 1)
test_df_tmp2

Unnamed: 0,id,landmarks
0,e324e0f3e6d9e504,0
1,d9e17c5f3e0c47b3,0
2,1a748a755ed67512,0
3,537bf9bdfccdafea,0
4,13f4c974274ee08b,0
...,...,...
117222,e351c3e672c25fbd,190441
117223,5426472625271a4d,0
117224,7b6a585405978398,0
117225,d885235ba249cf5d,0


In [14]:
# create a list of all the classes (i.e. 'landmarks') that are part of both the test set and the train set. 
landmarks = []
landmarks.append(0) # a class for all out of domain images 
for i in range(len(test_df_tmp1['landmarks'])):
    if test_df_tmp1['landmarks'][i] != 0:
            if isinstance(test_df_tmp1['landmarks'][i], int):
                landmarks.append(test_df_tmp1['landmarks'][i])
            else:
                for j in range(len(test_df_tmp1['landmarks'][i])):
                    landmarks.append(test_df_tmp1['landmarks'][i][j])

landmarks_np = np.array(landmarks)
landmarks_unique = np.unique(landmarks_np)
mask1 = np.in1d(landmarks_unique, only_test, invert=True) 
landmarks_unique = landmarks_unique[mask1]
landmarks_unique = list(landmarks_unique)

# ground_truth is a matrix of the true labels of each test set image. 
# the rows of ground_truth are the classes number in ascending order 
# the columns of ground_truth are the images in the test set
ground_truth = np.zeros((len(landmarks_unique), test_df_tmp2["id"].shape[0]))

# change to 1 image intersect with its class    
for i in range(test_df_tmp2["landmarks"].shape[0]): 
    if test_df_tmp2["landmarks"][i] == 0:
        ground_truth[0][i] = 1 
    else: 
        if isinstance(test_df_tmp2["landmarks"][i], int): 
            ground_truth[landmarks_unique.index(test_df_tmp2["landmarks"][i])][i] = 1
        if isinstance(test_df_tmp2["landmarks"][i], list): 
            for j in range(len(test_df_tmp2["landmarks"][i])): 
                ground_truth[landmarks_unique.index(test_df_tmp2["landmarks"][i][j])][i] = 1


In [21]:
pd.DataFrame(ground_truth).to_csv('C:/Users/Matan/Desktop/projectB/data/2019/ground_truth.csv', index = False)

In [23]:
landmarks_unique 

[0,
 48,
 81,
 1174,
 2053,
 2148,
 2220,
 2336,
 2473,
 2586,
 2872,
 3092,
 3232,
 3533,
 3705,
 3870,
 3887,
 4072,
 4177,
 4465,
 4987,
 5088,
 5292,
 5341,
 5522,
 6190,
 6363,
 6720,
 6798,
 7223,
 7428,
 7724,
 8078,
 8198,
 8552,
 9173,
 9463,
 9771,
 10155,
 10532,
 10763,
 10934,
 11153,
 11356,
 11362,
 11395,
 11755,
 11868,
 12151,
 12729,
 12731,
 12989,
 13975,
 14051,
 14098,
 14139,
 14160,
 14308,
 14859,
 14921,
 15310,
 15317,
 15622,
 15688,
 16275,
 16325,
 16748,
 16798,
 16892,
 16939,
 17049,
 17358,
 17716,
 17848,
 17960,
 18171,
 18391,
 18679,
 18926,
 20343,
 20772,
 21019,
 21232,
 21613,
 22056,
 22077,
 22363,
 22511,
 22813,
 23258,
 23777,
 24320,
 24958,
 24968,
 25109,
 25261,
 25863,
 26613,
 27609,
 28175,
 28364,
 28503,
 28604,
 30029,
 30923,
 30971,
 31136,
 31480,
 31793,
 31880,
 32004,
 32061,
 32576,
 32769,
 32998,
 33026,
 33636,
 34453,
 35027,
 35639,
 35855,
 35910,
 36006,
 36264,
 37134,
 37746,
 38055,
 38892,
 39012,
 39187,
 3927