# Training model on cartography classes

I want to see whether selecting easy-to-learn, hard-to-learn or ambiguous data instances will omprove performance on our 2k curriculum as well.
Might make multiple curriculum with:
- an equal mix of the three groups
- just ambiguous 
- just hard-to-learn

Paper: Swayamdipta, et all. (2020). Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv preprint arXiv:2009.10795.

## Importing the datasets
### Importing cartography data

In [66]:
import pandas as pd

cartography_data = pd.read_json('snli_roberta_0_6_data_map_coordinates.jsonl', lines=True)

In [67]:
print(len(cartography_data))
print(cartography_data) # 4804607632

549367
                 guid   index  confidence  variability  correctness
0       4446013212032       0    0.675678     0.301201            4
1        568940932410       1    0.843770     0.314807            5
2       4107496324431       2    0.864639     0.302112            5
3        151853983012       3    0.872000     0.154454            6
4       6041167176312       4    0.653921     0.342495            4
...               ...     ...         ...          ...          ...
549362  5732659631411  549362    0.997563     0.002784            6
549363  3920105265011  549363    0.994982     0.010534            6
549364  6888801884010  549364    0.936412     0.080370            6
549365   189022647122  549365    0.777098     0.066182            6
549366  4506417051031  549366    0.913720     0.081541            6

[549367 rows x 5 columns]


### Importing SNLI data

In [32]:
# if assigntools not yet downloaded run line
# ! git clone https://github.com/kovvalsky/assigntools.git

# if zip file of SNLI data not yet downloaded run lines
# !wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip
# !unzip snli_1.0.zip

In [33]:
# original dataset from Stanford
data_snli = pd.read_json('snli_1.0/snli_1.0_train.jsonl', lines=True)
# data.columns = ["guid", "index", "confidence", "variability", "correctness"]
# print(data_snli["annotator_labels"])
print(data_snli.shape)

(550152, 10)


In [34]:
from assigntools.LoLa.read_nli import snli_jsonl2dict, sen2anno_from_nli_problems
from assigntools.LoLa.sen_analysis import spacy_process_sen2tok, display_doc_dep

In [35]:
# this is Lasha's code for downloading SNLI
SNLI, S2A = snli_jsonl2dict('snli_1.0') 
print(f"Length of the SNLI dataset with the wrong labels: {len(SNLI_with_wrong['train'])}")

Found .json files for ['dev', 'test', 'train'] parts
processing DEV:	

10000it [00:00, 22340.11it/s]


9842 problems read
0 problems have a wrong annotator label
processing TEST:	

10000it [00:00, 20434.60it/s]


9824 problems read
0 problems have a wrong annotator label
processing TRAIN:	

550152it [00:26, 20683.46it/s]


549169 problems read
198 problems have a wrong annotator label
Most common weird labels: //(198)
Length of the SNLI dataset with the wrong labels: 549169


In [36]:
# this is Lasha's code for downloading SNLI without cleaning the labels
SNLI_with_wrong, S2A = snli_jsonl2dict('snli_1.0', clean_labels=False)
print(f"Length of the clean SNLI dataset: {len(SNLI['train'])}")

Found .json files for ['dev', 'test', 'train'] parts
processing DEV:	

10000it [00:00, 19459.90it/s]


9842 problems read
0 problems have a wrong annotator label
processing TEST:	

10000it [00:04, 2444.19it/s]


9824 problems read
0 problems have a wrong annotator label
processing TRAIN:	

550152it [00:26, 20979.15it/s]

549367 problems read
198 problems have a wrong annotator label
Most common weird labels: //(198)
Length of the clean SNLI dataset: 549169





Have to find a way to match up the number of instances of the cartography data to the number of instances in the original SNLI dataset in order to match the ID's from the cartography values to the sentences to train the model on.
- The cartography dataset has *549367*
- The original SNLI dataset downloaded from Stanford or huggingface has *550152*
- From Lasha's code has *549169* with 198 with wrong annotater labels

This suggests that if we delete the instances with wrong annotated labels from the cartography dataset that the instances from the cartography dataset would match that of Lasha's processed dataset and we could match up the ID's of the two datasets.

Task:
- Check if the ID's of the two datasets (SNLI and cartography) match up. Can use the code in which they assign the ID's to the dataset.

In [43]:
# print(x for x in list(SNLI['train'].keys()) if x not in list(SNLI_with_wrong['train'].keys()))
wrong_keys = list(set(list(SNLI_with_wrong['train'].keys())).difference(list(SNLI['train'].keys())))
print(len(wrong_keys))

198


Now I need to be able to delete the wrong label ID's from the cartography dataset.\
For that first see how to link the ID's from the SNLI dataset to that of the cartography data.

In [55]:
from data_utils_glue import convert_string_to_unique_number

print(number_ID = convert_string_to_unique_number(wrong_keys[0])) #5696581092010

In [72]:
# the guid values for the data instances that have a bad gold label
wrong_indeces = []
for key in wrong_keys:
    wrong_indeces.append(convert_string_to_unique_number(key))
print(wrong_indeces)

# print(list(cartography_data['guid']).index(number_ID))

[5696581092010, 375392855412, 371769150011, 4850614476111, 562928217311, 436608339411, 3228960484012, 200278000411, 246901891011, 3309042087411, 299193125211, 5716554915110, 4815080638312, 4045361947410, 300577374112, 2166946777412, 4482946929010, 888244173410, 4313861353310, 2862004252410, 4123816289310, 407278057211, 4792134256012, 4070658400011, 7616312438011, 2966552760211, 4622296311110, 1932314876412, 3532539748312, 2324374253012, 2915792034011, 5867606212, 4704939941012, 1752454466212, 449764037011, 2279380309211, 3670205710011, 418616992310, 4450153946210, 735787579112, 250699226210, 3423249426011, 470903027410, 434932657111, 4701385015011, 3650485497112, 3945002600010, 421762501012, 4418471031111, 4357061908311, 2894893895012, 2537692668112, 5903528077412, 4546867536111, 3758175529011, 6948564341412, 7803420092212, 158898445210, 4662376933112, 3247385464011, 7237669608111, 4587222385110, 5532294954011, 5117916560111, 4684510937312, 5609573810011, 4625272395210, 4552688825412, 

In [73]:
# deleting the wrong indeces from the cartography dataset
for ID in wrong_indeces:
    cartography_data = cartography_data[cartography_data['guid'] != ID]

print(len(cartography_data))

549169


To connect the right data instance from the cartography dataset to the right instance from the SNLI dataset I'm adding the cartography guid to the SNLI dataset to make it an easier lookup.

In [75]:
display(SNLI['train'][list(SNLI['train'].keys())[0]])

{'g': 'neutral',
 'pid': '3416050480.jpg#4r1n',
 'cid': '3416050480.jpg#4',
 'lnum': 1,
 'lcnt': Counter({'neutral': 1}),
 'ltype': '010',
 'p': 'A person on a horse jumps over a broken down airplane.',
 'h': 'A person is training his horse for a competition.'}

In [76]:
for key in SNLI['train'].keys():
    SNLI['train'][key]['guid'] = convert_string_to_unique_number(key)

display(SNLI['train'][list(SNLI['train'].keys())[0]])

{'g': 'neutral',
 'pid': '3416050480.jpg#4r1n',
 'cid': '3416050480.jpg#4',
 'lnum': 1,
 'lcnt': Counter({'neutral': 1}),
 'ltype': '010',
 'p': 'A person on a horse jumps over a broken down airplane.',
 'h': 'A person is training his horse for a competition.',
 'guid': 3416050480412}

## Making the curicula
I want to use the different groups (hard-to-learn, easy-to-learn, ambiguous) based on the values in the cartography dataset. Then try the same training sets (just hard-to-learn and just ambiguous) as used in the paper and, if times allows, also a mix. This will allow us to answer the question can we find the same patterns as the paper in our curriculums of 2k? 

Tasks:
- Look into what the values are that they used in the paper to destinguish the three groups. Paper doesn't mention values --> look at code
- How to take 2k datapoints from those bins. Do you want to take those randomly or make smaller bins inside of the groups to select a good sample of that particular group?
