# Test Set ~~Shenanigans~~ Labeling

The kaggle competition that serves as the base for our project provides an unlabeled test set (which makes a lot of sense in view of the competition).
For us, however, a labeled test set would allow for a more detailed analysis of the results (e.g., by plotting a confusion matrix).

Luckily, some (hours of) digging around the internet brought up a [dataset](https://github.com/shrnik/Disater_Pred/blob/master/socialmedia-disaster-tweets-DFE.csv) which seems to be the origin of the data used for this competition.
This notebook takes the unlabeled competition test set and extends it with the labels found in the original data set for the corresponding entries.

In [59]:
import pandas as pd
import os

LABEL_TRUE = 'Relevant'
LABEL_FALSE = 'Not Relevant'

DIR_DATA = os.path.join("..", "data")
PATH_DATA_TEST = os.path.join(DIR_DATA, "test.csv")
PATH_DATA_FULL = os.path.join(DIR_DATA, "full.csv")

## Import

Interestingly enough, the full dataset has some non-Unicode encoded characters. 
For those to be interpreted like they are in the kaggle test set the file has to be read with the `latin-1` encoding.

In [60]:
def print_info(name, df: pd.DataFrame):
    print(f'Dataframe \'{name}\' has:')
    print(f'\t{len(df)} rows')
    print(f'\twith columns: {df.columns.to_list()}')
    print()

df_test = pd.read_csv(PATH_DATA_TEST)
df_full = pd.read_csv(PATH_DATA_FULL, encoding='latin_1')

len_test = len(df_test)

print_info("Kaggle Test Set", df_test)
print_info("Full Data Set", df_full)

Dataframe 'Kaggle Test Set' has:
	3263 rows
	with columns: ['id', 'keyword', 'location', 'text']

Dataframe 'Full Data Set' has:
	10876 rows
	with columns: ['_unit_id', '_golden', '_unit_state', '_trusted_judgments', '_last_judgment_at', 'choose_one', 'choose_one:confidence', 'choose_one_gold', 'keyword', 'location', 'text', 'tweetid', 'userid']



## Merging

For each entry in the test set, find the corresponding entry (so the one with the same text) in the full dataset.
Some texts appear multiple times in the full dataset, leading to duplicate entries for some rows. Those are filtered out for the `df_merged_uniques` dataframe.

In [61]:
df_merged = pd.merge(df_test, df_full[['text', 'choose_one']], on='text', how='left')
df_merged_uniques = df_merged.drop_duplicates(subset='id')

print(len(df_merged_uniques))

print(f'Dropped {len(df_merged) - len(df_merged_uniques)} entries with duplicated ids')

3263
Dropped 214 entries with duplicated ids


## Adjustments

For some duplicates, the target is not the same (the same text is classified as disaster and no-disaster). 
For those instances, manual adjustments will be needed to find the labels that are expected by kaggle (and thus should be used as the labels of the test set).

### Show Instances with Uncertain Target

In [62]:
df_merged_dups = df_merged[df_merged.duplicated(subset='text', keep=False)]
df_merged_dups_diff = df_merged_dups.groupby('text').filter(lambda x: x['choose_one'].nunique() > 1)
df_merged_dups_diff_uniques = df_merged_dups_diff.drop_duplicates(subset=['text', 'id'])
print(f'Found {len(df_merged_dups_diff_uniques)} instances with uncertain targets.')

display(df_merged_dups_diff_uniques)

Found 22 instances with uncertain targets.


Unnamed: 0,id,keyword,location,text,choose_one
296,922,bioterrorism,,To fight bioterrorism sir.,Relevant
302,924,bioterrorism,,To fight bioterrorism sir.,Relevant
622,1931,burning%20buildings,"Dublin City, Ireland",@RockBottomRadFM Is one of the challenges on T...,Can't Decide
631,1964,burning%20buildings,San Francisco,? High Skies - Burning Buildings ? http://t.co...,Not Relevant
970,3094,deaths,,Bigamist and his Û÷firstÛª wife are charged ...,Not Relevant
1062,3374,demolition,,General News Û¢åÊ'Demolition of houses on wat...,Not Relevant
1292,4053,displaced,Pedophile hunting ground,.POTUS #StrategicPatience is a strategy for #G...,Relevant
1298,4056,displaced,Pedophile hunting ground,.POTUS #StrategicPatience is a strategy for #G...,Relevant
1401,4371,earthquake,in the Word of God,@GreenLacey GodsLove &amp; #thankU my sister f...,Relevant
1463,4572,emergency%20plan,,Do you have an emergency drinking water plan? ...,Relevant


### Adjust Targets

This part is pretty much manual labor:
- look at the entries with uncertain targets
- think about which target they should have
- modify their target value
- upload the prediction to kaggle and see whether the score improves

A single account is allowed to submit 5 predictions per day on kaggle, so a multitude of accounts had to be used here.

In [63]:
adjustments = 0

def change_pred(id, pred):
    global df_merged_uniques_adjusted
    global adjustments

    if (df_merged_uniques_adjusted.loc[df_merged_uniques_adjusted['id'] == id, 'choose_one'] != pred).any():
        adjustments += 1
        print(f'Changing prediction for id {id} to {pred}')
        df_merged_uniques_adjusted.loc[df_merged_uniques_adjusted['id'] == id, 'choose_one'] = pred

df_merged_uniques_adjusted = df_merged_uniques.copy()

print(len(df_merged_uniques_adjusted))

change_pred(922, LABEL_TRUE)
change_pred(924, LABEL_FALSE)

change_pred(1931, LABEL_FALSE)  # No Change
change_pred(1964, LABEL_FALSE)  # No Change

change_pred(3094, LABEL_FALSE)  # apparently not true.. 

change_pred(4053, LABEL_TRUE) 
change_pred(4056, LABEL_TRUE)

change_pred(4371, LABEL_TRUE)   # apparently true...
change_pred(4572, LABEL_TRUE)  

change_pred(4837, LABEL_TRUE)   # :(
change_pred(4930, LABEL_TRUE)   # :(
change_pred(4949, LABEL_FALSE)  # yeeee

change_pred(5679, LABEL_TRUE)   # :(

change_pred(8011, LABEL_TRUE)  
change_pred(10232, LABEL_FALSE) # wooo this is actually an issue

print(f'Adjusted {adjustments} predictions')

3263
Changing prediction for id 924 to Not Relevant
Changing prediction for id 1931 to Not Relevant
Changing prediction for id 4949 to Not Relevant
Changing prediction for id 10232 to Not Relevant
Adjusted 4 predictions


## Export(s)

### Labeled Test Set

In [64]:
df_test_labeled = df_merged_uniques_adjusted.copy()
df_test_labeled = df_test_labeled.rename(columns={'choose_one': 'target'})
df_test_labeled['target'] = df_test_labeled['target'].apply(lambda x: 1 if x == LABEL_TRUE else 0)

assert(len(df_test_labeled) == len_test)

df_test_labeled.to_csv("../data/test-labeled.csv", index=False)

### Kaggle Submission

In [65]:
df_submission = df_merged_uniques_adjusted[['id']].copy()
df_submission['target'] = df_merged_uniques_adjusted['choose_one'].apply(lambda x: 1 if x == LABEL_TRUE else 0)

assert(len(df_submission) == len_test)

df_submission.to_csv("../data/submission2.csv", index=False)