In [1]:
import pandas as pd
import numpy as np
import json
import glob
import ast

# Load the training data

- Tweets
    - A `Ground Truth Set` of tweets. 
- Labels
- Information Types

TREC-IS provides multiple Twitter datasets collected from a range of past wildfire, earthquake, flood, typhoon/hurricane, bombing and shooting events. We have had human annotators manually label this data into 25 information types based on the information each tweet contains, such as 'contains location' or is a 'search and rescue request'. 

Each tweet is also assigned a ***priority* label**, that indicates how critical the information within that tweet is for a response officer to see. 



### Labels
TRECIS-2018-2020A-labels.json

```json
[
  {
    "eventID": "joplinTornado2011",
    "eventName": "2011 Joplin Tornado",
    "eventDescription": "The 2011 Joplin tornado was a catastrophic EF5-rated multiple-vortex tornado that struck Joplin, Missouri, late in the afternoon of Sunday, May 22, 2011. The user is a response officer in the Missouri command and control center responsible for impact to the state. <a href='https://en.wikipedia.org/wiki/2011_Joplin_tornado' target='_blank'>Wikipedia Page<a>",
    "eventType": "Unknown",
    "postID": "72676276212731904",
    "postCategories": [
      "Factoid",
      "Hashtags",
      "News"
    ],
    "postPriority": "Low"
  },
 ```

In [2]:
labels_df = pd.read_json("../../../data/raw/data/2020/2020-A/labels/TRECIS-2018-2020A-labels.json")

### Topics

For each incident, we have a stream of related tweets, collected using hashtags, keyword, user, and geolocation monitoring. Each incident/event is accompanied by a brief "topic statement" in the TREC style:


TRECIS-2018-2020A.topics
```xml
<top>
<num>TRECIS-CTIT-H-Training-001</num>
<dataset>fireColorado2012</dataset>
<title>2012 Colorado wildfires</title>
<type>wildfire</type>
<url>https://en.wikipedia.org/wiki/2012_Colorado_wildfires</url>
<narr> The Colorado wildfires were an unusually devastating series of fires
in the US state of Colorado, which occurred throughout June, July, and
August 2012.
</narr>
</top>
```
- Load the event topics from `TRECIS-2020-A.topics`
- Map the events to ids `event_ids{}`

```python
{'athensEarthquake2019': 'TRECIS-CTIT-H-Test-035',
 'baltimoreFloods2019': 'TRECIS-CTIT-H-Test-036',
```

In [3]:
event_ids = {}

with open("../../../data/raw/data/2020/2020-A/topics/TRECIS-2020-A.topics", "r") as in_file:
    topic_num = ""
    topic_id = ""
    
    for line in in_file:
        
        if line.strip() == "</top>":
            event_ids[topic_id] = topic_num
        
        if line.startswith("<num>"):
            topic_num = line.partition(">")[-1].partition("<")[0]
              
        if line.startswith("<dataset>"):
            topic_id = line.partition(">")[-1].partition("<")[0]

#event_ids


### Information Types (Categories)
TRECIS-2020-ITypes-Task1.json                                

This loads the ontology file containing the `Information Types` to a pandas dataframe 

```json
"identifier": "TRECIS-ITR-H-Types-v4-T1",
	"description": "TREC-IS Incident Tweet Routing (High-Level) Information Types. Version four, Task 1, created for the 2020 edition of the track.",
	"informationTypes": [
		{
			"id": "Request-GoodsServices",
			"desc": "The user is asking for a particular service or physical good.",
			"level": "High-level",
			"intentType": "Request",
			"exampleLowLevelTypes": [ 
				"PsychiatricNeed", 
				"Equipment", 
				"ShelterNeeded", 
				"Vehicles"
			]
		},
```




In [4]:
df = pd.read_json("../../../data/raw/data/2020/2020-A/types/TRECIS-2020-ITypes-Task1.json", orient='columns')
df_split = df.join(pd.DataFrame(df.pop('informationTypes').tolist()))
df_split.head()

Unnamed: 0,identifier,description,id,desc,level,intentType,exampleLowLevelTypes
0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Request-GoodsServices,The user is asking for a particular service or...,High-level,Request,"[PsychiatricNeed, Equipment, ShelterNeeded, Ve..."
1,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Request-SearchAndRescue,The user is requesting a rescue (for themselve...,High-level,Request,"[SelfRescue, OtherRescue]"
2,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Request-InformationWanted,The user is requesting information,High-level,Request,"[PersonsNews, MissingPersons, EventStatus]"
3,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,CallToAction-Volunteer,The user is asking people to volunteer to help...,High-level,CallToAction,[RegisterNow]
4,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,CallToAction-Donations,The user is asking people to donate goods/money,High-level,CallToAction,"[DonateMoney, DonateGoods, PromoteFundRaising]"


## Annotated Data

Read the responses for each assessor and append them to an array of assessor data

In [5]:
annotators_data = []

def read_annotations(in_f_path):
    l_annotators_data = []
    
    with open(in_f_path, "rb") as in_file:
        annotator_content = in_file.read().decode("latin-1")
        trecis_training = json.loads(annotator_content)
        l_annotators_data.append(trecis_training)
        
    return l_annotators_data

# Read 2018 training file
in_f_path = "../../../data/raw/data/2018/training/TRECIS-CTIT-H-Training.json"
annotators_data.extend(read_annotations(in_f_path))
        
# Read each assessor file
for in_f_path in glob.iglob("../../data/raw/data/2020/2020-A/labels/TRECIS-*.json"):
    annotators_data.extend(read_annotations(in_f_path))
    
# 2019 Labels
for in_f_path in glob.iglob("../../data/raw/data/2019A/2019ALabels/*assr*.json"):
    annotators_data.extend(read_annotations(in_f_path))

    
print("Annotations:", len(annotators_data))
#print(annotators_data)

Annotations: 1


## Map the annotated tweets to our Dataset

outputs 

**tweet_category_map**

```python
{'Advice': [243413475681001473,
  213689735082807296, ...
  ```
**tweet_id_to_priority**

```python
[{'tweet_id': 243377845072715777, 'priority': 'Low'},
 {'tweet_id': 243367720022855680, 'priority': 'Low'},
 {'tweet_id': 245776676586389504, 'priority': 'Low'},
```

**tweet_id_to_category**
```python
{243413475681001473: 1,
 213689735082807296: 1,
 212221690904723458: 1,
```


> This is where I'm currently stuck. Not finding matches between postID and tweet_id

In [6]:
tweet_to_category = []
tweet_id_to_priority = []


# category_df
for annotator in annotators_data:
    local_events = annotator["events"]
    #print(local_events)
    for event in local_events:
        #print(event)
        for tweet in event["tweets"]:
            #print(tweet)
            
            # Pull out categories from the tweet dictionary
            for category in tweet["categories"]:
                #print(category)
                tweet_to_category.append({
                    "tweet_id": np.int64(tweet["postID"]),
                    "category": category
                })
                
            # Pull out priority, of which there should be only one
            tweet_id_to_priority.append({
                "tweet_id": np.int64(tweet["postID"]),
                "priority": tweet["priority"]
            })

print("Tweet to Category Map:", len(tweet_to_category))
print("Tweet ID to Priority Map:", len(tweet_id_to_priority))

category_df = pd.DataFrame(tweet_to_category)

print("Tweets with Category:", category_df["tweet_id"].value_counts().index.shape[0])

# Export to CSV
category_df.to_csv("tweet_to_category.csv", index=False)

# categoryMap()
# Maps a list of tweetIDs associated with categories

tweet_category_map = {}
category_df = pd.read_csv("tweet_to_category.csv")

cat_update_map = {
    "ContinuingNews": "News",
    "PastNews": "ContextualInformation",
    "KnownAlready": "OriginalEvent",
    "SignificantEventChange": "NewSubEvent",
}

category_df["category"] = category_df["category"].apply(lambda x: cat_update_map.get(x, x))

for category, tweets in category_df.groupby("category"):
    tweet_category_map[category] = list(tweets["tweet_id"])
    
# Deleted in 2019
del(tweet_category_map["Unknown"])

# Get a count of the category labels
category_to_label = {c:i+1 for i, c in enumerate(tweet_category_map.keys())}

tweet_id_to_category = {}

for category, tweet_ids in tweet_category_map.items():
   
        
    for tweet_id in tweet_ids:
        tweet_id_to_category[np.int64(tweet_id)] = category_to_label[category]
        
#tweet_id_to_category
#tweet_category_map
#tweet_id_to_priority

Tweet to Category Map: 1335
Tweet ID to Priority Map: 1335
Tweets with Category: 1335


### Map priority labels to numerical values
> We can then use this to calculate the error against our run

In [7]:
priority_df = pd.DataFrame(tweet_id_to_priority)

priority_mapping = {
    "Critical" : 1,
    "High" : 0.75,
    "Medium" : 0.5,
    "Low" : 0.25,
    "Unknown" : 0,
}


# Inst
temp_merged_priorities = []

for tweet_id, group in priority_df.groupby("tweet_id"):
    priority_list = list(group["priority"])
    p_scores = [priority_mapping[p] for p in priority_list]
    temp_merged_priorities.append({
        "tweet_id": tweet_id,
        "priority": np.mean(p_scores),
    })
    

# We then place these results in a dataframe to make them easier to work with 
priority_df = pd.DataFrame(temp_merged_priorities)

priority_map = {row["tweet_id"]: row["priority"] for idx, row in priority_df.iterrows()}

In [8]:
# Instantiate a new DataFrame to hold the categorised_tweets
eid = pd.DataFrame.from_records([tweet_id_to_category])
eid = eid.transpose()
eid = eid.reset_index()
eid.columns = ['tweet_id', 'label_id']
eid

Unnamed: 0,tweet_id,label_id
0,243413475681001473,1
1,213689735082807296,1
2,212221690904723458,1
3,217393586537377792,1
4,378071202931032064,1
...,...,...
1304,275577667875659776,22
1305,275540019781963777,22
1306,275720542621941760,22
1307,275877501895573504,22


In [9]:
tweet_to_category_priority_df = priority_df.join(category_df.set_index('tweet_id'), on='tweet_id')
tweet_to_category_id_priority_df = tweet_to_category_priority_df.join(eid.set_index('tweet_id'), on='tweet_id')

tweet_to_category_id_priority_df

Unnamed: 0,tweet_id,priority,category,label_id
0,211281973870727170,0.25,Irrelevant,11.0
1,211557401231495171,0.50,FirstPartyObservation,8.0
2,211565974422425600,0.75,ServiceAvailable,19.0
3,211607187653533697,0.50,Weather,22.0
4,211654415503990784,0.50,News,15.0
...,...,...,...,...
1330,396336012726525952,0.25,News,15.0
1331,396336079856345088,0.25,News,15.0
1332,396336243442589696,0.25,News,15.0
1333,396336297968562176,0.25,Factoid,7.0


In [10]:
df = tweet_to_category_id_priority_df.join(df_split, on="label_id")


In [11]:
print("Labels:", sum([len(v) for v in tweet_category_map.values()]))

Labels: 1309


## ID -> EventID

Returns a map of events with all identified tweet IDs

```
{'albertaFloods2013': [347686624563429376,
  347766337344503808,
  347783236191129600,
  347793432514801664,
```

In [12]:
merged_df = pd.merge(df, labels_df, left_on = 'tweet_id', right_on = 'postID', how = 'inner')
merged_df

Unnamed: 0,tweet_id,priority,category,label_id,identifier,description,id,desc,level,intentType,exampleLowLevelTypes,eventID,eventName,eventDescription,eventType,postID,postCategories,postPriority
0,211565974422425600,0.75,ServiceAvailable,19.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-OriginalEvent,A report of the original event occuring. This ...,High-level,Report,[],fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211565974422425600,[ServiceAvailable],High
1,211654415503990784,0.50,News,15.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-Official,An official report by a government or public s...,High-level,Report,"[OfficialStatement, RegionalWarning, PublicAle...",fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211654415503990784,[News],Medium
2,211681309368655872,0.25,News,15.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-Official,An official report by a government or public s...,High-level,Report,"[OfficialStatement, RegionalWarning, PublicAle...",fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211681309368655872,[News],Low
3,211685621125742592,0.25,Official,16.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-News,The post is a news report providing/linking to...,High-level,Other,"[NewsHeadline, NewsArticle]",fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211685621125742592,[Official],Low
4,211877049147736064,0.25,Factoid,7.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-ThirdPartyObservation,The user is reporting a information that they ...,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211877049147736064,[Factoid],Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
938,396336012726525952,0.25,News,15.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-Official,An official report by a government or public s...,High-level,Report,"[OfficialStatement, RegionalWarning, PublicAle...",laAirportShooting2013,CrisisLex26 2013 LA Airport Shooting,"On November 1, 2013, a shooting occurred at ar...",Unknown,396336012726525952,[News],Low
939,396336079856345088,0.25,News,15.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-Official,An official report by a government or public s...,High-level,Report,"[OfficialStatement, RegionalWarning, PublicAle...",laAirportShooting2013,CrisisLex26 2013 LA Airport Shooting,"On November 1, 2013, a shooting occurred at ar...",Unknown,396336079856345088,[News],Low
940,396336243442589696,0.25,News,15.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-Official,An official report by a government or public s...,High-level,Report,"[OfficialStatement, RegionalWarning, PublicAle...",laAirportShooting2013,CrisisLex26 2013 LA Airport Shooting,"On November 1, 2013, a shooting occurred at ar...",Unknown,396336243442589696,[News],Low
941,396336297968562176,0.25,Factoid,7.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-ThirdPartyObservation,The user is reporting a information that they ...,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",laAirportShooting2013,CrisisLex26 2013 LA Airport Shooting,"On November 1, 2013, a shooting occurred at ar...",Unknown,396336297968562176,[Factoid],Low


In [13]:
df = merged_df
df.sort_values(by=['label_id'])


df.loc[df['label_id'] == 6.0]




Unnamed: 0,tweet_id,priority,category,label_id,identifier,description,id,desc,level,intentType,exampleLowLevelTypes,eventID,eventName,eventDescription,eventType,postID,postCategories,postPriority
46,212678790353137664,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,212678790353137664,[EmergingThreats],High
78,214012344139911168,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,214012344139911168,[EmergingThreats],High
203,243375546594119680,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",costaRicaEarthquake2012,CrisisLex26 2012 Costa Rica Earthquake,The 2012 Costa Rica earthquake occurred at 08:...,Unknown,243375546594119680,[EmergingThreats],High
207,243376377057902592,1.0,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",costaRicaEarthquake2012,CrisisLex26 2012 Costa Rica Earthquake,The 2012 Costa Rica earthquake occurred at 08:...,Unknown,243376377057902592,[EmergingThreats],Critical
430,275730885792378880,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",typhoonPablo2012,CrisisLex26 2012 Typhoon Pablo,"Typhoon Bopha, known locally in the Philippine...",Unknown,275730885792378880,[EmergingThreats],High
439,275753656660414464,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",typhoonPablo2012,CrisisLex26 2012 Typhoon Pablo,"Typhoon Bopha, known locally in the Philippine...",Unknown,275753656660414464,[EmergingThreats],High
445,275760120070279168,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",typhoonPablo2012,CrisisLex26 2012 Typhoon Pablo,"Typhoon Bopha, known locally in the Philippine...",Unknown,275760120070279168,[EmergingThreats],High
457,275784077943123968,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",typhoonPablo2012,CrisisLex26 2012 Typhoon Pablo,"Typhoon Bopha, known locally in the Philippine...",Unknown,275784077943123968,[EmergingThreats],High
462,275797919137939456,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",typhoonPablo2012,CrisisLex26 2012 Typhoon Pablo,"Typhoon Bopha, known locally in the Philippine...",Unknown,275797919137939456,[EmergingThreats],High
478,275855943164649472,0.75,EmergingThreats,6.0,TRECIS-ITR-H-Types-v4-T1,TREC-IS Incident Tweet Routing (High-Level) In...,Report-FirstPartyObservation,The user is giving an eye-witness account,High-level,Report,"[Group/IndividualMovement, PeopleEvacuating, D...",typhoonPablo2012,CrisisLex26 2012 Typhoon Pablo,"Typhoon Bopha, known locally in the Philippine...",Unknown,275855943164649472,[EmergingThreats],High


In [14]:
df = df.drop(['postPriority',
              'eventName','eventType','eventDescription','eventID',
              'identifier',
              'description','category',
              'id','desc','postID','postCategories',
              'level',
              'exampleLowLevelTypes',
              'intentType'],axis =1)

df

Unnamed: 0,tweet_id,priority,label_id
0,211565974422425600,0.75,19.0
1,211654415503990784,0.50,15.0
2,211681309368655872,0.25,15.0
3,211685621125742592,0.25,16.0
4,211877049147736064,0.25,7.0
...,...,...,...
938,396336012726525952,0.25,15.0
939,396336079856345088,0.25,15.0
940,396336243442589696,0.25,15.0
941,396336297968562176,0.25,7.0


In [15]:

def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    df.astype(int)
    return df[indices_to_keep].astype(np.float64)

## Export to CSV


In [16]:
clean_dataset(df)
# Export to CSV
#merged_df.to_csv("train.csv", index=False)
df.to_csv("train.csv", index=False)
df = pd.read_csv("train.csv")
df

Unnamed: 0,tweet_id,priority,label_id
0,211565974422425600,0.75,19.0
1,211654415503990784,0.50,15.0
2,211681309368655872,0.25,15.0
3,211685621125742592,0.25,16.0
4,211877049147736064,0.25,7.0
...,...,...,...
917,396336012726525952,0.25,15.0
918,396336079856345088,0.25,15.0
919,396336243442589696,0.25,15.0
920,396336297968562176,0.25,7.0
