# Load the training data

- Tweets
    - A `Ground Truth Set` of tweets. 
- Labels
    - `TRECIS-2018-2020A.topics`
- Information Types
    - `TRECIS-2020-ITypes-Task1.json`

> TREC-IS provides multiple Twitter datasets collected from a range of past wildfire, earthquake, flood, typhoon/hurricane, bombing and shooting events. Human annotators manually label this data into 25 information types based on the information each tweet contains, such as 'contains location' or is a 'search and rescue request'. 

Each tweet is also assigned a ***priority* label**, that indicates how critical the information within that tweet is for a response officer to see. 



In [1]:
import pandas as pd
import numpy as np
import json
import glob
import ast

# Information Types
``TRECIS-2020-ITypes-Task1.json``

This loads the ontology file containing the **25** `Information Types` that we need to assign to the unlabelled tweets

In [2]:
df = pd.read_json("../../../data/raw/data/2020/2020-A/types/TRECIS-2020-ITypes-Task1.json", orient='columns')
df_split = df.join(pd.DataFrame(df.pop('informationTypes').tolist()))
informationTypes = df_split.drop(['identifier','description','level'],axis =1) # drop irrelevant (level?)
# Export to CSV
informationTypes.to_csv("dataset_information_types.csv", index=False)
informationTypes.head()



Unnamed: 0,id,desc,intentType,exampleLowLevelTypes
0,Request-GoodsServices,The user is asking for a particular service or...,Request,"[PsychiatricNeed, Equipment, ShelterNeeded, Ve..."
1,Request-SearchAndRescue,The user is requesting a rescue (for themselve...,Request,"[SelfRescue, OtherRescue]"
2,Request-InformationWanted,The user is requesting information,Request,"[PersonsNews, MissingPersons, EventStatus]"
3,CallToAction-Volunteer,The user is asking people to volunteer to help...,CallToAction,[RegisterNow]
4,CallToAction-Donations,The user is asking people to donate goods/money,CallToAction,"[DonateMoney, DonateGoods, PromoteFundRaising]"


# Topic Statements

For each incident, we have a stream of related tweets, collected using hashtags, keyword, user, and geolocation monitoring. Each incident/event is accompanied by a brief "topic statement" in the TREC style:


`TRECIS-2018-2020A.topics`

```xml
<top>
<num>TRECIS-CTIT-H-Training-001</num>
<dataset>fireColorado2012</dataset>
<title>2012 Colorado wildfires</title>
<type>wildfire</type>
<url>https://en.wikipedia.org/wiki/2012_Colorado_wildfires</url>
<narr> The Colorado wildfires were an unusually devastating series of fires
in the US state of Colorado, which occurred throughout June, July, and
August 2012.
</narr>
</top>
```


 -> `topic_ids`
```python
{'athensEarthquake2019': 'TRECIS-CTIT-H-Test-035',
 'baltimoreFloods2019': 'TRECIS-CTIT-H-Test-036',
```

In [3]:
topic_ids = {}

with open("../../../data/raw/data/2020/2020-A/topics/TRECIS-2020-A.topics", "r") as in_file:
    topic_num = ""
    topic_id = ""
    
    for line in in_file:
        
        if line.strip() == "</top>":
            topic_ids[topic_id] = topic_num
        
        if line.startswith("<num>"):
            topic_num = line.partition(">")[-1].partition("<")[0]
              
        if line.startswith("<dataset>"):
            topic_id = line.partition(">")[-1].partition("<")[0]

#topic_ids

# Labels
`TRECIS-2018-2020A-labels.json` -> `labels_df`

```json
[
  {
    "eventID": "joplinTornado2011",
    "eventName": "2011 Joplin Tornado",
    "eventDescription": "The 2011 Joplin tornado was a catastrophic EF5-rated multiple-vortex tornado that struck Joplin, Missouri, late in the afternoon of Sunday, May 22, 2011. The user is a response officer in the Missouri command and control center responsible for impact to the state. <a href='https://en.wikipedia.org/wiki/2011_Joplin_tornado' target='_blank'>Wikipedia Page<a>",
    "eventType": "Unknown",
    "postID": "72676276212731904",
    "postCategories": [
      "Factoid",
      "Hashtags",
      "News"
    ],
    "postPriority": "Low"
  },
 ```

In [4]:
labels_df = pd.read_json("../../../data/raw/data/2020/2020-A/labels/TRECIS-2018-2020A-labels.json")
labels_df.head()


Unnamed: 0,eventID,eventName,eventDescription,eventType,postID,postCategories,postPriority
0,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72676276212731904,"[Factoid, Hashtags, News]",Low
1,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72678400833228800,"[ServiceAvailable, Official, Hashtags, News]",Critical
2,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72682396750848000,"[Sentiment, Irrelevant]",Low
3,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72693931619528704,"[ThirdPartyObservation, Hashtags, News]",Medium
4,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72698562223407104,"[ThirdPartyObservation, Hashtags, Irrelevant]",Low


## Ground Truth

`TRECIS2020A-t12-assr*.json`

Read the responses for each assessor and append them to an array of assessor data


```json
"events": [
{"eventid": "siberianWildfires2020",
"tweets": [
{
  "postID" : "1157446798564306945",
  "timestamp" : "3 Mar 2020 12:28:12 GMT",
  "categories" : [ "Irrelevant" ],
  "indicatorTerms" : [ ],
  "priority" : "Low",
  "text" : "Trump offers Vladimir Putin help fighting forest fires in Siberia. (When California was consumed in biblical flames, Trump blamed the state's firefighters and slashed federal funding to stop wildfires.) https://t.co/TKTw3d0NLS"
},
```

In [5]:
training_data = []

def read_annotations(in_f_path):
    l_training_data = []
    
    with open(in_f_path, "rb") as in_file:
        annotator_content = in_file.read().decode("latin-1")
        trecis_training = json.loads(annotator_content)
        l_training_data.append(trecis_training)
        
    return l_training_data

# Read 2018 training file
in_f_path = "../../../data/raw/data/2018/training/TRECIS-CTIT-H-Training.json"
training_data.extend(read_annotations(in_f_path))
        
# Read each assessor file
for in_f_path in glob.iglob("../../data/raw/data/2020/2020-A/labels/TRECIS-*.json"):
    training_data.extend(read_annotations(in_f_path))
    
# 2019 Labels
for in_f_path in glob.iglob("../../data/raw/data/2019A/2019ALabels/*assr*.json"):
    training_data.extend(read_annotations(in_f_path))

    
print("Annotations:", len(training_data))
#print(training_data)


Annotations: 1


## Map the annotated tweets

outputs 

**tweet_category_map**

```python
{'Advice': [243413475681001473,
  213689735082807296, ...
  ```
**tweet_id_to_priority**

```python
[{'tweet_id': 243377845072715777, 'priority': 'Low'},
 {'tweet_id': 243367720022855680, 'priority': 'Low'},
 {'tweet_id': 245776676586389504, 'priority': 'Low'},
```

**tweet_id_to_category**
```python
{243413475681001473: 1,
 213689735082807296: 1,
 212221690904723458: 1,
```


In [19]:
tweet_to_category = []
tweet_id_to_priority = []

# tweet_to_category -> df[tweet_to_category] -> category_df -> tweet_category_map[for cat in category_df]

# category_df
for annotator in training_data:
    local_events = annotator["events"]
    for event in local_events:
        for tweet in event["tweets"]:
            # Pull out categories from the tweet dictionary
            for category in tweet["categories"]:
                #print(category)
                tweet_to_category.append({
                    # KnownAlready Official KnownAlready KnownAlready Official KnownAlready KnownAlready ContinuingNews ContinuingNews
                    "tweet_id": np.int64(tweet["postID"]),
                    "category": category
                })
                
            # Pull out priority, of which there should be only one
            tweet_id_to_priority.append({
                "tweet_id": np.int64(tweet["postID"]),
                "priority": tweet["priority"]
            })

print("Tweet to Category Map:", len(tweet_to_category))
print("Tweet ID to Priority Map:", len(tweet_id_to_priority))

category_df = pd.DataFrame(tweet_to_category)

print("Tweets with Category:", category_df["tweet_id"].value_counts().index.shape[0])

# Export to CSV
category_df.to_csv("tweet_to_category.csv", index=False)
tweet_to_category

Tweet to Category Map: 1335
Tweet ID to Priority Map: 1335
Tweets with Category: 1335


[{'tweet_id': 243377845072715777, 'category': 'KnownAlready'},
 {'tweet_id': 243367720022855680, 'category': 'KnownAlready'},
 {'tweet_id': 245776676586389504, 'category': 'KnownAlready'},
 {'tweet_id': 247293697593597952, 'category': 'KnownAlready'},
 {'tweet_id': 243426079547723776, 'category': 'KnownAlready'},
 {'tweet_id': 243401194737901569, 'category': 'Official'},
 {'tweet_id': 243436699517071360, 'category': 'KnownAlready'},
 {'tweet_id': 243366331687256065, 'category': 'KnownAlready'},
 {'tweet_id': 243378847502962688, 'category': 'Official'},
 {'tweet_id': 243375215244095488, 'category': 'KnownAlready'},
 {'tweet_id': 243374590288592896, 'category': 'KnownAlready'},
 {'tweet_id': 243461806629203969, 'category': 'ContinuingNews'},
 {'tweet_id': 243394697765199872, 'category': 'ContinuingNews'},
 {'tweet_id': 243411898605907968, 'category': 'Factoid'},
 {'tweet_id': 243372316963254273, 'category': 'MovePeople'},
 {'tweet_id': 243478046978473986, 'category': 'KnownAlready'},
 {'

## categoryMap()
- `category_df`
- `tweet_category_map`
- `tweet_id_to_category`

In [21]:
# categoryMap()
# Maps a list of tweetIDs associated with categories

tweet_category_map = {}
category_df = pd.read_csv("tweet_to_category.csv")

cat_update_map = {
    "ContinuingNews": "News",
    "PastNews": "ContextualInformation",
    "KnownAlready": "OriginalEvent",
    "SignificantEventChange": "NewSubEvent",
}

category_df["category"] = category_df["category"].apply(lambda x: cat_update_map.get(x, x))
i = 0
for category, tweets in category_df.groupby("category"):
    print(i, category)
    i += 1
    tweet_category_map[category] = list(tweets["tweet_id"])
    
# Deleted in 2019
#del(tweet_category_map["Unknown"])

# Get a count of the category labels
category_to_label = {c:i+1 for i, c in enumerate(tweet_category_map.keys())}

tweet_id_to_category = {}

for category, tweet_ids in tweet_category_map.items():
   
        
    for tweet_id in tweet_ids:
        tweet_id_to_category[np.int64(tweet_id)] = category_to_label[category]
        
#tweet_id_to_category
#tweet_category_map
#tweet_id_to_priority

0 Advice
1 CleanUp
2 ContextualInformation
3 Discussion
4 Donations
5 EmergingThreats
6 Factoid
7 FirstPartyObservation
8 Hashtags
9 InformationWanted
10 Irrelevant
11 MovePeople
12 MultimediaShare
13 NewSubEvent
14 News
15 Official
16 OriginalEvent
17 Sentiment
18 ServiceAvailable
19 ThirdPartyObservation
20 Unknown
21 Volunteer
22 Weather



Maps the tweetID in `tweet_id_to_priority` to it's numerical priority value. We can then use this to calculate the error against our run

Outputs `priority_df` and `priority_map` (identical)

In [8]:
priority_df = pd.DataFrame(tweet_id_to_priority)

priority_mapping = {
    "Critical" : 1,
    "High" : 0.75,
    "Medium" : 0.5,
    "Low" : 0.25,
    "Unknown" : 0,
}


temp_merged_priorities = []
for tweet_id, group in priority_df.groupby("tweet_id"):
    priority_list = list(group["priority"])
    p_scores = [priority_mapping[p] for p in priority_list]
    temp_merged_priorities.append({
        "tweet_id": tweet_id,
        "priority": np.mean(p_scores),
    })

priority_df = pd.DataFrame(temp_merged_priorities)
priority_map = {row["tweet_id"]: row["priority"] for idx, row in priority_df.iterrows()}

#priority_df.head()

#### ID -> EventID

Returns a map of events with all identified tweet IDs

```
{'albertaFloods2013': [347686624563429376,
  347766337344503808,
  347783236191129600,
  347793432514801664,
```

In [9]:
print("Labels:", sum([len(v) for v in tweet_category_map.values()]))

Labels: 1335


In [10]:
# Instantiate a new DataFrame to hold the categorised_tweets
cat_df = pd.DataFrame.from_records([tweet_id_to_category])
cat_df = cat_df.transpose()
cat_df = cat_df.reset_index()
cat_df.columns = ['tweet_id', 'postCategories']
#cat_df

## Merge the category and priority

In [11]:
tweet_to_category_priority_df = priority_df.join(category_df.set_index('tweet_id'), on='tweet_id')
tweet_to_category_id_priority_df = tweet_to_category_priority_df.join(cat_df.set_index('tweet_id'), on='tweet_id')
tweet_to_category_id_priority_df

Unnamed: 0,tweet_id,priority,category,postCategories
0,211281973870727170,0.25,Irrelevant,11
1,211557401231495171,0.50,FirstPartyObservation,8
2,211565974422425600,0.75,ServiceAvailable,19
3,211607187653533697,0.50,Weather,23
4,211654415503990784,0.50,News,15
...,...,...,...,...
1330,396336012726525952,0.25,News,15
1331,396336079856345088,0.25,News,15
1332,396336243442589696,0.25,News,15
1333,396336297968562176,0.25,Factoid,7


In [12]:
# 
merged_df = pd.merge(tweet_to_category_id_priority_df, labels_df, left_on = 'tweet_id', right_on = 'postID', how = 'inner')

#merged_df['eventID'] = merged_df['eventID'].str[-2:]
#merged_df.loc[merged_df['postCategories_x'] == 8]

merged_df



Unnamed: 0,tweet_id,priority,category,postCategories_x,eventID,eventName,eventDescription,eventType,postID,postCategories_y,postPriority
0,211565974422425600,0.75,ServiceAvailable,19,fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211565974422425600,[ServiceAvailable],High
1,211654415503990784,0.50,News,15,fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211654415503990784,[News],Medium
2,211681309368655872,0.25,News,15,fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211681309368655872,[News],Low
3,211685621125742592,0.25,Official,16,fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211685621125742592,[Official],Low
4,211877049147736064,0.25,Factoid,7,fireColorado2012,DEMO: CrisisLex26 2012 Colorado wildfires,The 2012 Colorado wildfires were an unusually ...,Unknown,211877049147736064,[Factoid],Low
...,...,...,...,...,...,...,...,...,...,...,...
938,396336012726525952,0.25,News,15,laAirportShooting2013,CrisisLex26 2013 LA Airport Shooting,"On November 1, 2013, a shooting occurred at ar...",Unknown,396336012726525952,[News],Low
939,396336079856345088,0.25,News,15,laAirportShooting2013,CrisisLex26 2013 LA Airport Shooting,"On November 1, 2013, a shooting occurred at ar...",Unknown,396336079856345088,[News],Low
940,396336243442589696,0.25,News,15,laAirportShooting2013,CrisisLex26 2013 LA Airport Shooting,"On November 1, 2013, a shooting occurred at ar...",Unknown,396336243442589696,[News],Low
941,396336297968562176,0.25,Factoid,7,laAirportShooting2013,CrisisLex26 2013 LA Airport Shooting,"On November 1, 2013, a shooting occurred at ar...",Unknown,396336297968562176,[Factoid],Low


In [13]:

'''
'Advice':1
Cleanup
ContextualInformation
Discussion
Donations
EmergingThreats
Factoid
FirstPartyObservation


'''
# Drop irrelevant columns
df = merged_df.drop(['postID','eventName','eventDescription','postPriority', 'postCategories_y'],axis =1)
df.to_csv("dataset_labelled_tweets.csv", index=False)
# Drop string
#df = df.drop(['category','eventID','eventType',],axis =1)

df


Unnamed: 0,tweet_id,priority,category,postCategories_x,eventID,eventType
0,211565974422425600,0.75,ServiceAvailable,19,fireColorado2012,Unknown
1,211654415503990784,0.50,News,15,fireColorado2012,Unknown
2,211681309368655872,0.25,News,15,fireColorado2012,Unknown
3,211685621125742592,0.25,Official,16,fireColorado2012,Unknown
4,211877049147736064,0.25,Factoid,7,fireColorado2012,Unknown
...,...,...,...,...,...,...
938,396336012726525952,0.25,News,15,laAirportShooting2013,Unknown
939,396336079856345088,0.25,News,15,laAirportShooting2013,Unknown
940,396336243442589696,0.25,News,15,laAirportShooting2013,Unknown
941,396336297968562176,0.25,Factoid,7,laAirportShooting2013,Unknown


In [14]:

df.sort_values(by=['postCategories_x'])


df.loc[df['postCategories_x'] == 15]

Unnamed: 0,tweet_id,priority,category,postCategories_x,eventID,eventType
1,211654415503990784,0.50,News,15,fireColorado2012,Unknown
2,211681309368655872,0.25,News,15,fireColorado2012,Unknown
12,212099863171710976,0.25,News,15,fireColorado2012,Unknown
13,212137133736075264,0.25,News,15,fireColorado2012,Unknown
24,212298199229149184,0.25,News,15,fireColorado2012,Unknown
...,...,...,...,...,...,...
937,396335979167903744,0.25,News,15,laAirportShooting2013,Unknown
938,396336012726525952,0.25,News,15,laAirportShooting2013,Unknown
939,396336079856345088,0.25,News,15,laAirportShooting2013,Unknown
940,396336243442589696,0.25,News,15,laAirportShooting2013,Unknown


## Export to CSV


In [15]:
#clean_dataset(df)

# Export to CSV
#merged_df.to_csv("train.csv", index=False)
df.to_csv("dataset.csv", index=False)
df = pd.read_csv("dataset.csv")
df

Unnamed: 0,tweet_id,priority,category,postCategories_x,eventID,eventType
0,211565974422425600,0.75,ServiceAvailable,19,fireColorado2012,Unknown
1,211654415503990784,0.50,News,15,fireColorado2012,Unknown
2,211681309368655872,0.25,News,15,fireColorado2012,Unknown
3,211685621125742592,0.25,Official,16,fireColorado2012,Unknown
4,211877049147736064,0.25,Factoid,7,fireColorado2012,Unknown
...,...,...,...,...,...,...
938,396336012726525952,0.25,News,15,laAirportShooting2013,Unknown
939,396336079856345088,0.25,News,15,laAirportShooting2013,Unknown
940,396336243442589696,0.25,News,15,laAirportShooting2013,Unknown
941,396336297968562176,0.25,Factoid,7,laAirportShooting2013,Unknown


In [16]:

df
#clean_dataset(df)
mapping = {'fireColorado2012': 1, 'costaRicaEarthquake2012': 2,
          'floodColorado2013': 3, 'typhoonPablo2012': 4,
          'laAirportShooting2013': 5, 'westTexasExplosion2013': 6,
          'guatemalaEarthquake2012': 7, 'italyEarthquakes2012': 8,
          'philipinnesFloods2012': 9, 'albertaFloods2013': 10, 
           'australiaBushfire2013': 11, 'bostonBombings2013': 12,
           'siberianWildfires2019': 46
          }
df = df.replace({'eventID': mapping})
df

Unnamed: 0,tweet_id,priority,category,postCategories_x,eventID,eventType
0,211565974422425600,0.75,ServiceAvailable,19,1,Unknown
1,211654415503990784,0.50,News,15,1,Unknown
2,211681309368655872,0.25,News,15,1,Unknown
3,211685621125742592,0.25,Official,16,1,Unknown
4,211877049147736064,0.25,Factoid,7,1,Unknown
...,...,...,...,...,...,...
938,396336012726525952,0.25,News,15,5,Unknown
939,396336079856345088,0.25,News,15,5,Unknown
940,396336243442589696,0.25,News,15,5,Unknown
941,396336297968562176,0.25,Factoid,7,5,Unknown


In [17]:
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    df.astype(int)
    return df[indices_to_keep].astype(np.float64)

df.sort_values(by=['label_id'])

KeyError: 'label_id'

In [None]:
df = df.drop(['id','intentType','category'],axis =1)

df