# Load the training data

- Tweets
    - A `Ground Truth Set` of tweets. 
- Labels
    - `TRECIS-2018-2020A.topics`
- Information Types
    - `TRECIS-2020-ITypes-Task1.json`

> TREC-IS provides multiple Twitter datasets collected from a range of past wildfire, earthquake, flood, typhoon/hurricane, bombing and shooting events. Human annotators manually label this data into 25 information types based on the information each tweet contains, such as 'contains location' or is a 'search and rescue request'. 

Each tweet is also assigned a ***priority* label**, that indicates how critical the information within that tweet is for a response officer to see. 



In [1]:
import pandas as pd
import numpy as np
import json
import glob
import ast

# Information Types
``TRECIS-2020-ITypes-Task1.json``

This loads the ontology file containing the **25** `Information Types` that we need to assign to the unlabelled tweets

In [2]:
df = pd.read_json("../../../0-data/raw/data/2020/2020-A/types/TRECIS-2020-ITypes-Task1.json", orient='columns')
df_split = df.join(pd.DataFrame(df.pop('informationTypes').tolist()))
informationTypes = df_split.drop(['identifier','description','level'],axis =1) # drop irrelevant (level?)

# Topic Statements

For each incident, we have a stream of related tweets, collected using hashtags, keyword, user, and geolocation monitoring. Each incident/event is accompanied by a brief "topic statement" in the TREC style:


`TRECIS-2018-2020A.topics`

```xml
<top>
<num>TRECIS-CTIT-H-Training-001</num>
<dataset>fireColorado2012</dataset>
<title>2012 Colorado wildfires</title>
<type>wildfire</type>
<url>https://en.wikipedia.org/wiki/2012_Colorado_wildfires</url>
<narr> The Colorado wildfires were an unusually devastating series of fires
in the US state of Colorado, which occurred throughout June, July, and
August 2012.
</narr>
</top>
```


 -> `topic_ids`
```python
{'athensEarthquake2019': 'TRECIS-CTIT-H-Test-035',
 'baltimoreFloods2019': 'TRECIS-CTIT-H-Test-036',
```

In [3]:
topic_ids = {}

with open("../../../0-data/raw/data/2020/2020-A/topics/TRECIS-2020-A.topics", "r") as in_file:
    topic_num = ""
    topic_id = ""
    
    for line in in_file:
        
        if line.strip() == "</top>":
            topic_ids[topic_id] = topic_num
        
        if line.startswith("<num>"):
            topic_num = line.partition(">")[-1].partition("<")[0]
              
        if line.startswith("<dataset>"):
            topic_id = line.partition(">")[-1].partition("<")[0]

#topic_ids

# Labels
`TRECIS-2018-2020A-labels.json` -> `labels_df`

```json
[
  {
    "eventID": "joplinTornado2011",
    "eventName": "2011 Joplin Tornado",
    "eventDescription": "The 2011 Joplin tornado was a catastrophic EF5-rated multiple-vortex tornado that struck Joplin, Missouri, late in the afternoon of Sunday, May 22, 2011. The user is a response officer in the Missouri command and control center responsible for impact to the state. <a href='https://en.wikipedia.org/wiki/2011_Joplin_tornado' target='_blank'>Wikipedia Page<a>",
    "eventType": "Unknown",
    "postID": "72676276212731904",
    "postCategories": [
      "Factoid",
      "Hashtags",
      "News"
    ],
    "postPriority": "Low"
  },
 ```

In [4]:
labels_df = pd.read_json("../../../0-data/raw/data/2020/2020-A/labels/TRECIS-2018-2020A-labels.json")
labels_df



Unnamed: 0,eventID,eventName,eventDescription,eventType,postID,postCategories,postPriority
0,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72676276212731904,"[Factoid, Hashtags, News]",Low
1,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72678400833228800,"[ServiceAvailable, Official, Hashtags, News]",Critical
2,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72682396750848000,"[Sentiment, Irrelevant]",Low
3,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72693931619528704,"[ThirdPartyObservation, Hashtags, News]",Medium
4,joplinTornado2011,2011 Joplin Tornado,The 2011 Joplin tornado was a catastrophic EF5...,Unknown,72698562223407104,"[ThirdPartyObservation, Hashtags, Irrelevant]",Low
...,...,...,...,...,...,...,...
42946,typhoonKrosa2020,typhoonKrosa2020,Placeholder,Unknown,1161999740080291840,[Irrelevant],Low
42947,typhoonKrosa2020,typhoonKrosa2020,Placeholder,Unknown,1162004768904163328,"[Location, MultimediaShare, ContextualInformat...",Low
42948,typhoonKrosa2020,typhoonKrosa2020,Placeholder,Unknown,1162005174468132864,"[Location, MultimediaShare]",Low
42949,typhoonKrosa2020,typhoonKrosa2020,Placeholder,Unknown,1162005861075750912,"[Location, MultimediaShare, Hashtags]",Low


In [6]:
# Map the categories to nuemric values
mymap = {'Advice':0, 'CleanUp':1, 'ContextualInformation':2, 'Discussion':3, 'Donations':4, 
        'EmergingThreats':5, 'Factoid':6, 'FirstPartyObservation':7, 'GoodsServices':8, 'Hashtags':9, 
        'InformationWanted':10,'Irrelevant':11, 'Location':12, 'MovePeople':13, 
         'MultimediaShare':14, 'NewSubEvent':15, 'News':16,
        'Official':17, 'OriginalEvent':18, 'SearchAndRescue':19, 'Sentiment':20, 'ServiceAvailable':21, 
         'ThirdPartyObservation':22,'Volunteer':23, 'Weather':24}

df2 = pd.DataFrame(labels_df["postCategories"].to_list(), columns=['cat1', 'cat2', 'cat3',
                                                                   'cat4', 'cat5', 'cat6',
                                                                   'cat7', 'cat8', 'cat9', 'cat10'])


#df['condition'] = df['condition'].map({1:'positive', -1:'negative', 0:'neutral'})

df2 = df2.applymap(lambda s: mymap.get(s) if s in mymap else s)
df2 = df2.fillna("0")
df = labels_df.join(df2)
df

labels_df = df


## Ground Truth

`TRECIS2020A-t12-assr*.json`

Read the responses for each assessor and append them to an array of assessor data


```json
"events": [
{"eventid": "siberianWildfires2020",
"tweets": [
{
  "postID" : "1157446798564306945",
  "timestamp" : "3 Mar 2020 12:28:12 GMT",
  "categories" : [ "Irrelevant" ],
  "indicatorTerms" : [ ],
  "priority" : "Low",
  "text" : "Trump offers Vladimir Putin help fighting forest fires in Siberia. (When California was consumed in biblical flames, Trump blamed the state's firefighters and slashed federal funding to stop wildfires.) https://t.co/TKTw3d0NLS"
},
```

In [7]:
training_data = []

def read_annotations(in_f_path):
    l_training_data = []
    
    with open(in_f_path, "rb") as in_file:
        annotator_content = in_file.read().decode("latin-1")
        trecis_training = json.loads(annotator_content)
        l_training_data.append(trecis_training)
        
    return l_training_data

# Read 2018 training file
#in_f_path = "../../../data/raw/data/2018/training/TRECIS-CTIT-H-Training.json"
#training_data.extend(read_annotations(in_f_path))
        
# Read each assessor file
#for in_f_path in glob.iglob("../../data/raw/data/2020/2020-A/labels/TRECIS-*.json"):
#    training_data.extend(read_annotations(in_f_path))
# 2019 Labels/Users/pseudo/Documents/GitHub/HelpMe/data/TRECIS2020A-t12-assr1.json
for in_f_path in glob.iglob("../../../0-data/raw/data/2020/2020-A/ground-truth-set/*assr*.json"):
    training_data.extend(read_annotations(in_f_path))
    
print("Annotations:", len(training_data))
#print(training_data)


Annotations: 5


## Map the annotated tweets

outputs 

**tweet_category_map**

```python
{'Advice': [243413475681001473,
  213689735082807296, ...
  ```
**tweet_id_to_priority**

```python
[{'tweet_id': 243377845072715777, 'priority': 'Low'},
 {'tweet_id': 243367720022855680, 'priority': 'Low'},
 {'tweet_id': 245776676586389504, 'priority': 'Low'},
```

**tweet_id_to_category**
```python
{243413475681001473: 1,
 213689735082807296: 1,
 212221690904723458: 1,
```


In [8]:
tweet_to_category = []
tweet_id_to_priority = []

# tweet_to_category -> df[tweet_to_category] -> category_df -> tweet_category_map[for cat in category_df]

# category_df
for annotator in training_data:
    local_events = annotator["events"]
    for event in local_events:
        for tweet in event["tweets"]:
            # Pull out categories from the tweet dictionary
            for category in tweet["categories"]:
                #print(category)
                tweet_to_category.append({
                    # KnownAlready Official KnownAlready KnownAlready Official KnownAlready KnownAlready ContinuingNews ContinuingNews
                    "tweet_id": np.int64(tweet["postID"]),
                    "category": category
                })
                
            # Pull out priority, of which there should be only one
            tweet_id_to_priority.append({
                "tweet_id": np.int64(tweet["postID"]),
                "priority": tweet["priority"]
            })

print("Tweet to Category Map:", len(tweet_to_category))
print("Tweet ID to Priority Map:", len(tweet_id_to_priority))

category_df = pd.DataFrame(tweet_to_category)

print("Tweets with Category:", category_df["tweet_id"].value_counts().index.shape[0])

# Export to CSV
category_df.to_csv("../3-csv/tweet_to_category.csv", index=False)
#tweet_to_category

Tweet to Category Map: 27036
Tweet ID to Priority Map: 12227
Tweets with Category: 6658


## categoryMap()
- `category_df`
- `tweet_category_map`
- `tweet_id_to_category`

In [9]:
# categoryMap()
# Maps a list of tweetIDs associated with categories

tweet_category_map = {}
category_df = pd.read_csv("../3-csv/tweet_to_category.csv")

#cat_update_map = {
#    "ContinuingNews": "News",
#    "PastNews": "ContextualInformation",
#    "KnownAlready": "OriginalEvent",
#    "SignificantEventChange": "NewSubEvent",
#}

#category_df["category"] = category_df["category"].apply(lambda x: cat_update_map.get(x, x))

i = 0
for category, tweets in category_df.groupby("category"):
    print(i, category)
    i += 1
    tweet_category_map[category] = list(tweets["tweet_id"])
    
# Deleted in 2019
#del(tweet_category_map["Unknown"])

# Get a count of the category labels
category_to_label = {c:i+1 for i, c in enumerate(tweet_category_map.keys())}

tweet_id_to_category = {}

for category, tweet_ids in tweet_category_map.items():
   
        
    for tweet_id in tweet_ids:
        tweet_id_to_category[np.int64(tweet_id)] = category_to_label[category]
        
#tweet_id_to_category
#tweet_category_map
#tweet_id_to_priority

0 Advice
1 CleanUp
2 ContextualInformation
3 Discussion
4 Donations
5 EmergingThreats
6 Factoid
7 FirstPartyObservation
8 GoodsServices
9 Hashtags
10 InformationWanted
11 Irrelevant
12 Location
13 MovePeople
14 MultimediaShare
15 NewSubEvent
16 News
17 Official
18 OriginalEvent
19 SearchAndRescue
20 Sentiment
21 ServiceAvailable
22 ThirdPartyObservation
23 Volunteer
24 Weather



Maps the tweetID in `tweet_id_to_priority` to it's numerical priority value. We can then use this to calculate the error against our run

Outputs `priority_df` and `priority_map` (identical)

In [10]:
priority_df = pd.DataFrame(tweet_id_to_priority)

priority_mapping = {
    "Critical" : 1,
    "High" : 0.75,
    "Medium" : 0.5,
    "Low" : 0.25,
    "Unknown" : 0,
}


temp_merged_priorities = []
for tweet_id, group in priority_df.groupby("tweet_id"):
    priority_list = list(group["priority"])
    p_scores = [priority_mapping[p] for p in priority_list]
    temp_merged_priorities.append({
        "tweet_id": tweet_id,
        "priority": np.mean(p_scores),
    })

priority_df = pd.DataFrame(temp_merged_priorities)
priority_map = {row["tweet_id"]: row["priority"] for idx, row in priority_df.iterrows()}

#priority_df.head()

#### ID -> EventID

Returns a map of events with all identified tweet IDs

```
{'albertaFloods2013': [347686624563429376,
  347766337344503808,
  347783236191129600,
  347793432514801664,
```

In [11]:
print("Labels:", sum([len(v) for v in tweet_category_map.values()]))

Labels: 27036


In [12]:
# Instantiate a new DataFrame to hold the categorised_tweets
cat_df = pd.DataFrame.from_records([tweet_id_to_category])
cat_df = cat_df.transpose()
cat_df = cat_df.reset_index()
cat_df.columns = ['tweet_id', 'postCategories']
cat_df

Unnamed: 0,tweet_id,postCategories
0,1155656085027487744,21
1,1155672988735270912,12
2,1155678291128246272,21
3,1155667255188959232,21
4,1155704526793564161,21
...,...,...
6653,1158164390669168640,17
6654,1158164612677799937,17
6655,1155388589427105792,17
6656,1157033482608107522,18


## Merge the category and priority

In [13]:
tweet_to_category_priority_df = priority_df.join(category_df.set_index('tweet_id'), on='tweet_id')
tweet_to_category_id_priority_df = tweet_to_category_priority_df.join(cat_df.set_index('tweet_id'), on='tweet_id')
tweet_to_category_id_priority_df['postCategories'] = tweet_to_category_id_priority_df['postCategories'] - 1
tweet_to_category_id_priority_df

Unnamed: 0,tweet_id,priority,category,postCategories
0,1128285482784366592,0.75,Location,12
0,1128285482784366592,0.75,Factoid,12
1,1128285665186197504,0.25,Location,12
1,1128285665186197504,0.25,Factoid,12
1,1128285665186197504,0.25,Hashtags,12
...,...,...,...,...
6657,1162006062867918848,0.25,Irrelevant,16
6657,1162006062867918848,0.25,Location,16
6657,1162006062867918848,0.25,MultimediaShare,16
6657,1162006062867918848,0.25,Hashtags,16


In [14]:
#df = tweet_to_category_id_priority_df
#df['category'] = df[['tweet_id','priority', 'category', 'postCategories']].groupby(['tweet_id','priority', 'postCategories'])['category'].transform(lambda x: ','.join(x))
#df[['tweet_id','priority', 'postCategories']].drop_duplicates()


In [15]:
#df = df.drop_duplicates()
#df

In [23]:


merged_df = pd.merge(tweet_to_category_id_priority_df, labels_df, left_on = 'tweet_id', right_on = 'postID', how = 'inner')

#merged_df['eventID'] = merged_df['eventID'].str[-2:]
#merged_df.loc[merged_df['postCategories_x'] == 8]

merged_df

merged_df['num'] = merged_df['postCategories_y'].str.len()
merged_df

Unnamed: 0,tweet_id,priority,category,postCategories_x,eventID,eventName,eventDescription,eventType,postID,postCategories_y,...,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,num
0,1128285482784366592,0.75,Location,12,papuaNewguineaEarthquake2020,papuaNewguineaEarthquake2020,Placeholder,Unknown,1128285482784366592,"[Location, Factoid]",...,6,0,0,0,0,0,0,0,0,2
1,1128285482784366592,0.75,Factoid,12,papuaNewguineaEarthquake2020,papuaNewguineaEarthquake2020,Placeholder,Unknown,1128285482784366592,"[Location, Factoid]",...,6,0,0,0,0,0,0,0,0,2
2,1128285665186197504,0.25,Location,12,papuaNewguineaEarthquake2020,papuaNewguineaEarthquake2020,Placeholder,Unknown,1128285665186197504,"[Location, Factoid, Hashtags]",...,6,9,0,0,0,0,0,0,0,3
3,1128285665186197504,0.25,Factoid,12,papuaNewguineaEarthquake2020,papuaNewguineaEarthquake2020,Placeholder,Unknown,1128285665186197504,"[Location, Factoid, Hashtags]",...,6,9,0,0,0,0,0,0,0,3
4,1128285665186197504,0.25,Hashtags,12,papuaNewguineaEarthquake2020,papuaNewguineaEarthquake2020,Placeholder,Unknown,1128285665186197504,"[Location, Factoid, Hashtags]",...,6,9,0,0,0,0,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15264,1162006062867918848,0.25,Irrelevant,16,typhoonKrosa2020,typhoonKrosa2020,Placeholder,Unknown,1162006062867918848,"[Location, MultimediaShare, Hashtags, News]",...,14,9,16,0,0,0,0,0,0,4
15265,1162006062867918848,0.25,Location,16,typhoonKrosa2020,typhoonKrosa2020,Placeholder,Unknown,1162006062867918848,"[Location, MultimediaShare, Hashtags, News]",...,14,9,16,0,0,0,0,0,0,4
15266,1162006062867918848,0.25,MultimediaShare,16,typhoonKrosa2020,typhoonKrosa2020,Placeholder,Unknown,1162006062867918848,"[Location, MultimediaShare, Hashtags, News]",...,14,9,16,0,0,0,0,0,0,4
15267,1162006062867918848,0.25,Hashtags,16,typhoonKrosa2020,typhoonKrosa2020,Placeholder,Unknown,1162006062867918848,"[Location, MultimediaShare, Hashtags, News]",...,14,9,16,0,0,0,0,0,0,4


In [24]:
# Drop irrelevant columns
df = merged_df.drop(['postID','eventName','eventDescription','postPriority', 'postCategories_y'],axis =1)
df.to_csv("../3-csv/dataset_labelled_tweets.csv", index=False)

# Drop string
#df = df.drop(['category','eventID','eventType',],axis =1)

#df.drop['postCategories_x']

In [25]:
df.sort_values(by=['postCategories_x'])

df.loc[df['postCategories_x'] == 15]

Unnamed: 0,tweet_id,priority,category,postCategories_x,eventID,eventType,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,num
6981,1157715110342930432,0.75,Location,15,elPasoWalmartShooting2020,Unknown,12,5,15,0,0,0,0,0,0,0,3
6982,1157715110342930432,0.75,Factoid,15,elPasoWalmartShooting2020,Unknown,12,5,15,0,0,0,0,0,0,0,3
6983,1157715110342930432,0.75,Location,15,elPasoWalmartShooting2020,Unknown,12,5,15,0,0,0,0,0,0,0,3
6984,1157715110342930432,0.75,EmergingThreats,15,elPasoWalmartShooting2020,Unknown,12,5,15,0,0,0,0,0,0,0,3
6985,1157715110342930432,0.75,NewSubEvent,15,elPasoWalmartShooting2020,Unknown,12,5,15,0,0,0,0,0,0,0,3
9689,1158482318081662976,0.25,NewSubEvent,15,siberianWildfires2020,Unknown,11,0,0,0,0,0,0,0,0,0,1
9690,1158482318081662976,0.25,MultimediaShare,15,siberianWildfires2020,Unknown,11,0,0,0,0,0,0,0,0,0,1
9691,1158482318081662976,0.25,ContextualInformation,15,siberianWildfires2020,Unknown,11,0,0,0,0,0,0,0,0,0,1
9692,1158482318081662976,0.25,Irrelevant,15,siberianWildfires2020,Unknown,11,0,0,0,0,0,0,0,0,0,1
9736,1158519738684923904,0.25,Location,15,siberianWildfires2020,Unknown,11,0,0,0,0,0,0,0,0,0,1


In [26]:

df
#clean_dataset(df)
mapping = {'fireColorado2012': 1, 'costaRicaEarthquake2012': 2,
          'floodColorado2013': 3, 'typhoonPablo2012': 4,
          'laAirportShooting2013': 5, 'westTexasExplosion2013': 6,
          'guatemalaEarthquake2012': 7, 'italyEarthquakes2012': 8,
          'philipinnesFloods2012': 9, 'albertaFloods2013': 10, 
           'australiaBushfire2013': 11, 'bostonBombings2013': 12,
           'siberianWildfires2019': 46
          }
df = df.replace({'eventID': mapping})
#df.describe()

In [27]:
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    df.astype(int)
    return df[indices_to_keep].astype(np.float64)

df.sort_values(by=['eventID'])

Unnamed: 0,tweet_id,priority,category,postCategories_x,eventID,eventType,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,num
1959,1152222275346804736,0.25,Irrelevant,11,athensEarthquake2020,Unknown,11,0,0,0,0,0,0,0,0,0,1
1849,1152192006208311296,0.50,Location,14,athensEarthquake2020,Unknown,12,14,9,2,0,0,0,0,0,0,4
1848,1152191535791902720,0.25,Irrelevant,11,athensEarthquake2020,Unknown,11,0,0,0,0,0,0,0,0,0,1
1847,1152191136380985344,0.25,Irrelevant,11,athensEarthquake2020,Unknown,11,0,0,0,0,0,0,0,0,0,1
1846,1152191119284944896,0.25,Irrelevant,11,athensEarthquake2020,Unknown,11,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4977,1156990922867073024,0.50,Factoid,14,whaleyBridgeCollapse2020,Unknown,12,5,14,6,0,0,0,0,0,0,4
4978,1156994205073285120,0.75,Location,16,whaleyBridgeCollapse2020,Unknown,12,5,14,16,0,0,0,0,0,0,4
4979,1156994205073285120,0.75,EmergingThreats,16,whaleyBridgeCollapse2020,Unknown,12,5,14,16,0,0,0,0,0,0,4
4973,1156990742725910528,0.50,ContextualInformation,16,whaleyBridgeCollapse2020,Unknown,12,5,14,16,2,0,0,0,0,0,5


In [28]:
df = df.drop(['eventType','category'],axis =1)

df

Unnamed: 0,tweet_id,priority,postCategories_x,eventID,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,num
0,1128285482784366592,0.75,12,papuaNewguineaEarthquake2020,12,6,0,0,0,0,0,0,0,0,2
1,1128285482784366592,0.75,12,papuaNewguineaEarthquake2020,12,6,0,0,0,0,0,0,0,0,2
2,1128285665186197504,0.25,12,papuaNewguineaEarthquake2020,12,6,9,0,0,0,0,0,0,0,3
3,1128285665186197504,0.25,12,papuaNewguineaEarthquake2020,12,6,9,0,0,0,0,0,0,0,3
4,1128285665186197504,0.25,12,papuaNewguineaEarthquake2020,12,6,9,0,0,0,0,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15264,1162006062867918848,0.25,16,typhoonKrosa2020,12,14,9,16,0,0,0,0,0,0,4
15265,1162006062867918848,0.25,16,typhoonKrosa2020,12,14,9,16,0,0,0,0,0,0,4
15266,1162006062867918848,0.25,16,typhoonKrosa2020,12,14,9,16,0,0,0,0,0,0,4
15267,1162006062867918848,0.25,16,typhoonKrosa2020,12,14,9,16,0,0,0,0,0,0,4


## Export to CSV


In [29]:
#clean_dataset(df)

# Export to CSV
#merged_df.to_csv("train.csv", index=False)
df.to_csv("../3-csv/dataset.csv", index=False)
df = pd.read_csv("../3-csv/dataset.csv")
df

Unnamed: 0,tweet_id,priority,postCategories_x,eventID,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,cat10,num
0,1128285482784366592,0.75,12,papuaNewguineaEarthquake2020,12.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
1,1128285482784366592,0.75,12,papuaNewguineaEarthquake2020,12.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,1128285665186197504,0.25,12,papuaNewguineaEarthquake2020,12.0,6.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
3,1128285665186197504,0.25,12,papuaNewguineaEarthquake2020,12.0,6.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
4,1128285665186197504,0.25,12,papuaNewguineaEarthquake2020,12.0,6.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15264,1162006062867918848,0.25,16,typhoonKrosa2020,12.0,14.0,9.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,4
15265,1162006062867918848,0.25,16,typhoonKrosa2020,12.0,14.0,9.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,4
15266,1162006062867918848,0.25,16,typhoonKrosa2020,12.0,14.0,9.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,4
15267,1162006062867918848,0.25,16,typhoonKrosa2020,12.0,14.0,9.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,4


In [30]:
category_df['category'].transpose()
category_df.sum

<bound method DataFrame.sum of                   tweet_id           category
0      1155671863911051265         Irrelevant
1      1155716993351204864           Location
2      1155716993351204864    MultimediaShare
3      1155716993351204864               News
4      1155914024690814979  InformationWanted
...                    ...                ...
27031  1157292521376092160            Weather
27032  1157292521376092160           Location
27033  1157292521376092160    EmergingThreats
27034  1157292521376092160    MultimediaShare
27035  1157292521376092160               News

[27036 rows x 2 columns]>

In [31]:
cat_to_eng = {
    0 : 'Advice',
    1 : 'CleanUp',
    2 : 'ContextualInformation',
    3 : 'Discussion',
    4 : 'Donations',
    5 : 'EmergingThreats',
    6 : 'Factoid',
    7 : 'FirstPartyObservation',
    8 : 'GoodsServices',
    9 : 'Hashtags',
    10 : 'InformationWanted',
    11 : 'Irrelevant',
    12 : 'Location',
    13 : 'MovePeople',
    14 : 'MultimediaShare',
    15 : 'NewSubEvent',
    16 : 'News',
    17 : 'Official',
    18 : 'OriginalEvent',
    19 : 'SearchAndRescue',
    20 : 'Sentiment',
    21 : 'ServiceAvailable',
    22 : 'ThirdPartyObservation',
    23 : 'Volunteer',
    24 : 'Weather',
}
category_df2 = category_df.replace({'category': cat_to_eng})
category_df2

Unnamed: 0,tweet_id,category
0,1155671863911051265,Irrelevant
1,1155716993351204864,Location
2,1155716993351204864,MultimediaShare
3,1155716993351204864,News
4,1155914024690814979,InformationWanted
...,...,...
27031,1157292521376092160,Weather
27032,1157292521376092160,Location
27033,1157292521376092160,EmergingThreats
27034,1157292521376092160,MultimediaShare
