## Prepare Training Data
### Sources: 
#### 1. Hurricanes Harvey, Irma and Maria
Firoj Alam, Ferda Ofli, Muhammad Imran. CrisisMMD: Multimodal Twitter Datasets from Natural Disasters. To appear at the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA. [Bibtex]

#### 2. Hurricane Sandy and Joplin Tornado
Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. Practical Extraction of Disaster-Relevant Information from Social Media. In Proceedings of the 22nd international conference on World Wide Web companion, May 2013, Rio de Janeiro, Brazil. [Bibtex]

Data downloaded from https://crisisnlp.qcri.org/

In [110]:
import pandas as pd

### 1. Hurricanes Harvey, Irma and Maria
#### Import three separate hurricane tweet archives from original tab-separated files

In [111]:
harvey = pd.read_csv('../data/annotations/hurricane_harvey_final_data.tsv',delimiter='\t')
harvey['event'] = 'harvey'

irma = pd.read_csv('../data/annotations/hurricane_irma_final_data.tsv',delimiter='\t')
irma['event'] = 'irma'

maria = pd.read_csv('../data/annotations/hurricane_maria_final_data.tsv',delimiter='\t')
maria['event'] = 'maria'

#### Merge the three files and examine.

In [112]:
train = pd.concat([harvey,irma,maria], axis=0)
# this is the total number of images, not the total number of tweets
train.shape

(13530, 16)

In [113]:
train.event.value_counts()

maria     4562
irma      4525
harvey    4443
Name: event, dtype: int64

#### These files have one record per tweet image, not per tweet. Remove dupicates by dropping image columns and then removing duplicates and nulls.

In [114]:
# drop unneeded columns
cols = ['tweet_id', 'text_info', 'text_human', 'tweet_text', 'event']
train = train[cols]

# drop duplicates - (due to tweets with multiple images)
train.drop_duplicates(inplace=True)
train.dropna(inplace=True)

In [115]:
train.shape

(9465, 5)

#### Create Tweet Category Target Variable

In [116]:
# this field has the categories
train.text_human.value_counts()

other_relevant_information                4653
rescue_volunteering_or_donation_effort    2635
infrastructure_and_utility_damage          907
not_relevant_or_cant_judge                 717
affected_individuals                       329
injured_or_dead_people                     159
vehicle_damage                              50
missing_or_found_people                     15
Name: text_human, dtype: int64

In [117]:
# the not informative tweets are the one coded as "not_relevant_or_cant_judge" in "text_human"
train.text_info.value_counts()

informative        8748
not_informative     717
Name: text_info, dtype: int64

In [118]:
# create numerical y variable coded 1 through 7
train['y'] = train.text_human.map(
    {"other_relevant_information":1,"rescue_volunteering_or_donation_effort":2, 
     "infrastructure_and_utility_damage":3, "not_relevant_or_cant_judge":4,
     "affected_individuals":5, 'injured_or_dead_people':6, 'vehicle_damage':7,
     'missing_or_found_people':8})

In [119]:
train.y.value_counts(1)

1    0.491601
2    0.278394
3    0.095827
4    0.075753
5    0.034760
6    0.016799
7    0.005283
8    0.001585
Name: y, dtype: float64

In [120]:
train.to_pickle('../data/train.pkl')

### 2. Hurricane Sandy and Joplin Tornado
#### Read and merge tweet archives. 
##### Labeled tweets were saved in multiple files for each event, with lower-numbered files having more detailed coding about categories. We only need the first file, which gives us informative / not-informative, and the second file, which gives us the categories for the informative tweets.
### a. Joplin Tornado

In [121]:
# read first-level coding file for Joplin tweets
joplin = pd.read_csv('../data/Joplin_2011_labeled_data/01_personal-informative-other/a131709.csv',
                     index_col='_unit_id')

In [122]:
# this is the top-level coding
joplin.choose_one.value_counts()

Personal only                       794
Informative (Direct or Indirect)    762
Informative (Indirect)              469
Informative (Direct)                265
Other                                94
Name: choose_one, dtype: int64

In [123]:
# 2,384 coded tweets
joplin.shape

(2384, 10)

In [124]:
# read second-level file
joplin2 = pd.read_csv('../data/Joplin_2011_labeled_data/02_informative_caution-infosrc-donation-damage-other/a121571.csv',
                  index_col='_unit_id')

In [125]:
joplin2.shape

(1233, 12)

In [126]:
# this is the second-level coding
joplin2.choose_one.value_counts()

Caution and advice                       436
Information source                       280
Donations of money, goods or services    204
Casualties and damage                    137
Unknown                                  130
People missing, found or seen             46
Name: choose_one, dtype: int64

#### Combine first and second-level Joplin files

In [127]:
# take only the uninformative tweets from the first file, since the 
# informative ones are futher coded in the second file (most of them were)
mask = ((joplin['choose_one'] == 'Personal only') | (joplin['choose_one'] == 'Other'))
joplin_uninformative = joplin[mask]

In [128]:
joplin_uninformative.shape

(888, 10)

#### Note: there are 263 more tweets that are coded as informative in the first-level file then the total number of tweets in the second-level file. 

In [129]:
joplin2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1233 entries, 191082764 to 191083996
Data columns (total 12 columns):
_golden                  1233 non-null bool
_unit_state              1233 non-null object
_trusted_judgments       1233 non-null int64
_last_judgment_at        1233 non-null object
choose_one               1233 non-null object
choose_one:confidence    1233 non-null float64
choose_one_gold          62 non-null object
id                       1233 non-null int64
retweetcount             649 non-null float64
screenname               1221 non-null object
text                     1233 non-null object
userid                   1221 non-null float64
dtypes: bool(1), float64(3), int64(2), object(6)
memory usage: 116.8+ KB


In [130]:
joplin.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2384 entries, 204690712 to 204693117
Data columns (total 10 columns):
_golden                  2384 non-null bool
_unit_state              2384 non-null object
_trusted_judgments       2384 non-null int64
_last_judgment_at        2384 non-null object
choose_one               2384 non-null object
choose_one:confidence    2384 non-null float64
choose_one_gold          78 non-null object
predicted                2383 non-null object
text_no_rt               2384 non-null object
tweet                    2384 non-null object
dtypes: bool(1), float64(1), int64(1), object(7)
memory usage: 188.6+ KB


In [157]:
# rename columns from second file
joplin2.rename(columns={'choose_one':'category','text':'tweet'}, inplace=True)

# combine the two files
joplin_all = pd.concat([joplin_uninformative, joplin2],axis=0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """


In [132]:
# make an event identifier that we can use when we merge the files
joplin_all['event'] = 'joplin'

In [133]:
joplin_all.shape

(2121, 16)

In [134]:
joplin_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2121 entries, 204690718 to 191083996
Data columns (total 16 columns):
_golden                  2121 non-null bool
_last_judgment_at        2121 non-null object
_trusted_judgments       2121 non-null int64
_unit_state              2121 non-null object
category                 1233 non-null object
choose_one               888 non-null object
choose_one:confidence    2121 non-null float64
choose_one_gold          86 non-null object
event                    2121 non-null object
id                       1233 non-null float64
predicted                887 non-null object
retweetcount             649 non-null float64
screenname               1221 non-null object
text_no_rt               888 non-null object
tweet                    2121 non-null object
userid                   1221 non-null float64
dtypes: bool(1), float64(4), int64(1), object(10)
memory usage: 267.2+ KB


### b. Hurricane Sandy

In [135]:
# read first file for Sandy tweets
sandy = pd.read_csv('../data/sandy2012_labeled_data/01_personal-informative-other/a143145.csv',
                  index_col='_unit_id')

In [136]:
# 1,000 coded Sandy tweets
sandy.shape

(1000, 11)

In [137]:
sandy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 221934923 to 221941939
Data columns (total 11 columns):
_golden                  1000 non-null bool
_unit_state              1000 non-null object
_trusted_judgments       1000 non-null int64
_last_judgment_at        1000 non-null object
choose_one               1000 non-null object
choose_one:confidence    1000 non-null float64
choose_one_gold          51 non-null object
nil                      1 non-null object
text_no_rt               1000 non-null object
tweet                    1000 non-null object
user                     1000 non-null object
dtypes: bool(1), float64(1), int64(1), object(8)
memory usage: 86.9+ KB


In [138]:
sandy.choose_one.value_counts()

Informative (Indirect)              386
Personal Only                       296
Other                               161
Informative (Direct or Indirect)     79
Informative (Direct)                 78
Name: choose_one, dtype: int64

In [139]:
# read second-level Sandy file
sandy2 = pd.read_csv('../data/sandy2012_labeled_data/02_informative_caution-infosrc-donation-damage-other/a144267.csv',
                  index_col='_unit_id')

In [140]:
sandy2.choose_one.value_counts()

Casualties and damage                    170
Caution and advice                       144
Unknown                                  125
Information Source                        72
Donations of money, goods or services     32
Name: choose_one, dtype: int64

In [141]:
sandy2.shape

(543, 11)

In [142]:
sandy2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 543 entries, 223607030 to 223607572
Data columns (total 11 columns):
_golden                  543 non-null bool
_unit_state              543 non-null object
_trusted_judgments       543 non-null int64
_last_judgment_at        543 non-null object
choose_one               543 non-null object
choose_one:confidence    543 non-null float64
choose_one_gold          41 non-null object
text_no_rt               543 non-null object
tweet                    543 non-null object
type                     543 non-null object
user                     543 non-null object
dtypes: bool(1), float64(1), int64(1), object(8)
memory usage: 47.2+ KB


#### Merge the two Sandy files

In [143]:
mask = ((sandy['choose_one'] == 'Personal Only') | (sandy['choose_one'] == 'Other'))

sandy_uninformative = sandy[mask]

In [144]:
sandy_uninformative.shape

(457, 11)

In [145]:
# rename columns of 2nd level Sandy file
sandy2.rename(columns={'choose_one':'category'}, 
                 inplace=True)

In [146]:
# merge Joplin and Sandy first-level files
sandy_all = pd.concat([sandy_uninformative,sandy2], axis=0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [147]:
# make an event identifier that we can use when we merge the files
sandy_all['event'] = 'sandy'

In [148]:
sandy_all.shape

(1000, 14)

In [149]:
sandy_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 221934923 to 223607572
Data columns (total 14 columns):
_golden                  1000 non-null bool
_last_judgment_at        1000 non-null object
_trusted_judgments       1000 non-null int64
_unit_state              1000 non-null object
category                 543 non-null object
choose_one               457 non-null object
choose_one:confidence    1000 non-null float64
choose_one_gold          65 non-null object
nil                      1 non-null object
text_no_rt               1000 non-null object
tweet                    1000 non-null object
type                     543 non-null object
user                     1000 non-null object
event                    1000 non-null object
dtypes: bool(1), float64(1), int64(1), object(11)
memory usage: 110.4+ KB


In [170]:
# make Sandy-only file
train_sandy = sandy_all

In [171]:
# create y variable
train_sandy['y'] = train3['category'].map(
    {"Casualties and damage":1,"Caution and advice":2, 
     "Unknown":3, "Information Source":4,"Information source":4,"Donations of money, goods or services":5,
    'People missing, found or seen':6})

In [173]:
train_sandy['y'].fillna(0,inplace=True)

In [174]:
train_sandy.to_pickle('../data/train_sandy.pkl')

### Combine the Joplin and Sandy data into single df

In [158]:
train3 = pd.concat([joplin_all,sandy_all], axis=0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [159]:
train3.shape

(3121, 19)

In [160]:
train3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3121 entries, 204690718 to 223607572
Data columns (total 19 columns):
_golden                  3121 non-null bool
_last_judgment_at        3121 non-null object
_trusted_judgments       3121 non-null int64
_unit_state              3121 non-null object
category                 1776 non-null object
choose_one               1345 non-null object
choose_one:confidence    3121 non-null float64
choose_one_gold          151 non-null object
event                    1000 non-null object
id                       1233 non-null float64
nil                      1 non-null object
predicted                887 non-null object
retweetcount             649 non-null float64
screenname               1221 non-null object
text_no_rt               1888 non-null object
tweet                    3121 non-null object
type                     543 non-null object
user                     1000 non-null object
userid                   1221 non-null float64
dtypes: bool

In [161]:
train3.category.value_counts()

Caution and advice                       580
Casualties and damage                    307
Information source                       280
Unknown                                  255
Donations of money, goods or services    236
Information Source                        72
People missing, found or seen             46
Name: category, dtype: int64

In [165]:
train3['y'] = train3['category'].map(
    {"Casualties and damage":1,"Caution and advice":2, 
     "Unknown":3, "Information Source":4,"Information source":4,"Donations of money, goods or services":5,
    'People missing, found or seen':6})

In [166]:
train3['y'].fillna(0,inplace=True)

In [167]:
train3.y.value_counts()

0.0    1345
2.0     580
4.0     352
1.0     307
3.0     255
5.0     236
6.0      46
Name: y, dtype: int64

In [169]:
train3.to_pickle('../data/train3.pkl')