# Data Collection

Because there are no publicly available datasets created from tweets during the 2020 California wildfires, we created our own. We pulled just over 108,000 tweets between July 1, 2020 and December 31, 2020 using snscrape. The second half of the year had the largest fires, so we chose to focus on that period. We chose snscrape because we needed to search through old tweets and packages like Tweepy only allow standard users to pull from the last seven days. To scrape our tweets, we used a list of keywords that were related to the fires. This list was 'wildfire', 'fire', 'forest fire', 'smoke', 'burn', 'blaze', 'california', and 'warning'.<br >

Scraping by location was tricky. We set our search within 160km around Napa (coordinates: 38.502500, -122.265400) and collected the city names of where each tweet was pulled from. snscrape is not very clear about how it chooses the location information, but after some research we concluded that it was from one of two sources: the location in the user profile or the geolocation of the tweet.<br >

Tweets were pulled using scraper_script.py, found in the main repository.

In [1]:
# imports

import pandas as pd

# Merging, Preliminary Cleaning and Checks

Each of us scraped tweets between two-month periods. From the original raw data, we were interested in keeping the "date", "user", "content", "id" and "user_location". We each used slightly different codes to gather our datasets, so different adjustments were involved in standardizing our data in order to merge them seamlessly later.

#### July through August 2020

In [2]:
# importing individual datasets

blaze_78 = pd.read_csv('./datasets/tweets_78/blaze78.csv')
burn_78 = pd.read_csv('./datasets/tweets_78/burn78.csv')
california_78 = pd.read_csv('./datasets/tweets_78/california78.csv')
fire_78 = pd.read_csv('./datasets/tweets_78/fire78.csv')
forest_fire_78 = pd.read_csv('./datasets/tweets_78/forest_fire78.csv')
smoke_78 = pd.read_csv('./datasets/tweets_78/smoke78.csv')
warning_78 = pd.read_csv('./datasets/tweets_78/warning78.csv')
wildfire_78 = pd.read_csv('./datasets/tweets_78/wildfire78.csv')

In [3]:
# adding a column to each dataset with the associated keyword, so we remember after merging

blaze_78['keyword'] = 'blaze'
burn_78['keyword'] = 'burn'
california_78['keyword'] = 'california'
fire_78['keyword'] = 'fire'
forest_fire_78['keyword'] = 'forest fire'
smoke_78['keyword'] = 'smoke'
warning_78['keyword'] = 'warning'
wildfire_78['keyword'] = 'wildfire'

In [4]:
blaze_78.shape, burn_78.shape, california_78.shape, fire_78.shape, forest_fire_78.shape, smoke_78.shape, warning_78.shape, wildfire_78.shape

((64, 29),
 (1028, 29),
 (28934, 29),
 (8333, 29),
 (100, 6),
 (3375, 29),
 (617, 6),
 (1116, 29))

In [5]:
# combining all tweets from July and August

df_78 = blaze_78.append([burn_78, california_78, fire_78, forest_fire_78, smoke_78, warning_78, wildfire_78], ignore_index=True)

In [6]:
df_78.shape

(43567, 30)

In [7]:
# all original columns

df_78.columns

Index(['url', 'date', 'content', 'renderedContent', 'id', 'user', 'replyCount',
       'retweetCount', 'likeCount', 'quoteCount', 'conversationId', 'lang',
       'source', 'sourceUrl', 'sourceLabel', 'outlinks', 'tcooutlinks',
       'media', 'retweetedTweet', 'quotedTweet', 'inReplyToTweetId',
       'inReplyToUser', 'mentionedUsers', 'coordinates', 'place', 'hashtags',
       'cashtags', 'user_location', 'keyword', 'Unnamed: 0'],
      dtype='object')

In [8]:
# dropping columns

df_78.drop(columns = ['url', 'renderedContent', 'replyCount',
       'retweetCount', 'likeCount', 'quoteCount', 'conversationId', 'lang',
       'source', 'sourceUrl', 'sourceLabel', 'outlinks', 'tcooutlinks',
       'media', 'retweetedTweet', 'quotedTweet', 'inReplyToTweetId',
       'inReplyToUser', 'mentionedUsers', 'coordinates', 'place', 'hashtags',
       'cashtags', 'Unnamed: 0', 'id'], inplace=True)

In [9]:
# extract tweet ID
def twid(x):
    twid = x.split(",")[1][7:]
    return twid

# map to dataframe, create new column
df_78['id'] = df_78['user'].map(lambda x: twid(x))

#### September through October 2020

In [10]:
# reading in individual datasets

warning_910 = pd.read_csv('./datasets/tweets_910/warning_910.csv')
california_910 = pd.read_csv('./datasets/tweets_910/california_910.csv')
burn_910 = pd.read_csv('./datasets/tweets_910/burn_910.csv')
forest_fire_910 = pd.read_csv('./datasets/tweets_910/forest_fire_910.csv')
blaze_910 = pd.read_csv('./datasets/tweets_910/blaze_910.csv')
smoke_910 = pd.read_csv('./datasets/tweets_910/smoke_910.csv')
fire_910 = pd.read_csv('./datasets/tweets_910/fire_910.csv')
wildfire_910 = pd.read_csv('./datasets/tweets_910/wildfire_910.csv')

In [11]:
# adding column with keyword

warning_910['keyword'] = 'warning'
california_910['keyword'] = 'california'
burn_910['keyword'] = 'burn'
forest_fire_910['keyword'] = 'forest fire'
blaze_910['keyword'] = 'blaze'
smoke_910['keyword'] = 'smoke'
fire_910['keyword'] = 'fire'
wildfire_910['keyword'] = 'wildfire'

In [12]:
# checking sizes of datasets imported (tweets from Sept 1 - October 31)

warning_910.shape, california_910.shape, burn_910.shape, forest_fire_910.shape, blaze_910.shape, smoke_910.shape, fire_910.shape, wildfire_910.shape

((504, 6),
 (10000, 6),
 (952, 6),
 (235, 6),
 (92, 6),
 (3674, 6),
 (8330, 6),
 (1297, 6))

In [13]:
# join dataframes

df_910 = warning_910.append([california_910, burn_910, forest_fire_910, blaze_910, smoke_910, fire_910, wildfire_910], ignore_index=True)

In [14]:
# total rows

df_910.shape

(25084, 6)

In [15]:
# dropping extra index

df_910.drop(columns = ['Unnamed: 0'], inplace=True)

In [16]:
# extract tweet ID
def twid(x):
    twid = x.split(",")[1][7:]
    return twid

# map to dataframe, create new column
df_910['id'] = df_910['user'].map(lambda x: twid(x))

#### November through December 2020

In [17]:
# importing individual dataframes

blaze_1112 = pd.read_csv('./datasets/tweets_1112/Blaze.csv')
burn_1112 = pd.read_csv('./datasets/tweets_1112/Burn.csv')
california_1112 = pd.read_csv('./datasets/tweets_1112/California.csv')
fire_1112 = pd.read_csv('./datasets/tweets_1112/Fire.csv')
forest_fire_1112 = pd.read_csv('./datasets/tweets_1112/Forest_Fire.csv')
smoke_1112 = pd.read_csv('./datasets/tweets_1112/Smoke.csv')
warning_1112 = pd.read_csv('./datasets/tweets_1112/Warning.csv')
wildfire_1112 = pd.read_csv('./datasets/tweets_1112/Wildfire.csv')
wildfires_1112 = pd.read_csv('./datasets/tweets_1112/Wildfires.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [18]:
# adding keyword column

blaze_1112['keyword'] = 'blaze'
burn_1112['keyword'] = 'burn'
california_1112['keyword'] = 'california'
fire_1112['keyword'] = 'fire'
forest_fire_1112['keyword'] = 'forest fire'
smoke_1112['keyword'] = 'smoke'
warning_1112['keyword'] = 'warning'
wildfire_1112['keyword'] = 'wildfire'
wildfires_1112['keyword'] = 'wildfires'

In [19]:
# checking shapes

blaze_1112.shape, burn_1112.shape, california_1112.shape, fire_1112.shape, forest_fire_1112.shape, smoke_1112.shape, warning_1112.shape, wildfire_1112.shape, wildfires_1112.shape

((68, 31),
 (683, 31),
 (33745, 31),
 (3596, 31),
 (30, 31),
 (1410, 31),
 (360, 31),
 (141, 31),
 (141, 31))

In [20]:
df_1112 = blaze_1112.append([burn_1112, california_1112, fire_1112, forest_fire_1112, smoke_1112, warning_1112, wildfire_1112, wildfires_1112], ignore_index=True)

In [21]:
df_1112.drop(columns = ['Unnamed: 0', 'url', 'renderedContent',
       'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
       'conversationId', 'lang', 'source', 'sourceUrl', 'sourceLabel',
       'outlinks', 'tcooutlinks', 'media', 'retweetedTweet', 'quotedTweet',
       'inReplyToTweetId', 'inReplyToUser', 'mentionedUsers', 'hashtags', 'cashtags', 'Keyword', 'coordinates', 'place'], inplace=True)

In [22]:
df_1112.shape

(40174, 6)

# Data Cleaning

Since our data was manually collected, we had to clean it thoroughly to ensure that everything is uniform and will merge seamlessly.

In [23]:
# function used to extract tweet id from "user" column

def twid(x):
    twid = x.split(",")[1][7:]
    return twid

# Feature Engineering

In [24]:
# pulling just username
def username(x):
    username = x.split(",")[0][14:]
    username = username.replace("'", "")
    return username

# map to dataframe, create new column
df_78['username'] = df_78['user'].map(lambda x: username(x))
df_910['username'] = df_910['user'].map(lambda x: username(x))
df_1112['username'] = df_1112['user'].map(lambda x: username(x))

In [25]:
# this returns boolean value for whether or not a user is verified
def verified(x):
    verified = x.split("'verified':")[1].split(",")[0]
    verified = verified.strip()
    if verified == 'True':
        return int(1)
    else:
        return int(0)

# map to dataframe, create new column
df_78['verified'] = df_910['user'].map(lambda x: verified(x))
df_910['verified'] = df_910['user'].map(lambda x: verified(x))
df_1112['verified'] = df_910['user'].map(lambda x: verified(x))

In [26]:
# does this tweet appear more than once in the dataset?

df_78['duplicate'] = df_78.duplicated(subset = ['id'], keep=False)
df_910['duplicate'] = df_910.duplicated(subset = ['id'], keep=False)
df_1112['duplicate'] = df_1112.duplicated(subset = ['id'], keep=False)

df_78['duplicate'] = df_78['duplicate'].map({True: 1, False: 0})
df_910['duplicate'] = df_910['duplicate'].map({True: 1, False: 0})
df_1112['duplicate'] = df_1112['duplicate'].map({True: 1, False: 0})

# Final Checks

In [36]:
column_names = ["date", "user", "content", "id", "user_location", "keyword", "username", "verified", "duplicate"]
df_78 = df_78.reindex(columns=column_names)
df_78.head()

Unnamed: 0,date,user,content,id,user_location,keyword,username,verified,duplicate
0,2020-08-30 08:22:38+00:00,"{'username': 'mbvukutaphiri', 'id': 254791401,...",Happy listening folks. 🎻🎸🎺🎷🥁 Be and stay bless...,254791401,"Davis, California, USA",blaze,mbvukutaphiri,0.0,1
1,2020-08-28 23:20:42+00:00,"{'username': 'ChrispyKremeKim', 'id': 60667502...",Blaze it,606675029,,blaze,ChrispyKremeKim,0.0,0
2,2020-08-27 19:37:03+00:00,"{'username': 'DonovanTroi', 'id': 331170446, '...",#TBT when “The Voice” @DonovanTroi was serving...,331170446,"Antioch, California",blaze,DonovanTroi,1.0,0
3,2020-08-27 01:00:12+00:00,"{'username': 'UnderCoverToni', 'id': 108567669...",Koihime Enbu RyoRaiRai\nFighting EX Layer\nStr...,1085676697482014720,,blaze,UnderCoverToni,0.0,1
4,2020-08-25 19:33:02+00:00,"{'username': '_Victorres_', 'id': 298651233, '...",@BLAZE_4K_ You ain’t deleting shit,298651233,"Bay Area, CA",blaze,_Victorres_,0.0,1


In [28]:
column_names = ["date", "user", "content", "id", "user_location", "keyword", "username", "verified", "duplicate"]
df_910 = df_910.reindex(columns=column_names)
df_910.head()

Unnamed: 0,date,user,content,id,user_location,keyword,username,verified,duplicate
0,2020-10-31 21:55:45+00:00,"{'username': 'solarpaddy', 'id': 1456056540, '...",My pilot friend tells me that BA is warning al...,1456056540,Berkeley/Ohlone Territory,warning,solarpaddy,0,1
1,2020-10-31 21:02:40+00:00,"{'username': 'AronHunt', 'id': 16213424, 'disp...",@sacbee_news Warning: Paywall... And also left...,16213424,"West Coast, USA",warning,AronHunt,0,0
2,2020-10-31 18:42:33+00:00,"{'username': 'EASki', 'id': 22224654, 'display...","Warning ⚠️ @ Oakland, California https://t.co/...",22224654,"Oakland, California",warning,EASki,1,1
3,2020-10-31 13:30:05+00:00,"{'username': 'Journeyman15', 'id': 141393104, ...",I was actually just listening to a classical s...,141393104,"Newcastle, California",warning,Journeyman15,0,1
4,2020-10-31 03:59:02+00:00,"{'username': 'VirtuallyKim', 'id': 206525910, ...",My Canadian husband is disappointed that you c...,206525910,"Silicon Valley, CA",warning,VirtuallyKim,0,1


In [29]:
column_names = ["date", "user", "content", "id", "user_location", "keyword", "username", "verified", "duplicate"]
df_1112 = df_1112.reindex(columns=column_names)
df_1112.head()

Unnamed: 0,date,user,content,id,user_location,keyword,username,verified,duplicate
0,2020-12-27 21:59:11+00:00,"{'username': 'Karkarmari', 'id': 7215441369526...","An inspirational woman, who continues to blaze...",1343315517126238208,"Palo Alto, CA",blaze,Karkarmari,0.0,0
1,2020-12-27 02:54:32+00:00,"{'username': 'idontcosplay408', 'id': 10880060...",I have not a clue on a lot of guitar tones but...,1343027456425353222,"San Jose, CA",blaze,idontcosplay408,0.0,0
2,2020-12-27 01:57:06+00:00,"{'username': '_Victorres_', 'id': 298651233, '...",@Wario64 @BLAZE_4K_,1343013004007198720,"Bay Area, CA",blaze,_Victorres_,1.0,0
3,2020-12-26 21:44:13+00:00,"{'username': 'YunggsavageT', 'id': 972849282, ...",Kome Take Ah Blaze w/Me &amp; We gone see the ...,1342949361718878215,"Studewood ,TX",blaze,YunggsavageT,0.0,0
4,2020-12-26 02:51:54+00:00,"{'username': '_Victorres_', 'id': 298651233, '...",@BLAZE_4K_ Appreciate you. Merry Christmas to ...,1342664406845382656,"Bay Area, CA",blaze,_Victorres_,0.0,0


In [30]:
df_78.to_csv('./datasets/all_78.csv')
df_910.to_csv('./datasets/all_910.csv')
df_1112.to_csv('./datasets/all_1112.csv')

# Final Merge and Save

In [31]:
df_final = df_78.append([df_910, df_1112], ignore_index=True)

In [32]:
df_final.to_csv('./datasets/finaldata.csv')

In [33]:
df_final.shape

(108825, 9)

In [34]:
df_final.isnull().sum()

date                 0
user                 0
content              0
id                   0
user_location    10031
keyword              0
username             0
verified         33573
duplicate            0
dtype: int64

In [2]:
df_final = pd.read_csv('./datasets/finaldata.csv')

Since we made this dataframe from scratch, we needed to label the tweets by hand. We are interested in identifying tweets with relevant, helpful information for victims of the fire. To do this, we searched through a few thousand of the tweets and noted words that appeared in the sort of tweets we were hoping to identify. Additionally, we found a paper about mining intellgence from social media posts during disasters and used words from their list that were appropriate in this context.<br >

We created a list with those "important" words (imp_words below) and then wrote a function to assign a "1" to tweets with a helpful word and a "0" to tweets that are not helpful or relevant.

In [10]:
def relevant(x):
    imp_words = ['fires', 'burning', 'smoke', 'calfire', 'lightning', 'evacuation', 'haze', 'hazey', 'hazy', 'safe', 'pray', 'donation', 'donate', 'petition', 'power', 'blackout', 'scu', 'czu', 'lnu', 'prepare', 'firefighter', 'emergency', 'shelter', 'news', 'responder', 'red cross', '911', 'damage', 'displace', 'victim', 'impact', 'injury', 'structure', 'volunteer', 'smog', 'brush', 'update', 'near']
    if any(imp_word in x.lower() for imp_word in imp_words) == True:
        return int(1)
    else:
        return int(0)

# map to dataframe, create new column
df_final['relevant'] = df_final['content'].map(lambda x: relevant(x))
df_final.head(20)

Unnamed: 0.1,Unnamed: 0,date,user,content,id,user_location,keyword,username,verified,duplicate,relevant
0,0,2020-08-30 08:22:38+00:00,"{'username': 'mbvukutaphiri', 'id': 254791401,...",Happy listening folks. 🎻🎸🎺🎷🥁 Be and stay bless...,254791401,"Davis, California, USA",blaze,mbvukutaphiri,0.0,1,1
1,1,2020-08-28 23:20:42+00:00,"{'username': 'ChrispyKremeKim', 'id': 60667502...",Blaze it,606675029,,blaze,ChrispyKremeKim,0.0,0,0
2,2,2020-08-27 19:37:03+00:00,"{'username': 'DonovanTroi', 'id': 331170446, '...",#TBT when “The Voice” @DonovanTroi was serving...,331170446,"Antioch, California",blaze,DonovanTroi,1.0,0,0
3,3,2020-08-27 01:00:12+00:00,"{'username': 'UnderCoverToni', 'id': 108567669...",Koihime Enbu RyoRaiRai\nFighting EX Layer\nStr...,1085676697482014720,,blaze,UnderCoverToni,0.0,1,0
4,4,2020-08-25 19:33:02+00:00,"{'username': '_Victorres_', 'id': 298651233, '...",@BLAZE_4K_ You ain’t deleting shit,298651233,"Bay Area, CA",blaze,_Victorres_,0.0,1,0
5,5,2020-08-23 21:27:35+00:00,"{'username': 'CallMeDd_Ttv', 'id': 1453083032,...",@Blaze_4real U can play for 10 hours for $5,1453083032,,blaze,CallMeDd_Ttv,0.0,0,0
6,6,2020-08-23 19:52:52+00:00,"{'username': 'whoadoggie', 'id': 19524255, 'di...",@PSU_Blaze @notcapnamerica I was thinking the ...,19524255,"San Francisco, CA",blaze,whoadoggie,0.0,1,0
7,7,2020-08-23 18:28:11+00:00,"{'username': 'spncrmr', 'id': 1581999932, 'dis...",blaze but you can never be too prepared. I kno...,1581999932,"San Jose, CA",blaze,spncrmr,0.0,1,1
8,8,2020-08-22 03:53:53+00:00,"{'username': 'kentnish', 'id': 14879434, 'disp...",Firefighters from the Ben Lomond Fire Dept wor...,14879434,"Washington, DC",blaze,kentnish,0.0,1,1
9,9,2020-08-22 01:31:47+00:00,"{'username': 'MaggieAngst', 'id': 3325425641, ...",Friday evening update on the #CZUFire in Santa...,3325425641,"San Jose, CA",blaze,MaggieAngst,0.0,1,1


In [11]:
df_final.groupby(by = ['relevant']).count()

Unnamed: 0_level_0,Unnamed: 0,date,user,content,id,user_location,keyword,username,verified,duplicate
relevant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,83633,83633,83633,83633,83633,76054,83633,83633,61744,83633
1,25192,25192,25192,25192,25192,22740,25192,25192,13508,25192


In [13]:
df_final.drop(columns = ['Unnamed: 0'])

Unnamed: 0,date,user,content,id,user_location,keyword,username,verified,duplicate,relevant
0,2020-08-30 08:22:38+00:00,"{'username': 'mbvukutaphiri', 'id': 254791401,...",Happy listening folks. 🎻🎸🎺🎷🥁 Be and stay bless...,254791401,"Davis, California, USA",blaze,mbvukutaphiri,0.0,1,1
1,2020-08-28 23:20:42+00:00,"{'username': 'ChrispyKremeKim', 'id': 60667502...",Blaze it,606675029,,blaze,ChrispyKremeKim,0.0,0,0
2,2020-08-27 19:37:03+00:00,"{'username': 'DonovanTroi', 'id': 331170446, '...",#TBT when “The Voice” @DonovanTroi was serving...,331170446,"Antioch, California",blaze,DonovanTroi,1.0,0,0
3,2020-08-27 01:00:12+00:00,"{'username': 'UnderCoverToni', 'id': 108567669...",Koihime Enbu RyoRaiRai\nFighting EX Layer\nStr...,1085676697482014720,,blaze,UnderCoverToni,0.0,1,0
4,2020-08-25 19:33:02+00:00,"{'username': '_Victorres_', 'id': 298651233, '...",@BLAZE_4K_ You ain’t deleting shit,298651233,"Bay Area, CA",blaze,_Victorres_,0.0,1,0
...,...,...,...,...,...,...,...,...,...,...
108820,2020-11-06 01:13:59+00:00,"{'username': 'MeghanMacaluso', 'id': 212676843...",@OaklandFireCA Not two weeks ago you stopped a...,1324520370146213888,"Oakland, California",wildfires,MeghanMacaluso,,1,0
108821,2020-11-05 06:29:25+00:00,"{'username': 'SatansMlt', 'id': 13175785292157...",Welcome SATAN now and be safe clean water but ...,1324237363791290368,"Palo Alto, CA",wildfires,SatansMlt,,1,1
108822,2020-11-04 14:54:17+00:00,"{'username': 'k1sep1', 'id': 222673911, 'displ...",@n_th_n_ Like how Trump denied Californians fe...,1324002029874810881,"Hayward, CA",wildfires,k1sep1,,1,0
108823,2020-11-03 18:48:04+00:00,"{'username': 'InsuringCAL', 'id': 20453796, 'd...",Good news! #wildfires https://t.co/2OdeLTdqEx,1323698478111846401,California,wildfires,InsuringCAL,,1,1


In [14]:
df_final.to_csv('./datasets/finaldata_label.csv')