# Leveraging Social Media to Identify Major Historic Flood Events, Notebook 1/3: 

## Twitter Scraper

#### Contents:
- [Library import](#Library-import)
- [Variable declaration](#Variable-declaration)
- [Twitter scrape](#Twitter-scrape)
- [Save data to csv](#Save-data-to-csv)

### Library import

In [12]:
# Import Libraries
import pandas as pd
import numpy as np
from twitterscraper import query_tweets
import datetime as dt

# Widen column width so more tweet content is visible
pd.set_option('max_colwidth', 300)

### Variable declaration

In [42]:
# Set date range and city of interest for twitter scrape here
start_date = dt.date(2014, 1, 1)
end_date = dt.date(2018, 1, 1)
lang = 'english'
city = 'Manila'

### Twitter scrape

In [43]:
# Use the twitterscraper library's "query_tweets" to scrape all tweets of interest
# 'near' keyword not used because only returns information if users have their location turned on--and that is only a small number of users
tweet_list = query_tweets(f' ("#flood" OR "#flooding" OR "floodwater" OR "floodwaters" OR "#floodwater" OR "#floodwaters" OR "#flooddamage" OR "#flooddamages" OR "#flooddeath" OR "#flooddeaths") AND ("{city}" OR "{city}flood" OR "{city}flooding" OR "#{city}" OR "#{city}flood" OR "#{city}flooding" OR "#{city}weather") -filter:retweets',
                               #ADD TIMESTAMP
                              begindate = start_date,
                              enddate = end_date,
                              lang = lang,
                              poolsize = 10)

INFO: queries: [' ("#flood" OR "#flooding" OR "floodwater" OR "floodwaters" OR "#floodwater" OR "#floodwaters" OR "#flooddamage" OR "#flooddamages" OR "#flooddeath" OR "#flooddeaths") AND ("Manila" OR "Manilaflood" OR "Manilaflooding" OR "#Manila" OR "#Manilaflood" OR "#Manilaflooding" OR "#Manilaweather") -filter:retweets since:2014-01-01 until:2014-05-27', ' ("#flood" OR "#flooding" OR "floodwater" OR "floodwaters" OR "#floodwater" OR "#floodwaters" OR "#flooddamage" OR "#flooddamages" OR "#flooddeath" OR "#flooddeaths") AND ("Manila" OR "Manilaflood" OR "Manilaflooding" OR "#Manila" OR "#Manilaflood" OR "#Manilaflooding" OR "#Manilaweather") -filter:retweets since:2014-05-27 until:2014-10-20', ' ("#flood" OR "#flooding" OR "floodwater" OR "floodwaters" OR "#floodwater" OR "#floodwaters" OR "#flooddamage" OR "#flooddamages" OR "#flooddeath" OR "#flooddeaths") AND ("Manila" OR "Manilaflood" OR "Manilaflooding" OR "#Manila" OR "#Manilaflood" OR "#Manilaflooding" OR "#Manilaweather") -f

In [44]:
# Save scraped tweets to a dataframe
df = pd.DataFrame(t.__dict__ for t in tweet_list)

In [45]:
# How many tweets have we gotten?
df.shape

(702, 16)

**# of tweets per city:** <br>

Houston: 33,036 tweets <br>
Manila: 702 tweets <br>


Note: The scrape was originally run on Bangkok, Kolkata, and Mumbai as well, as examples of developing cities, but Manila had a comparable number of tweets to the other cities, and much more complete meteorological data, hence our choice of city. We were still only able to collect ~.02x the number of tweets for Manila that we were for Houston, which does influence our overall results.

### Save data to csv

In [47]:
# Save scraped tweets to a csv for analysis
df.to_csv(f'../data/tweetscrape_{city}_{start_date}_to_{end_date}.csv', index = None)