# Notebook for importing twitter data for hurricane sandy.

The twitter dataset is from mdredze.

In [36]:
%matplotlib inline
import sys
import os
sys.path.append(os.path.abspath('../'))

import pandas as pd
import pymongo
import twitterinfrastructure.twitter_sandy as ts
import importlib
importlib.reload(ts)

#os.chdir('../')
print(os.getcwd())

C:\dev\research\socialsensing\notebooks


## Hydrate tweet IDs into tweets using Hydrator.
1. Run the following cell to convert the raw mdredze sandy tweet ids file into an interim file of tweet ids in the format necessary to hydrate using Hydrator.
1. Use [Hydrator](https://github.com/DocNow/hydrator) to hydrate the "data/interim/sandy-tweetids.txt" file. Hydrating on 03-14-2018 created a 13.3 GB json file with ??? tweets.

In [37]:
# create interim file with only tweet ids for hydration using Hydrator 
# (6,554,744 tweet ids, 124.5 MB)
# takes ~1 min (3.1 GHz Intel Core i7, 16 GB 1867 MHz DDR3)
path = "data/raw/release-mdredze.txt"
write_path = "data/interim/sandy-tweetids.txt"
num_tweets = ts.create_hydrator_tweetids(path=path, write_path=write_path, 
                                         filter_sandy=False, progressbar=False, verbose=1)

2019-05-02 13:30:54 : Started converting tweet ids from data/raw/release-mdredze.txt to Hydrator format.



FileNotFoundError: [Errno 2] No such file or directory: 'data/interim/sandy-tweetids.txt'

## Import hydrated tweets into mongodb database.

In [35]:
# import tweets (4799665 tweets out of 4799665 lines, 12.2 GB total doc size)
# takes ~ 40 mins (3.1 GHz Intel Core i7, 16 GB 1867 MHz DDR3)
# path = 'data/processed/sandy-tweets-20180314.json'
path = 'E:/Work/projects/twitterinfrastructure/data/processed/sandy-tweets-20180314.json'
collection = 'tweets'
db_name = 'sandy'
db_instance = 'mongodb://localhost:27017/'

insert_num = ts.insert_tweets(path, collection=collection, db_name=db_name, 
                              db_instance=db_instance, progressbar=True,
                              overwrite=True, verbose=1)


2019-05-02 13:18:14 : Started inserting tweets from "E:/Work/projects/twitterinfrastructure/data/processed/sandy-tweets-20180314.json" to tweets collection in sandy database.

2019-05-02 13:18:14 : Dropped tweets collection (if exists).



HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7105: character maps to <undefined>

## Import taxi_zones GeoJSON into mongodb database.

1. Open terminal.
1. Change to the twitterinfrastructure project home directory. For example, run the following (based on my directory structure):

	$ cd Documents/projects/twitterinfrastructure

1. Use mongoimport to import the taxi_zones_crs4326_mod.geojson into the database by running the following in terminal (not mongodb shell). Be aware of double dash lines in front of db, collection, file, and jsonArray arguments).

	$ mongoimport --db sandy --collection taxi_zones --file "data/processed/taxi_zones_crs4326_mod.geojson" --jsonArray

1. Run the following cell to create a geosphere index in the taxi_zones collection.

In [2]:
# create geosphere index in taxi_zones collection
db_instance = 'mongodb://localhost:27017/'
db_name = 'sandy'
zones_collection = 'taxi_zones'
#db_name = 'sandy_test'
#zones_collection = 'taxi_zones_test'
client = pymongo.MongoClient(db_instance)
db = client[db_name]
db[zones_collection].create_index([("geometry", pymongo.GEOSPHERE)])

zones = db[zones_collection].find()
print('{count} taxi zones found in imported taxi_zones GeoJSON file.'.format(
    count=zones.count()))

263 taxi zones found in imported taxi_zones GeoJSON file.


## Import nyiso_zones GeoJSON into mongodb database.

1. Open terminal.
1. Change to the twitterinfrastructure project home directory. For example, run the following (based on my directory structure):

	$ cd Documents/projects/twitterinfrastructure

1. Use mongoimport to import the 'nyiso-zones-crs4326-mod.geojson' file into the database by running the following in terminal (not mongodb shell). Be aware of double dash lines in front of db, collection, file, and jsonArray arguments. Make sure you delete any existing nyiso_zones collection in the database (the command will append, not overwrite).

	This geojson was created by manually querying and copying nyiso zone geojsons from [here](https://services1.arcgis.com/Lsfphzk53dXVltQC/arcgis/rest/services/NYISO_Zones/FeatureServer/0/query?outFields=*&where=1%3D1) (linked from [here](https://hub.arcgis.com/items/3a510da542c74537b268657f63dc2ce4)) to the 'data/raw/nyiso/' directory. Those individual zone geojsons were then combined into the 'nyiso.geojson' file and loaded into qgis3 (version 3.2, using the 'Add Vector Layer' option, individual zones were visualized by adjusting symbology of the layer properties to be categorized). The layer was then exported to a geojson file using qgis3 (with the EPSG:4326 crs).

	$ mongoimport --db sandy --collection nyiso_zones --file "data/processed/nyiso-zones-crs4326-mod.geojson" --jsonArray

1. Run the following cell to create a geosphere index in the nyiso_zones collection and add the properties.zone_id field to each zone in the collection.

In [4]:
# create geosphere index in nyiso_zones collection
db_instance = 'mongodb://localhost:27017/'
db_name = 'sandy'
zones_collection = 'nyiso_zones'
client = pymongo.MongoClient(db_instance)
db = client[db_name]
db[zones_collection].create_index([("geometry", pymongo.GEOSPHERE)])
zones = db[zones_collection].find()
print('{count} nyiso zones found in imported nyiso_zones GeoJSON file.'.format(
    count=zones.count()))

# add zone_id to nyiso_zones collection
zones_path = 'data/raw/nyiso/nyiso-zones.csv'
df = pd.read_csv(zones_path)
for abbrev, zone_id in zip(df['abbrev'], df['zone_id']):
    db[zones_collection].update_one(
        {"properties.Zone": abbrev},
        {"$set": {"properties.zone_id": zone_id}}
    )

11 nyiso zones found in imported nyiso_zones GeoJSON file.
