# Extracting location from tweets with _no_ geotag, translating location into coordinates, and mapping coordinates. 

Not all Tweets reveal their geo tags. Therefore, when road closures are tweeted and the metadata does _not_ include a geotag, what is the best way to extract location names from text? In this notebook, we experiment with the best ways to extract location from text in Tweets with no geotags. 

### Contents
- [**Import Libraries**](#Import-Libraries)
- [**Load Tweets**](#Load-Tweets)
- [**Pre-Processing: spaCy**](#Pre-Processing:-spaCy)
  - [Original Location](#Original-Location)
  - [Modified Text A](#Modified-Text-A)
  - [Modified Text B](#Modified-Text-B)
  - [Modified Text C](#Modified-Text-C)
  - [Modified Text D](#Modified-Text-D)
  - [Modified Text E](#Modified-Text-E)
- [**Extract Location from Text**](#Extract-Location-from-Text)
  - [Location-Extraction-Function](#Location-Extraction-Function)
  - [Extract Location from all modified columns](#Extract-Location-from-all-modified-columns)
  - [Combine all unique locations from all Modifications](#Combine-all-unique-locations-from-all-Modifications)
- [**Show-DataFrame-with-text-and-location-comparison**](#Show-DataFrame-with-text-and-location-comparison)
- [**Assess Performance of Location Extraction for Each Version of Modified Text**](#Assess-Performance-of-Location-Extraction-for-Each-Version-of-Modified-Text)
- [**spaCy Conclusion**](#spaCy-Conclusion)
- [**Convert Locations to Coordinates**](#Convert-Locations-to-Coordinates)
  - [Save data to csv](#save-data-to-csv)
- [**Mapping**](#Mapping)
  - [Load Data](#Load-Data)
  - [Convert Location to Coordinates](#Convert-Location-to-Coordiantes)
  - [Load Map](#Load-Map)

# Import Libraries

In [1]:
import pandas as pd
import re
import spacy
import string

# For mapping
import folium
import os
import geocoder

# Load Tweets
 - These are tweets that have the words "road" and "closed" in them.  

In [2]:
# Read in csv with Tweets
df_twitter_closures = pd.read_csv('../data/tweets_ngram_spacy.csv')

# Drop columns so that only tweets are shown
df_twitter_closures.drop(columns=['Unnamed: 0',
                                  'date',
                                  'keywords',
                                  'location',
                                  'source',
                                  'ngrams',
                                  'spacy_location',
                                  'spacy_location_no_ngram'], inplace=True)
# Print DF shape
print(df_twitter_closures.shape)

# Show head 
df_twitter_closures.head()

(236, 1)


Unnamed: 0,text
0,I-94 Clearwater to St. Cloud - both directions...
1,Hazardous driving conditions with ice covered ...
2,Hwy 95 at Fanny Lake Road just east of Cambrid...
3,Ramp from Stearns County Road 75 to eastbound ...
4,Correction/update - EB I-94 at Hwy 4 near Melr...


# Pre-Processing: spaCy
 - Modify tweets to make them more conducive to being parsed by Spacy to **extract location names from text**

In [3]:
# Create new columns to transfer modified tweet text. Five versions of tweets will be created.
df_twitter_closures['modified_text_A'] = ''
df_twitter_closures['modified_text_B'] = ''
df_twitter_closures['modified_text_C'] = ''
df_twitter_closures['modified_text_D'] = ''
df_twitter_closures['modified_text_E'] = ''

# Show modified DF
df_twitter_closures.head(2)

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E
0,I-94 Clearwater to St. Cloud - both directions...,,,,,
1,Hazardous driving conditions with ice covered ...,,,,,


### Original Location

  - Modify original text with bare minimum modifications, to avoid duplicate locations showing up in set locations. 

In [4]:
# Modify text 
for i in range(len(df_twitter_closures)):
    modified_text = df_twitter_closures['text'].loc[i].replace("Hwy", 'Highway')
    modified_text = modified_text.replace("CR ", 'Country Road ')
    modified_text = modified_text.replace("I-", 'Interstate ')
    modified_text = modified_text.replace("EB ", 'Eastbound ')
    modified_text = modified_text.replace("WB ", 'Westbound ')
    modified_text = modified_text.replace("SB ", 'Southbound ')
    modified_text = modified_text.replace("NB ", 'Nouthbound ')
    modified_text.translate(str.maketrans('', '', string.punctuation))
    #modified_text = re.sub('#[A-Za-z0-9\-\.\_]+(?:\s|$)', '', modified_text)
    df_twitter_closures['text'].loc[i] = modified_text

# Show modified DF 
df_twitter_closures.head()

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E
0,Interstate 94 Clearwater to St. Cloud - both d...,,,,,
1,Hazardous driving conditions with ice covered ...,,,,,
2,Highway 95 at Fanny Lake Road just east of Cam...,,,,,
3,Ramp from Stearns County Road 75 to eastbound ...,,,,,
4,Correction/update - Eastbound Interstate 94 at...,,,,,


### Modified Text A
 - Replace words "between" and "in" with "at" to make it even more clear that a _location_ is being referred to. 
   - Definition of "at": expressing location or arrival in a particular place or position.
 - Replace location abbreviations with complete words
 - Remove twitter hashtags as these were being parsed as locations in an initial run of Spacy on the DF

In [5]:
# Modify text 
for i in range(len(df_twitter_closures)):
    modified_text = df_twitter_closures['text'].loc[i].replace("between ", 'at ')
    modified_text = modified_text.replace("Between ", 'at ')
    modified_text = modified_text.replace("In ", ' at ')
    modified_text = modified_text.replace(" in ", ' at ')
    modified_text = modified_text.replace("CR ", 'Country Road ')
    modified_text = modified_text.replace("Hwy", 'Highway')
    modified_text = modified_text.replace("I-", 'Interstate ')
    modified_text = modified_text.replace("EB ", 'Eastbound ')
    modified_text = modified_text.replace("WB ", 'Westbound ')
    modified_text = modified_text.replace("SB ", 'Southbound ')
    modified_text = modified_text.replace("NB ", 'Nouthbound ')
    modified_text.translate(str.maketrans('', '', string.punctuation))
    #modified_text = re.sub('#[A-Za-z0-9\-\.\_]+(?:\s|$)', '', modified_text)
    df_twitter_closures['modified_text_A'].loc[i] = modified_text

# Show modified DF 
df_twitter_closures.head()

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E
0,Interstate 94 Clearwater to St. Cloud - both d...,Interstate 94 Clearwater to St. Cloud - both d...,,,,
1,Hazardous driving conditions with ice covered ...,Hazardous driving conditions with ice covered ...,,,,
2,Highway 95 at Fanny Lake Road just east of Cam...,Highway 95 at Fanny Lake Road just east of Cam...,,,,
3,Ramp from Stearns County Road 75 to eastbound ...,Ramp from Stearns County Road 75 to eastbound ...,,,,
4,Correction/update - Eastbound Interstate 94 at...,Correction/update - Eastbound Interstate 94 at...,,,,


### Modified Text B
Modified Text A: 
 - Replace words "between" and "in" with "at" to make it even more clear that a _location_ is being referred to. 
   - Definition of "at": expressing location or arrival in a particular place or position.
 - Replace location abbreviations with complete words
 - Remove twitter hashtags as these were being parsed as locations in an initial run of Spacy on the DF

Including Above, Modified Text B: 
 - Add "at" to the beginning of all tweets, so that tweets that begin with a location name are more likely to be identified as a location by Spacy

In [6]:
# Modify text 
for i in range(len(df_twitter_closures)):
    modified_text = df_twitter_closures['text'].loc[i].replace("between ", '')
    modified_text = modified_text.replace("Between ", '')
    modified_text = modified_text.replace("In ", 'at ')
    modified_text = modified_text.replace(" in ", ' at ')
    modified_text = "At " + modified_text
    modified_text = modified_text.replace("CR ", 'Country Road ')
    modified_text = modified_text.replace("Hwy", 'Highway')
    modified_text = modified_text.replace("I-", 'Interstate ')
    modified_text = modified_text.replace("EB ", 'Eastbound ')
    modified_text = modified_text.replace("WB ", 'Westbound ')
    modified_text = modified_text.replace("SB ", 'Southbound ')
    modified_text = modified_text.replace("NB ", 'Nouthbound ')
    modified_text.translate(str.maketrans('', '', string.punctuation))
    #modified_text = re.sub('#[A-Za-z0-9\-\.\_]+(?:\s|$)', '', modified_text)
    df_twitter_closures['modified_text_B'].loc[i] = modified_text

    # Show modified DF 
df_twitter_closures.head()

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E
0,Interstate 94 Clearwater to St. Cloud - both d...,Interstate 94 Clearwater to St. Cloud - both d...,At Interstate 94 Clearwater to St. Cloud - bot...,,,
1,Hazardous driving conditions with ice covered ...,Hazardous driving conditions with ice covered ...,At Hazardous driving conditions with ice cover...,,,
2,Highway 95 at Fanny Lake Road just east of Cam...,Highway 95 at Fanny Lake Road just east of Cam...,At Highway 95 at Fanny Lake Road just east of ...,,,
3,Ramp from Stearns County Road 75 to eastbound ...,Ramp from Stearns County Road 75 to eastbound ...,At Ramp from Stearns County Road 75 to eastbou...,,,
4,Correction/update - Eastbound Interstate 94 at...,Correction/update - Eastbound Interstate 94 at...,At Correction/update - Eastbound Interstate 94...,,,


### Modified Text C

Modified Text A: 
 - Replace words "between" and "in" with "at" to make it even more clear that a _location_ is being referred to. 
   - Definition of "at": expressing location or arrival in a particular place or position.
 - Replace location abbreviations with complete words
 - Remove twitter hashtags as these were being parsed as locations in an initial run of Spacy on the DF

Including Above, Modified Text B: 
 - Add "at" to the beginning of all tweets, so that tweets that begin with a location name are more likely to be identified as a location by Spacy
 
Including Above, Modified Text C: 
 - Ensure all instances of "east/westbound" are capitalized --> "Eastbound" and "Westbound"

In [7]:
# Modify text 
for i in range(len(df_twitter_closures)):
    modified_text = df_twitter_closures['text'].loc[i].replace("between ", '')
    modified_text = modified_text.replace("Between ", '')
    modified_text = modified_text.replace("In ", 'at ')
    modified_text = modified_text.replace(" in ", ' at ')
    modified_text = "At " + modified_text
    modified_text = modified_text.replace("CR ", 'Country Road ')
    modified_text = modified_text.replace("Hwy", 'Highway')
    modified_text = modified_text.replace("I-", 'Interstate ')
    modified_text = modified_text.replace("EB ", 'Eastbound ')
    modified_text = modified_text.replace("WB ", 'Westbound ')
    modified_text = modified_text.replace("SB ", 'Southbound ')
    modified_text = modified_text.replace("NB ", 'Nouthbound ')
    modified_text = modified_text.replace("eastbound", 'Eastbound')
    modified_text = modified_text.replace("westbound", 'Westbound')
    modified_text = modified_text.replace("northbound", 'Northbound')
    modified_text = modified_text.replace("southbound", 'Southbound')  
    modified_text.translate(str.maketrans('', '', string.punctuation))
    #modified_text = re.sub('#[A-Za-z0-9\-\.\_]+(?:\s|$)', '', modified_text)
    df_twitter_closures['modified_text_C'].loc[i] = modified_text

    # Show modified DF 
df_twitter_closures.head()

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E
0,Interstate 94 Clearwater to St. Cloud - both d...,Interstate 94 Clearwater to St. Cloud - both d...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,,
1,Hazardous driving conditions with ice covered ...,Hazardous driving conditions with ice covered ...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,,
2,Highway 95 at Fanny Lake Road just east of Cam...,Highway 95 at Fanny Lake Road just east of Cam...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...,,
3,Ramp from Stearns County Road 75 to eastbound ...,Ramp from Stearns County Road 75 to eastbound ...,At Ramp from Stearns County Road 75 to eastbou...,At Ramp from Stearns County Road 75 to Eastbou...,,
4,Correction/update - Eastbound Interstate 94 at...,Correction/update - Eastbound Interstate 94 at...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...,,


### Modified Text D

Modified Text A: 
 - Replace words "between" and "in" with "at" to make it even more clear that a _location_ is being referred to. 
   - Definition of "at": expressing location or arrival in a particular place or position.
 - Replace location abbreviations with complete words
 - Remove twitter hashtags as these were being parsed as locations in an initial run of Spacy on the DF

Including Above, Modified Text B: 
 - Add "at" to the beginning of all tweets, so that tweets that begin with a location name are more likely to be identified as a location by Spacy
 
Including Above, Modified Text C: 
 - Ensure all instances of "east/westbound" are capitalized --> "Eastbound" and "Westbound"
 
Including Above, Modified Text D: 
 - Remove all punctuation

In [8]:
# Modify text 
for i in range(len(df_twitter_closures)):
    modified_text = df_twitter_closures['text'].loc[i].replace("between ", '')
    modified_text = modified_text.replace("Between ", '')
    modified_text = modified_text.replace("In ", 'at ')
    modified_text = modified_text.replace(" in ", ' at ')
    modified_text = "At " + modified_text
    modified_text = modified_text.replace("CR ", 'Country Road ')
    modified_text = modified_text.replace("Hwy", 'Highway')
    modified_text = modified_text.replace("I-", 'Interstate ')
    modified_text = modified_text.replace("EB ", 'Eastbound ')
    modified_text = modified_text.replace("WB ", 'Westbound ')
    modified_text = modified_text.replace("eastbound", 'Eastbound')
    modified_text = modified_text.replace("westbound", 'Westbound')
    modified_text = modified_text.replace("SB ", 'Southbound ')
    modified_text = modified_text.replace("NB ", 'Nouthbound ')
    modified_text = modified_text.replace("eastbound", 'Eastbound')
    modified_text = modified_text.replace("westbound", 'Westbound')
    modified_text = modified_text.replace("northbound", 'Northbound')
    modified_text = modified_text.replace("southbound", 'Southbound') 
    modified_text.translate(str.maketrans('', '', string.punctuation))
    #modified_text = re.sub('#[A-Za-z0-9\-\.\_]+(?:\s|$)', '', modified_text)
    df_twitter_closures['modified_text_D'].loc[i] = modified_text

    # Show modified DF 
df_twitter_closures.head()

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E
0,Interstate 94 Clearwater to St. Cloud - both d...,Interstate 94 Clearwater to St. Cloud - both d...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,
1,Hazardous driving conditions with ice covered ...,Hazardous driving conditions with ice covered ...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,
2,Highway 95 at Fanny Lake Road just east of Cam...,Highway 95 at Fanny Lake Road just east of Cam...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...,
3,Ramp from Stearns County Road 75 to eastbound ...,Ramp from Stearns County Road 75 to eastbound ...,At Ramp from Stearns County Road 75 to eastbou...,At Ramp from Stearns County Road 75 to Eastbou...,At Ramp from Stearns County Road 75 to Eastbou...,
4,Correction/update - Eastbound Interstate 94 at...,Correction/update - Eastbound Interstate 94 at...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...,


### Modified Text E
Modified Text A: 
 - Replace words "between" and "in" with "at" to make it even more clear that a _location_ is being referred to. 
   - Definition of "at": expressing location or arrival in a particular place or position.
 - Replace location abbreviations with complete words
 - Remove twitter hashtags as these were being parsed as locations in an initial run of Spacy on the DF

Including Above, Modified Text B: 
 - Add "at" to the beginning of all tweets, so that tweets that begin with a location name are more likely to be identified as a location by Spacy
 
Including Above, Modified Text E: 
 - Modify "Modification A" so that "between" is replaced with "at." 

In [9]:
# Modify text 
for i in range(len(df_twitter_closures)):
    modified_text = df_twitter_closures['text'].loc[i].replace("between ", 'at ')
    modified_text = modified_text.replace("Between ", 'at ')
    modified_text = modified_text.replace("In ", 'at ')
    modified_text = modified_text.replace(" in ", ' at ')
    modified_text = "At " + modified_text
    modified_text = modified_text.replace("CR ", 'Country Road ')
    modified_text = modified_text.replace("Hwy", 'Highway')
    modified_text = modified_text.replace("I-", 'Interstate ')
    modified_text = modified_text.replace("EB ", 'Eastbound ')
    modified_text = modified_text.replace("WB ", 'Westbound ')
    modified_text = modified_text.replace("SB ", 'Southbound ')
    modified_text = modified_text.replace("NB ", 'Nouthbound ')
    modified_text.translate(str.maketrans('', '', string.punctuation))
    #modified_text = re.sub('#[A-Za-z0-9\-\.\_]+(?:\s|$)', '', modified_text)
    df_twitter_closures['modified_text_E'].loc[i] = modified_text

    # Show modified DF 
df_twitter_closures.head()

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E
0,Interstate 94 Clearwater to St. Cloud - both d...,Interstate 94 Clearwater to St. Cloud - both d...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...
1,Hazardous driving conditions with ice covered ...,Hazardous driving conditions with ice covered ...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...
2,Highway 95 at Fanny Lake Road just east of Cam...,Highway 95 at Fanny Lake Road just east of Cam...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...
3,Ramp from Stearns County Road 75 to eastbound ...,Ramp from Stearns County Road 75 to eastbound ...,At Ramp from Stearns County Road 75 to eastbou...,At Ramp from Stearns County Road 75 to Eastbou...,At Ramp from Stearns County Road 75 to Eastbou...,At Ramp from Stearns County Road 75 to eastbou...
4,Correction/update - Eastbound Interstate 94 at...,Correction/update - Eastbound Interstate 94 at...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...


# Extract Location from Text
  - Using Spacy, extract location names from each version of modified text and place location names into new columns.

In [10]:
# Create new columns to house extract locations for each version of text
df_twitter_closures['location_original'] = ''
df_twitter_closures['location_A'] = ''
df_twitter_closures['location_B'] = ''
df_twitter_closures['location_C'] = ''
df_twitter_closures['location_D'] = ''
df_twitter_closures['location_E'] = ''
df_twitter_closures['state'] = 'Minnesota'

# Show modified DF
df_twitter_closures.head(2)

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E,location_original,location_A,location_B,location_C,location_D,location_E,state
0,Interstate 94 Clearwater to St. Cloud - both d...,Interstate 94 Clearwater to St. Cloud - both d...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,,,,,,,Minnesota
1,Hazardous driving conditions with ice covered ...,Hazardous driving conditions with ice covered ...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,,,,,,,Minnesota


### Location Extraction Function

In [11]:
def get_loc(df, text_column, location_column):
    # Use Spacy to extract location names from `text` column
    for i in range(len(df)):
        nlp = spacy.load("en_core_web_sm")
        doc = nlp(df_twitter_closures[text_column].iloc[i])
        locations = set()
        for ent in doc.ents:
            if (ent.label_=='GPE') or (ent.label_=='FAC'):
                locations.add(ent.text)
                df_twitter_closures[location_column].iloc[i] = locations

### Extract Location from all modified columns

In [12]:
# Extract location from `text` column
get_loc(df_twitter_closures, 'text', 'location_original')

# Extract location from `modified_text_A` column
get_loc(df_twitter_closures, 'modified_text_A', 'location_A')

# Extract location from `modified_text_B` column
get_loc(df_twitter_closures, 'modified_text_B', 'location_B')

# Extract location from `modified_text_C` column
get_loc(df_twitter_closures, 'modified_text_C', 'location_C')

# Extract location from `modified_text_D` column
get_loc(df_twitter_closures, 'modified_text_D', 'location_D')

# Extract location from `modified_text_E` column
get_loc(df_twitter_closures, 'modified_text_E', 'location_E')

### Combine all unique locations from all Modifications

In [13]:
# Create new column to house ALL locations mentioned in all columns
df_twitter_closures['all_locations'] = ''

In [14]:
# Add all unique locations to `all_locations` column
for i in range(len(df_twitter_closures)):
    original = set(df_twitter_closures['location_original'].loc[i])
    a = set(df_twitter_closures['location_A'].loc[i])
    b = set(df_twitter_closures['location_B'].loc[i])
    c = set(df_twitter_closures['location_C'].loc[i])
    d = set(df_twitter_closures['location_D'].loc[i])
    e = set(df_twitter_closures['location_E'].loc[i])
    f = set(df_twitter_closures['state'].loc[i])
    unique = original.union(a.union(b.union(c.union(d.union(e.union(f))))))
    df_twitter_closures['all_locations'].loc[i] = unique

# Show DataFrame with text and location comparison

In [15]:
df_twitter_closures.head(10)

Unnamed: 0,text,modified_text_A,modified_text_B,modified_text_C,modified_text_D,modified_text_E,location_original,location_A,location_B,location_C,location_D,location_E,state,all_locations
0,Interstate 94 Clearwater to St. Cloud - both d...,Interstate 94 Clearwater to St. Cloud - both d...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,At Interstate 94 Clearwater to St. Cloud - bot...,{St. Cloud},{St. Cloud},,,,,Minnesota,"{t, e, o, a, i, St. Cloud, s, M, n}"
1,Hazardous driving conditions with ice covered ...,Hazardous driving conditions with ice covered ...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,At Hazardous driving conditions with ice cover...,"{Interstate 94, Mille Lacs}","{Interstate 94, Mille Lacs}","{Interstate 94, Mille Lacs}","{Interstate 94, Mille Lacs}","{Interstate 94, Mille Lacs}","{Interstate 94, Mille Lacs}",Minnesota,"{t, Interstate 94, Mille Lacs, e, o, a, i, s, ..."
2,Highway 95 at Fanny Lake Road just east of Cam...,Highway 95 at Fanny Lake Road just east of Cam...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...,At Highway 95 at Fanny Lake Road just east of ...,"{Cambridge, Fanny Lake Road}","{Cambridge, Fanny Lake Road}","{Highway 95, Cambridge, Fanny Lake Road}","{Highway 95, Cambridge, Fanny Lake Road}","{Highway 95, Cambridge, Fanny Lake Road}","{Highway 95, Cambridge, Fanny Lake Road}",Minnesota,"{t, n, e, o, a, Cambridge, i, s, Highway 95, M..."
3,Ramp from Stearns County Road 75 to eastbound ...,Ramp from Stearns County Road 75 to eastbound ...,At Ramp from Stearns County Road 75 to eastbou...,At Ramp from Stearns County Road 75 to Eastbou...,At Ramp from Stearns County Road 75 to Eastbou...,At Ramp from Stearns County Road 75 to eastbou...,"{Stearns County Road 75, Interstate 94}","{Stearns County Road 75, Interstate 94}","{Stearns County Road 75, Interstate 94}","{Stearns County Road 75, Eastbound Interstate 94}","{Stearns County Road 75, Eastbound Interstate 94}","{Stearns County Road 75, Interstate 94}",Minnesota,"{Stearns County Road 75, t, Interstate 94, e, ..."
4,Correction/update - Eastbound Interstate 94 at...,Correction/update - Eastbound Interstate 94 at...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...,At Correction/update - Eastbound Interstate 94...,"{Highway 4, Melrose, Interstate 94}","{Highway 4, Melrose, Interstate 94}","{Highway 4, Melrose, Interstate 94}","{Highway 4, Melrose, Interstate 94}","{Highway 4, Melrose, Interstate 94}","{Highway 4, Melrose, Interstate 94}",Minnesota,"{t, Melrose, Interstate 94, e, o, a, i, Highwa..."
5,Eastbound Interstate 94 is closed in Monticell...,Eastbound Interstate 94 is closed at Monticell...,At Eastbound Interstate 94 is closed at Montic...,At Eastbound Interstate 94 is closed at Montic...,At Eastbound Interstate 94 is closed at Montic...,At Eastbound Interstate 94 is closed at Montic...,,,{Eastbound Interstate 94},{Eastbound Interstate 94},{Eastbound Interstate 94},{Eastbound Interstate 94},Minnesota,"{t, e, o, a, i, Eastbound Interstate 94, s, M, n}"
6,Highway 210 Brainerd to Ironton will be closed...,Highway 210 Brainerd to Ironton will be closed...,At Highway 210 Brainerd to Ironton will be clo...,At Highway 210 Brainerd to Ironton will be clo...,At Highway 210 Brainerd to Ironton will be clo...,At Highway 210 Brainerd to Ironton will be clo...,{County Road 142},{County Road 142},{County Road 142},{County Road 142},{County Road 142},{County Road 142},Minnesota,"{t, e, o, a, County Road 142, i, s, M, n}"
7,Surface water/flooding continues on area roads...,Surface water/flooding continues on area roads...,At Surface water/flooding continues on area ro...,At Surface water/flooding continues on area ro...,At Surface water/flooding continues on area ro...,At Surface water/flooding continues on area ro...,"{Kanabec, Mille Lacs}",{Kanabec},{Kanabec},{Kanabec},{Kanabec},{Kanabec},Minnesota,"{t, Kanabec, Mille Lacs, e, o, a, i, s, M, n}"
8,Eastbound Interstate 94 closed just east of th...,Eastbound Interstate 94 closed just east of th...,At Eastbound Interstate 94 closed just east of...,At Eastbound Interstate 94 closed just east of...,At Eastbound Interstate 94 closed just east of...,At Eastbound Interstate 94 closed just east of...,{New Munich},{New Munich},"{New Munich, Eastbound Interstate 94}","{New Munich, Eastbound Interstate 94}","{New Munich, Eastbound Interstate 94}","{New Munich, Eastbound Interstate 94}",Minnesota,"{t, e, o, a, i, New Munich, Eastbound Intersta..."
9,Heading from #CentralMinnesota to the Twin Cit...,Heading from #CentralMinnesota to the Twin Cit...,At Heading from #CentralMinnesota to the Twin ...,At Heading from #CentralMinnesota to the Twin ...,At Heading from #CentralMinnesota to the Twin ...,At Heading from #CentralMinnesota to the Twin ...,,,,,,,Minnesota,"{t, e, o, a, i, s, M, n}"


# Assess Performance of Location Extraction for Each Version of Modified Text

In [16]:
# Create variables to house sum of locations in each location-column
location_original = 0
location_A = 0
location_B = 0
location_C = 0
location_D = 0
location_E = 0
all_locations = 0

# Add sum of locations in each column
for i in range(len(df_twitter_closures)):
    location_original += len(df_twitter_closures['location_original'].loc[i])
    location_A += len(df_twitter_closures['location_A'].loc[i])
    location_B += len(df_twitter_closures['location_B'].loc[i])
    location_C += len(df_twitter_closures['location_C'].loc[i])
    location_D += len(df_twitter_closures['location_D'].loc[i])
    location_E += len(df_twitter_closures['location_E'].loc[i])
    all_locations += len(df_twitter_closures['all_locations'].loc[i])

# Print totals 
print(f'Location_original has {location_original} locations.')
print(f'Location_A has {location_A} locations.')
print(f'Location_B has {location_B} locations.')
print(f'Location_C has {location_C} locations.')
print(f'Location_D has {location_D} locations.')
print(f'Location_E has {location_E} locations.')
print(f'All_locations has {all_locations} locations.')

Location_original has 316 locations.
Location_A has 308 locations.
Location_B has 368 locations.
Location_C has 368 locations.
Location_D has 368 locations.
Location_E has 381 locations.
All_locations has 2305 locations.


***Note***: `All_locations` has more locations because "Minnesota" was added.

# spaCy Conclusion

At its best, the text modifications were able to increase Spacy's ability to extract location from text by **98.4%**. However, if all unique locations are combined into one column (`all_locations`), Spacy's ability to extract location from text increases to **117.7%**. However, one of the pitfalls of combining unique location names into column is that there can still be duplicates -- for instance, "Interstate 94" and "Eastbound Interstate 94" show up twice. Interestingly enough, combining all unique locations into column does not necessarily produce optimal results.

Things to consider looking ahead: 
 - Is it possible to convert locations to longitude and latitude where duplicates exist (example: "Interstate 94" and "Eastbound Interstate 94", or "Hwy 4" and "Highway 4")? 
 - When extracting locations from live Tweets, perhaps the best option to go with is Modification E, since there are no duplicate locations whatsoever and it extracted the highest number of total locations.
 - Will these modifications work on tweets from other states' Dept. of Transportation Twitter accounts?

# Convert Locations to Coordinates

In [17]:
coordinates = []
for i in range(len(df_twitter_closures['all_locations'])):
    g = geocoder.arcgis(df_twitter_closures['all_locations'][i])
    coordinates.append(tuple(g.latlng))

In [18]:
all_tweets = []
for i in range(len(df_twitter_closures['text'])):
    all_tweets.append([df_twitter_closures['text'].loc[i]])

### Save data to csv

In [19]:
#df_twitter_closures.to_csv('../data/twitter_locations.csv')

# Mapping

## Load Data

In [20]:
# read in data from location extraction notebook

twitter_map_info = pd.read_csv('../data/twitter_locations.csv')

## Convert Location to Coordinates

In [21]:
# Get coordinates
coordinates = []
for i in range(len(twitter_map_info['all_locations'])):
    g = geocoder.arcgis(twitter_map_info['all_locations'][i])
    coordinates.append(tuple(g.latlng))

# Get tweets associated with coordinates
all_tweets = []
for i in range(len(twitter_map_info['text'])):
    all_tweets.append([twitter_map_info['text'].loc[i]])

# Make sure coordinate count is equal to road closure tweets count    
print(len(coordinates))
print(len(all_tweets))

236
236


## Load Map

In [22]:
# Read in json map file
state_geo = os.path.join('./maps', 'us-states.json')

# Set parameters to show map with pins where road closures are, and corresponding tweets as pop-ups
mapit = folium.Map( location=[44.9778, -93.2650], tiles="OpenStreetMap", zoom_start=13)
state_geo = os.path.join('../data', 'us-states.json')

for latlong, tweet in zip(coordinates, all_tweets):
    folium.Marker(location=[latlong[0], latlong[1]], popup=str(tweet)).add_to(mapit)
    
mapit 

***View real map when downloaded***
![title](../images/map_example2.png)
![title](../images/map_example.png)