## Requirements:

**python libraries:**
- pandas
- spacy
- geopy


**csv files:**
- sorted_articles.csv
- outlet_locations.csv

### Functions

- **spacy_location ( *string* text )**

    Finds most frequently mentioned location in a piece of text. Returns as a string. 


- **df_to_list_dict ( *DataFrame* df, *string* key )**

    Creates a list dictionary of a pandas dataframe, with a specified key value.


- **clean_loc ( *string* text )**

    Cleans location text for specific use cases. 
    

- **geocode_articles ( *DataFrame* df, *Dataframe* outlet_locations )**

    Assigns geolocation information to each article (row) found within 'df' using spacy_location() or outlet_locations.
    
    
    

### spacy_location ( *string* text ) :

**Input:**

- **text -** text to analyze 

**Output**

- **final_loc -** location extracted from text

In [4]:
def spacy_location(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    final_loc = ''
    locations = []
    
    for ent in doc.ents:
        if (ent.label_ == 'GPE'):
            locations.append(ent.text)
    
    if len(locations) > 0:
        loc_counts = pd.DataFrame([[x, locations.count(x)] for x in set(locations)])
        loc_counts.columns = ['location', 'count']
        loc_counts = loc_counts.sort_values(by=['count'], ascending=False)
    
        final_loc = loc_counts.iloc[0, 0]
        
    return final_loc   

### df_to_list_dict ( *DataFrame* df , *string* key ) :

**Input:**

- **df -** pandas dataframe to transform to dictionary
- **key -** string to use as key in dictionary

**Output**

- **dict -** a dictionary of the dataframe, where the key is the *string* key input and the value is a list of the column values per row


In [5]:
def df_to_list_dict(df, key):
    return df.set_index(key).T.to_dict('list')

### clean_loc ( *String* text )


**Input:**

- **text -** location text to analyze for specific use cases to clean

**Output**

- **text-** updated ('cleaned') location values

In [6]:
def clean_loc(text): 
    if text=='Surrey':
        text = 'Surrey, British Columbia'
    elif text =='B.C.':
        text = 'British Columbia'
    
    return text

### geocode_articles ( *DataFrame* df, *Dataframe* outlet_locations ) : 

**Input:** 
- **df -** dataframe of articles 
- **outlet_locations -** dataframe of outlets and their geographic locations

**Output:** 
- **df_loc -** dataset of articles with geolocation information


In [17]:
def geocode_articles(df, outlet_locations):
    
    new_df = df
    gl = Nominatim(user_agent='newsworthy_ml')
    
    loc_dict = df_to_list_dict(outlet_locations, 'outlet')
    
    
    for i in range(len(df)):          # Parses through each article in df
    
        text_loc = spacy_location(new_df.loc[i, 'text'])     # Find most frequently mentioned location in article text
        text_loc = clean_loc(text_loc)                            # Fix specific use cases for locations        
    
        if text_loc != '':                 # If location string from article text is not empty
            try:
                print(text_loc)
                new_df.loc[i, 'location'] = text_loc       # Assign article location to this text
                loc = gl.geocode(text_loc)                  # Find coordinates of location returned by spacy
                lat, long = loc.latitude, loc.longitude
                
                new_df.loc[i, 'lat'] = lat                       # Assign coordinates to article
                new_df.loc[i, 'long'] = long
                print(str([lat, long]))
            
            except:
                print("Error with extracting coordinates.")
                new_df.loc[i, 'lat'] = 0.0
                new_df.loc[i, 'long'] = 0.0
            
        else:                                   # Use location of outlet instead
            try:
                new_df.loc[i, 'location'] = loc_dict[new_df.loc[i, 'outlet']][0]
                new_df.loc[i, 'lat'] = loc_dict[new_df.loc[i, 'outlet']][1]
                new_df.loc[i, 'long'] = loc_dict[new_df.loc[i, 'outlet']][2]
                
            except:
                print("Error calling from dictionary.")

## Script

In [12]:
import pandas as pd
import spacy
from geopy.geocoders import Nominatim

In [13]:
# Get necessary .csv files as pandas dataframes

df = pd.read_csv('complete.csv')
df = df.drop(columns=['Unnamed: 0'])

outlet_locations = pd.read_csv('outlet_locations.csv')
outlet_locatons = outlet_locations.drop(columns=['Unnamed: 0'])


In [None]:
# Find locations of each article

new_df = geocode_articles(df, outlet_locations)

In [None]:
new_df.to_csv('final_df.csv')