## Import and Join Geocodes to DataFrames
This notebook picks up after 01-setup_provider_dataframes, where I created 'tennessee_providers' and 'tennessee_providers_geocodes' dataframes. 

- tennessee_providers contains names and addresses of all providers
- tennessee_providers_geocodes contains geocoded addresses of all providers
- There is a many:one relationship between the two with tennessee_providers_geocodes containing unique locations from tennessee_providers.
- These dataframes are being joined here so that they can be plotted on the Nashville map

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
# set up import variables

# import tennessee_providers and geocoded address files for joining
tennessee_providers_geocodes_csv = '../data/location/tennessee_providers_geocoded_addresses.csv'
tennessee_providers_csv = '../data/clean/tennessee_providers.csv' 


# import only the needed columns from tennessee_providers_geocodes
columns = ['address', 'city', 'state', 'zip', 'longitude', 'latitude'] 
tennessee_providers_geocodes = pd.read_csv(tennessee_providers_geocodes_csv, usecols=columns)
tennessee_providers = pd.read_csv(tennessee_providers_csv)


# confirm imports
print('This is tennessee_providers with geocodes')
print(tennessee_providers_geocodes.info())
print()
print('This is tennessee_providers w/o geocodes')
print(tennessee_providers.info())


This is tennessee_providers with geocodes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1229 entries, 0 to 1228
Data columns (total 6 columns):
address      1229 non-null object
city         1229 non-null object
state        1229 non-null object
zip          1229 non-null int64
longitude    938 non-null float64
latitude     938 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 57.7+ KB
None

This is tennessee_providers w/o geocodes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1997 entries, 0 to 1996
Data columns (total 7 columns):
Unnamed: 0    1997 non-null int64
address       1997 non-null object
address2      714 non-null object
city          1997 non-null object
state         1997 non-null object
zip           1997 non-null int64
full_name     1997 non-null object
dtypes: int64(2), object(5)
memory usage: 109.3+ KB
None


Ready to combine the dataframes. I should get 1997 rows in the combined table. 938 of the rows should have longitude and latitude

- This code gives me the right number of rows but does not combine address, city, state and zip

```all_providers = pd.concat([tennessee_providers, tennessee_providers_geocodes], axis=1, join='outer')```

- This code combines the fields properly but returns 2313 rows. Where do the extra 316 rows come from?

```all_providers = pd.merge(tennessee_providers, tennessee_providers_geocodes, how='outer', on='address', 
                         left_on=None, right_on=None, left_index=False, right_index=False, sort=True,
                         suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
```

In [3]:
# all_providers = pd.merge(tennessee_providers, tennessee_providers_geocodes, how='outer', on='address', 
#                          left_on=None, right_on=None, left_index=False, right_index=False, sort=True,
#                          suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)


# this doesn't work. I get right # of rows, but two columns called 'address'
# all_providers = pd.concat([tennessee_providers, tennessee_providers_geocodes], axis=1, join='outer')

all_providers = pd.merge(tennessee_providers, tennessee_providers_geocodes, how='outer', on='address', 
                         left_on=None, right_on=None, left_index=False, right_index=False, sort=True,
                         suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)


# confirm join
all_providers.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2313 entries, 0 to 2312
Data columns (total 12 columns):
Unnamed: 0    2306 non-null float64
address       2313 non-null object
address2      815 non-null object
city_x        2306 non-null object
state_x       2306 non-null object
zip_x         2306 non-null float64
full_name     2306 non-null object
city_y        1787 non-null object
state_y       1787 non-null object
zip_y         1787 non-null float64
longitude     1430 non-null float64
latitude      1430 non-null float64
dtypes: float64(5), object(7)
memory usage: 234.9+ KB


In [4]:
all_providers.head()

Unnamed: 0.1,Unnamed: 0,address,address2,city_x,state_x,zip_x,full_name,city_y,state_y,zip_y,longitude,latitude
0,7588912.0,#6 sixth street suite 205,,Bristol,TN,37620.0,Ronald Brizendine,Bristol,TN,37620.0,,
1,3658055.0,1 innis brook lane,,Brentwood,TN,37027.0,Tayebeh Asad sangabi,Brentwood,TN,37027.0,-86.731094,35.970486
2,3514867.0,1 medical center blvd,,Cookeville,TN,38501.0,Tim Bongartz,Cookeville,TN,385014294.0,-85.50952,36.170155
3,4950698.0,1 medical center blvd,,Cookeville,TN,385014294.0,Frances Thomason,Cookeville,TN,385014294.0,-85.50952,36.170155
4,5721680.0,1 medical center blvd,,Cookeville,TN,38501.0,Jasmine Olive,Cookeville,TN,385014294.0,-85.50952,36.170155


In [5]:
all_providers.tail()

Unnamed: 0.1,Unnamed: 0,address,address2,city_x,state_x,zip_x,full_name,city_y,state_y,zip_y,longitude,latitude
2308,3515565.0,Vanderbilt student health ctr,"Zerfoss building, station 17",Nashville,TN,372320001.0,Camellia Koleyni,Nashville,TN,372320001.0,-86.80022,36.141613
2309,7035156.0,Vanderbilt university department of medicine,"D 3100, medical center north",Nashville,TN,372320001.0,Benjamin Tillman,Nashville,TN,372320001.0,-86.80022,36.141613
2310,4640539.0,Vanderbilt university student ctr,"Zerfoss building, station 17",Nashville,TN,372328710.0,Rachel Aholt,Nashville,TN,372328710.0,-86.80022,36.141613
2311,9454455.0,Vumc anesthesiology,1301 medical center drive,Nashville,TN,372325614.0,Meredith Coleman,Nashville,TN,372325614.0,-86.80022,36.141613
2312,8146872.0,Vumc dept of oto med ctr east south tower,"1215 21st avenue south, suite 7209",Nashville,TN,372328605.0,John Heaphy,Nashville,TN,372328605.0,-86.80022,36.141613


In [6]:
# rename city, state and zip columns, drop duplicate columns
# df.rename(columns = {'$b':'B'}, inplace = True)

# set up temp dataframe
df = all_providers

#rename columns
df.rename(columns={'city_x':'city', 'state_x':'state', 'zip_x':'zip'}, inplace = True)

# recreate all_providers with only needed columns
all_providers = df[['full_name', 'address', 'address2', 'city', 'state', 'zip', 'longitude', 'latitude']]
all_providers

Unnamed: 0,full_name,address,address2,city,state,zip,longitude,latitude
0,Ronald Brizendine,#6 sixth street suite 205,,Bristol,TN,37620.0,,
1,Tayebeh Asad sangabi,1 innis brook lane,,Brentwood,TN,37027.0,-86.731094,35.970486
2,Tim Bongartz,1 medical center blvd,,Cookeville,TN,38501.0,-85.509520,36.170155
3,Frances Thomason,1 medical center blvd,,Cookeville,TN,385014294.0,-85.509520,36.170155
4,Jasmine Olive,1 medical center blvd,,Cookeville,TN,38501.0,-85.509520,36.170155
...,...,...,...,...,...,...,...,...
2308,Camellia Koleyni,Vanderbilt student health ctr,"Zerfoss building, station 17",Nashville,TN,372320001.0,-86.800220,36.141613
2309,Benjamin Tillman,Vanderbilt university department of medicine,"D 3100, medical center north",Nashville,TN,372320001.0,-86.800220,36.141613
2310,Rachel Aholt,Vanderbilt university student ctr,"Zerfoss building, station 17",Nashville,TN,372328710.0,-86.800220,36.141613
2311,Meredith Coleman,Vumc anesthesiology,1301 medical center drive,Nashville,TN,372325614.0,-86.800220,36.141613


In [7]:
# NOTE: all_providers does not include the three public clinics in davidson County. I added them manually 
new_names = [['East Public Health Center', '1015 East Trinity Lane', 'Nashville', 'TN', -86.745286, 36.204273], 
             ['Woodbine Public Health Center', '224 Oriel Avenue', 'Nashville', 'TN', -86.743627, 36.122097],
             ['Lentz Public Health Center', '1015 East Trinity Lane', 'Nashville', 'TN', -86.812991, 36.155043]
            ] 
    
new_names_df = pd.DataFrame(new_names, columns =['full_name', 'address', 'city','state', 'longitude', 'latitude']) 

#confirm add
new_names_df 



Unnamed: 0,full_name,address,city,state,longitude,latitude
0,East Public Health Center,1015 East Trinity Lane,Nashville,TN,-86.745286,36.204273
1,Woodbine Public Health Center,224 Oriel Avenue,Nashville,TN,-86.743627,36.122097
2,Lentz Public Health Center,1015 East Trinity Lane,Nashville,TN,-86.812991,36.155043


In [8]:
# Add new_names_df this should give me 2316 rows

df = all_providers.append(new_names_df, sort=False)

# confirm
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2316 entries, 0 to 2
Data columns (total 8 columns):
full_name    2309 non-null object
address      2316 non-null object
address2     815 non-null object
city         2309 non-null object
state        2309 non-null object
zip          2306 non-null float64
longitude    1433 non-null float64
latitude     1433 non-null float64
dtypes: float64(3), object(5)
memory usage: 162.8+ KB


In [9]:
# Export final file for use in Tableau
df.to_csv('../data/location/all_providers_geocoded.csv')
