### Improving American assignee geolocalization at county level


In this notebook I propose some code to create a crawler to improve the geolocalization of patents whose assignee address is in United States. The reason for sharing this lies in the fact that currently most of patent addresses are linked to GeoNames ZIPcodes, hence it is not possible to assign coordinates (or county) to all the  patents whose ZIP codes are not present in Geonames' list (https://www.geonames.org/postal-codes/postal-codes-us.html).
To improve this current situation, in my patent geolocalization effort I create a crawler to find geographic information for these unmatched ZIP codes through unitedstateszipcodes.org (https://www.unitedstateszipcodes.org/).

In my research work, I managed to assign to an American county almost 98% of all patents granted at USPTO (100% for green patents, defined as such according to WIPO Green Inventory or OECD Green tech definition), which have at least one assignee residing within the US. To create the complete geolocalized (at county level) dataset, starting from the application dataset of patentsview (https://patentsview.org/download/data-download-tables), I used a number of sources:
1- raw location data from patentsview (https://patentsview.org/download/data-download-tables)
2- USPTO assignee dataset (extracting zipcode from the addresses, where present) (https://assignment.uspto.gov/patent/index.html#/patent/search)
3- location dataset by Gaetan de Rassenfosse (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3425764)
Then, when even using this sources there wasn't the necessary detail to geolocalize the patent at county level, I used two different strategies:
1-in the first place, for those patents missing zipcodes, I recovered them using a public API (https://geo.fcc.gov/api/census/area) to get them using their latitude and longitude
2- eventually, for those still lacking county information despite having a zip code, I created a crawler to scrape zip codes from unitedstateszipcodes.org and to get in that way the county of the remaining patents.

The complete dataset of US assigned patents accounts for 4,067,011 obs, including patents granted between 1980 and 2020 and it is available at the following link: https://drive.google.com/drive/folders/1FsDDmwhfARg83SCSVcxeiHylW8h-Er73?usp=sharing
In the same folder you can also find a dataset about geolocation (US, county level) of green patents (263'630 obs) 


In [27]:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait

from bs4 import BeautifulSoup as bs
import time, json
import pandas as pd

In [28]:
data = pd.read_csv("/allbut30K.csv")#last 30K patents with no county assigned

  exec(code_obj, self.user_global_ns, self.user_ns)


In [29]:
nacode = data[(data["county"].isna()) & (data["postcode"].notna())] #Looking for patents with no postcode

In [30]:
codes = list(set(nacode["postcode"]))

In [31]:
URL = "https://www.unitedstateszipcodes.org/" #Starting creating the actual crawler for unitedstateszipcodes

In [40]:
def webDriver(incognito: bool=True, headless: bool=False, 
              geckodriver: str = None,
              tor: bool = True,
              torrc: str = "/etc/tor/torrc"):
    
    options = Options()
    options.headless = headless
    
    if torrc is not None and tor is True:

        options.set_preference("network.proxy.type", 1)
        options.set_preference("network.proxy.socks_version", 5)
        options.set_preference("network.proxy.socks", "127.0.0.1")
        options.set_preference("network.proxy.socks_port", 9050)
        options.set_preference("network.proxy.socks_remote_dns", False)
    
    options.set_preference("network.proxy.type", 1)
    options.set_preference("dom.webdriver.enabled", False)
    options.set_preference("useAutomationExtension", False)
    options.set_preference("devtools.jsonview.enabled", False)

    
    web = webdriver.Firefox(
        options=options,
        executable_path=geckodriver,
    )
    
    web.get("https://check.torproject.org/")
    return web

In [41]:
d = webDriver(incognito=True, headless=False, geckodriver="./geckodriver_1")

  web = webdriver.Firefox(


In [34]:
def get_info(webdriver, zipcode):
    
    def city(dict_):
        post_office = dict_["Post Office City"]
        if "(View All Cities)" in post_office:
            post_office = post_office.replace("(View All Cities)", "")
        city, state = post_office.split(",")
        return city.strip(), state.strip()
    
    def county(dict_):
        county = dict_["County"]
        if "County" in county:
            county = county.replace("County", "")
        return county.strip()
    
    def geocode(dict_):
        geoc = dict_["Coordinates"]
        if "ZIP" in geoc:
            geoc = geoc.replace("ZIP", "***")
            a, b = geoc.split("***")

        lat, lon = a.split(",")
        return float(lat.strip()), float(lon.strip())
    
    
    webdriver.get(URL)
    
    input_, info = webdriver.find_element(By.ID, "q"), {}
    webdriver.execute_script('arguments[0].value = "";', input_)
    input_.send_keys(zipcode)
    input_.send_keys(Keys.ENTER)
    time.sleep(1)
    content = bs(webdriver.page_source, "html.parser")
    
    if content.find("div", {"class": "alert alert-danger text-center"}):
        print("Not Found")
        return {"city": "", "state": "", "county": "",
               "latitude": "", "longitude": ""}
    
    try:
        tab = content.find(
            {
                "table": {
                    "class": "table"
                }
            }
        )

        for c in tab.find_all("tr"):
            key = c.find("th")
            value = c.find("td")
            info.update(
                {
                    key.text.strip()[:-1]: value.text.strip()
                }
            )

        try:
            city_, state_ = city(info)
            county_ = county(info)
            lat, lon = geocode(info)    
            i =  {"city": city_, "state": state_, "county": county_,
                   "latitude": lat, "longitude": lon}
            print(f"OK postalcode {zipcode}: {i}")
            return i

        except (KeyError, Exception) as e:
            print(f"ERROR: {e}")
            return {"city": "", "state": "", "county": "",
                   "latitude": "", "longitude": ""}
    except (KeyError, Exception) as e:
        print(f"ERROR: {e}")
        return {"city": "", "state": "", "county": "",
           "latitude": "", "longitude": ""}

In [35]:
with open("./results.json", "r") as f:
    cod = json.load(f)

In [43]:
with open("./results.json", "w") as f:
    json.dump(cod, f)

In [42]:
for code in codes:
    if code not in cod.keys() or not cod[code]["county"]:
        info = get_info(d, code)
        cod.update({code: info})
    else:                   
        continue

Not Found
Not Found
Not Found
Not Found
ERROR: 'Post Office City'
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
ERROR: local variable 'a' referenced before assignment
Not Found
OK postalcode 01845-2649: {'city': 'North Andover', 'state': 'MA', 'county': 'Essex', 'latitude': 42.7, 'longitude': -71.13}
Not Found
ERROR: 'Post Office City'
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
ERROR: local variable 'a' referenced before assignment
ERROR: local variable 'a' referenced before assignment
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
OK postalcode 01983: {'city': 'Topsfield', 'state': 'MA', 'county': 'Essex', 'latitude': 42.64, 'longitude': -70.94}
Not Found
Not Found
Not Found
Not Found
Not Found
ERROR: local variable 'a' referenced before assignment
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
Not Found
ERROR: 'Post Office City'
Not Found
Not Found


OK postalcode 01746-2029: {'city': 'Holliston', 'state': 'MA', 'county': 'Middlesex', 'latitude': 42.2, 'longitude': -71.43}
OK postalcode 01702-5406: {'city': 'Framingham', 'state': 'MA', 'county': 'Middlesex', 'latitude': 42.28, 'longitude': -71.42}
OK postalcode 06831-5118: {'city': 'Greenwich', 'state': 'CT', 'county': 'Fairfield', 'latitude': 41.03, 'longitude': -73.67}
Not Found
OK postalcode 08234-4075: {'city': 'Egg Harbor Township', 'state': 'NJ', 'county': 'Atlantic', 'latitude': 39.4, 'longitude': -74.6}
OK postalcode 01590: {'city': 'Sutton', 'state': 'MA', 'county': 'Worcester', 'latitude': 42.14, 'longitude': -71.76}
ERROR: local variable 'a' referenced before assignment
OK postalcode 06468-2412: {'city': 'Monroe', 'state': 'CT', 'county': 'Fairfield', 'latitude': 41.34, 'longitude': -73.25}
OK postalcode 01748-9103: {'city': 'Hopkinton', 'state': 'MA', 'county': 'Middlesex', 'latitude': 42.23, 'longitude': -71.53}
OK postalcode 03070: {'city': 'New Boston', 'state': 'NH'

Not Found
OK postalcode 02460: {'city': 'Newtonville', 'state': 'MA', 'county': 'Middlesex', 'latitude': 42.35, 'longitude': -71.21}
Not Found
OK postalcode 01833: {'city': 'Georgetown', 'state': 'MA', 'county': 'Essex', 'latitude': 42.72, 'longitude': -70.98}
OK postalcode 06513: {'city': 'New Haven', 'state': 'CT', 'county': 'New Haven', 'latitude': 41.32, 'longitude': -72.86}
OK postalcode 06890: {'city': 'Southport', 'state': 'CT', 'county': 'Fairfield', 'latitude': 41.14, 'longitude': -73.29}
OK postalcode 06111: {'city': 'Newington', 'state': 'CT', 'county': 'Hartford', 'latitude': 41.69, 'longitude': -72.73}
Not Found
OK postalcode 03848: {'city': 'Kingston', 'state': 'NH', 'county': 'Rockingham', 'latitude': 42.91, 'longitude': -71.06}
OK postalcode 02461-1951: {'city': 'Newton Highlands', 'state': 'MA', 'county': 'Middlesex', 'latitude': 42.32, 'longitude': -71.21}
OK postalcode 02061-1605: {'city': 'Norwell', 'state': 'MA', 'county': 'Plymouth', 'latitude': 42.17, 'longitude'

OK postalcode 08318-4563: {'city': 'Elmer', 'state': 'NJ', 'county': 'Salem', 'latitude': 39.6, 'longitude': -75.2}
ERROR: 'NoneType' object has no attribute 'find_all'
OK postalcode 05862: {'city': 'Peacham', 'state': 'VT', 'county': 'Caledonia', 'latitude': 44.33, 'longitude': -72.16}
Not Found
Not Found
OK postalcode 03846: {'city': 'Jackson', 'state': 'NH', 'county': 'Carroll', 'latitude': 44.19, 'longitude': -71.18}
OK postalcode 03840: {'city': 'Greenland', 'state': 'NH', 'county': 'Rockingham', 'latitude': 43.03, 'longitude': -70.85}
OK postalcode 06416-2072: {'city': 'Cromwell', 'state': 'CT', 'county': 'Middlesex', 'latitude': 41.61, 'longitude': -72.68}
OK postalcode 07069: {'city': 'Watchung', 'state': 'NJ', 'county': 'Somerset', 'latitude': 40.64, 'longitude': -74.44}
Not Found
OK postalcode 02879: {'city': 'Wakefield', 'state': 'RI', 'county': 'Washington', 'latitude': 41.43, 'longitude': -71.54}
OK postalcode 02647: {'city': 'Hyannis Port', 'state': 'MA', 'county': 'Barns

OK postalcode 08817: {'city': 'Edison', 'state': 'NJ', 'county': 'Middlesex', 'latitude': 40.52, 'longitude': -74.4}
Not Found
OK postalcode 06905-5619: {'city': 'Stamford', 'state': 'CT', 'county': 'Fairfield', 'latitude': 41.1, 'longitude': -73.55}
OK postalcode 02053: {'city': 'Medway', 'state': 'MA', 'county': 'Norfolk', 'latitude': 42.16, 'longitude': -71.43}
Not Found
OK postalcode 05043: {'city': 'East Thetford', 'state': 'VT', 'county': 'Orange', 'latitude': 43.81, 'longitude': -72.22}
ERROR: 'NoneType' object has no attribute 'find_all'
OK postalcode 06371: {'city': 'Old Lyme', 'state': 'CT', 'county': 'New London', 'latitude': 41.4, 'longitude': -72.3}
OK postalcode 06901: {'city': 'Stamford', 'state': 'CT', 'county': 'Fairfield', 'latitude': 41.054, 'longitude': -73.538}
Not Found
Not Found
Not Found
Not Found
OK postalcode 02043-4312: {'city': 'Hingham', 'state': 'MA', 'county': 'Plymouth', 'latitude': 42.25, 'longitude': -70.92}
OK postalcode 02861-1900: {'city': 'Pawtucke

OK postalcode 02806: {'city': 'Barrington', 'state': 'RI', 'county': 'Bristol', 'latitude': 41.74, 'longitude': -71.32}
OK postalcode 95034-7200: {'city': 'Morgan Hill', 'state': 'CA', 'county': 'Santa Clara', 'latitude': 37.1, 'longitude': -121.6}
ERROR: local variable 'a' referenced before assignment
OK postalcode 01776: {'city': 'Sudbury', 'state': 'MA', 'county': 'Middlesex', 'latitude': 42.39, 'longitude': -71.42}
OK postalcode 08502: {'city': 'Belle Mead', 'state': 'NJ', 'county': 'Somerset', 'latitude': 40.44, 'longitude': -74.65}
OK postalcode 07014: {'city': 'Clifton', 'state': 'NJ', 'county': 'Passaic', 'latitude': 40.83, 'longitude': -74.14}
Not Found
OK postalcode 02127: {'city': 'South Boston', 'state': 'MA', 'county': 'Suffolk', 'latitude': 42.33, 'longitude': -71.04}
OK postalcode 05047: {'city': 'Hartford', 'state': 'VT', 'county': 'Windsor', 'latitude': 43.7, 'longitude': -72.4}
OK postalcode 01001: {'city': 'Agawam', 'state': 'MA', 'county': 'Hampden', 'latitude': 42.

Not Found
OK postalcode 08535: {'city': 'Millstone Township', 'state': 'NJ', 'county': 'Monmouth', 'latitude': 40.23, 'longitude': -74.45}
Not Found
Not Found
Not Found
OK postalcode 06450-7195: {'city': 'Meriden', 'state': 'CT', 'county': 'New Haven', 'latitude': 41.53, 'longitude': -72.79}
ERROR: 'Post Office City'
OK postalcode 07083: {'city': 'Union', 'state': 'NJ', 'county': 'Union', 'latitude': 40.69, 'longitude': -74.27}
Not Found
OK postalcode 04032: {'city': 'Freeport', 'state': 'ME', 'county': 'Cumberland', 'latitude': 43.86, 'longitude': -70.1}
Not Found
OK postalcode 01851: {'city': 'Lowell', 'state': 'MA', 'county': 'Middlesex', 'latitude': 42.62, 'longitude': -71.34}
OK postalcode 95314: {'city': 'Pinecrest', 'state': 'CA', 'county': 'Tuolumne', 'latitude': 38.3, 'longitude': -119.9}
OK postalcode 05363: {'city': 'Wilmington', 'state': 'VT', 'county': 'Windham', 'latitude': 42.9, 'longitude': -72.9}
OK postalcode 08810: {'city': 'Dayton', 'state': 'NJ', 'county': 'Middles

In [44]:
new_data = pd.DataFrame.from_dict(cod, orient="index")
new_data["postcode"] = new_data.index
new_data.reset_index(drop=True, inplace=True)

In [48]:
new_data.to_csv("/mxxx/new_county.csv", index=False)

In [47]:
pd.merge(data, new_data, on="postcode", how="left")

Unnamed: 0,patent_id,city_x,state_x,county_x,postcode,postcode_clean,lats_x,lngs_x,county1,city_y,state_y,county_y,latitude,longitude
0,10000015,,MASSACHUSETTS,MIDDLESEX,02421,2421.0,,,MIDDLESEX,Lexington,MA,Middlesex,42.44,-71.23
1,10000042,NEWARK,OHIO,LICKING,43055,43055.0,40.072399,-82.404602,,,,,,
2,10000042,COLUMBUS,OHIO,FRANKLIN,43215,43215.0,39.967098,-83.004402,,,,,,
3,10000042,COLUMBUS,OHIO,DELAWARE,43215,43215.0,39.967098,-83.004402,,,,,,
4,10000053,,NEW HAMPSHIRE,STRAFFORD,03824,3824.0,,,STRAFFORD,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
440852,RE48184,,MASSACHUSETTS,BRISTOL,02767,2767.0,,,BRISTOL,Raynham,MA,Bristol,41.94,-71.06
440853,RE48193,SCHENECTADY,NEW YORK,SCHENECTADY,12309,12309.0,42.809101,-73.869301,,,,,,
440854,RE48247,SAN DIEGO,CALIFORNIA,SAN DIEGO,92121,92121.0,32.891899,-117.203500,MIDDLESEX,,,,,
440855,RE48247,,MASSACHUSETTS,MIDDLESEX,02472,2472.0,,,MIDDLESEX,Watertown,MA,Middlesex,42.37,-71.18
