# Address Parsing NLP 
---


<h1> Introduction </h1>

Address parsing is something that requires tedious manual work. There are tools that can be used like QAS, Oracles Quick Address Search. QAS,
although is a great resource is very rigid and predetermined using rigid algoriths. Often time addresses need preprocessing to be utilized by QAS especially when address strings contain additional information. Natural Language Processing (NLP) is a great resource to parse addresses from large strings with addition information. NLPs are highly adaptable and capable of learning specific requirements which can be easily determined by training data. The only requirement for fine tuning of the data requirement is a lot of data.


In this notebook, we will extract address from large dataset in which the address strings are in various formats. The dataset was taken from  [Indiana Department of Environmental Management](<https://www.in.gov/idem/cleanups/investigation-and-cleanup-programs/emergency-response/>). 


<h1>Prerequisites</h1>

- Data Preparations:
    - Need a cleaned dataset. I have cleaned around 300 records from this dataset using regular expressions
 
- Model Training Workflow & Library
    - we will be using [spaCy](<https://spacy.io/usage/spacy-101>). a library which deals with large strings and training NLPS
    - this library uses Named Entity Recognition (NER) which is a method to extract and categorize information withing text.
    - this model will extract subsections of the data and categorize them.
 
  
Example:

<!-- Markdown -->
<div style="overflow-x:auto;">
    <table style="width:100%; margin-left: 0;">
        <thead>
            <tr>
                <th style="text-align:left;">Raw Location String</th>
                <th style="text-align:left;">Parsed Address</th>
                <th style="text-align:left;">Parsed City</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td style="text-align:left;">123 Something St. Tronto  </td>
                <td style="text-align:left;">123 Something St. </td>
                <td style="text-align:left;">Tronto</td>
            </tr>
        </tbody>
    </table>
</div>


- despite 'Tronto' being spelt wrong the parsing will only take a subsection of the text. This although does not correct everything, it ensures the data is a subsection of the data. In the future, we can make a seperate NLP to correct spelling and common mistakes in city values.

<h1>Necessary Imports</h1>

In [1]:
import spacy
from spacy.tokens import DocBin
import pandas as pd
import re

<h1>Data Preparation</h1>
Things I have learned working with spaCy:

1. the parsed data must be a subset of the original string

Example:

<!-- Markdown -->
<div style="overflow-x:auto;">
    <table style="width:100%; margin-left: 0;">
        <thead>
            <tr>
                <th style="text-align:left;">Is it Problematic</th>
                <th style="text-align:left;">Raw Location String</th>
                <th style="text-align:left;">Parsed Address</th>
                <th style="text-align:left;">Parsed City</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <th style="text-align:left;">No</th>
                <td style="text-align:left;">123 Something St. Tronto  </td>
                <td style="text-align:left;">123 Something St. </td>
                <td style="text-align:left;">Tronto</td>
            </tr>
            <tr>
                <th style="text-align:left;">Yes</th>
                <td style="text-align:left;">123 Something St. Tronto  </td>
                <td style="text-align:left;">123 Something Street </td>
                <td style="text-align:left;">Toronto</td>
            </tr>
        </tbody>
    </table>
</div>


Notice that the second row in the Parsed Address column there 'Street' instead of 'St.'. Since 'St.' isn't readily found in the original string there is going to be an error. Notice in the Parsed City columns 'Toronto' is fixed so 'Toronto isn't found in the original string.

2. two seperate entities must not over lap.

<!-- Markdown -->
<div style="overflow-x:auto;">
    <table style="width:100%; margin-left: 0;">
        <thead>
            <tr>
                <th style="text-align:left;">Is it Problematic</th>
                <th style="text-align:left;">Raw Location String</th>
                <th style="text-align:left;">Parsed Address</th>
                <th style="text-align:left;">Parsed City</th>
                <th style="text-align:left;">Parsed Prov</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <th style="text-align:left;">No</th>
                <td style="text-align:left;">123 Onnis St. Tronto  Nt</td>
                <td style="text-align:left;">123 Onnis St. </td>
                <td style="text-align:left;">Tronto</td>
                <td style="text-align:left;">Nt</td>
            </tr>
            <tr>
                <th style="text-align:left;">Yes</th>
                <td style="text-align:left;">123 Onnis St. Tronto  On</td>
                <td style="text-align:left;">123 Onnis St. </td>
                <td style="text-align:left;">Tronto</td>
                <td style="text-align:left;">On</td>
            </tr>
        </tbody>
    </table>
</div>


Notice in the second row, there are two places where the string 'On' is in the Raw Location String: 123 __On__ nis St. Tronto  __On__. This causes issues when using specific methods in spaCy.

<h1>Data Cleaning, Data Training, Validation & Test Subset</h1>

In [2]:
# Import data

df=pd.read_csv("ADDRESS_TRAINING_DATA_CLEANED.csv",sep=",",dtype=str)

# Example of data
print(df.head(15).to_markdown(index=False))



| LOCATION                                                                    | ADDRESS                | CITY         |   ZIP |   X |   Y |
|:----------------------------------------------------------------------------|:-----------------------|:-------------|------:|----:|----:|
| PEARL ST   LAUREL IN  FRANKLIN CO                                           | nan                    | LAUREL       |   nan | nan | nan |
| 200 Trowbridge Rd  Indianapolis, Marion Co  INDIANA RR YARD                 | 200 Trowbridge Rd      | Indianapolis |   nan | nan | nan |
| 487 Corn Creek Road  Bedford, KY                                            | 487 Corn Creek Road    | Bedford      |   nan | nan | nan |
| Ohio River, River Mile 475  Cincinnati, OH                                  | nan                    | Cincinnati   |   nan | nan | nan |
| 2403 US 31 LOT 51  PLYMOUTH  MARSHALL CO                                    | 2403 US 31             | PLYMOUTH     |   nan | nan | nan |
| Marion County  304

In [3]:
def file_specific_cleaning(df_train):

    # 1. uppercase everything

    # Convert all string columns to uppercase
    # df_train = df_train.applymap(lambda x: x.upper() if isinstance(x, str) else x)
    df_train = df_train.apply(lambda x: x.str.upper() if x.dtype == "object" else x)


    # 2. all &'s in the address field must be AND

    df_train['ADDRESS'] =  df_train['ADDRESS'] .str.replace('&', 'AND').str.upper()
    df_train['ADDRESS'] =  df_train['ADDRESS'] .str.replace('@', 'AND').str.upper()


    # 3. escape characters: for example + or /. this is an issue for regular expressions so we have to escape them
    def escape_non_nan(value):
        if pd.isna(value):
            return value
        return re.escape(value)
    
    # Apply the function to the ADDRESS column
    # df_train['ADDRESS'] = df_train['ADDRESS'].apply(escape_non_nan)
    df_train['Y'] = df_train['Y'].apply(escape_non_nan)

    # 4. all the @ in the Location to &

    df_train['LOCATION'] =  df_train['LOCATION'] .str.replace('@', '&').str.upper()
    df_train['LOCATION'] =  df_train['LOCATION'] .str.replace('&', 'AND').str.upper()

    return df_train

df=file_specific_cleaning(df)
print(df.head(15).to_markdown(index=False))

| LOCATION                                                                    | ADDRESS                  | CITY         |   ZIP |   X |   Y |
|:----------------------------------------------------------------------------|:-------------------------|:-------------|------:|----:|----:|
| PEARL ST   LAUREL IN  FRANKLIN CO                                           | nan                      | LAUREL       |   nan | nan | nan |
| 200 TROWBRIDGE RD  INDIANAPOLIS, MARION CO  INDIANA RR YARD                 | 200 TROWBRIDGE RD        | INDIANAPOLIS |   nan | nan | nan |
| 487 CORN CREEK ROAD  BEDFORD, KY                                            | 487 CORN CREEK ROAD      | BEDFORD      |   nan | nan | nan |
| OHIO RIVER, RIVER MILE 475  CINCINNATI, OH                                  | nan                      | CINCINNATI   |   nan | nan | nan |
| 2403 US 31 LOT 51  PLYMOUTH  MARSHALL CO                                    | 2403 US 31               | PLYMOUTH     |   nan | nan | nan |
| MARI

In [4]:
# remove problematic records from above;

# This span function outputs the indexes of the parased data. We will do a check to see if any of these values are  null

def get_address_span(address=None,address_component=None,label=None):
    '''Search for specified address component and get the span.
    Eg: get_address_span(address="221 B, Baker Street, London",address_component="221",label="BUILDING_NO") would return (0,2,"BUILDING_NO")'''
    
    if address_component is None or (pd.isna(address_component) or str(address_component).lower() == 'nan'):
        pass
    else:
        span=re.search(address_component,address)
        return (span.start(),span.end(),label)




In [19]:

indexes = []

for i in range(df.shape[0]):
    LOC = df['LOCATION'][i]
    for thing in df:
        address_component = df[thing][i]
        if address_component is None or (pd.isna(address_component) or str(address_component).lower() == 'nan'):
            pass
        else:
            try:
                span=re.search(address_component,LOC)
            except:
                indexes.append(i)
            if span is None:
                indexes.append(i)
                print(i)
                print(LOC)
                print(address_component)
print(indexes)

16
50980 ROUTE 13  MIDDLEBURRY, IN.  ELKHART COUNTY
MIDDLEBURY
19
6383 CRIMSON CIRCLE EAST DR (SOUTHPORT)  SPILL RAN OFF THIS PROPERTY, ALONG THE STREET CURB TOWARD SOUTH ABOUT 4-5 PROPERTIES AND INTO A STORM DRAIN
6383 CRIMSON CIRCLE EAST DR (SOUTHPORT)  SPILL RAN OFF THIS PROPERTY, ALONG THE STREET CURB TOWARD SOUTH ABOUT 4-5 PROPERTIES AND INTO A STORM DRAIN
91
ST RD 124 AND 200 E\  MT ETNA
ST RD 124 AND 200 E\  MT ETNA
100
1801 CRAWFORD ST  MIDDLETOWN, OH  45044
54044
164
491 SOUTH COUNTRY RD. 800 EAST/ AVON YARD  INDIANAPOLIS, IN    HENDRICKS COUNTY  (NRC STATED MARION COUNTY)
491 SOUTH COUNTRY RD. 800 EAST/ AVON YARD  INDIANAPOLIS, IN    HENDRICKS COUNTY  (NRC STATED MARION COUNTY)
164
491 SOUTH COUNTRY RD. 800 EAST/ AVON YARD  INDIANAPOLIS, IN    HENDRICKS COUNTY  (NRC STATED MARION COUNTY)
SR 327 AND CR 10
182
INTERSECTION OF 223 E AND 200 S  DANVILLE, IN
223 E AND 200 S '
191
GO LO  4321 E DUNES HIGHWAY (US12)  GARY
GO LO  4321 E DUNES HIGHWAY (US12)  GARY
193
SECTION 18, 6 SO

In [33]:
# we want to remove these indexes because they are causing issues, sometimes its because of the ( or the / getting rid of these rexcords is ok

df_filtered = df.drop(indexes)
# print(df_filtered.head(30).to_markdown(index=False))


# Save the modified DataFrame to a CSV file
csv_filename = 'filtered_data.csv'
df_filtered.to_csv(csv_filename, index=False)


In [34]:
indexes_test = []

for index, row in df_filtered.iterrows():
    LOC = row['LOCATION']
    for thing in df_filtered:
        address_component = df_filtered[thing][index]
        if address_component is None or (pd.isna(address_component) or str(address_component).lower() == 'nan'):
            pass
        else:
            try:
                span=re.search(address_component,LOC)
            except:
                indexes_test.append(i)
            if span is None:
                indexes_test.append(i)

print(indexes_test)

[]


<h2>Split data into Train, Validation and Test sets</h2>

In [35]:
# Get the number of rows in the DataFrame
num_rows = df_filtered.shape[0]

# Print the number of rows
print("Number of rows in the DataFrame:", num_rows)


# # 70% of 259 is 182: 182 rows will be the training set

df_train = df_filtered.head(182)

# # 10% of 259 is 26: 26 rows will be the validation set

df_val = df_filtered.iloc[182: 207]

# # 20% of 300 is 52: 52ish rows will be the validation set
df_test = df_filtered.iloc[208: 259]


Number of rows in the DataFrame: 286


In [36]:



def extend_list(entity_list,entity):
    if pd.isna(entity):
        return entity_list
    else:
        entity_list.append(entity)
        return entity_list




In [37]:

def create_entity_spans(df,tag_list):

    '''Create entity spans for training/test datasets'''
    df["AddressTag"]=df.apply(lambda row:get_address_span(address=row['LOCATION'],address_component=row['ADDRESS'],label='ADDRESS'),axis=1)
    df["CityTag"]=df.apply(lambda row:get_address_span(address=row['LOCATION'],address_component=row['CITY'],label='CITY'),axis=1)
    df["ZipTag"]=df.apply(lambda row:get_address_span(address=row['LOCATION'],address_component=row['ZIP'],label='ZIP'),axis=1)
    df["XTag"]=df.apply(lambda row:get_address_span(address=row['LOCATION'],address_component=row['X'],label='X'),axis=1)
    df["YTag"]=df.apply(lambda row:get_address_span(address=row['LOCATION'],address_component=row['Y'],label='Y'),axis=1)
    df['EmptySpan']=df.apply(lambda x: [], axis=1)

    for i in tag_list:
        df['EntitySpans']=df.apply(lambda row: extend_list(row['EmptySpan'],row[i]),axis=1)
        df['EntitySpans']=df[['EntitySpans','LOCATION']].apply(lambda x: (x.iloc[1], x.iloc[0]),axis=1)
    return df['EntitySpans']

In [38]:


def check_overlapping_spans(spans):
    """Utility function to check for overlapping spans"""
    sorted_spans = sorted(spans, key=lambda span: span.start)
    for i in range(1, len(sorted_spans)):
        if sorted_spans[i].start < sorted_spans[i - 1].end:
            return True
    return False


In [39]:




def get_doc_bin(training_data, nlp):
    
    '''Create DocBin object for building training/test corpus'''
    # the DocBin will store the example documents
    db = DocBin()
    for text, annotations in training_data:
        doc = nlp(text)  # Construct a Doc object
        ents = []
        for start, end, label in annotations:
            span = doc.char_span(start, end, label=label)
            if span is not None:
                ents.append(span)
            # else:
            #     print(span)
            #     print(text)
            #     print(annotations)
        # Check for overlapping spans
        if not check_overlapping_spans(ents):
            doc.ents = ents
            db.add(doc)
        else:
            # count += 1 #138
            print(f"Warning: Overlapping spans in text '{text}' with annotations '{annotations}'")

    return db


In [40]:
#Load blank English model. This is needed for initializing a Document object for our training/test set.
nlp = spacy.blank("en")

#Define custom entity tag list
tag_list=["AddressTag","CityTag","ZipTag","XTag","YTag"]

# Get entity spans
df_entity_spans= create_entity_spans(df_train.astype(str),tag_list)
training_data= df_entity_spans.values.tolist()


# # Get & Persist DocBin to disk
doc_bin_train= get_doc_bin(training_data,nlp)
doc_bin_train.to_disk("training.spacy")
# ######################################


# # Get entity spans
df_entity_spans= create_entity_spans(df_val.astype(str),tag_list)
validation_data= df_entity_spans.values.tolist()

# # Get & Persist DocBin to disk
doc_bin_test= get_doc_bin(validation_data,nlp)
doc_bin_test.to_disk("Test.spacy")
# ##########################################



DocBin are used to package all the data into a single binary file.
This file contains all the text data in a format optimized for speed and efficiency. It’s like putting all your documents into a compact, easy-to-carry container.
When you need to work with the data, you can quickly load this DocBin file, unpack the documents, and perform various NLP tasks like text classification, entity recognition, or sentiment analysis efficiently.

- config file

In [41]:
# we can run some code through the command line to train our model

In [42]:
"""
python -m spacy init fill-config base_config.cfg config\config.cfg
python -m spacy train config\config.cfg --paths.train training_dataset.spacy --paths.dev test.spacy --output output\models --training.eval_frequency 10 --training.max_steps 300
"""

'\npython -m spacy init fill-config base_config.cfg config\\config.cfg\npython -m spacy train config\\config.cfg --paths.train training_dataset.spacy --paths.dev test.spacy --output output\\models --training.eval_frequency 10 --training.max_steps 300\n'

In [43]:
"""
E    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  --------  ------  ------  ------  ------
  0       0     65.07    6.90    5.33    9.76    0.07
  0      10    738.81    0.00    0.00    0.00    0.00
  1      20    295.99   45.28  100.00   29.27    0.45
  1      30    184.99   47.76   61.54   39.02    0.48
  2      40    569.01    9.52    9.30    9.76    0.10
  2      50    280.38   56.41   59.46   53.66    0.56
  3      60     97.77   59.74   63.89   56.10    0.60
  3      70    126.17   72.50   74.36   70.73    0.72
  4      80     71.42   73.68   80.00   68.29    0.74
  5      90     48.18   72.73   77.78   68.29    0.73
  5     100     33.35   74.70   73.81   75.61    0.75
  6     110     18.58   74.07   75.00   73.17    0.74
  7     120     24.04   76.71   87.50   68.29    0.77
  7     130     11.46   76.92   81.08   73.17    0.77
  8     140      7.35   72.29   71.43   73.17    0.72
  8     150     10.64   75.32   80.56   70.73    0.75
  9     160     10.74   77.50   79.49   75.61    0.77
 10     170      7.42   77.11   76.19   78.05    0.77
 10     180     11.04   78.57   76.74   80.49    0.79
 11     190      0.56   77.50   79.49   75.61    0.77
 12     200      8.41   77.50   79.49   75.61    0.77
 12     210     13.54   75.61   75.61   75.61    0.76
 13     220      7.26   79.01   80.00   78.05    0.79
 14     230     10.71   75.32   80.56   70.73    0.75
 14     240      0.83   76.54   77.50   75.61    0.77
 15     250      5.92   80.00   82.05   78.05    0.80
 16     260      5.97   80.00   82.05   78.05    0.80
 17     270      0.82   80.00   82.05   78.05    0.80
 17     280      3.36   77.92   83.33   73.17    0.78
 18     290      3.04   80.52   86.11   75.61    0.81
 19     300      4.51   81.82   76.60   87.80    0.82

"""

'\nE    #       LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE\n---  ------  --------  ------  ------  ------  ------\n  0       0     65.07    6.90    5.33    9.76    0.07\n  0      10    738.81    0.00    0.00    0.00    0.00\n  1      20    295.99   45.28  100.00   29.27    0.45\n  1      30    184.99   47.76   61.54   39.02    0.48\n  2      40    569.01    9.52    9.30    9.76    0.10\n  2      50    280.38   56.41   59.46   53.66    0.56\n  3      60     97.77   59.74   63.89   56.10    0.60\n  3      70    126.17   72.50   74.36   70.73    0.72\n  4      80     71.42   73.68   80.00   68.29    0.74\n  5      90     48.18   72.73   77.78   68.29    0.73\n  5     100     33.35   74.70   73.81   75.61    0.75\n  6     110     18.58   74.07   75.00   73.17    0.74\n  7     120     24.04   76.71   87.50   68.29    0.77\n  7     130     11.46   76.92   81.08   73.17    0.77\n  8     140      7.35   72.29   71.43   73.17    0.72\n  8     150     10.64   75.32   80.56   70.73    0.75\n  9     

this model take 300 steps. and by the end of it our losses have decreased


<h1>Test the model</h1>

In [44]:
# df_test


nlp=spacy.load("output\models\model-best")

###Prediction output###



# location_list = df_test['LOCATION']
# correct_address = df_test['ADDRESS']
# correct_city = df_test['CITY']
# correct_ZIP = df_test['ZIP']
# correct_X = df_test['X']
# correct_Y = df_test['Y']

count_right_address = 0
count_right_city = 0
counter =0

for index, row in df_test.iterrows():
    LOC = row['LOCATION']
    doc=nlp(LOC)
    for ent in doc.ents:
        if ent.label_ == 'ADDRESS':
            print('this is the parsed address the model predicted: ' + ent.text)
            print('this is the pased address I manually parsed: ' + str(row['ADDRESS']))
            if ent.text ==str(row['ADDRESS']):
                count_right_address += 1
                print('yes')
            else:
                print('no')
            
        elif ent.label_ == 'CITY':
            # print('this is the parsed city the model predicted: ' + ent.text)
            # print('this is the pased city I manually parsed: ' + str(row['CITY']))
            if ent.text ==str(row['CITY']):
                count_right_city += 1
    counter+=1
                    
print(count_right_city)    
print(count_right_address)
print(counter)

this is the parsed address the model predicted: 87TH AND PULASKI RD
this is the pased address I manually parsed: 87TH AND PULASKI RD
yes
this is the parsed address the model predicted: 6795 E CR 600
this is the pased address I manually parsed: 6795 E CR 600
yes
this is the parsed address the model predicted: 600 EAST DALLAS ROAD
this is the pased address I manually parsed: 600 EAST DALLAS ROAD
yes
this is the parsed address the model predicted: 1030 E MARKET
this is the pased address I manually parsed: 1030 E MARKET
yes
this is the parsed address the model predicted: 301 N RANDOLPH ST
this is the pased address I manually parsed: 301 N RANDOLPH ST
yes
this is the parsed address the model predicted: 1119 S SR 3
this is the pased address I manually parsed: 1119 S SR 3
yes
this is the parsed address the model predicted: 8947 E DELAWARE PARKWAY
this is the pased address I manually parsed: 8947 E DELAWARE PARKWAY
yes
this is the parsed address the model predicted: DR. JAMES A DILLON PARK
thi

in the training dataset: the model score about a 50% just parsing the addresses (26/51) and about a 60% with city values (33/51)

<h1> Testing Raw data with the model:</h1>

In [45]:
# testing the model with raw data 

nlp=spacy.load("output\models\model-best")


df_RAW=pd.read_csv("ADDRESS_DATA_RAW.csv",sep=",",dtype=str)


###Prediction output###



address_list = df_RAW['LOCATION'][1:100]

for address in address_list:
    doc=nlp(address)
    ent_list=[(ent.text, ent.label_) for ent in doc.ents]
    print("Address string -> "+address)
    print("Parsed address -> "+str(ent_list))
    print("******")


Address string -> 135th & New Ave, Lemont, IL  60439  
Parsed address -> [('135th & New', 'ADDRESS'), ('Lemont', 'CITY'), ('60439', 'ZIP')]
******
Address string -> 135th & New Ave  Lemont, IL  60439
Parsed address -> [('135th & New Ave', 'ADDRESS'), ('Lemont', 'CITY'), ('60439', 'ZIP')]
******
Address string -> Ranards Hauling, Removal and Recycling  2772 North State Road 157
Parsed address -> [('2772 North State Road 157', 'ADDRESS')]
******
Address string -> Milepost 298.3  Newell, IL
Parsed address -> [('Newell', 'CITY')]
******
Address string -> I 70 East Bound Mile Marker 112  Greenfield, IN
Parsed address -> [('Greenfield', 'CITY')]
******
Address string -> Toll Road MM 23 EB pavement and shoulder
Parsed address -> [('Toll Road MM 23', 'ADDRESS')]
******
Address string -> Lake Michigan & Lake Front Whihala Beach  Whiting, IN
Parsed address -> [('Lake Front Whihala Beach', 'CITY'), ('Whiting', 'CITY')]
******
Address string -> 6383 CRIMSON CIRCLE EAST DR (SOUTHPORT)  SPILL RAN OF

At a glance these values look pretty good! Even though the test dataset scored about a 50% in accuracy. This shows that for the data to parsed carefully more training data with parsed addresses will be nessesary.

<h1>Next steps: The Future</h1>


ArcGIS has a library itself called: arcgis.learn, which also uses spaCy in coorperated with geocoding and ESRI. In the future we can incoorperate this into the workflow as well. 

Better training data = better results. In the future using larger training data to train our model will result in better outcomes

Exploring the parameters in the config file like learning rate etc.

<h1>References</h1>
  I used a lot of the functions from this repository in this medium article:
  
  [Building and Address Parser with spaCy](<https://medium.com/globant/building-an-address-parser-with-spacy-e3376b7cff>). 


  An example of the ArcGis Learn module:
  
   [Information extraction from Madison city crime incident reports using Deep Learning](<https://developers.arcgis.com/python/samples/information-extraction-from-madison-city-crime-incident-reports-using-deep-learning/#References>). 

 

  
