 Download and Unzip Stanford NER

---


This cell handles the initial setup by downloading the necessary files from Stanford's servers.

Define Paths: It starts by defining the URL for the Stanford NER software, the name of the directory it will be extracted to, and the name for the downloaded zip file.

Download: It checks if the zip file (stanford-ner.zip) already exists. If not, it downloads the file from the specified URL using urllib.request.urlretrieve.

Extract: After the download, it checks if the target directory (stanford-ner-2018-10-16) exists. If not, it unzips the downloaded file into the current directory using the zipfile library.

Essentially, this cell automates the process of getting the required third-party software onto your machine.

In [26]:
import os
import urllib.request
import zipfile

Initialize the NER Tagger

---


This cell prepares and loads the core NER tagger from the files that were just extracted.

Set File Paths: It creates the full paths to two critical files:

jar: The executable Java archive (.jar) file that contains the Stanford NER program logic.

model: A pre-trained machine learning model file (.ser.gz) that knows how to identify entities in English text. This specific model is trained to recognize 4 classes of entities.

Initialize Tagger: It checks if both the jar and model files actually exist at the specified paths. If they do, it initializes the StanfordNERTagger from the NLTK library, pointing it to these two files. The tagger is now ready to be used.



In [27]:
# Define the download URL and target directory
stanford_ner_url = 'https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip'
stanford_ner_dir = 'stanford-ner-2018-10-16'
stanford_ner_zip = 'stanford-ner.zip'

In [28]:
# Download the Stanford NER library
if not os.path.exists(stanford_ner_zip):
    print(f"Downloading {stanford_ner_url}...")
    urllib.request.urlretrieve(stanford_ner_url, stanford_ner_zip)
    print("Download complete.")
else:
    print(f"{stanford_ner_zip} already exists.")

# Extract the downloaded zip file
if not os.path.exists(stanford_ner_dir):
    print(f"Extracting {stanford_ner_zip}...")
    with zipfile.ZipFile(stanford_ner_zip, 'r') as zip_ref:
        zip_ref.extractall('.')
    print("Extraction complete.")
else:
    print(f"{stanford_ner_dir} already exists.")

Downloading https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip...
Download complete.
Extracting stanford-ner.zip...
Extraction complete.


In [29]:
from nltk.tag.stanford import StanfordNERTagger
import os

# Update the jar and model paths to the extracted directory
jar = os.path.join(stanford_ner_dir, 'stanford-ner.jar')
model = os.path.join(stanford_ner_dir, 'classifiers/english.conll.4class.distsim.crf.ser.gz')

# Check if the files exist before initializing the tagger
if os.path.exists(jar) and os.path.exists(model):
    ner_tagger = StanfordNERTagger(model, jar, encoding='utf8')
    print("StanfordNERTagger initialized successfully.")
else:
    print(f"Error: Could not find {jar} or {model}. Please check the paths.")

StanfordNERTagger initialized successfully.


Initialize a Specific 4-Class Model (Redundant)

---


This cell is functionally identical to the previous one.

It re-defines the path to the same 4-class English model and initializes another StanfordNERTagger instance named st_4class. Since the variable ner_tagger was already created in the previous cell using the exact same model and JAR file, this cell is redundant but reinforces how the tagger is loaded.

In [30]:
import os

st_4class_model_name = 'english.conll.4class.distsim.crf.ser.gz'
st_4class_model_path = os.path.join(stanford_ner_dir, 'classifiers', st_4class_model_name)

# Check if the file exists before initializing the tagger
if os.path.exists(st_4class_model_path) and os.path.exists(jar):
    st_4class = StanfordNERTagger(st_4class_model_path, jar, encoding='utf8')
    print("StanfordNERTagger with 4-class model initialized successfully.")
else:
    print(f"Error: Could not find {st_4class_model_path} or {jar}. Please check the paths.")

StanfordNERTagger with 4-class model initialized successfully.


Perform NER Tagging on a Sentence

---


This is the cell where the NER tagger is put to use.

Define Sentence: A sample sentence, "Barack Obama is the 44th President of the United States.", is created.

Tag Sentence: The .tag() method of the ner_tagger is called. It requires the sentence to be split into a list of words, which is done using sentence.split(). The tagger processes this list and assigns an entity tag to each word.

Print Results: The output is a list of tuples. Each tuple contains a word from the original sentence and its predicted entity tag. For this sentence, you would expect to see

In [31]:
sentence = "Barack Obama is the 44th President of the United States."

tagged_sentence = ner_tagger.tag(sentence.split())

print(tagged_sentence)

[('Barack', 'PERSON'), ('Obama', 'PERSON'), ('is', 'O'), ('the', 'O'), ('44th', 'O'), ('President', 'O'), ('of', 'O'), ('the', 'O'), ('United', 'LOCATION'), ('States.', 'LOCATION')]
