# Conditional Random Fields

CRF's are powerful structured prediction models, commonly used in sequence labeling tasks. CRF's are used in Named Entity Recognition for NLP tasks.

Basically, the problem with normal logistic or decision tree. what happens is it takes every word individually important. which can make it very hard to capture the sequences. So, we use CRF's which are a type od discriminative probablistic model that are particularly well suitated for sequence labeling.

Let's take an example, We have a sequence of observations and we have to assign a sequence of labels (Like POS tags or named entity tags).

For this example, we will use Kaggle Dataset : CoNLL003

### Importing Libraries and Loading the data

Things we need to download : 
- We can use CRF from the class sklearn-crfsuite
- We use different variables to access the dataset

In [2]:
import os
from sklearn_crfsuite import CRF, metrics
import nltk

DATA_DIR = '/home/ntejha/Projects/ML-Algo-Implementation/data/CoNLL003'
TRAIN_FILE = '/home/ntejha/Projects/ML-Algo-Implementation/data/CoNLL003/train.txt'
VALID_FILE = '/home/ntejha/Projects/ML-Algo-Implementation/data/CoNLL003/valid.txt'
TEST_FILE = '/home/ntejha/Projects/ML-Algo-Implementation/data/CoNLL003/test.txt'

### Understanding the data

As the data is in text file, we have to make sure the sentences and store all the data into the all_sentences variables as list.

In [3]:
def read_story_data(filepath):
    """
    This function reads our special story files.
    It breaks them into sentences. Each 'word' in a sentence
    comes with its type (like Noun, Verb) and its 'name' tag (like B-PER for start of Person).
    """
    all_sentences = []
    current_sentence = [] 

    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()

            
            if line == '' or line.startswith('-DOCSTART-'):
                if current_sentence: 
                    all_sentences.append(current_sentence) 
                    current_sentence = [] 
            else:
                
                parts = line.split() 
                if len(parts) == 4: 
                    
                    current_sentence.append(tuple(parts))
                else:
                    print(f"Heads up! Skipping a weird line: {line}")

    if current_sentence:
        all_sentences.append(current_sentence)
    return all_sentences

print("Reading the detective's training manuals...")
training_stories = read_story_data(TRAIN_FILE)
validation_stories = read_story_data(VALID_FILE)
test_stories = read_story_data(TEST_FILE)

print(f"Found {len(training_stories)} training stories.")
print(f"Found {len(validation_stories)} validation stories.")
print(f"Found {len(test_stories)} test stories.")

print("\nHere's how a small part of a training story looks:")
print(training_stories[0][:5])

Reading the detective's training manuals...
Found 14041 training stories.
Found 3250 validation stories.
Found 3453 test stories.

Here's how a small part of a training story looks:
[('EU', 'NNP', 'B-NP', 'B-ORG'), ('rejects', 'VBZ', 'B-VP', 'O'), ('German', 'JJ', 'B-NP', 'B-MISC'), ('call', 'NN', 'I-NP', 'O'), ('to', 'TO', 'B-VP', 'O')]


### Giving the features to the Model

We are giving clues that a model uses to make its decisions. For NER, we need to extract clues from each word in a sentence to help the model decide if that word is part of a person name, a location, an organization or nothing special.

In [4]:
def get_clues_for_word(one_sentence_facts, word_position):
    """
    This function creates a 'clue list' (features) for a single word.
    It looks at the word itself, its tags (POS, Chunk), and its neighbors.
    """
    word, pos_tag, chunk_tag, _ = one_sentence_facts[word_position] 

    clues = {
        'always_on': 1.0, 
        'word_lowercase': word.lower(), 
        'last_3_letters': word[-3:],   
        'last_2_letters': word[-2:],   
        'is_all_caps': word.isupper(),  
        'starts_with_cap': word.istitle(), 
        'is_a_number': word.isdigit(),  
        'part_of_speech': pos_tag,       
        'first_2_pos': pos_tag[:2],      
        'chunk_tag': chunk_tag,          
        'first_2_chunk': chunk_tag[:2],  
    }

    
    if word_position > 0:
        word_before, pos_before, chunk_before, _ = one_sentence_facts[word_position-1] 
        clues.update({ 
            '-1_word_lowercase': word_before.lower(),
            '-1_starts_with_cap': word_before.istitle(),
            '-1_is_all_caps': word_before.isupper(),
            '-1_part_of_speech': pos_before,
            '-1_first_2_pos': pos_before[:2],
            '-1_chunk_tag': chunk_before,
            '-1_first_2_chunk': chunk_before[:2],
        })
    else:
        clues['is_start_of_sentence'] = True 

   
    if word_position < len(one_sentence_facts) - 1: 
        word_after, pos_after, chunk_after, _ = one_sentence_facts[word_position+1]
        clues.update({ 
            '+1_word_lowercase': word_after.lower(),
            '+1_starts_with_cap': word_after.istitle(),
            '+1_is_all_caps': word_after.isupper(),
            '+1_part_of_speech': pos_after,
            '+1_first_2_pos': pos_after[:2],
            '+1_chunk_tag': chunk_after,
            '+1_first_2_chunk': chunk_after[:2],
        })
    else:
        clues['is_end_of_sentence'] = True 

    return clues

def get_all_clues_for_sentence(sentence_data):
    """ Converts a whole sentence into a list of clue-lists for each word. """
    return [get_clues_for_word(sentence_data, i) for i in range(len(sentence_data))]

def get_all_answers_for_sentence(sentence_data):
    """ Extracts only the NER tags (the answers) for each word in a sentence. """
    return [ner_tag for word, pos_tag, chunk_tag, ner_tag in sentence_data]

### Preparing for Learning 

This will gather everything and then process it and puts it into specific folders for our CRF needs.

In [5]:
print("\nOrganizing the clues (features) and answers (NER tags) for the detective...")

X_train_clues = [get_all_clues_for_sentence(s) for s in training_stories]
X_valid_clues = [get_all_clues_for_sentence(s) for s in validation_stories]
X_test_clues = [get_all_clues_for_sentence(s) for s in test_stories]


y_train_answers = [get_all_answers_for_sentence(s) for s in training_stories]
y_valid_answers = [get_all_answers_for_sentence(s) for s in validation_stories]
y_test_answers = [get_all_answers_for_sentence(s) for s in test_stories]

print(f"Example of clues for the first word of the first training story:\n{X_train_clues[0][0]}")
print(f"Example of the correct answer for that word: {y_train_answers[0][0]}")


Organizing the clues (features) and answers (NER tags) for the detective...
Example of clues for the first word of the first training story:
{'always_on': 1.0, 'word_lowercase': 'eu', 'last_3_letters': 'EU', 'last_2_letters': 'EU', 'is_all_caps': True, 'starts_with_cap': False, 'is_a_number': False, 'part_of_speech': 'NNP', 'first_2_pos': 'NN', 'chunk_tag': 'B-NP', 'first_2_chunk': 'B-', 'is_start_of_sentence': True, '+1_word_lowercase': 'rejects', '+1_starts_with_cap': False, '+1_is_all_caps': False, '+1_part_of_speech': 'VBZ', '+1_first_2_pos': 'VB', '+1_chunk_tag': 'B-VP', '+1_first_2_chunk': 'B-'}
Example of the correct answer for that word: B-ORG


### Training the CRF

It will train the CRF to study patterns nad learn better at identifying features.

In [6]:


print("\nNow, training our 'Detective Brain' (CRF Model). This takes a little time...")
my_ner_detective = CRF(
    algorithm='lbfgs', 
    c1=0.1,  
    c2=0.1,  
    max_iterations=100, 
    all_possible_transitions=True 
)

try:
    my_ner_detective.fit(X_train_clues, y_train_answers)
    print("Detective brain training finished successfully!")
except Exception as e:
    print(f"An error occurred during training: {e}")
    print("This can sometimes happen with very small datasets or unusual feature sets.")


Now, training our 'Detective Brain' (CRF Model). This takes a little time...
Detective brain training finished successfully!


### Evaluation

This is just a series of test to check if they actually generalize the data very well.

In [7]:
print("\nEvaluating the model's performance on the validation set...")

predictions_valid = my_ner_detective.predict(X_valid_clues)

all_known_name_labels = list(my_ner_detective.classes_)
if 'O' in all_known_name_labels:
    all_known_name_labels.remove('O') 

print("\n--- Detective's Report Card (Validation Set) ---")

print(metrics.flat_classification_report(
    y_true=y_valid_answers,          
    y_pred=predictions_valid, 
    labels=all_known_name_labels,
    digits=3                  
))


overall_f1_valid = metrics.flat_f1_score(
    y_true=y_valid_answers,
    y_pred=predictions_valid,
    average='weighted', 
    labels=all_known_name_labels
)
print(f"Overall Name-Finding Score (F1-score) on Validation Set: {overall_f1_valid:.3f}")


print("\nNow for the final, fair test on the 'test' manual (it's never seen this before!)...")
predictions_test = my_ner_detective.predict(X_test_clues)

print("\n--- Detective's Final Report Card (Test Set) ---")
print(metrics.flat_classification_report(
    y_true=y_test_answers,
    y_pred=predictions_test,
    labels=all_known_name_labels,
    digits=3
))

overall_f1_test = metrics.flat_f1_score(
    y_true=y_test_answers,
    y_pred=predictions_test,
    average='weighted',
    labels=all_known_name_labels
)
print(f"Overall Name-Finding Score (F1-score) on Test Set: {overall_f1_test:.3f}")


Evaluating the model's performance on the validation set...

--- Detective's Report Card (Validation Set) ---
              precision    recall  f1-score   support

       B-ORG      0.852     0.805     0.828      1341
      B-MISC      0.925     0.839     0.880       922
       B-PER      0.896     0.906     0.901      1842
       I-PER      0.936     0.956     0.946      1307
       B-LOC      0.915     0.878     0.896      1837
       I-ORG      0.816     0.832     0.824       751
      I-MISC      0.900     0.728     0.805       346
       I-LOC      0.892     0.805     0.847       257

   micro avg      0.895     0.868     0.882      8603
   macro avg      0.892     0.844     0.866      8603
weighted avg      0.896     0.868     0.881      8603

Overall Name-Finding Score (F1-score) on Validation Set: 0.881

Now for the final, fair test on the 'test' manual (it's never seen this before!)...

--- Detective's Final Report Card (Test Set) ---
              precision    recall  f1-sc

### Prediction

This is basically to find the names in any new sentence we give it.

In [9]:
print("\n--- Let's give our detective a new case! ---")


def prepare_new_story_sentence(raw_text_input):
    """
    Takes a plain sentence and guesses its word parts (like POS and Chunk tags)
    to prepare it for our detective.
    """
    words = nltk.word_tokenize(raw_text_input)

    processed_sentence_facts = []
    for word in words:
        if word[0].isupper() and word.isalpha():
            pos = 'NNP' 
        elif word.isdigit():
            pos = 'CD'
        elif word == '.':
            pos = '.'
        elif word == ',':
            pos = ','
        elif word in ['the', 'a', 'an']:
            pos = 'DT' 
        elif word in ['will', 'was', 'is', 'visits', 'announced', 'hold', 'have']:
            pos = 'VBD'
        elif word in ['in', 'on', 'at', 'next']:
            pos = 'IN' 
        elif word in ['today', 'month', 'week', 'Monday', 'July']: 
            pos = 'NN' 
        else:
            pos = 'NN'

        chunk = 'O'
        dummy_ner = 'O'

        processed_sentence_facts.append((word, pos, chunk, dummy_ner))
    return processed_sentence_facts

my_story_1 = "Elon Musk visited Tokyo on July 20, 2025."
processed_story_1 = prepare_new_story_sentence(my_story_1)
clues_for_story_1 = get_all_clues_for_sentence(processed_story_1)

predicted_name_tags_1 = my_ner_detective.predict([clues_for_story_1])[0] # Get the first (and only) sentence's tags

print(f"\nOriginal Story: \"{my_story_1}\"")
print("Detective's Name Findings:")
for i, (word, _, _, _) in enumerate(processed_story_1):
    print(f"  Word: '{word}' -> Name Type: {predicted_name_tags_1[i]}")

print("\n--- Another case for our detective! ---")

my_story_2 = "The United Nations will have a meeting in London next Monday."
processed_story_2 = prepare_new_story_sentence(my_story_2)
clues_for_story_2 = get_all_clues_for_sentence(processed_story_2)
predicted_name_tags_2 = my_ner_detective.predict([clues_for_story_2])[0]

print(f"\nOriginal Story: \"{my_story_2}\"")
print("Detective's Name Findings:")
for i, (word, _, _, _) in enumerate(processed_story_2):
    print(f"  Word: '{word}' -> Name Type: {predicted_name_tags_2[i]}")


--- Let's give our detective a new case! ---

Original Story: "Elon Musk visited Tokyo on July 20, 2025."
Detective's Name Findings:
  Word: 'Elon' -> Name Type: B-ORG
  Word: 'Musk' -> Name Type: I-ORG
  Word: 'visited' -> Name Type: O
  Word: 'Tokyo' -> Name Type: B-LOC
  Word: 'on' -> Name Type: O
  Word: 'July' -> Name Type: O
  Word: '20' -> Name Type: O
  Word: ',' -> Name Type: O
  Word: '2025' -> Name Type: O
  Word: '.' -> Name Type: O

--- Another case for our detective! ---

Original Story: "The United Nations will have a meeting in London next Monday."
Detective's Name Findings:
  Word: 'The' -> Name Type: B-ORG
  Word: 'United' -> Name Type: I-ORG
  Word: 'Nations' -> Name Type: I-ORG
  Word: 'will' -> Name Type: O
  Word: 'have' -> Name Type: O
  Word: 'a' -> Name Type: O
  Word: 'meeting' -> Name Type: O
  Word: 'in' -> Name Type: O
  Word: 'London' -> Name Type: B-LOC
  Word: 'next' -> Name Type: O
  Word: 'Monday' -> Name Type: O
  Word: '.' -> Name Type: O
