<a href="https://colab.research.google.com/github/koad7/NLP_PYTORCH/blob/main/NER_Improved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
try:
    from transformers import BertModel, BertTokenizer
    import nltk
except ImportError:
    !pip install  -q  transformers nltk
    from transformers import BertModel, BertTokenizer

import torch
import numpy as np


# Example text

paragraph = '''Police launched a battery investigation
Yahoo has confirmed that Spears filed a police report after the confrontation, which apparently identifies Spurs director of team security Damian Smith as the person who allegedly backhanded the singer.

"On July 5, 2023, at approximately 11 p.m., LVMPD officers responded to a property in the 3700 block of Las Vegas Boulevard regarding a battery investigation," a police spokesperson said on Thursday. "The incident has been documented on a police report and no arrest or citations have been issued."

A criminal investigation is being conducted. TMZ, meanwhile, has reported the case will "likely" be sent to the district attorney's office.

An eyewitness backs up Spears's version
A man who saw everything go down at the Aria hotel told TMZ he saw Spears approach Wembanyama. Purportedly, she leaned in and said, "Excuse me, sir ... excuse me, sir." Spears apparently touched Wembanyama's back and that's when the witness claimed to see a Spurs security guard hit her in the face causing Spears's sunglasses to fly off.

Sam Asghari watched the whole thing go down, too
Spears's husband blasted the "coward" who allegedly hit his wife, but said Wembanyama isn't at fault.

"The violent behavior of an out-of-control security guard should not cast a shadow on the accomplishment of a great young man on the rise. The blame should fall on the coward who did this, the people who hired him without proper vetting, and a systemic culture of disregard for women within sports and entertainment," he wrote, in part, on Instagram.

"I can't imagine a scenario where an unarmed female fan showing any kind of excitement or appreciation for a celebrity would cause her to be physically assaulted, much less being hit in the face for tapping someone on the shoulder," he added.'''



[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
[?25h

# EMBEDDINGS

In [3]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from sklearn.cluster import KMeans

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
try:
    from hmmlearn import hmm
except ImportError:
    !pip install  -q  hmmlearn
    from hmmlearn import hmm


# Tokenize the paragraph
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(paragraph)

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)

# Map the token strings to their vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)

# Convert indexed tokens to a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

# Predict hidden states features for each layer
with torch.no_grad():
    outputs = model(tokens_tensor)
    hidden_states = outputs[2]  # The third output consists of hidden states

# Select the embeddings from the last layer
token_embeddings = hidden_states[-1]

# Create a DataFrame to store the token-level embeddings
embedding_data = []
for token, embedding in zip(tokens, token_embeddings[0]):
    embedding_data.append(embedding.numpy())

# Create a Pandas DataFrame from the embedding data
named_entity_dict = pd.DataFrame(embedding_data, index=tokens)

# Normalize the word embeddings
normalized_embeddings = embedding_data / np.linalg.norm(embedding_data, axis=1, keepdims=True)

# Perform K-means clustering with K=2
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(normalized_embeddings)

# Get the cluster labels and the cluster sizes
cluster_labels_ = kmeans.labels_
cluster_sizes = [len(cluster_labels_[cluster_labels_ == i]) for i in range(2)]

# Find the index of the larger cluster (0 or 1)
larger_cluster_index = cluster_sizes.index(max(cluster_sizes))

# Swap the cluster labels to map 0 to the larger cluster and 1 to the smaller cluster
# and Create binary tags for each word
binary_tags = (cluster_labels_ ^ (larger_cluster_index == 1)).astype(int)



# Create a DataFrame to store the named entity dictionary
named_entity_dict = pd.DataFrame({'word': tokens, 'tag': binary_tags})
named_entity_dict['tag'] = named_entity_dict['tag'].astype(int)

# Create a coarse NE dictionary
coarse_ne_dict = {}
for i, row in named_entity_dict.iterrows():
    word = row['word']
    tag = row['tag']
    if tag == 1:
        coarse_ne_dict[word] = tag

# Save the named entity dictionary to a CSV file
named_entity_dict.to_csv('named_entity_dict.csv', index=False)





# ENTITY SPAN PREDICTION


*Gaussian Hidden Markov Model (Gaussian-HMM) for Entity Span Detection*: In the first step, we use a Gaussian Hidden Markov Model (Gaussian-HMM) to learn the latent Markov process among the NE labels. We use the IOB (Inside, Outside, Beginning) tagging scheme to represent the entity spans. The Gaussian-HMM helps capture the transitions between different IOB labels and aids in identifying the boundaries of named entities.

In [None]:
# Load the named entity dictionary
named_entity_dict = pd.read_csv('named_entity_dict.csv')

# Define the observation symbols and transition probabilities
observation_symbols = ["B", "I", "O"]
transition_probs = {
    "B": {"B": 0.7, "I": 0.3, "O": 0.0},
    "I": {"B": 0.4, "I": 0.6, "O": 0.0},
    "O": {"B": 0.0, "I": 0.0, "O": 0.5}
}


# Define the emission probabilities
emission_probs = {}
for _, row in named_entity_dict.iterrows():
    word = row['word']
    label = row['tag']
    if label == 1:
        emission_probs[word] = np.array([1.0, 0.0, 0.0])
    elif label == 0:
        emission_probs[word] = np.array([0.0, 1.0, 0.0])
    else:
        emission_probs[word] = np.array([0.0, 0.0, 1.0])

# Define the initial state probabilities
initial_state_probs = {"B": 0.5, "I": 0.3, "O": 0.2}

# Define the number of components (i.e., the number of Gaussians per state)
num_components = 3

# Fit the model to the data
model = hmm.GaussianHMM(n_components=num_components)
model.startprob_ = np.array([initial_state_probs["B"], initial_state_probs["I"], initial_state_probs["O"]])
model.transmat_ = np.array([
    [transition_probs["B"]["B"], transition_probs["B"]["I"], transition_probs["B"]["O"]],
    [transition_probs["I"]["B"], transition_probs["I"]["I"], transition_probs["I"]["O"]],
    [transition_probs["O"]["B"], transition_probs["O"]["I"], transition_probs["O"]["O"]]
]) / np.array([
    [transition_probs["B"]["B"] + transition_probs["B"]["I"] + transition_probs["B"]["O"]],
    [transition_probs["I"]["B"] + transition_probs["I"]["I"] + transition_probs["I"]["O"]],
    [transition_probs["O"]["B"] + transition_probs["O"]["I"] + transition_probs["O"]["O"]]
])



In [82]:
# Predict the most likely state sequence given the observed data
predicted_states = model.predict(named_entity_dict['tag'])




AttributeError: ignored

# ENTITY TYPES PREDICTIONS

The code performs the following tasks:

1. Tokenizes a given paragraph using the BERT tokenizer.
2. Loads a pre-trained BERT model and extracts the hidden states features for each token in the paragraph.
3. Selects the embeddings from the last layer of the BERT model.
4. Creates a DataFrame named_entity_dict to store the token-level embeddings.
5. Normalizes the word embeddings.
6. Performs K-means clustering with K=2 on the normalized embeddings.
7. Predicts cluster labels (binary tags) for each word using the K-means model.
8. Creates a DataFrame named_entity_dict to store the named entity dictionary, consisting of words and their binary tags.
9. Saves the named entity dictionary to a CSV file.
10. Loads the named entity dictionary from the CSV file.
11. Defines observation symbols and transition probabilities for the Hidden Markov Model (HMM).
12. Defines emission probabilities based on the binary tags from the named entity dictionary.
13. Defines initial state probabilities for the HMM.
14. Fits a GaussianHMM model to the data using the defined probabilities.
15. Predicts the most likely state sequence given the observed data.
16. Prints the predicted states along with their corresponding state names.

Issues in the code:
1. The variable `word_embeddings` is not defined when creating the binary tags.
2. The variable `observation_sequences` is not defined when predicting the most likely state sequence.

To resolve these issues, you can replace the lines:
```python
binary_tags = np.zeros(len(word_embeddings))
binary_tags[cluster_labels == 0] = 0
binary_tags[cluster_labels == 1] = 1
```
with:
```python
binary_tags = np.zeros(len(named_entity_dict))
binary_tags[cluster_labels == 0] = 0
binary_tags[cluster_labels == 1] = 1
```

and replace the line:
```python
predicted_states = model.predict(observation_sequences)
```
with:
```python
predicted_states = model.predict(named_entity_dict['tag'].values.reshape(-1, 1))
```

These changes ensure that the correct variables are used in the code.

In [80]:
named_entity_dict

Unnamed: 0,word,tag
0,police,0
1,launched,1
2,a,1
3,battery,1
4,investigation,1
...,...,...
390,",",0
391,"""",0
392,he,0
393,added,0


Unnamed: 0,word,tag
0,police,0
1,launched,1
2,a,1
3,battery,1
4,investigation,1
...,...,...
390,",",0
391,"""",0
392,he,0
393,added,0
