In [None]:
## Exploring Named Entity Recognition with NLTK: A Beginner's Guide.
In the vast landscape of information available online, it can be challenging to extract meaningful insights from articles. That's where our project steps in, employing a clever tool called Named Entity Recognition (NER). Think of NER as a friendly guide that helps computers recognize and categorize important details like names of people, places, and organizations within a sea of words.

Our project specifically focuses on reading news articles. Using the Natural Language Toolkit (NLTK), our system breaks down sentences, identifies the roles of words (like names or locations), and highlights significant information. It's like having a language wizard that points out crucial details, making it easier to understand the main points of an article. This not only aids in digesting information quickly but also helps readers, researchers, and analysts make more informed decisions.

In essence, our project is all about simplifying the complexity of text. By applying NER to news articles, we're providing a tool that enhances comprehension, making it accessible for anyone navigating through a barrage of information. Whether you're a student, researcher, or someone curious about current events, our project strives to make the wealth of online knowledge more understandable and user-friendly.

## Module 1
### Task 1: Importing Necessary Libraries and Resources.
Before understanding the process of Named Entity Recognition, we need to import the necessary libraries and view the news article. Let's take a look!

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

#--- Read in text file(text.txt) ----
with open('./text.txt', 'r') as file:
    my_text = file.read()

# ---WRITE YOUR CODE FOR TASK 1 ---
#--- Inspect data ---
my_text[:500]

### Task 2: Tokenizing the Sentence.

Fantastic!!! We have checked the resource. Now, we need to split each word into a separate element. By doing this, we enable our system to analyze and understand the structure of the text more effectively. Let's get it done, buddy!

# Uncommented the below code and run once. After the code is excecuted, please commented the below code
# import nltk
# nltk.download('punkt')
import nltk
nltk.download('punkt')


#--- WRITE YOUR CODE FOR TASK 2 ---
from nltk.tokenize import word_tokenize
tokenized_words = word_tokenize(my_text)

#--- Inspect data ---
tokenized_words

### Task 3: Performing Part-of-Speech Tagging.
Great!!! We have converted the text to tokens. Now, we need to perform Part-of-speech tagging, which assigns grammatical categories (like nouns or verbs) to each word. This process helps us grasp the finer details of the sentence's language and contributes to the overall comprehension of the text. Let's work on it.

# Uncommented the below code and run once. After the code is excecuted, please commented the below code
# import nltk
# nltk.download('averaged_perceptron_tagger')
import nltk
nltk.download('averaged_perceptron_tagger')

#--- WRITE YOUR CODE FOR TASK 3 ---
from nltk import pos_tag
pos_tags = pos_tag(tokenized_words)

#--- Inspect data ---
pos_tags

### Task 4: Named Entity Recognition and Continuous Extraction.
Wow!!! We have assigned grammatical categories to each word. Now, we need to identify and classify entities within the text. Let's get started!

# Uncommented the below code and run once. After the code is excecuted, please commented the below code
# import nltk
# nltk.download('maxent_ne_chunker')
# nltk.download('words')
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

#--- WRITE YOUR CODE FOR TASK 4 ---
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

ner_tree = ne_chunk(pos_tags)
prev = None
continuous_chunk = []
current_chunk = []

for subtree in ner_tree:
    if isinstance(subtree, Tree):  # Check if the subtree is a named entity
        entity_type = subtree.label()  # Get the type of the entity
        entity = " ".join([word for word, tag in subtree.leaves()])  # Get the named entity
        if prev != entity_type:  # Avoid duplicates
            continuous_chunk.append((entity_type, entity))
        prev = entity_type
    else:
        if prev:
            continuous_chunk.append((prev, " ".join([word for word, tag in current_chunk])))
            prev = None
        current_chunk = []


if current_chunk:
    continuous_chunk.append((prev, " ".join([word for word, tag in current_chunk])))


continuous_chunk_with_types = list(set(continuous_chunk))

#--- Inspect data ---
continuous_chunk_with_types
