🚀 NLP Analysis of a News Report on a Medevac Crash

This Jupyter Notebook showcases Natural Language Processing (NLP) techniques applied to a real-world news report about a medevac plane crash. The analysis includes:
	•	Tokenization & Normalization: Breaking the text into sentences and words, converting to lowercase, and applying stemming.
	•	Part-of-Speech (POS) Tagging: Identifying the grammatical structure of words to understand the text’s composition.
	•	N-Gram Analysis: Extracting common bi-grams and tri-grams to uncover key topics and trends.
	•	Insights & Interpretation: Examining word patterns, frequent phrases, and linguistic structures to determine the main focus of the news article.

This notebook demonstrates my ability to preprocess, analyze, and extract meaningful insights from text data—a crucial skill in NLP applications such as sentiment analysis, topic modeling, and text classification. 🚀

# Part 1

In [None]:
import nltk
import os
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
nltk.download('all')

#  Sample News Excerpt
text = """A medevac plane crashed soon after takeoff in Philadelphia on Friday with a child 
and five others on board, the air ambulance company that operated it said, adding that it 
had not confirmed any survivors. Jet Rescue Air Ambulance, based in Mexico and licensed to 
operate in the U.S., said its aircraft crashed with four crew members, one pediatric medical 
patient and the patient's mother on board. "At this time we cannot confirm any survivors," 
the company said in a statement."""

#  Sentence Tokenization
sentences = sent_tokenize(text)
print("\n Sentence Tokenization:\n", sentences)

#  Word Tokenization
words = word_tokenize(text)
print("\n Word Tokenization:\n", words)

#  Normalization (Lowercasing + Stemming)
ps = PorterStemmer()
normalized_words = [ps.stem(word.lower()) for word in words]
print("\n Normalized Words (Lowercased + Stemmed):\n", normalized_words)


 Sentence Tokenization:
 ['A medevac plane crashed soon after takeoff in Philadelphia on Friday with a child \nand five others on board, the air ambulance company that operated it said, adding that it \nhad not confirmed any survivors.', "Jet Rescue Air Ambulance, based in Mexico and licensed to \noperate in the U.S., said its aircraft crashed with four crew members, one pediatric medical \npatient and the patient's mother on board.", '"At this time we cannot confirm any survivors," \nthe company said in a statement.']

 Word Tokenization:
 ['A', 'medevac', 'plane', 'crashed', 'soon', 'after', 'takeoff', 'in', 'Philadelphia', 'on', 'Friday', 'with', 'a', 'child', 'and', 'five', 'others', 'on', 'board', ',', 'the', 'air', 'ambulance', 'company', 'that', 'operated', 'it', 'said', ',', 'adding', 'that', 'it', 'had', 'not', 'confirmed', 'any', 'survivors', '.', 'Jet', 'Rescue', 'Air', 'Ambulance', ',', 'based', 'in', 'Mexico', 'and', 'licensed', 'to', 'operate', 'in', 'the', 'U.S.', ',', 

# Part 2

In [38]:
import nltk
import os
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag

#  Sample News Excerpt
text = """A medevac plane crashed soon after takeoff in Philadelphia on Friday with a child 
and five others on board, the air ambulance company that operated it said, adding that it 
had not confirmed any survivors. Jet Rescue Air Ambulance, based in Mexico and licensed to 
operate in the U.S., said its aircraft crashed with four crew members, one pediatric medical 
patient and the patient's mother on board. "At this time we cannot confirm any survivors," 
the company said in a statement."""

#  Word Tokenization
words = word_tokenize(text)

#  POS Tagging
pos_tags = pos_tag(words)
print("\n✅POS Tagging:\n", pos_tags)

#  Identify POS Frequency Distribution
pos_freq = nltk.FreqDist(tag for (word, tag) in pos_tags)
print("\n POS Tag Frequency:\n", pos_freq.most_common())

# Display Analysis
print("\n Observations &  Patterns:")
print("✔ Common tags: NN (Noun), VB (Verb), JJ (Adjective)")
print("✔ Words like 'crashed' might be marked as VBD (past tense verb)")
print("✔ Proper nouns (e.g., Philadelphia, Mexico) should be tagged as NNP")
print("✔ Some errors may occur, especially for ambiguous words or abbreviations  (e.g., 'U.S.')")


✅POS Tagging:
 [('A', 'DT'), ('medevac', 'NN'), ('plane', 'NN'), ('crashed', 'VBD'), ('soon', 'RB'), ('after', 'IN'), ('takeoff', 'NN'), ('in', 'IN'), ('Philadelphia', 'NNP'), ('on', 'IN'), ('Friday', 'NNP'), ('with', 'IN'), ('a', 'DT'), ('child', 'NN'), ('and', 'CC'), ('five', 'CD'), ('others', 'NNS'), ('on', 'IN'), ('board', 'NN'), (',', ','), ('the', 'DT'), ('air', 'NN'), ('ambulance', 'NN'), ('company', 'NN'), ('that', 'WDT'), ('operated', 'VBD'), ('it', 'PRP'), ('said', 'VBD'), (',', ','), ('adding', 'VBG'), ('that', 'IN'), ('it', 'PRP'), ('had', 'VBD'), ('not', 'RB'), ('confirmed', 'VBN'), ('any', 'DT'), ('survivors', 'NNS'), ('.', '.'), ('Jet', 'NNP'), ('Rescue', 'NNP'), ('Air', 'NNP'), ('Ambulance', 'NNP'), (',', ','), ('based', 'VBN'), ('in', 'IN'), ('Mexico', 'NNP'), ('and', 'CC'), ('licensed', 'VBD'), ('to', 'TO'), ('operate', 'VB'), ('in', 'IN'), ('the', 'DT'), ('U.S.', 'NNP'), (',', ','), ('said', 'VBD'), ('its', 'PRP$'), ('aircraft', 'NN'), ('crashed', 'VBN'), ('with', '

# Part3

In [39]:
import nltk
import os
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from nltk.probability import FreqDist

#  Sample News Excerpt
text = """A medevac plane crashed soon after takeoff in Philadelphia on Friday with a child 
and five others on board, the air ambulance company that operated it said, adding that it 
had not confirmed any survivors. Jet Rescue Air Ambulance, based in Mexico and licensed to 
operate in the U.S., said its aircraft crashed with four crew members, one pediatric medical 
patient and the patient's mother on board. "At this time we cannot confirm any survivors," 
the company said in a statement."""

#  Tokenize the text into words
words = word_tokenize(text)

#  Generate Bi-Grams and Tri-Grams
bigrams = list(ngrams(words, 2))
trigrams = list(ngrams(words, 3))

#  Compute Frequency Distributions
bigram_freq = FreqDist(bigrams)
trigram_freq = FreqDist(trigrams)

#  Most Common Bi-Grams and Tri-Grams
top_bigrams = bigram_freq.most_common(5)
top_trigrams = trigram_freq.most_common(5)

#  Print Results
print("\n Most Common Bi-Grams:\n", top_bigrams)
print("\n Most Common Tri-Grams:\n", top_trigrams)


 Most Common Bi-Grams:
 [(('on', 'board'), 2), (('any', 'survivors'), 2), (('A', 'medevac'), 1), (('medevac', 'plane'), 1), (('plane', 'crashed'), 1)]

 Most Common Tri-Grams:
 [(('A', 'medevac', 'plane'), 1), (('medevac', 'plane', 'crashed'), 1), (('plane', 'crashed', 'soon'), 1), (('crashed', 'soon', 'after'), 1), (('soon', 'after', 'takeoff'), 1)]
