# Airline Customer Review Analysis using NLP: Text Preprocessing, POS Tagging, Parsing & Ambiguity Resolution with NLTK

--------------------------------------------------------------------------------------------------------------------------------

In [1]:
# Importing libraries
import os
import re
import nltk
import time
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk import CFG
from nltk.corpus import stopwords
from tabulate import tabulate
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag, CFG, ChartParser, RecursiveDescentParser, ShiftReduceParser, word_tokenize, sent_tokenize, Tree
from nltk.parse import RecursiveDescentParser, ShiftReduceParser
from IPython.display import Markdown, display ,SVG ,HTML
from nltk.parse.chart import TopDownChartParser, BottomUpChartParser
import pandas as pd

#  Download NLTK resources
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download("averaged_perceptron_tagger_eng", quiet=True)

True

---------------------------------------------------------------------------------------------------------------------------------

# TASK 1

In [2]:
# Load data file from local machine and set as dataframe
file_path = r"C:\Users\BITSprem007\Downloads\NLP\capstone_airline_reviews3.xlsx"
df = pd.read_excel(file_path)

# Drop blank/null rows from 'customer_review'
df = df.dropna(subset=["customer_review"])
df = df[df["customer_review"].str.strip().astype(bool)]

# Define stopwords
stop_words = set(stopwords.words("english"))

# Step 1: Remove punctuations, special characters & stopwords
def clean_text_no_lower(text):
    text = str(text)
    # Remove encoded junk
    text = text.encode("utf-8", "ignore").decode("utf-8", "ignore")
    # Remove punctuations & special characters
    text = re.sub(r"[^a-zA-Z\s]", " ", text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [w for w in tokens if w.lower() not in stop_words]
    return " ".join(tokens)

df["removed_punct_stop"] = df["customer_review"].apply(clean_text_no_lower)

# Step 2: Convert to lowercase
df["lower_case_review"] = df["removed_punct_stop"].str.lower()

# Final output
display(HTML("<span style='font-size:100%; font-weight:bold; text-decoration:underline;'>TASK 1 :</span>"))

# Rename columns for display
subset = df.rename(columns={
    "customer_review": "Customer review in original form",
    "removed_punct_stop": "After removing punctuations, special characters & stopwords",
    "lower_case_review": "After converting to lowercase"
})[[
    "Customer review in original form",
    "After removing punctuations, special characters & stopwords",
    "After converting to lowercase"
]].head(5).reset_index(drop=True)

display(subset)

# Final Answer Statement
print("We have download the file in local machine and setting it as the DataFrame.")
print("We successfully cleaned punctuations, special characters, and stopwords from 'customer_review'.")
print("Finally, we converted the cleaned text to lowercase and displayed the first 5 processed rows in tabular form.")


Unnamed: 0,Customer review in original form,"After removing punctuations, special characters & stopwords",After converting to lowercase
0,âœ… Trip Verified | London to Izmir via Istanb...,Trip Verified London Izmir via Istanbul First ...,trip verified london izmir via istanbul first ...
1,âœ… Trip Verified | Istanbul to Bucharest. We ...,Trip Verified Istanbul Bucharest make check ai...,trip verified istanbul bucharest make check ai...
2,âœ… Trip Verified | Rome to Prishtina via Ista...,Trip Verified Rome Prishtina via Istanbul flew...,trip verified rome prishtina via istanbul flew...
3,âœ… Trip Verified | Flew on Turkish Airlines I...,Trip Verified Flew Turkish Airlines IAD IST KH...,trip verified flew turkish airlines iad ist kh...
4,âœ… Trip Verified | Mumbai to Dublin via Istan...,Trip Verified Mumbai Dublin via Istanbul Never...,trip verified mumbai dublin via istanbul never...


We have download the file in local machine and setting it as the DataFrame.
We successfully cleaned punctuations, special characters, and stopwords from 'customer_review'.
Finally, we converted the cleaned text to lowercase and displayed the first 5 processed rows in tabular form.


# TASK 2

## Task 2.1 POS Tagging 

In [22]:
# Use the last 2 reviews from Task 1
reviews_column = df['lower_case_review'].tail(2).reset_index(drop=True)

# POS Tagging
display(Markdown("### POS tagging on the last 2 rows of customer_review"))

for review in reviews_column:
    # Display the full review as header
    display(Markdown(f"**Customer Review:** {review}"))
    
    # Tokenize review into words
    tokens = word_tokenize(review)
    tags = pos_tag(tokens)
    
    display(Markdown(f"**POS Tags:** {tags}"))

### POS tagging on the last 2 rows of customer_review

**Customer Review:** several flights kbp ams times one way lgw r zrh twice one way txl kbp mixed experience yelled tried correct agent pronunciation phoenix final destination day time misplaced onward zrh lax tickets wife two kids agent thought gave someone else mistake ran looking even though tried explaining must placed somewhere desk fact guys tried make giving us passes kbp business class lounge simply time go run straight gate delay otherwise average service newer planes even started smiling

**POS Tags:** [('several', 'JJ'), ('flights', 'NNS'), ('kbp', 'VBD'), ('ams', 'JJ'), ('times', 'NNS'), ('one', 'CD'), ('way', 'NN'), ('lgw', 'VBZ'), ('r', 'NN'), ('zrh', 'NN'), ('twice', 'RB'), ('one', 'CD'), ('way', 'NN'), ('txl', 'NN'), ('kbp', 'VB'), ('mixed', 'JJ'), ('experience', 'NN'), ('yelled', 'VBD'), ('tried', 'JJ'), ('correct', 'JJ'), ('agent', 'NN'), ('pronunciation', 'NN'), ('phoenix', 'IN'), ('final', 'JJ'), ('destination', 'NN'), ('day', 'NN'), ('time', 'NN'), ('misplaced', 'VBN'), ('onward', 'RB'), ('zrh', 'JJ'), ('lax', 'JJ'), ('tickets', 'NNS'), ('wife', 'NN'), ('two', 'CD'), ('kids', 'NNS'), ('agent', 'JJ'), ('thought', 'VBN'), ('gave', 'VBD'), ('someone', 'NN'), ('else', 'RB'), ('mistake', 'NN'), ('ran', 'VBD'), ('looking', 'VBG'), ('even', 'RB'), ('though', 'IN'), ('tried', 'JJ'), ('explaining', 'VBG'), ('must', 'MD'), ('placed', 'VBN'), ('somewhere', 'RB'), ('desk', 'JJ'), ('fact', 'NN'), ('guys', 'NNS'), ('tried', 'VBD'), ('make', 'VBP'), ('giving', 'VBG'), ('us', 'PRP'), ('passes', 'VBZ'), ('kbp', 'NN'), ('business', 'NN'), ('class', 'NN'), ('lounge', 'NN'), ('simply', 'RB'), ('time', 'NN'), ('go', 'VB'), ('run', 'RB'), ('straight', 'RB'), ('gate', 'JJ'), ('delay', 'NN'), ('otherwise', 'RB'), ('average', 'JJ'), ('service', 'NN'), ('newer', 'NN'), ('planes', 'NNS'), ('even', 'RB'), ('started', 'VBD'), ('smiling', 'VBG')]

**Customer Review:** kbp ams uia although relatively short flight good meal drinks served staff friendly seating ok looking price ticket think well worth money spend

**POS Tags:** [('kbp', 'NN'), ('ams', 'NNS'), ('uia', 'VBP'), ('although', 'IN'), ('relatively', 'RB'), ('short', 'JJ'), ('flight', 'NN'), ('good', 'JJ'), ('meal', 'NN'), ('drinks', 'NNS'), ('served', 'VBD'), ('staff', 'NN'), ('friendly', 'RB'), ('seating', 'VBG'), ('ok', 'RP'), ('looking', 'VBG'), ('price', 'NN'), ('ticket', 'NN'), ('think', 'VBP'), ('well', 'RB'), ('worth', 'JJ'), ('money', 'NN'), ('spend', 'NN')]

## Task 2.2 & 2.3  Parsing + Visual Tree + Efficiency

In [42]:
# ==============================
# Task 2: POS Tagging and Parsing (Without Operation Count)
# ==============================

import nltk
from nltk import CFG
from nltk.parse.chart import TopDownChartParser, BottomUpChartParser
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from IPython.display import display, Markdown
import pandas as pd
import time

# Download necessary NLTK resources
nltk.download("punkt", quiet=True)
nltk.download("averaged_perceptron_tagger", quiet=True)

# Pick last 2 reviews
sample_reviews = df["lower_case_review"].tail(2).reset_index(drop=True)
token_limits = [20, 15]  # Token limit for each review

display(Markdown("### Parsing Sentences Using Top-Down & Bottom-Up Methods"))

efficiency_results = []

for i, review in enumerate(sample_reviews, 1):
    display(Markdown(f"## Review {i} Parsing"))
    tokens = word_tokenize(review)
    tags = pos_tag(tokens)

    # Filter alphabetic tokens
    alpha_tags = [(w, p) for w, p in tags if w.isalpha()]
    if not alpha_tags:
        continue

    # Limit tokens according to predefined limits
    alpha_tags = alpha_tags[:token_limits[i-1]]

    # Build simple CFG: S -> POS POS ...
    pos_sequence = " ".join(p for _, p in alpha_tags)
    grammar_rules = [f"S -> {pos_sequence}"]
    for w, p in alpha_tags:
        grammar_rules.append(f"{p} -> '{w}'")
    grammar_text = "\n".join(grammar_rules)

    try:
        grammar = CFG.fromstring(grammar_text)
    except Exception as e:
        display(Markdown(f" Skipping review {i} due to CFG error: {e}"))
        continue

    # ---------------- Top-Down Parser ----------------
    td_parser = TopDownChartParser(grammar)
    t0 = time.time()
    td_trees = list(td_parser.parse([w for w, _ in alpha_tags]))
    td_time = time.time() - t0

    # ---------------- Bottom-Up Parser ----------------
    bu_parser = BottomUpChartParser(grammar)
    t1 = time.time()
    bu_trees = list(bu_parser.parse([w for w, _ in alpha_tags]))
    bu_time = time.time() - t1

    # Display CFG
    display(Markdown(f"### Review {i} CFG:\n```\n{grammar_text}\n```"))

    # Display Parse Trees
    if td_trees:
        display(Markdown("**Top-Down Parse Tree:**"))
        td_trees[0].pretty_print()
        # td_trees[0].draw()  # Uncomment to open tree GUI

    if bu_trees:
        display(Markdown("**Bottom-Up Parse Tree:**"))
        bu_trees[0].pretty_print()
        # bu_trees[0].draw()  # Uncomment to open tree GUI

    # Track efficiency (time only)
    efficiency_results.append({
        'Review': i,
        'Tokens': len(alpha_tags),
        'Top-Down Time (s)': td_time,
        'Bottom-Up Time (s)': bu_time
    })

# Display efficiency comparison
display(Markdown("### Efficiency Comparison"))
efficiency_df = pd.DataFrame(efficiency_results)
display(efficiency_df)


### Parsing Sentences Using Top-Down & Bottom-Up Methods

## Review 1 Parsing

### Review 1 CFG:
```
S -> JJ NNS VBD JJ NNS CD NN VBZ NN NN RB CD NN NN VB JJ NN VBD JJ JJ
JJ -> 'several'
NNS -> 'flights'
VBD -> 'kbp'
JJ -> 'ams'
NNS -> 'times'
CD -> 'one'
NN -> 'way'
VBZ -> 'lgw'
NN -> 'r'
NN -> 'zrh'
RB -> 'twice'
CD -> 'one'
NN -> 'way'
NN -> 'txl'
VB -> 'kbp'
JJ -> 'mixed'
NN -> 'experience'
VBD -> 'yelled'
JJ -> 'tried'
JJ -> 'correct'
```

**Top-Down Parse Tree:**

                                               S                                                             
    ___________________________________________|_________________________________________________________     
   JJ     NNS   VBD  JJ  NNS   CD  NN VBZ  NN  NN   RB   CD  NN  NN  VB   JJ      NN      VBD     JJ     JJ  
   |       |     |   |    |    |   |   |   |   |    |    |   |   |   |    |       |        |      |      |    
several flights kbp ams times one way lgw  r  zrh twice one way txl kbp mixed experience yelled tried correct



**Bottom-Up Parse Tree:**

                                               S                                                             
    ___________________________________________|_________________________________________________________     
   JJ     NNS   VBD  JJ  NNS   CD  NN VBZ  NN  NN   RB   CD  NN  NN  VB   JJ      NN      VBD     JJ     JJ  
   |       |     |   |    |    |   |   |   |   |    |    |   |   |   |    |       |        |      |      |    
several flights kbp ams times one way lgw  r  zrh twice one way txl kbp mixed experience yelled tried correct



## Review 2 Parsing

### Review 2 CFG:
```
S -> NN NNS VBP IN RB JJ NN JJ NN NNS VBD NN RB VBG RP
NN -> 'kbp'
NNS -> 'ams'
VBP -> 'uia'
IN -> 'although'
RB -> 'relatively'
JJ -> 'short'
NN -> 'flight'
JJ -> 'good'
NN -> 'meal'
NNS -> 'drinks'
VBD -> 'served'
NN -> 'staff'
RB -> 'friendly'
VBG -> 'seating'
RP -> 'ok'
```

**Top-Down Parse Tree:**

                                              S                                                
  ____________________________________________|______________________________________________   
 NN NNS VBP    IN        RB       JJ    NN    JJ   NN   NNS    VBD     NN     RB      VBG    RP
 |   |   |     |         |        |     |     |    |     |      |      |      |        |     |  
kbp ams uia although relatively short flight good meal drinks served staff friendly seating  ok



**Bottom-Up Parse Tree:**

                                              S                                                
  ____________________________________________|______________________________________________   
 NN NNS VBP    IN        RB       JJ    NN    JJ   NN   NNS    VBD     NN     RB      VBG    RP
 |   |   |     |         |        |     |     |    |     |      |      |      |        |     |  
kbp ams uia although relatively short flight good meal drinks served staff friendly seating  ok



### Efficiency Comparison

Unnamed: 0,Review,Tokens,Top-Down Time (s),Bottom-Up Time (s)
0,1,20,0.003913,0.003944
1,2,15,0.002607,0.001802


##### In Task 2, the last two customer reviews were analyzed. All words in each review were first processed with POS tagging to map tokens to their parts of speech. Simple CFGs were then created using the first 20 tokens from the first review and 15 tokens from the second. Both Top-Down and Bottom-Up parsing methods were applied, and their parse trees were visualized, showing identical structures due to the straightforward, deterministic grammar. Finally, an efficiency comparison table captured the parsing times, indicating that both methods performed efficiently, with minor differences in speed.

# TASK 3

### Ambiguity is a common challenge in parsing natural language sentences. Consider the sentence "Time flies like an arrow." This sentence can be ambiguous or misleading due to its multiple possible interpretations.

#### 1 .Explain the source of ambiguity in this sentence and how it poses a challenge to standard CFGs. 

The sentence “Time flies like an arrow” is ambiguous because several words can function as different parts of speech depending on interpretation. For example, “Time” can be a noun (referring to the concept of time) or a verb (meaning to measure duration), and “flies” can be a verb (to move quickly) or a noun (insects). Additionally, “like an arrow” can modify either the verb “flies” or describe a method for measuring. This results in multiple valid readings, such as:

“Time moves quickly, similar to an arrow” (literal meaning).

“Measure flies in the same way you would measure an arrow” (imperative meaning).

“Time flies that behave like an arrow” (noun phrase interpretation).

Standard context-free grammars (CFGs) are limited in handling such cases because they assume a fixed, single parse structure. As a result, a CFG might produce incorrect parses, fail to account for all possible interpretations, or generate ambiguous trees that do not reflect the intended meaning.

#### 2.	Propose a modification to a CFG that could correctly parse this sentence without leading to incorrect interpretations. 

In [46]:
import nltk
from nltk import CFG, ChartParser
from nltk.tokenize import word_tokenize
from IPython.display import display, Markdown

# Sentence
sentence = "Time flies like an arrow"
tokens = word_tokenize(sentence)

# Modified CFG to handle multiple interpretations
grammar_text = """
S -> NP VP
NP -> N | N N | Det N
VP -> V NP | V NP PP | V
PP -> P NP
Det -> 'an'
N -> 'Time' | 'flies' | 'arrow'
V -> 'flies' | 'like'
P -> 'like'
"""

grammar = CFG.fromstring(grammar_text)

# Create parser
parser = ChartParser(grammar)

# Parse the sentence
trees = list(parser.parse(tokens))

# Display CFG
display(Markdown(f"### CFG used:\n```\n{grammar_text}\n```"))

# Display all parse trees
if trees:
    for i, tree in enumerate(trees, 1):
        display(Markdown(f"**Parse Tree {i}:**"))
        tree.pretty_print()
        tree.draw()
else:
    display(Markdown("No parse tree could be generated."))


### CFG used:
```

S -> NP VP
NP -> N | N N | Det N
VP -> V NP | V NP PP | V
PP -> P NP
Det -> 'an'
N -> 'Time' | 'flies' | 'arrow'
V -> 'flies' | 'like'
P -> 'like'

```

**Parse Tree 1:**

                S                
       _________|____             
      |              VP          
      |          ____|___         
      NP        |        NP      
  ____|____     |     ___|____    
 N         N    V   Det       N  
 |         |    |    |        |   
Time     flies like  an     arrow



In this code, the sentence “Time flies like an arrow” is tokenized and parsed using a modified CFG designed to handle the inherent ambiguity. The grammar defines rules for noun phrases (NP), verb phrases (VP), and prepositional phrases (PP), allowing words like "flies" and "like" to function as both verbs and nouns, depending on context. For example, "Time" and "flies" can form a compound noun phrase, while "like an arrow" can be interpreted as a VP containing a prepositional phrase.

The modification in the CFG (compared to a standard CFG) includes additional NP and VP production rules, such as:

NP -> N N to allow compound nouns like "Time flies".

VP -> V NP PP | V to accommodate sentences where "like" is a verb or a preposition introducing a prepositional phrase.

By explicitly including these variations, the parser can generate correct parse trees for the sentence without producing invalid interpretations. The code then uses NLTK’s ChartParser to generate all possible parse trees according to this CFG and visually displays them, demonstrating how the CFG handles ambiguity.