# **Assignment 3 - NLP Project - Group 1 Wednesday Morning:**

---

# **Introduction**

---

Effective communication is a critical skill in professional settings, especially during job interviews. Candidates are often evaluated not just on the content of their responses but also on their delivery; grammar, tone, clarity, and overall professionalism. However, providing objective, consistent feedback on interview responses can be challenging for recruiters and job seekers alike.

.

**Why Smart Interview Assistance Matters**

Automated interview assistance tools can help candidates improve their speaking skills by providing constructive feedback on their responses. Such systems offer several key benefits:



*   **Objective Evaluation:** Eliminates human biases by offering consistent scoring criteria.

*   **Personalised Feedback:** Highlights individual strengths and areas for improvement.
*   **Skill Development:** Helps candidates refine their responses, improve articulation, and reduce filler word usage.
*   **Scalable Solution:** Enables large-scale interview preparation without the need for human reviewers.

.

**The Problem We Are Solving**

Job seekers often struggle with articulating professional responses in interviews. Common issues include:
*   Overuse of filler words (e.g., "um", "like", "you know").

*   Poor grammar and sentence structure.
*   Negative or uncertain tone.
*   Lack of vocabulary diversity.

.

**Proposed Solution**

Our Smart Interview Assistance project leverages Natural Language Processing (NLP) and machine learning to analyse interview responses and provide professionalism scores. The core steps include:

* **Data Preprocessing:** Cleaning and preparing textual data for analysis.

* **Feature Extraction:** Extracting linguistic features such as sentiment scores, lexical diversity, grammar issues, and filler word counts.

* **Model Training:** Using machine learning algorithms (e.g., Logistic Regression, Random Forest) to predict professionalism scores.

* **Evaluation & Feedback:** Assessing model performance and generating actionable feedback for users.


In [1]:
#------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [2]:
# Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

# Import pandas for data manipulation
import pandas as pd

# Load the CSV dataset from Google Drive
df = pd.read_csv('/content/drive/MyDrive/RatedInterviewQuestionsDataset.csv')

# Drop index column if it exists (some CSVs export an unnamed index column)
df = df.drop(columns=['Unnamed: 0'], errors='ignore')

# Display dataset shape and basic info
print(df.shape)
print(df.columns)
print(df.head(3))
print(df['professionalism_rating'].value_counts())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
(200, 3)
Index(['question', 'answer', 'professionalism_rating'], dtype='object')
                                            question  \
0  How do you stay motivated during repetitive ta...   
1  What skills do you hope to develop in your nex...   
2  How do you deal with feedback that you disagre...   

                                              answer  professionalism_rating  
0  Well, you know, repetitive tasks, they're just...                     0.0  
1  Well, that's a great question! Honestly, I hav...                     0.0  
2  When faced with conflict, I approach it calmly...                     1.0  
professionalism_rating
0.0    100
1.0    100
Name: count, dtype: int64


In [3]:
import re  # Import regular expressions library

# Function to preprocess text data
def preprocess_text(text: str) -> str:
    # Lowercase
    text = text.lower()

    # Replace newlines with space
    text = re.sub(r'[\n\r]+', ' ', text)

    # Remove punctuation characters
    text = re.sub(r'[^a-z0-9\s]', '', text)

    # Normalize multiple spaces to single space
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply preprocessing to dataset column 'answer' and store in new column 'clean_answer'
df['clean_answer'] = df['answer'].apply(preprocess_text)

# Print original and cleaned text for comparison
print(df['answer'].iloc[0], "\n--> \n", df['clean_answer'].iloc[0])

Well, you know, repetitive tasks, they're just...there. You gotta do them, right? I try to just, like, get them over with. Honestly, sometimes I zone out a bit. It's not ideal, I know, but it helps. I listen to music, usually something upbeat, to keep me from falling asleep.

And coffee! Lots of coffee. Gotta keep that caffeine flowing, haha! Sometimes, I just tell myself it's temporary, you know? Just a little blip on the radar. I think about the bigger picture, maybe, like the end result. Or I just daydream.

Honestly, I haven't really figured out the perfect system yet. I kinda wing it. I'm a pretty adaptable person though, so I manage. Hopefully, the task doesn't last too long. And if it does, well, I just push through. 
--> 
 well you know repetitive tasks theyre justthere you gotta do them right i try to just like get them over with honestly sometimes i zone out a bit its not ideal i know but it helps i listen to music usually something upbeat to keep me from falling asleep a

In [4]:
# Update the list of available packages and their versions
!apt-get update

# Install OpenJDK 17 (Java Development Kit version 17) without asking for confirmation (-y flag)
!apt-get install -y openjdk-17-jdk

Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:2 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading

In [5]:
# Install and import needed libraries
# !pip install spacy nltk language_tool_python textstat vaderSentiment

import nltk, spacy, language_tool_python, textstat
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Download spaCy English model and NLTK data
try:
    nlp = spacy.load('en_core_web_sm')  # Load small English model
except:
    spacy.cli.download('en_core_web_sm')  # Download if not present
    nlp = spacy.load('en_core_web_sm')
nltk.download('punkt')  # Download tokenizer data for NLTK if needed

# Initialize sentiment analyzer and grammar tool
sentiment_analyzer = SentimentIntensityAnalyzer()
tool = language_tool_python.LanguageTool('en-US')  # Grammar checker tool

# List of filler words to track
filler_words = ["um", "uh", "like", "you know", "actually", "basically",
                "literally", "i mean", "hmm", "ah", "ok so"]

# Function to extract features from text
def extract_features(text: str) -> dict:
    original_text = text  # Preserve original text
    clean_text = preprocess_text(text)  # Apply text preprocessing

    # 1. Sentiment score (compound value)
    compound = sentiment_analyzer.polarity_scores(original_text)['compound']

    # 2. Lexical diversity (unique words / total words)
    tokens = [tok for tok in clean_text.split() if tok.isalpha()]
    lex_div = len(set(tokens)) / len(tokens) if tokens else 0

    # 3. Grammar issue count
    matches = tool.check(original_text)
    gram_issues = len([m for m in matches if m.ruleId not in {"PUNCTUATION", "WHITESPACE_RULE"}])

    # 4. Flesch readability score (ease of reading metric)
    flesch_score = textstat.flesch_reading_ease(original_text)

    # 5. Filler word count
    filler_count = 0
    for f in filler_words:
        filler_count += len(re.findall(rf'\b{re.escape(f)}\b', clean_text))

    # 6. Part-of-Speech (POS) counts using spaCy
    doc = nlp(original_text)
    pos_counts = {"NOUN": 0, "VERB": 0, "ADJ": 0, "ADV": 0, "PRON": 0, "INTJ": 0}

    for token in doc:
        if token.pos_ in pos_counts:
            # Treat proper nouns as general nouns
            if token.pos_ == "PROPN":
                pos_counts["NOUN"] += 1
            else:
                pos_counts[token.pos_] += 1

    # Aggregate all features into a dictionary
    features = {
        'sentiment': compound,
        'lexical_diversity': lex_div,
        'grammar_errors': gram_issues,
        'readability': flesch_score,
        'filler_count': filler_count,
        'noun_count': pos_counts['NOUN'],
        'verb_count': pos_counts['VERB'],
        'adj_count': pos_counts['ADJ'],
        'adv_count': pos_counts['ADV'],
        'pronoun_count': pos_counts['PRON'],
        'interjection_count': pos_counts['INTJ']
    }

    return features

# Apply feature extraction function to each answer in DataFrame
feature_rows = []
for ans in df['answer']:
    feature_rows.append(extract_features(ans))

# Convert extracted features to DataFrame
features_df = pd.DataFrame(feature_rows)

# Add professionalism rating column to features DataFrame
features_df['professionalism'] = df['professionalism_rating']

# Preview first 5 rows of the final features DataFrame
print(features_df.head(5))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


   sentiment  lexical_diversity  grammar_errors  readability  filler_count  \
0     0.9867           0.700000               2    78.680928             4   
1     0.9689           0.666667               0    82.550769             8   
2    -0.0772           0.896552               0    60.841638             0   
3     0.7964           0.904762               0    18.920357             0   
4     0.9941           0.554622               1    89.092660            10   

   noun_count  verb_count  adj_count  adv_count  pronoun_count  \
0          16          25         10         21             26   
1          14          24         12         11             27   
2           4           7          0          2              5   
3           3           3          2          2              3   
4          10          24         14         13             26   

   interjection_count  professionalism  
0                   3              0.0  
1                   7              0.0  
2          

In [15]:
!pip install tensorflow tensorflow-stubs

[31mERROR: Could not find a version that satisfies the requirement tensorflow-stubs (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow-stubs[0m[31m
[0m

In [17]:
# Split data into training and test sets (80/20 split)
from sklearn.model_selection import train_test_split
X = df['clean_answer'].values
y = df['professionalism_rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tokenize text and convert to sequences of integers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)  # fit on training text to build vocabulary
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq  = tokenizer.texts_to_sequences(X_test)

# Pad sequences to a uniform length (use the max length from training data)
max_len = max(len(seq) for seq in X_train_seq)
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post')
X_test_pad  = pad_sequences(X_test_seq, maxlen=max_len, padding='post')

# Determine vocabulary size for the embedding layer
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary size:", vocab_size, "| Sequence length:", max_len)

Vocabulary size: 1306 | Sequence length: 145


In [18]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Build a simple LSTM regression model
embedding_dim = 100  # dimension for word embeddings
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_len))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))

# Compile the model with MSE loss and MAE metric
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
model.summary()



In [10]:
# Train the model on the training set
epochs = 10
batch_size = 16
history = model.fit(X_train_pad, y_train, epochs=epochs, batch_size=batch_size, verbose=2)

Epoch 1/10
10/10 - 4s - 413ms/step - loss: 0.2545 - mae: 0.5012
Epoch 2/10
10/10 - 1s - 79ms/step - loss: 0.2211 - mae: 0.4642
Epoch 3/10
10/10 - 1s - 52ms/step - loss: 0.1076 - mae: 0.2684
Epoch 4/10
10/10 - 1s - 53ms/step - loss: 0.0934 - mae: 0.2171
Epoch 5/10
10/10 - 1s - 51ms/step - loss: 0.0573 - mae: 0.1775
Epoch 6/10
10/10 - 1s - 54ms/step - loss: 0.0327 - mae: 0.1419
Epoch 7/10
10/10 - 1s - 52ms/step - loss: 0.0081 - mae: 0.0837
Epoch 8/10
10/10 - 1s - 53ms/step - loss: 0.0041 - mae: 0.0617
Epoch 9/10
10/10 - 1s - 61ms/step - loss: 0.0019 - mae: 0.0418
Epoch 10/10
10/10 - 1s - 54ms/step - loss: 0.0011 - mae: 0.0330


In [13]:
# Use the trained model to predict on the test set
y_pred = model.predict(X_test_pad).flatten()

# Calculate evaluation metrics: MAE, RMSE, and R^2
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Test MAE: {mae:.4f}")
print(f"Test RMSE: {rmse:.4f}")
print(f"Test R^2: {r2:.4f}")

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 628ms/step
Test MAE: 0.0297
Test RMSE: 0.0301
Test R^2: 0.9963


In [20]:
# 1. Build a small DataFrame of test answers with actual vs predicted
test_df = pd.DataFrame({
    'clean_answer': X_test,      # your test answers
    'actual_rating': y_test,     # true scores
    'predicted_rating': y_pred   # model predictions
})

# 2. Merge back to get the original question and original (uncleaned) answer
test_df = test_df.merge(
    df[['clean_answer', 'question', 'answer']],
    on='clean_answer',
    how='left'
)

# 3. Reorder/display the first few rows
display(
    test_df[['question', 'answer', 'actual_rating', 'predicted_rating']].head(10)
)

Unnamed: 0,question,answer,actual_rating,predicted_rating
0,How do you respond when a teammate isn't pulli...,"Okay, so if someone's slacking, that's, like, ...",0.0,0.038998
1,How do you respond when a teammate isn't pulli...,"Okay, so like, if someone's slacking, it's not...",0.0,0.034559
2,What are your expectations from leadership in ...,"Well, I guess, like, leadership, right? It's i...",0.0,0.035289
3,How does this role fit into your career path?,My long-term goal is to grow into a senior rol...,1.0,0.974177
4,How does this role fit into your career path?,My long-term goal is to grow into a senior rol...,1.0,0.974177
5,Where do you see yourself in 5 years?,My long-term goal is to grow into a senior rol...,1.0,0.974233
6,Where do you see yourself in 5 years?,My long-term goal is to grow into a senior rol...,1.0,0.974233
7,Where do you see yourself in 5 years?,My long-term goal is to grow into a senior rol...,1.0,0.974233
8,Where do you see yourself in 5 years?,My long-term goal is to grow into a senior rol...,1.0,0.974233
9,What would you do if you disagreed with your m...,"When faced with conflict, I approach it calmly...",1.0,0.974408
