Milestone 1: Preprocessing the Text ---
Includes cleaning, tokenization, stopword removal, lemmatization markdown


1.1 : Loading the dataset

In [2]:
import pandas as pd

df = pd.read_excel(r'C:\Users\jmhol\Documents\AngeloState\CS6399-Project\student_feedback_dataset.xlsx')
df.head()


Unnamed: 0,Course_ID,Course_Name,Course_Type,Delivery_Mode,Class_Size,Difficulty_Rating,Instructor_ID,Instructor_Tenure,Instructor_Field,Student_ID,Year,Enrollment_Status,Overall_GPA,Grade_in_Course,Feedback_Text,Date_Submitted,Numeric_Rating_Instructor,Numeric_Rating_Course,Overall_Satisfaction
0,CSE174,Balanced solution-oriented groupware,Lab,In-Person,100.0,3.0,I3947,No,Mathematics,S54300,Senior,Part-time,3.81,B,The professor was very engaging and the course...,2025-01-26,4.0,2.0,1.0
1,CSE418,Seamless tertiary encoding,Lab,Online,63.0,2.0,I1123,No,Computer Science,S43210,Sophomore,Full-time,3.21,D,I liked the assignments but found the lectures...,2025-03-08,1.0,5.0,5.0
2,CSE399,Balanced systemic instruction set,Seminar,In-Person,23.0,3.0,I1340,Yes,Psychology,S99523,Senior,Full-time,2.45,C,Poor structure and lack of feedback made this ...,2024-12-23,5.0,5.0,4.0
3,CSE205,Front-line zero tolerance workforce,Lecture,Online,59.0,2.0,I9773,Yes,History,S37128,Freshman,Full-time,3.83,B,I struggled with the pace and the material was...,2025-02-22,1.0,2.0,3.0
4,CSE349,Multi-lateral motivating superstructure,Lab,Hybrid,53.0,5.0,I4581,No,History,S30918,Sophomore,Part-time,3.99,D,Poor structure and lack of feedback made this ...,2025-02-09,2.0,1.0,3.0


1.2 - Cleaning the feedback text

In [3]:
import string

def clean_text(text):
    # converting to lowercase and removing punctuation 
    return text.lower().translate(str.maketrans('', '', string.punctuation))

#creating a new column
df['Cleaned_Feedback'] = df['Feedback_Text'].apply(clean_text)

#looking at the new column and data
df[['Feedback_Text', 'Cleaned_Feedback']].head(10)


Unnamed: 0,Feedback_Text,Cleaned_Feedback
0,The professor was very engaging and the course...,the professor was very engaging and the course...
1,I liked the assignments but found the lectures...,i liked the assignments but found the lectures...
2,Poor structure and lack of feedback made this ...,poor structure and lack of feedback made this ...
3,I struggled with the pace and the material was...,i struggled with the pace and the material was...
4,Poor structure and lack of feedback made this ...,poor structure and lack of feedback made this ...
5,Great experience! I learned a lot and felt sup...,great experience i learned a lot and felt supp...
6,"It was an average experience—not too hard, not...",it was an average experience—not too hard not ...
7,I really enjoyed this class. It was well-organ...,i really enjoyed this class it was wellorganiz...
8,Great experience! I learned a lot and felt sup...,great experience i learned a lot and felt supp...
9,I liked the assignments but found the lectures...,i liked the assignments but found the lectures...


1.3: Tokenization 

In [4]:
import spacy

#spacy english model
nlp = spacy.load("en_core_web_sm")

def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.text for token in doc if not token.is_space]

df['Tokens'] = df['Cleaned_Feedback'].apply(spacy_tokenizer)

df[['Cleaned_Feedback', 'Tokens']].head(10)


Unnamed: 0,Cleaned_Feedback,Tokens
0,the professor was very engaging and the course...,"[the, professor, was, very, engaging, and, the..."
1,i liked the assignments but found the lectures...,"[i, liked, the, assignments, but, found, the, ..."
2,poor structure and lack of feedback made this ...,"[poor, structure, and, lack, of, feedback, mad..."
3,i struggled with the pace and the material was...,"[i, struggled, with, the, pace, and, the, mate..."
4,poor structure and lack of feedback made this ...,"[poor, structure, and, lack, of, feedback, mad..."
5,great experience i learned a lot and felt supp...,"[great, experience, i, learned, a, lot, and, f..."
6,it was an average experience—not too hard not ...,"[it, was, an, average, experience, —, not, too..."
7,i really enjoyed this class it was wellorganiz...,"[i, really, enjoyed, this, class, it, was, wel..."
8,great experience i learned a lot and felt supp...,"[great, experience, i, learned, a, lot, and, f..."
9,i liked the assignments but found the lectures...,"[i, liked, the, assignments, but, found, the, ..."


1.4: Remove Stopwords

In [5]:
from spacy.lang.en.stop_words import STOP_WORDS

# remove stopwords
def remove_stopwords(tokens):
    return [token for token in tokens if token.lower() not in STOP_WORDS]

# Apply to Tokens column
df['Tokens_NoStop'] = df['Tokens'].apply(remove_stopwords)

#preview
df[['Tokens', 'Tokens_NoStop']].head(10)

Unnamed: 0,Tokens,Tokens_NoStop
0,"[the, professor, was, very, engaging, and, the...","[professor, engaging, course, material, clear]"
1,"[i, liked, the, assignments, but, found, the, ...","[liked, assignments, found, lectures, boring]"
2,"[poor, structure, and, lack, of, feedback, mad...","[poor, structure, lack, feedback, class, diffi..."
3,"[i, struggled, with, the, pace, and, the, mate...","[struggled, pace, material, unclear]"
4,"[poor, structure, and, lack, of, feedback, mad...","[poor, structure, lack, feedback, class, diffi..."
5,"[great, experience, i, learned, a, lot, and, f...","[great, experience, learned, lot, felt, suppor..."
6,"[it, was, an, average, experience, —, not, too...","[average, experience, —, hard, easy]"
7,"[i, really, enjoyed, this, class, it, was, wel...","[enjoyed, class, wellorganized, informative]"
8,"[great, experience, i, learned, a, lot, and, f...","[great, experience, learned, lot, felt, suppor..."
9,"[i, liked, the, assignments, but, found, the, ...","[liked, assignments, found, lectures, boring]"


1.5: Lemmatization - Reducing words to their base form

In [6]:
# spacy lemmtization
def lemmatize_tokens(tokens):
    doc = nlp(" ".join(tokens))  
    return [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and not token.is_space]

# application 
df['Lemmas'] = df['Tokens_NoStop'].apply(lemmatize_tokens)

#preview
df[['Tokens_NoStop', 'Lemmas']].head(10)


Unnamed: 0,Tokens_NoStop,Lemmas
0,"[professor, engaging, course, material, clear]","[professor, engage, course, material, clear]"
1,"[liked, assignments, found, lectures, boring]","[like, assignment, find, lecture, boring]"
2,"[poor, structure, lack, feedback, class, diffi...","[poor, structure, lack, feedback, class, diffi..."
3,"[struggled, pace, material, unclear]","[struggle, pace, material, unclear]"
4,"[poor, structure, lack, feedback, class, diffi...","[poor, structure, lack, feedback, class, diffi..."
5,"[great, experience, learned, lot, felt, suppor...","[great, experience, learn, lot, felt, support]"
6,"[average, experience, —, hard, easy]","[average, experience, hard, easy]"
7,"[enjoyed, class, wellorganized, informative]","[enjoy, class, wellorganize, informative]"
8,"[great, experience, learned, lot, felt, suppor...","[great, experience, learn, lot, felt, support]"
9,"[liked, assignments, found, lectures, boring]","[like, assignment, find, lecture, boring]"


1.6: Exporting the Data 

In [7]:
df.to_csv(r'C:\Users\jmhol\Documents\AngeloState\CS6399-Project\student_feedback_cleaned.csv', index=False)


Step 2: Sentiment Analysis Using VADER

2.1 Install and Importing VADER

In [8]:
import nltk
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\jmhol\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

2.2 Initalizing the VADER Analyzer 

In [9]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()


2.3 Apply Vader to dataframe

In [10]:
# applying vader to each entry
df['VADER_Scores'] = df['Feedback_Text'].apply(vader.polarity_scores)

#set scores into each column
df = pd.concat([df, df['VADER_Scores'].apply(pd.Series)], axis=1)

#preview
df[['Feedback_Text', 'compound', 'pos', 'neu', 'neg']].head(10)


Unnamed: 0,Feedback_Text,compound,pos,neu,neg
0,The professor was very engaging and the course...,0.6478,0.37,0.63,0.0
1,I liked the assignments but found the lectures...,-0.2617,0.175,0.553,0.272
2,Poor structure and lack of feedback made this ...,-0.7845,0.0,0.47,0.53
3,I struggled with the pace and the material was...,-0.5267,0.0,0.614,0.386
4,Poor structure and lack of feedback made this ...,-0.7845,0.0,0.47,0.53
5,Great experience! I learned a lot and felt sup...,0.7712,0.572,0.428,0.0
6,"It was an average experience—not too hard, not...",-0.4226,0.0,0.678,0.322
7,I really enjoyed this class. It was well-organ...,0.5563,0.31,0.69,0.0
8,Great experience! I learned a lot and felt sup...,0.7712,0.572,0.428,0.0
9,I liked the assignments but found the lectures...,-0.2617,0.175,0.553,0.272


2.4 Vader Score Thresholds

In [11]:
#function to classify scores
def get_sentiment_label(score):
    if score >= 0.05:
        return 'Positive'
    elif score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# apply to compound column of df
df['Sentiment_Label'] = df['compound'].apply(get_sentiment_label)

#preview
df[['Feedback_Text', 'compound', 'Sentiment_Label']].head(10)


Unnamed: 0,Feedback_Text,compound,Sentiment_Label
0,The professor was very engaging and the course...,0.6478,Positive
1,I liked the assignments but found the lectures...,-0.2617,Negative
2,Poor structure and lack of feedback made this ...,-0.7845,Negative
3,I struggled with the pace and the material was...,-0.5267,Negative
4,Poor structure and lack of feedback made this ...,-0.7845,Negative
5,Great experience! I learned a lot and felt sup...,0.7712,Positive
6,"It was an average experience—not too hard, not...",-0.4226,Negative
7,I really enjoyed this class. It was well-organ...,0.5563,Positive
8,Great experience! I learned a lot and felt sup...,0.7712,Positive
9,I liked the assignments but found the lectures...,-0.2617,Negative


2.5 Export

In [12]:
df.to_csv(r'C:\Users\jmhol\Documents\AngeloState\CS6399-Project\student_feedback_with_sentiment.csv', index=False)


In [13]:
df['Lemmas'].explode().value_counts().head(20)


Lemmas
class         168
professor     121
great          98
course         94
hard           93
content        76
lecture        69
experience     62
enjoy          61
material       61
fair           59
follow         56
engage         54
difficult      54
structure      48
lack           48
assignment     46
practical      42
time           41
recommend      41
Name: count, dtype: int64