# **Imports and importing the data**

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error


In [None]:
rateMyProfData = pd.read_csv('Cleaned_UW_RMP.csv')
rateMyProfData.head()

# **Data Exploration**

## **Further Data Checking and Cleaning**

In [None]:
# See the null values by each row
rateMyProfData.isnull().sum()

In the above cell, we can see there are a significant number of null values in the columns that are optional for users when leaving a review. Due to the sheer number of null values, This information is not neccessary or useful. We will drop these for the rest of our analysis.

In [None]:
rateMyProfData.drop(columns=['For-Credit', 
                         'Attendance', 
                         'Take-Again', 
                         'Grade', 
                         'Textbook'], inplace=True)

Further we need to deal with the null values that are still in place. For the review body, there is only one null value, and the review body is necessary so we will drop this row.

In [None]:
rateMyProfData = rateMyProfData.dropna(subset=['Review-Body'])

And finally, we chose to drop rows with null values course name and number, to keep the data from those for some interesting analysis later.

In [None]:
rateMyProfData = rateMyProfData.dropna(subset=['Course-Name', 'Course-Number'])
rateMyProfData['Course-Number'] = [str(num)[0] for num in rateMyProfData['Course-Number']] 
# user input data is all over the place, 
# take the first digit for the class year, ie 1 = 100 level, 4 = 400 level.


## **Data Analysis**

In [None]:
rateMyProfData['Quality'].plot(kind='hist', bins=9, title='Quality', align='mid', width=0.4)

plt.gca().spines[['top', 'right']].set_visible(False)
plt.xlabel('Quality')
plt.ylabel('Frequency')

plt.show()

In [None]:
rateMyProfData['Difficulty'].plot(kind='hist', bins=5, title='Difficulty', width=0.6)
plt.gca().spines[['top', 'right',]].set_visible(False)
plt.xticks(range(1, 6))

plt.show()

In [None]:
rateMyProfData['Course-Number'].value_counts().plot(kind='bar')
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for i, course_number in enumerate([1, 2, 3, 4]):
  ax = axes[i // 2, i % 2]
  course_data = rateMyProfData[rateMyProfData['Course-Number'] == str(course_number)]
  course_data['Quality'].plot(kind='hist', bins=9, title=f'Quality Distribution for {course_number}00 level courses', ax=ax, width=0.4)
  ax.spines[['top', 'right']].set_visible(False)
  ax.set_xlabel('Quality')
  ax.set_ylabel('Frequency')

plt.tight_layout()

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

for i, course_number in enumerate([1, 2, 3, 4]):
    ax = axes[i // 2, i % 2]
    course_data = rateMyProfData[rateMyProfData['Course-Number'] == str(course_number)]
    difficulty_values = course_data['Difficulty']
    bin_edges = np.arange(0.5, 6, 1)  # Calculate bin edges manually to center bars
    ax.hist(difficulty_values, bins=bin_edges, align='mid', rwidth=0.8)
    ax.set_title(f'Difficulty Distribution for {course_number}00 level courses')
    ax.set_xlabel('Difficulty')
    ax.set_ylabel('Frequency')
    ax.set_xticks(range(1, 6))
    ax.set_xticklabels(range(1, 6))
    ax.spines[['top', 'right']].set_visible(False)

plt.tight_layout()

In [None]:
course_counts = rateMyProfData['Course-Name'].value_counts()
for course_name, count in course_counts.items():
  if count > 100:
    print(f"{course_name}: {count}")

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(12, 10))

for i, course_name in enumerate(course_counts.index[:9]):
  ax = axes[i // 3, i % 3]
  course_data = rateMyProfData[rateMyProfData['Course-Name'] == course_name]
  course_data['Quality'].plot(kind='hist', bins=9, title=f'{course_name} Quality', ax=ax, width=0.4)
  ax.spines[['top', 'right']].set_visible(False)

plt.tight_layout()

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(12, 10))

for i, course_name in enumerate(course_counts.index[:9]):
  ax = axes[i // 3, i % 3]
  course_data = rateMyProfData[rateMyProfData['Course-Name'] == course_name]
  bin_edges = np.arange(0.5, 6, 1)  # Calculate bin edges manually to center bars
  course_data['Difficulty'].plot(kind='hist', bins=bin_edges, title=f'{course_name} Difficulty', ax=ax, width=0.6)
  ax.spines[['top', 'right']].set_visible(False)
  ax.set_xlabel('Difficulty')
  ax.set_ylabel('Frequency')
  ax.set_xticks(range(1, 6))
  ax.set_xticklabels(range(1, 6))

plt.tight_layout()

Possible analysis with Course name, date, etc.

# **Prediction of Quality and Difficulty**

## **Basic Machine Learning**

To start creating a prediction of quality and difficulty for the reviews, we need to do some pre-processing on the review bodies. Our goal is to compare a basic Machine Learning model here to a more robust transformer. We have chosen to use a linear regression model, and to compare it to a BERT model.

### Review Pre-Processing for Logistic Regression Model

In [None]:
# Send to lowercase
rateMyProfData['cleanReview'] = rateMyProfData['Review-Body'].str.lower()
# remove numbers
rateMyProfData['cleanReview'] = rateMyProfData['cleanReview'].apply(lambda x: re.sub(r'\d+', '', x)) 
# remove punctuation
rateMyProfData['cleanReview'] = rateMyProfData['cleanReview'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation))) 
# remove extra spaces
rateMyProfData['cleanReview'] = rateMyProfData['cleanReview'].apply(lambda x: ' '.join([token for token in x.split()]))
# remove stop words
stop = set(stopwords.words('english'))
rateMyProfData['cleanReview'] = rateMyProfData['cleanReview'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

rateMyProfData['cleanReview'].head()

#### Word Tokenization

In [None]:
rateMyProfData['review_tokens'] = rateMyProfData['cleanReview'].apply(lambda x: word_tokenize(x))
rateMyProfData['review_tokens'].head()

#### POS Tagging

In [None]:
rateMyProfData['review_tokens'] = rateMyProfData['review_tokens'].apply(lambda x: nltk.pos_tag(x))
rateMyProfData['review_tokens'].head()

#### Lemmatizing

In [None]:
# need this to get correct POS tag for nltk lemmatizer
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [None]:
lemmatizer = WordNetLemmatizer()

rateMyProfData['review_tokens'] = rateMyProfData['review_tokens'].apply(lambda x: [lemmatizer.lemmatize(word, get_wordnet_pos(tag))
                                                                        if get_wordnet_pos(tag)
                                                                        else lemmatizer.lemmatize(word)
                                                                        for word, tag in x])
rateMyProfData['review_tokens'].head()

### Bag of Words

In [None]:
# Change it to a list of strings so BoW can operate correctly
rateMyProfData['cleanReview'] = rateMyProfData['review_tokens'].apply(lambda x: ' '.join(x))
rateMyProfData['cleanReview'].head()

### Creating the Linear Regression Models

In [None]:
# Split the data into test and train for both predictions
X_train_Qual, X_test_Qual, y_train_Qual, y_test_Qual = train_test_split(rateMyProfData['cleanReview'], 
                                                                        rateMyProfData['Quality'], 
                                                                        test_size=0.2, 
                                                                        random_state=42)
X_train_Diff, X_test_Diff, y_train_Diff, y_test_Diff = train_test_split(rateMyProfData['cleanReview'], 
                                                                        rateMyProfData['Difficulty'], 
                                                                        test_size=0.2, 
                                                                        random_state=42)

In [None]:
# Get the bodies for analysis later
X_test_Qual_Orig = X_test_Qual
X_test_Diff_Orig = X_test_Diff

In [None]:
vectorizer_Qual = CountVectorizer()
regressor_Qual = LinearRegression()

vectorizer_Diff = CountVectorizer()
regressor_Diff = LinearRegression()

# Convert the text to a bag-of-words representation
X_train_Qual = vectorizer_Qual.fit_transform(X_train_Qual)
X_test_Qual = vectorizer_Qual.transform(X_test_Qual)

regressor_Qual.fit(X_train_Qual, y_train_Qual)
y_pred_Qual = regressor_Qual.predict(X_test_Qual)

X_train_Diff = vectorizer_Diff.fit_transform(X_train_Diff)
X_test_Diff = vectorizer_Diff.transform(X_test_Diff)

regressor_Diff.fit(X_train_Diff, y_train_Diff)
y_pred_Diff = regressor_Diff.predict(X_test_Diff)

### Analyzing the Linear Regression performances

#### Quality Analysis

In [None]:
mse = mean_squared_error(y_test_Qual, y_pred_Qual)
print("Quality Mean Squared Error:", mse)

In [None]:
example_index = 1
specific_instance = X_test_Qual[example_index]
prediction = regressor_Qual.predict(specific_instance)
actual_label = y_test_Qual.iloc[example_index]
print("Cleaned review body at index:", example_index)
print(X_test_Qual_Orig.iloc[example_index])
print("Actual X_test_Qual[2] label:", actual_label)
print("Prediction for X_test_Qual[2]:", prediction)

In [None]:
example_index = 4
specific_instance = X_test_Qual[example_index]
prediction = regressor_Qual.predict(specific_instance)
actual_label = y_test_Qual.iloc[example_index]
print("Cleaned review body at index:", example_index)
print(X_test_Qual_Orig.iloc[example_index])
print("Actual X_test_Qual[2] label:", actual_label)
print("Prediction for X_test_Qual[2]:", prediction)

As we can see, the performance is quite terrible. For a range of labels from 0.5 - 5.0, a mean squared error of almost 47 is very bad. 

Further the prediction at index 2 above, isnt too far off, but the prediction at index 4 is way out of scope. We can see from the review bodies that the review at index 4 contains very positive words, it is likely that the regressor is seeing a highly positive review and correlating a little too heavily, also not realizing that the max value is 5.

One thing we found upon researching is to clip the prediction to between 0.5 and 5.0 for quality to eliminate over-estimating highly praising reviews. While this does not actually help the Machine Learning model, it does help the usefulness of the prediction, and the mean-squared error. It is much better now at 4.6, but 4.6 is still quite bad for a range of possible values between 0.5 and 5.0.

In [None]:
y_pred_Qual = np.clip(y_pred_Qual, 0.5, 5.0)

mse = mean_squared_error(y_test_Qual, y_pred_Qual)
print("Quality Mean Squared Error after clipping:", mse)

#### Difficulty Analysis

In [None]:
mse = mean_squared_error(y_test_Diff, y_pred_Diff)
print("Difficulty Mean Squared Error:", mse)

In [None]:
example_index = 7
specific_instance = X_test_Diff[example_index]
prediction = regressor_Diff.predict(specific_instance)
actual_label = y_test_Diff.iloc[example_index]
print("Cleaned review body at index:", example_index)
print(X_test_Diff_Orig.iloc[example_index])
print("Actual X_test_Diff[2] label:", actual_label)
print("Prediction for X_test_Diff[2]:", prediction)

In [None]:
example_index = 6
specific_instance = X_test_Diff[example_index]
prediction = regressor_Diff.predict(specific_instance)
actual_label = y_test_Diff.iloc[example_index]
print("Cleaned review body at index:", example_index)
print(X_test_Diff_Orig.iloc[example_index])
print("Actual X_test_Diff[2] label:", actual_label)
print("Prediction for X_test_Diff[2]:", prediction)

Just as before with quality, the performance of the difficulty model is very similarly terrible. This model has a mean-squared error of about 46, which is slightly better. 

The prediction at index 7 above is pretty close, even though the word confuse is in the review, which we thought would increase the difficulty prediction. 

At index 6 also above is even more interesting, even though the review conatains negative phrasing such as dont ask, and contains the word hard, the model predictied a difficulty score of 1.5 relating to very easy, while the true label was 5 out of 5 difficulty. In this review, it is likely because the context of the word easy. In the review to a human reader it seems that the intent of the word easy here is the author contradicting what others have said, and disagreeing with it.

Clipping the prediction to between 1 and 5 for quality to eliminate out of bounds reviews helped here again. Similarly, it does not help the prediction model, but just the output and mean-squared error. It is much better now at 3.7, but again 3.7 is still quite bad for a range of possible values between 1 and 5.

In [None]:
y_pred_Diff = np.clip(y_pred_Diff, 1, 5)

mse = mean_squared_error(y_test_Diff, y_pred_Diff)
print("Difficulty Mean Squared Error after clipping:", mse)

## **BERT NLP Model**