#### Final Proposal EDF 6938: Natural Langauge Processing

### Stude Edge: Student Feedback and Recommendation Rates
> #### Author: Erin Mowry
> #### Date: Dec. 6th, 2022
> #### Email: enmowry@ufl.edu


#### 1. Introduction 

> Study Edge is a tutoring company based in Gainesville, Fl. They provide review content for a variety of University of Florida classes, including live review sessions, study hours, and 1-on-1 tutoring. 
>
> Like many other businesses, Study Edge sends out a survey to its customers at the end of every semester. Some of the information this survey collects includes open-ended feedback related to the quality of course materials and products. It also ask students to rate how likely they are to recommend the Study Edge product to a friend. This number is important because word of mouth is the main mode of advertisement for the company. 
>
> This year, Study Edge is preparing to receive additional funding from its new parent company. This leads to the question of which problems should be tackled with these funds. Logically, the problems which should be addressed first are those which cause customers to have a low recommendation rate for the Study Edge product, since this costs the company new customers.
>
>The goal of this project is to identify the most salient problems leading to low recommendation rates. To accomplish this goal, I will use a Naive Bayes classifier to identify the top features associated with three different sets of customer feedback, grouped by recommendation rate. These features can then be used to guide spending decisions. 




#### 2. Methods 

> **I. Data Collection and Cleaning**
>
> The data were sent to me in an excel sheet containing the contents of google surveys sent every semester from Fall 2013 to Spring 2022. Each semester had its own sheet in the document, but some of the recent years were duplicated on other sheets. Although each sheet had the same heading, the responses in some cells clearly did not match the heading. 
> 
> In order to clean the data, the first thing I did was remove columns from each sheet that were responses to multiple choice questions, or some unidentifiable open-ended question that was not relevant to this project. This left me with colummns for the date of the survey response, the course code, the recommendation rating, and two to three columns of open-ended responses. I renamed the columns of the open-ended responses to "Website/App Improvements," "Non-App Improvements," and "Other Comments." I then manually shifted comments into the appropriate column if they were not already aligned. I also removed the column containing course code because it was not relevant. 
>
> Lastly, I had to manually go through and remove responses that were either entirely blank or duplicates of other responses. Most of the duplicates I removed were of longer responses that were easily identifiable as someone submitting the survey twice on accident. Others were duplicated one word responses, and I realized too late that I couldn't be confident whether these were duplicates from the same student or just many students saying that "Nothing" could be improved. In repeating this analysis, I would need to re-do this data cleaning process a bit more thoroughly. 
>
> I ended up with n = 2377 survey responses to be used. 
>

---


> 
> **II. Pre-processing**
>
> Upon uploading the excel sheet into Colaboratory, I had to remove a few random columns that came in with the document. I also had to fill blank cells with an empty string. The other initial task was to create a new column called 'feedback,' which combined the text inside the "Website/App Improvements" column and the "Non-App Improvements." This was the larger text that I proceeded to normalize.
>
> My text normalization was based on the methods Dr. Shin provided in class. In the cleaning method, punctuation as well as contractions were removed. However, I did not remove digits as I thought references to time might be relevant in this context. Additionally, when I actually called the method on my feedback data, punctuation was not removed. I duplicated the punctuation removal section of the code as its own method and called it again, and the second time it worked. 
> 
> The other methods I used to pre-process the data were word tokenization, lemmatization, and stopwords. I used the NLTK resources for both the word tokenizer and lemamtizer. I also used the NLTK English stopwords set. However, after reviewing what the processed data looked like with stopwords removed, I decided that some of the default stopwords were actually relevant to my context and removed them from the list. The default stopwords that I kept in my feedback data are: 'more', 'less', 'should', 'not', and 'most.'
>
> After applying each of the methods described above to the feedback data, I was left with a new column of processed data, called 'p_feedback.' At this point, I created a column for the correct labels of each text. Each text with a recommendation rate of 10 was labeled as "Perfect" (n = 1520). Text with a recommendation rate of 8 or 9 was labeled as "High" (n = 612). All texts with a recommendation rate of 7 or below were labeled as "Low" (n = 245). Lastly, I rejoined the tokens in the 'p_feedback' column into one string of feedback. 
>

---


>**III. Text Vectorization**
>
> I chose to use an n-gram BOW vectorization method from SKLearn. With this tool, I would be able to test a range of n-grams in my model to see which had the best fit and most interpretable results. The n-gram range (a, b) works by creating all possible features of size a, a+1,..., b-1, b. Part of my data analysis was to determine which n-gram range was ideal for training the model. At the start of the project, BOW was the text vectorization method I felt most compfortable implementing independently. Additionally, the independence assumption of BOW vectorization seemed relevant given that my text was composed of answers to two different, somewhat unrelated survey questions. 
>

---


> **IV. Model Selection**
>
> I selected my n-gram range after comparing the recall and f1-score of the model using various ranges. The ranges I initially tested were: (1,1), (1,2), (2,2), (1,3), (2,4), (2,5), and (1,5). In selecting the desired range, I wanted to prioritize better results on the Low data group, even at the expense of the other two groups. After considering these factors, and the interpetability of various n-gram sized features, I chose to use the n-gram range (2,4). The confusion matrix from the initial test of this range is below. 

Sklearn's score on testing data : 0.43487394957983194

Classification report for testing data : 

              precision    recall  f1-score   support

        High       0.29      0.41      0.34       123
         Low       0.17      0.41      0.24        49
     Perfect       0.73      0.45      0.56       304

    accuracy                           0.43       476

>In most of the other models, values for recall and f1-score of the Low data were below 0.20. While the (2,5) range had comparable effectiveness to this model, I chose to use the (2,4) range for analysis in order to minimize the number of features being evaluated. 
>
>
> After completing the first round of data anlaysis with the n-gram range (2,4), I realized that the top weighted bi-gram features were not all interpretable. For example, I could not be sure whether "practice problems" refers to a need for *more* problems or *better* problems. 
>
>    *Top 10 features in (2,4) model, Low group*
>
>         -7.9020	exam review    
>         -8.0198	practice problem
>         -8.0843	study edge     
>         -8.7129	review video   
>         -8.8465	study hour     
>         -9.0006	review session 
>         -9.0006	more practice  
>         -9.0006	more candy     
>         -9.0006	formula sheet  
>         -9.0006	end packet     
   
>
> I did some more testing with n-gram ranges whose smallest feature was a tri-gram. I decided to proceed using the range (3,4) because the confusion matrix was very similar to the results I saw for the (2,4) range but this time the results would be more interpretable. For comparison, the confusion matrix for the model using the (3,4) range is below.
>
Sklearn's score on testing data : 0.4117647058823529
>
Classification report for testing data : 
>
              precision    recall  f1-score   support

        High       0.26      0.36      0.30       123
         Low       0.16      0.35      0.22        49
     Perfect       0.68      0.44      0.54       304

    accuracy                           0.41       476

>
>

---


> **V. Language Model**
>
> The base task of this project was to classify testing data by recommendation rate. Since the data is labeled categorically, I used a Multinomial Naive Bayes model to train the data. The model I used was the SKLearn NB model presented in class by Dr. Shin. I did not make any modifications to the model itself. In splitting my data into training and testing sets, I did add a parameter to have the samples stratified by label, as the data set was very skewed. 
>
>

---


> **VI. Analysis**
>
> The purpose of this project was to identify the features with the heighest weighted features for the Low recommendation rate data. Hypothetically, this information can be used to inform actions taken to improve the Study Edge product. To identify the most important features, I used the log proability of features provided by the SKLearn NB classifier. Once I have the list of features, all I need to do is identify the topics that seem to be causing low recommendation rates and pass them along to the company to act on. 
>
> For this portion of the code, my limited knowledge of Python was making it difficult to print out the information I needed. I was grateful to find a method on Stack Overflow that magically called the data in the right way. The original source of the code I used can be found here: https://stackoverflow.com/questions/29867367/sklearn-multinomial-nb-most-informative-features



#### 4. Analysis Demonstration 

##### 4.1. Dependencies 

In [None]:
# Import all the library that is necessary for your analysis 
import pandas as pd 
import numpy as np 
import nltk 
nltk.download(['punkt', 'wordnet', 'omw-1.4', 'stopwords'])
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

#############################################################

##### 4.2. Code

In [None]:
#### Your code for the analysis will be provided here 

#import data
filename = 'SE_feedback.xlsx'
data = pd.read_excel(filename)

data = data.drop(['Unnamed: 5'], axis = 1)
data = data.drop(['n = 2377'], axis = 1)
data = data.replace(np.nan, '')

#create new column of combined data
data['feedback'] = data['Website/App Improvements'] + ' ' + data['Non-App Improvements']

#preprocess data using methods from class
#and then apply them to a duplicate of feedback column so can restart if anything is weird.
def cleaning(self):  #clean whole comment first

    string = self
    string  = string.lower() # step 1. lowercase

    punc = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' 
    string  = string.strip(punc) 

    string  = string.replace("can't", 'cannot') 
    string = string.replace("n't", ' not')
    string  = string.replace("'ll", ' will')
    string  = string.replace("'m", ' am')
    string = string.replace("he's", "he is")
    string = string.replace("it's", 'it is')
    #did not remove digits bc communicate relevant information in some comments
    return string 

def tokenize(self):  #word tokenize second
    import nltk
    return nltk.word_tokenize(self)

def stopwords(self):  #stop words 3rd, to minimize num words through lemmatizer
    from nltk.corpus import stopwords

    stop_words = set(stopwords.words('english')) - {'more', 'less', 'should', 'not', 'most', } 
    word_tokens = self
    filtered_feedback = [w for w in word_tokens if not w.lower() in stop_words]
  
    filtered_feedback = []
    for w in word_tokens:
        if w not in stop_words:
            filtered_feedback.append(w)
    return filtered_feedback

def lemmatize(self):  #lemmatize fourth bc requires tokens
    import nltk
    lemmatizer = nltk.stem.WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in self]

#Applying preprocess methods here
data['p_feedback'] = data['feedback'].apply(cleaning) #processed feedback is copy of feedback initially
data['p_feedback'] = data['p_feedback'].apply(tokenize) 
data['p_feedback'] = data['p_feedback'].apply(stopwords)
data['p_feedback'] = data['p_feedback'].apply(lemmatize)


#still has punc, so try to remove again lol
def rem_punc(self):
    punc = set('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
    filtered_feedback = []
    for w in self:
        if w not in punc:
            filtered_feedback.append(w)
    return filtered_feedback

data['p_feedback'] = data['p_feedback'].apply(rem_punc)

#add labels
data['labels'] = ''
for i in range(len(data)):
  if(data['Recommendation'].values[i] == 10):
    data['labels'][i] = 'High' #'Perfect'
  elif(data['Recommendation'].values[i] == 8 or data['Recommendation'].values[i] == 9):
    data['labels'][i] = 'High'
  else:
    data['labels'][i] = 'Low'

#rejoin to one string
def rejoin(self):
    return ' '.join(self)

data['p_feedback'] = data['p_feedback'].apply(rejoin)

#n-gram bow to vectorize 
bow = CountVectorizer(ngram_range=(3,4))
X = bow.fit_transform(data['p_feedback'])

y = data['labels']

#Let's randomly shuffle and use 80% of the data as training and rest of it as testing 
#also tell it to stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

clf_nb = MultinomialNB() # this is your new classifier 
clf_nb.fit(X_train, y_train) #let's fit the model 
y_hat = clf_nb.predict(X_test) #predit y_hat 

#precision, f-measure, recall
sklearn_score_test= clf_nb.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

#results of how test data was categorized
print('The testing labels were:')
print(list(y_test))
print('The predicted labels were:')
print(list(y_hat))

#n-gram probabilities
print(list(bow.get_feature_names_out()))

print(list(clf_nb.classes_))

#The method below was found on Stack Overflow at https://stackoverflow.com/questions/29867367/sklearn-multinomial-nb-most-informative-features
#Original method was only able to access one group, so I added class_num parameter to be able to access coefficients for each group of data
#Magically this all worked and I dare not try to recreate it :)
#This shows 100 highest and lowest weighted features side by side
def show_most_informative_features(vectorizer, clf, class_num, n=100):
  feature_names = vectorizer.get_feature_names_out()
  coefs_with_fns = sorted(zip(clf.coef_[class_num], feature_names))
  top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
  for (coef_1, fn_1), (coef_2, fn_2) in top:
      print('\t%.4f\t%-15s\t\t%.4f\t%-15s' % (coef_1, fn_1, coef_2, fn_2))

print('Top 100 features for High recommendation rate')
show_most_informative_features(bow, clf_nb, 0)
print('Top 100 features for Low recommendation rate')
show_most_informative_features(bow, clf_nb, 1)
print('Top 100 features for Percfect recommendation rate')
show_most_informative_features(bow, clf_nb, 2)
####################################################

#### 4. Results 

> Model: n-gram range (3,4)
>
>**Confusion Matrix**
>
>Sklearn's score on testing data : 0.41596638655462187
>
>Classification report for testing data : 
>
>               precision recall  f1-score   support
    High            0.25      0.34      0.29       123
    Low             0.17      0.35      0.23        49
    Perfect         0.67      0.46      0.55       304
    accuracy                            0.42       476
>
> There is nothing stellar about the efficacy of this model, but I am okay with that and expected to see results like this. The student feedback is very similar regardless of the recommendation rate. However, the students with low recommendation rates are more likely to be customers outside of Study Edge's main market (i.e., students not in greek life) and thus it is very important to address the issues most important to this group if the company wants to grow. 
> 
>
> **Top 10 features for Low recommendation rate**
>
>        -8.8117	more practice problem
>        -8.8117	end packet question
>        -9.0348	more study hour
>        -9.3225	worksheet homework question
>        -9.3225	test phenomenal integration quiz
>        -9.3225	streaming buggy material integration
>        -9.3225	reserve seat review
>        -9.3225	question practice problem
>        -9.3225	put another website
>        -9.3225	problem could look more
>
> **Top 10 features for High recommendation rate**
>
>         -7.7661	more practice problem
>         -8.2769	more study hour
>         -8.8647	give more token
>         -9.0878	weekly exam review
>         -9.0878	study hour not 
>         -9.0878	short concept video
>         -9.0878	live review schedule
>         -9.3755	would nice practice question
>         -9.3755	would nice practice
>         -9.3755	worth 12 token need
>
> **Top 10 features for Perfect recommendation rate**
> 
>         -7.1764	more practice problem
>         -7.6583	more study hour
>         -8.7569	study edge great
>         -8.7569	not think anything
>         -8.9110	watch video without
>         -8.9110	video le token 
>         -8.9110	token per month
>         -8.9110	exam review video
>         -8.9110	end packet problem
>         -9.0933	think study edge
>
>
> These results show that the top features in the Low recommendation rate set of feedback are largely related to practice problems. I did not expect feedback related to practice problems to be so closley related to a low recommendation rate of the Study Edge product. I originally expected that review schedules and video access/token payment system would be the most important features. While this is not the case for the Low data set, these topics do appear in the top features for the High and Perfect recommendation groups. 
> 

#### 5. Conclusion and Discussion

> The results of this analysis indicate the largest factor contributing to customers giving a low recommendation rate for Study Edge is dissatisfaction with practice problems. Practice problems are written by tutors, remote tutors, or class assistants. Most practice problems also have video solutions for students to view. If more practice problems need to be created, or a large number of practice problems need to be revised, it is reasonable to expect that additional labor is needed to complete that project. So, I might suggest to Study Edge that they use the additional funding they are receiving to hire additional staff to address this problem. 
>
> Over the course of this project, I have noticed a few issues that I would want to revise before continuing. Firstly, almost half of the survey data I used is from 2013-2015, and I know that Study Edge and its products have changed significantly since then. When I look at the top 100 features for each data set, I can see these irrelevant features being highly weighted. In the future, I would choose to work only with the most recent data, from 2019 onwards. 
> 
> Secondly, now that I have spent some time working with this model and identifying top features, I'm not sure that it is the most helpful analysis of the data. The top features are sometimes being pulled from a single comment, and that's not what I intended. I was hoping top features would be more repetitive in the data. I wonder if it would be better to look for larger clusters in the text as a whole, since there was so much overlap in the feedback between groups anyway. 
>
>
>
>
