# Introduction to Amazon Customer Reviews Project

# 1. The project
The current competition aims to predict customer sentiments regarding [Baby products purchased on Amazon.com](http://jmcauley.ucsd.edu/data/amazon/), on the basis of their written reviews. Prior to starting with data analysis, we would like to first answer these three fundamental questions fo any ML project:

**1. Where do the data come from? (To which population will results generalize?)**
* The data originated from Emily Fox, Ph.D., Carlos Guestrin, Ph.D., and Julian McAuely on amazon product reviews on baby products from May 1996 to July 2014. As it can be seen, the data was collected from English speaking country (the USA), meaning we will be only able to generalize the reviews to the American websites that sell baby products.

**2. What are candidate machine learning methods? (models? features?)**
* As the problem issues binary outcomes, we can classify this problem as a binary classification problem. The  candidates are:
<div style="display:flex; flex-direction: row; flex-wrap: nowrap; align-items: stretch; width:100%;">
    <div style="display:inline-block;width:45%;">
        <ul>
<h5> Model Candidates: </h5>          
<li> Ridge Regression
<li> Lasso Regression
<li> Partial Least Squares
<li> Principal Component Regression
<li> Smoothing
<li> KNN
        </ul>
    </div>
    <div style="display:inline-block;width:45%;">
        <ul>
<h5> Feature Candidates: </h5>  
<li> Sentiment Analysis using AFINN
<li> Emotion Words from NRC
<li> Lexical Diversity (amount of unique words)
<li> Amount of Articles and Fillerwords
<li> Average Word/ Sentence Length
<li> Stopword Proportions
<li> Ajective/ Linking/ Unique Words Proportions
<li> Total Word Count
<li> Average Number of Words per Sentence
<li> Bigram Analysis/ n_grams
<li> TF_IDF: relative importance of words
<li> Word Correlations
<li> Positive words related to reviews and ratings

For the final submission, we removed following features: word correlations and fillerword counts, as they make little sense as predictors for written and relatively short texts.
        </ul>
    </div>

**3. What is the Bayes' error bound?**
* Human performance on this task can be assumed to be above chance. If it were about the concrete ratings (5 stars), accuracy would be lower, but for binary classification into satisfied and not satisfied, we can guess accuracy to be about 80%. Other literature on similar problems also suggest an accuracy of about 85% (Mtetwa, 2018; Deshpande, 2021).


----------

# 2. Read Data

This step simply contains installing necessary packages and reading datasets

In [None]:
# Importing packages
suppressMessages(library(tidyverse))
suppressMessages(library(tidytext))
suppressMessages(library(caret))
suppressMessages(library(glmnet))
suppressMessages(library(pls))
suppressMessages(library(pROC))

# Data attached to this notebook
list.files(path = "../input")

In [None]:
dir("../input", recursive=TRUE)

In [None]:
# Find the right file path
csv_filepath = dir("..", pattern="amazon_baby.csv", recursive=TRUE, full.names = TRUE)

# Read in the csv file
amazon = read_csv(csv_filepath) %>%
    rownames_to_column('id') 

In [None]:
# Check how many train and test data are included:
trainidx = !is.na(amazon$rating)
table(trainidx)

From the above, there are 153,531 training samples and 30,000 test samples.

----------

# 3. Preprocessing

Prepend the product name to the review text: paste the `name` string and `review` string into a single string using the `unite()` function:

In [None]:
# Paste name and review into a single string separated by a "–".
# The new string replaces the original review.
amazon = amazon %>% 
    unite(review, name, review, sep = " — ", remove = FALSE) %>%
     mutate(satisfied = ifelse(rating > 3, 1, 0))

The data frame contains both the train and test data. The test data are the reviews for which the rating is missing and need to be provided with a prediction. 

## 3.1 Tokenization

We're going to use tidytext to break up the text into separate tokens and count the number of occurences per review. To keep track of the review to which the review belongs, we have added the rownames as `id` above, which is simply the row number. As tokens you can consider single words, pairs of words called bi-grams, or n-grams. 

In [None]:
reviews = amazon %>% 

   # tokinize reviews at word level
   unnest_tokens(token, review) %>%

   # count tokens within reviews as 'n'
   # (keep id, name, and rating in the result)
   count(id, name, token, satisfied)

head(reviews)


# 4. Features engineering

## Features: Competition 1
### Word Features
Here we calculate the total number of words per review, as well as the average number of words per sentence in each review, as there might be differences in how much people write in their reviews depending on whether or not they rate the product highly. For example, some people might give a more in-depth explanation of what they did not like about the product, which in turn might also be related to how many words are used per sentence.

In [None]:
## Total Word Count
total_count <- reviews %>% 
    group_by(id) %>%
    summarize(word_count = n(), .groups = "drop") %>%
    replace(is.na(.), 0)

## Average number of words per sentence
sentence_tokens <- amazon %>%
    unnest_tokens(sentence, review, token = 'sentences') #tokenization into sentences

n_words_in_sentence <- sentence_tokens %>%
    mutate(sentence_indice = c(1:nrow(sentence_tokens))) %>% #adding indices to keep track of sentences
    unnest_tokens(word, sentence, token = "words") %>% #break up sentences into words
    mutate(tally = 1) %>% #add tallies for every word
    group_by(id, sentence_indice) %>% 
    summarise(words_per_sentence = sum(tally), .groups = "drop") #get the words per sentence

n_words_in_sentence <- n_words_in_sentence %>%
    group_by(id) %>%
    summarise(mean_words_sentence = mean(words_per_sentence)) %>%
    replace(is.na(.), 0)

We make the features into a format that is compatible with the design matrix later on:

In [None]:
total_count = total_count %>%
    mutate(token = "word_count", .before = word_count) %>%
    rename("value" = word_count)
n_words_in_sentence = n_words_in_sentence %>%
    mutate(token = "n_words_in_sentence", .before = mean_words_sentence) %>%
    rename("value" = mean_words_sentence)

### Features: Sentiment Analysis using AFINN

We choose to do a sentiment analysis using the AFINN dictionary. Using this dictionary, each word gets a score assigned that ranges between -5 and 5. Negative words are scored < 0, whereas postive words are score > 0. We expect scores < 0 for the unsatisfying reviews and > 0 for positive reviews.

In [None]:
#import afinn dictionary
download.file("http://www2.imm.dtu.dk/pubdb/edoc/imm6010.zip","afinn.zip")
unzip("afinn.zip")
afinn = read.delim("AFINN/AFINN-111.txt", sep="\t", col.names = c("word","score"), stringsAsFactors = FALSE)

In [None]:
#map words from afinn dict. with words in the reviews
sent <- reviews %>% 
    inner_join(afinn,by = c("token" = "word")) 

In [None]:
#count appearences of words in reviews
sent = sent %>% #
    group_by(id) %>%
    mutate(product = n*score,
          token = "sent_score") %>% 
    summarise("value" = sum(product))

### Features: Lexical diversity

Lexical diversity can be defined by the amount of unique words in a passage of text. We expect positive reviews to be more well-written and thoughtful about the word choice. Therefore, the lexical diversity is believed to be higher for these reviews.

In [None]:
#we calculate the number of unique words for each review
unique_words <- reviews %>%
    group_by(id) %>%
    summarise(lex_diversity = n_distinct(token)) 

In [None]:
#rename col
unique_words = unique_words %>%
    mutate(token = "lex_diversity", .before = lex_diversity) %>%
    rename("value" = lex_diversity)

### Features: Emotion words from NRC

We included emotion words because the use of emotion is likely to be related to the review rating. Scores on negative emotions such as anger, sadness and disgust are likely to be higher in unsatisfied reviews than satisfied reviews

In [None]:
# Load emotion words
load_nrc = function() {
    if (!file.exists('nrc.txt'))
        download.file("https://www.dropbox.com/s/yo5o476zk8j5ujg/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt?dl=1","nrc.txt")
        nrc = read.table("nrc.txt", 
                         col.names = c('word','sentiment','applies'), 
                         stringsAsFactors = FALSE)
        nrc %>% 
            filter(applies == 1) %>% 
                select(-applies)
}

nrc = load_nrc()

In [None]:
# Tokenize dataset
emotion <- reviews %>%
    inner_join(nrc,by = c("token" = "word"))

# Create score for emotions
sentiment_scores <- emotion %>%
    count(`id`, sentiment) %>% # Create score for emotions
    rename("token" = sentiment, "value" = n)

### Features: Use of Articles

Use of articles, like the, a, an, may indicate how descriptive the person is, and impact on how extreme the rating would be.

In [None]:
# Create a word count
word_count <- reviews %>%
    count(id) %>%
    rename(word_count = n)

In [None]:
# Calculate the amount of articles each person uses
articles <- reviews %>%
    filter(token == "the" | token == "a" | token == "an") %>%
    count(id) %>%
    rename(articles_count = n)

In [None]:
# Calculate the amount of articles per word
count_words <- word_count %>% 
    full_join(articles, by = "id") %>%
    mutate(articles_score = articles_count / word_count) %>%
    replace(is.na(.), 0) %>%
    select(id, articles_score) %>%
    mutate(token = "articles_score", .before = articles_score) %>%
    rename("value" = articles_score)

### Features: Average Word/ Sentence Length
We calculated the average word and average sentence length in each review. These features can be useful as individuals with stronger opinions tend to write longer sentences, as well as longer words. The expression of the product may be more elaborate and thus indicate extremely higher (5) or lower (1) rating of the reviews. 

Average word length was measured by the average amount of characters used per tokens, and average sentence length was measured from the average amount of tokens used in each sentence.

In [None]:
## Average word length 
word_length <- reviews %>% 
    group_by(id, satisfied) %>% 
    summarise(avg_word_len = nchar(token) %>%
              mean(), .groups = "drop") %>%
    select(id, avg_word_len) %>%
    mutate(token = "avg_word_len", .before = avg_word_len) %>%
    rename("value" = avg_word_len)

## Average sentence length
sentence_length <- amazon %>%
    unnest_tokens(sentences, review, token = "sentences") %>%
    mutate(sentence_num = row_number()) %>%
    unnest_tokens(words, sentences, token = "words") %>% 
    group_by(id, satisfied) %>%
    count(sentence_num) %>%
    summarise(avg_sen_len = mean(n), .groups = "drop") %>%
    select(id, avg_sen_len) %>%
    mutate(token = "avg_sen_len", .before = avg_sen_len) %>%
    rename("value" = avg_sen_len)

### Features: Proportions of the words

We calculated the proportions of unique sets of words from reviews dataset. The counting of each word seemed like a good indicator, but it may not be able to fully deliver the information because the proportions of the word count may be differ. Thus, we decided to use proportions of stopword, adjectives, linking words, and unique words as features. 
* Stopword is usually meaningless and insignificant that most of the people tend to ignore or remove from data analysis. However, we wanted to use the proportions of the stopword because it may contain some important information that we may easily not recognize. The elaboration of words (so the less stopword) may mean individuals are more eager to express their opinions and thus give either very high or low rating.
* Adjectives also convey information because they also add elaboration into the words, showing how descriptive the review is.
* Unique words may or may not give information about the ratings on reviews because it may show what kind of person is writing a review on, but not what kind of rating the review would get. However, it is still added as it showed some changes in sentiment analysis.

In [None]:
# Stopword Proportions
sw_count = reviews %>% 
    # Add the total number of tokens per review as 'N'
    add_count(id, name = "N") %>% 

    # Retain only tokens that are stopwords
    inner_join(get_stopwords(), by = c(token='word')) %>% 

    # Compute the total number of stopwords per review
    group_by(id, satisfied, N) %>% 
    summarise(n_stopwords = sum(n), .groups = "drop") %>%
    mutate(sw_prop = n_stopwords/N) %>%
    select(id, sw_prop) %>%
    mutate(token = "sw_prop", .before = sw_prop) %>%
    rename("value" = sw_prop)

# adjective list [List was acquired from this github repo: https://gist.github.com/hugsy/8910dc78d208e40de42deb29e62df913]
adjective_list <- read.delim(url("https://gist.github.com/hugsy/8910dc78d208e40de42deb29e62df913/raw/eec99c5597a73f6a9240cab26965a8609fa0f6ea/english-adjectives.txt"),
                             header = FALSE)
names(adjective_list) <- paste("word")

adj_count = reviews %>% 
    add_count(id, name = "N") %>% 
    semi_join(adjective_list, by = c(token = "word")) %>%
    group_by(id, satisfied, N) %>% 
    summarise(n_adjectives = sum(n), .groups = "drop") %>%
    mutate(adj_prop = n_adjectives/N) %>%
    select(id, adj_prop) %>%
    mutate(token = "adj_prop", .before = adj_prop) %>%
    rename("value" = adj_prop)

# Unique word Proportions
unique_word_prop <- reviews %>%
    add_count(id, name = "N") %>% 
    anti_join(get_stopwords(), by = c(token='word')) %>% 
     group_by(id, satisfied, N) %>% 
    summarise(unique_count = length(unique(token)), .groups = "drop") %>%
    mutate(unique_prop = unique_count/ N) %>%
    select(id, unique_prop) %>%
    mutate(token = "unique_prop", .before = unique_prop) %>%
    rename("value" = unique_prop)

## Other Features

### Features: Positive Words

We also found a list of positive words usually used in reviews (Songpan, 2017) and thought it was interesting to add as a predictor.
The positive words indicated higher ratings in reviews; thus we thought it would increase the accuracy, as it reflects the reality of reviews properly.

In [None]:
#search for positive words in the reviews
poswords <- reviews %>%
    filter(token == "convenient"|
           token == "good" |
           token == "near" |
           token == "comfortable"|
           token == "very good "|
           token == "cheap"|
           token == "worth"|
           token == "safe"|
           token == "smile"|
           token == "delicious"|
           token == "beautiful"|
           token == "luxurious") %>%
    count(id) %>%
    rename("pos_count" = n)

# Create word count by dividing positive words by total words
count_pos <- word_count %>% 
    full_join(poswords, by = "id") %>%
    mutate(pos_score= pos_count/word_count) %>%
    replace(is.na(.), 0) %>%
    select(id, pos_score) %>%
    mutate(token = "pos_score", .before = pos_score) %>%
    rename("value" = pos_score)

### Features: Bigrams
Bigrams are used to tokenize consecutive sequences of words (showing how often word X is followed by word Y). We can see if there are relationships between different words. Bigrams, as they inlcude words like "not" and "never", we can use this to see if the rating can be predicted accurately.

In [None]:
# bigrams
bigrams = amazon %>% 
  unnest_tokens(bigram, review, "ngrams", n = 2) %>% 
  count(id, name, rating, bigram)

# Bigram tf_idf
tf_idf_bigrams = bigrams %>%
   bind_tf_idf(bigram, id, n) %>%
    # remove near zero variance
   filter(idf <= -log(0.5/100)) %>%
    # replacing words that are not present are NA's with 0
   replace_na(list(tf=0, idf=Inf, tf_idf=0)) %>%
   rename(token = bigram) %>%
   select(id, token, rating, tf_idf)

head(tf_idf_bigrams, 6)

# remove bigrams considering limitations in memory
rm(bigrams)
gc()

Select only the tf-idaf values, since that is what we are interested in and make it into a format that can be merged with all other features:

In [None]:
tf_idf_bigrams2 <- tf_idf_bigrams %>% 
    select(id, token, tf_idf) %>%
    mutate(token = token, .before = tf_idf) %>%
    rename("value" = tf_idf)

tf_idf_bigrams2$token <- paste(tf_idf_bigrams2$token, "bigrams", sep = "_") 

### Non-zero variance features

Features that have almost no variance across cases cannot provide a lot of information about the target variable. Variance across cases is the leading principle in any data context. For binary and count data as considered here the variance is determined by the average (that's a mathetmatical fact). Hence, for the current data we can look simply at document frequencies and do not need to compute variances. 

We will remove tokens that occur in less than 0.01% of the documents (there are ~180,000 reviews in the data set; less than 0.01% &times; 180,000 reviews = 18 of the reviews). The number 0.01% is quite arbitrary, but will remove idiosyncratic strings and miss-spellings that occur only in singular reviews. 

Since $IDF_t$, the column `idf`, which measures the surprise of a `token` $t$, is computed as 

$$IDF_t = -\log\left({\text{df}_t \over N}\right) = -\log(\text{proportion of document in which }t\text{ occurs})$$ 

we can filter the rows in `features` for which $-\log(\text{df}_t / N) \leq -\log(0.01\%)$ (i.e., the 'surprise' should be lower than $-\log(0.01/100)$).


**We removed non-zero variance features in one of the code chunks above when computing the tf-idf values.**



### Features: TF-IDF

TF-IDF, also total-frequency-inverse-document-frequency, is computed based on the total frequency of a word and the frequency of that word within a specific document. It is based on the reasoning that important and characteristic words of a specific document will occur more within that document compared to all others.

The code for the tf_idf computation is based on the literature as well as the kaggle notebook "Huge design matrices - Spooky author". When computing the tf-idf values, we decided not to remove stopwords since they turned out to be highly significant predictors of rating before.
First, we compute the tf-idf values for all words across all "documents" (reviews). This design matrix can be used to fit a lasso regression later on.
For now, we have decided to keep the tf-idf separate from all other features since the dataframe is too big to merge, and the sparse matrix is not compatible for merging with the other features.

In [None]:
tf_idf_df = reviews %>% 
    # Compute TF·IDF for each word per sentence
    count(name, id, token) %>% 
    bind_tf_idf(token, id, n) %>% 

    # Words that are not present in a particular scentence are NA's but should be 0
    replace_na(list(tf=0, idf=Inf, tf_idf=0))

#looking at the results
tf_idf_df = tf_idf_df %>%
    group_by(name) %>%
    arrange(desc(tf_idf)) %>%
    # Filtering for non-zero variance predictors
    filter(idf <= -log(0.01/100)) %>%
    ungroup()

We select only the tf-idaf values, since that is what we are interested in and make it into a format that can be merged with all other features:

In [None]:
tf_idf_df1 <- tf_idf_df %>% 
    select(id, token, tf_idf) %>%
    mutate(token = token, .before = tf_idf) %>%
    rename("value" = tf_idf)

We add tf_idf to the end of all token words to differentiate between other features:

In [None]:
tf_idf_df1$token <- paste(tf_idf_df1$token, "tf_idf", sep = "_") 

## Merging all features into one dataframe

For the next step, we have merged all the features into one dataframe!


In [None]:
features_df = bind_rows(tf_idf_df1, tf_idf_bigrams2, total_count, n_words_in_sentence,
                        sent, sentiment_scores, count_words,
                        word_length, sentence_length, sw_count, adj_count,
                        unique_word_prop, count_pos)

As the features took a lot of memory in computational process, we removed the memories of each feature. 

In [None]:
rm(tf_idf_df, tf_idf_df1, tf_idf_bigrams2)


# 5. Models

## Not relying on manual feature selection

In the Personality competition we computed features by utilizing word lists that in previous research were found to be predictive of sentiment. This requires substantial input from experts on the subject. If such knowledge is not (yet) available a process of trial and error can be used. But with many thousands of features automation of this process is essential. 


In addition forward and/or backward selection, automated methods that try to automatically ballance flexibility and predictive performance are

1. Lasso and Ridge regression
2. Principal Components and Partial Least Squares regression
3. Smoothing 
4. Regression and Classification trees (CART)
5. Random Forests
6. Support Vector Machines

Methods (1) and (2) on this list involve methods are able to take many features while automatically reducing redundant flexibility to any desired level. Multicollinearity, the epithome of reduancy, is also automatically taken care of by these methods.

Number (3) on the list, smoothing, grants more flexibility by allowing for some non-linearity in the relations between features and the target variable, without the need to manually specify a specific mathematical form (as is necessary in polynomial regression).

Methods (4), (5), and (6) are not only able to remove redundant features, but also can automatically recognize interactions between  features.

Hence, all of these methods remove the necessity of finding the best features by hand. 

All of these methods are associated with a small set of 1 to 3 (or 4 in some cases) parameters that control the flexibility of the model in a more or less continuous way&mdash;much like the $k$ parameter in k-nearest neighbers. Like the $k$ parameter in k-NN, these parameters can and need to be adjusted (*'tuned'*) for optimal predictive performance. Tuning is best done on a validation set (a subset from the training data), or using cross-validation, depending on the size of the data set.

## 5.1 Model fitting

Not all algorithms can deal with sparse matrices. For instance `lm()` can't. The package `glmnet`, which is extensively discussed in chapter 6 of ISLR, has a function with the same name `glmnet()` which can handle sparse matrices, and also allow you to reduce the model's flexibility by means of the Lasso penalty or ridge regression penalty. Furthermore, like the standard `glm()` function, it can also handle a variety of dependent variable families, including gaussian (for linear regression), binomial (for logistic regression), multinomial (for multinomial logistic regression), Poisson (for contingency tables and counts), and a few others. It is also quite caple of dealing computationally efficiently with the many features we have here.

> <span style=color:brown>The aim of this competition is the predict the probability that a customer is ***satisfied***. This is deemed to be the case if `rating > 3`.  Hence, you will need as a dependent variable `y` a factor that specifies whether this is the case. </span>

The performance of your submission will be evaluated using the area under the curve (AUC) of the receiver operating curve (ROC). See chapter 4 in the ISLR book. See also the help file for how `cv.glmnet` can works with this measure.

As said, `glmnet()` allows you to tune the flexibility of the model by means of _regularizing_ the regression coefficients. The type of regularization (i.e., the Lasso or ridge) that is used is controled by the `alpha` parameter. Refer to the book for an explanation. The amount of regularization is specified by means of the `lambda` parameter. Read the warning in the `help(glmnet)` documentation about changing this parameter. To tune this parameter look at the `cv.glmnet()` function.


Here we turn the dataframe containing our features into a sparse data matrix.

In [None]:
#into sparse matrix for lasso regression later
sparse_matrix = features_df %>% 
    cast_sparse(row = id, column = token, value = value) %>% 
    # Remove rows that do not belong to cases
    .[!is.na(rownames(.)),]

sparse_matrix[1:8,20:25]
cat("rows, columns: ", dim(sparse_matrix))

We remove features_df and tf_idf to save memory space

In [None]:
rm(features_df)

Then we split the data into test and training data.
For now, we only use a sample of 5000 for training the model as the full training data took too long to load.

In [None]:
train_ids = amazon %>%
    filter(!is.na(satisfied)) %>%
    select(id) %>%
    pull()

test_ids = amazon %>%
    select(id) %>%
    filter(!id %in% train_ids) %>%
    pull()

### Lasso Regression
Lasso is a alternative to ridge regression that overcomes the disadvantage of including all p predictors in the final model. Lasso minimizes the quantity by using lasso penalty. This helps performing variable selection and make it easier to interpret.

First, we create a training data set and then a vector holding all outcome descriptions for that training data:

In [None]:
training_data <- sparse_matrix[rownames(sparse_matrix) %in% train_ids, ] 

In [None]:
y =
    data.frame(id=rownames(training_data)) %>% 
    inner_join(amazon, by = "id") %>% 

    # Extract 'satified' as a factor
    pull(satisfied) %>%
    as.factor()

Then, we fit the model:

In [None]:
doMC::registerDoMC(cores = 4) 

fit_lasso <- glmnet::cv.glmnet(training_data, y, alpha = 1, family = "binomial", 
                               standardize = TRUE,
                               parallel = TRUE,
                            type.measure = "auc")

Here we plot the AUC of the model as a function of the log of lambda:

In [None]:
fit_lasso
plot(fit_lasso)

### Ridge Regression
Ridge regression can be used when there is a high number of predictors which exceeds the number of observations or can be used when there is multicollinearity between predictors.

In [None]:
fit_ridge <- glmnet::cv.glmnet(training_data, y, alpha = 0, family = "binomial", 
                               standardize = TRUE,
                               parallel = TRUE,
                            type.measure = "auc")

Here we plot the AUC of the model as a function of the log of lambda:

In [None]:
fit_ridge
plot(fit_ridge)

## Other Models:
Other candidate models that we tried to fit to the data were: K-Nearest-Neighbours, Naive Bayes, Principal Component Regression. Unforutnately these models do not work well with a sparse matrix, thus we fit them on the remaining predictors. As the tf-idf features make up majority of the predictors, we did not expect these models to perform well due to loss of information.

We then create a dataframe containing reviewer ids in the rows and all remaining predictors in the columns:

In [None]:
total_count2 <- total_count %>% #
    select(id, value) %>%
    rename("word_count" = value) 
n_words_in_sentence2 <- n_words_in_sentence %>%
    select(id, value) %>%
    rename("n_words_in_sentence" = value)
sent2 <- sent %>%
    select(id, value) %>%
    rename("sent" = value) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, sent)
sentiment_scores2 <- emotion %>%
    count(`id`, sentiment) %>%
    pivot_wider(names_from = sentiment, values_from = n) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, c(unique(emotion$sentiment)))
count_words2 <- count_words %>%
    select(id, value) %>%
    rename("count_words" = value) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, count_words)
word_length2 <- word_length %>%
    select(id, value) %>%
    rename("word_length" = value) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, word_length)
sentence_length2 <- sentence_length %>%
    select(id, value) %>%
    rename("sentence_length" = value) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, sentence_length)
sw_count2 <- sw_count %>%
    select(id, value) %>%
    rename("sw_count" = value) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, sw_count)
adj_count2 <- adj_count %>%
    select(id, value) %>%
    rename("adj_count" = value) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, adj_count)
unique_word_prop2 <- unique_word_prop %>%
    select(id, value) %>%
    rename("unique_word_prop" = value) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, unique_word_prop)
count_pos2 <- count_pos %>%
    select(id, value) %>%
    rename("count_pos" = value) %>%
    right_join(amazon, by = "id", values_fill = 0) %>%
    select(id, count_pos)

We bring it together into a new dataframe:

In [None]:
features_df2 <- total_count2 %>%
    left_join(n_words_in_sentence2, by = "id") %>%
    left_join(sent2, by = "id") %>%
    left_join(sentiment_scores2, by = "id") %>%
    left_join(count_words2, by = "id") %>%
    left_join(word_length2, by = "id") %>%
    left_join(sentence_length2, by = "id") %>%
    left_join(sw_count2, by = "id") %>%
    left_join(adj_count2, by = "id") %>%
    left_join(unique_word_prop2, by = "id") %>%
    left_join(count_pos2, by = "id") #

We replace NA values by 0 and add the outcome variable to the dataframe and remove the test data:

In [None]:
features_df2 <- features_df2 %>%
    replace(is.na(.), 0)
features_df2 <- amazon %>%
    select(id, satisfied) %>%
    left_join(features_df2, by = "id") %>%
    filter(id %in% train_ids) %>%
    select(-id)

Remove features that are superfluous:

In [None]:
rm(total_count, n_words_in_sentence, sent, sentiment_scores, count_words,
   word_length, sentence_length, sw_count, adj_count, unique_word_prop, count_pos,
   total_count2, n_words_in_sentence2, sent2, sentiment_scores2, count_words2,
   word_length2, sentence_length2, sw_count2, adj_count2, unique_word_prop2, count_pos2)

Then we fit the models:

### KNN
K-nearest neighbors (KNN) is a non-parametric approach, meaning it does not assume anything of the decision boundary. One of the advantages of non-parametric model is that it can display better output when the decision boundary is non-linear.

In [None]:
## KNN
set.seed(1)
trcntr = caret::trainControl('cv', number = 10) 

fit_knn <- caret::train(factor(satisfied) ~ ., data = features_df2, method="knn", trControl = trcntr, preProcess = 'scale')
fit_knn

# the plot shows the ideal k-fold
# plot(fit_knn)

### Partial Least Squares Regression

Partial Least Squares Regression can be used for dimension reduction.
Here we fit the model with ten-fold crossvalidation, as well as using standardized values.

In [None]:
set.seed(1)
fit_pls <- plsr(satisfied ~ ., data = features_df2,
                scale = TRUE, validation = "CV")

In [None]:
validationplot(fit_pls, val.type = "MSEP")

### Principal Component Regression
PCR is a regression technique that can be used to analyze multiple regression data suffering from multicollinearity. Instead of original features, principal components are used as predictors. PCR needs to be fit on a datamatrix.

In [None]:
set.seed(1)
pcr_fit <- pcr(satisfied ~ . , data = features_df2,
               scale = TRUE, validation = "CV")

In [None]:
validationplot(pcr_fit, val.type = "MSEP")

## 5.2 Model evaluation



Given that AUC is the performance measure used to rate the submission in this competition,we compare the models by this measure.

First, we compute the Accuracy and AUC of all models:

In [None]:
#Lasso Regression:

# Performance on test set
pred_lasso = predict(fit_lasso, training_data, s='lambda.min', type='class') %>% factor()
caret::confusionMatrix(pred_lasso, y)

# evaluation statistics
Lasso <- c(mean(pred_lasso == y), max(fit_lasso$cvm))

#Ridge Regression:

# Performance on test set
pred_ridge = predict(fit_ridge, training_data, s='lambda.min', type='class') %>% factor()
caret::confusionMatrix(pred_ridge, y)

#evaluation statistics
Ridge <- c(mean(pred_ridge == y), max(fit_ridge$cvm)) 

For PCR and PLS the number of components chosen for making the predictions was based on the validationplots.

In [None]:
#PCR:
pcr_pred <- predict(pcr_fit, ncomp = 15, type = "resp") 
pcr_acc <- mean(ifelse(pcr_pred < 0.5, 0, 1) == y)

pcr_auc <- auc(y, ifelse(pcr_pred < 0.5, 0, 1))

PCR <- c(pcr_acc, pcr_auc)

#PLS
pls_pred <- predict(fit_pls, ncomp = 4, type = "resp")
pls_acc <- mean(ifelse(pls_pred < 0.5, 0, 1) == y)

pls_auc <- auc(y, ifelse(pls_pred < 0.5, 0, 1))

PLS <- c(pls_acc, pls_auc)

Predict fit_knn for auc and accuracy scores:

In [None]:
#KNN

knn_pred <- predict(fit_knn)
knn_acc <- mean(knn_pred == y)
knn_auc <- pROC::roc(response = y, predictor = as.numeric(knn_pred))

Finally merge acc and auc scores to KNN:

In [None]:
KNN <- c(knn_acc, knn_auc$auc)

Then, we compare them visually:

In [None]:
perf_comp <- data.frame(Ridge = Ridge, Lasso = Lasso, KNN = KNN, PCR = PCR, PLS = PLS) 
row.names(perf_comp) <- c("Accuracy", "AUC")

perf_comp

AUC_model <- c(max(fit_ridge$cvm), max(fit_lasso$cvm), KNN[2], PCR[2], PLS[2])

barplot(AUC_model, 
        xlab="AUC", 
        #horiz=TRUE,
        names.arg=c("Ridge", "Lasso", "KNN", "PCR", "PLS"),
        col=c("Skyblue","Pink", "Red", "Green", "Orange"))

## CONCLUSION

Based on the plots we can conclude that `lasso` resulted in the best accuracy. However, `ridge` reached a quite similar accuracy, which was slightly less than `lasso`. This is expected as `lasso` method overcomes the disadvantage of `ridge` via giving penalty high values of the coefficient as well as setting the coefficients to zero if they do not have any relevance with the model. 
For other three models, `KNN`, `PCR`, and `PLS`, they had much less accuracy than `ridge` or `lasso`. This was because when dealing with a high number of features, these models tend to have lower accuracy.


# 6. Submitting your predictions


We create a test data set & Then we compute the probabilities for this test data and make them into a tibble:

Lasso and Ridge regression performs similarly (though Lasso is slightly higher), but since Lasso imposes a higher penalty on coefficients (sparse solution), we go with Lasso for predictions.

In [None]:
test_data <- sparse_matrix[rownames(sparse_matrix) %in% test_ids, ]

test_preds <- predict(fit_lasso, 
               test_data, 
               s = "lambda.min", 
               type = "response")

submission = tibble(Id = rownames(test_preds), 
                    Prediction = test_preds[,1]) %>%
    arrange(as.integer(Id))

head(submission)

Then we write the predictions into a csv file for submission:

In [None]:
write_csv(submission, file="submission.csv")