# DSCI 100: Group Project

# TODO: Insert project title


## Classification of Facebook Posts

#### Introduction:

The rapid change of technology has greatly transformed the business world. Social media platforms have become the best place for businesses to advertise their brands through customer engagement. 
Our study focuses on the biggest social network worldwide, Facebook, with over 2.7 billion monthly active users ((Statista, 2021). 

The dataset *Facebook performance metrics* (Moro et al., 2016) contains data related to posts published throughout the year 2014 on a renowned cosmetics brand's Facebook page. Post information such as type (photo, status, link, or video), time posted (month, day of week, and hour), user engagement (comments, likes, and shares), impressions on each post (too many columns to list here), and whether the post was paid or unpaid is included in the dataset. This project proposes to use all relevant columns from this dataset to determine the type of a brand's Facebook post. We will determine which of these metrics are relivant in our cleaning and exploration of the data.

TODO: (from prop feedback) Please provide some background info, so that someone unfamiliar with it will be prepared to understand the rest of your proposal.


# TODO: EXPLICITELY STATE OUR RESEARCH QUESTION HERE!!!
**from proposal feedback, Please state your proposal question clearly and in a form of a "predictive question"**
Our study focuses on predicting post type given 

#### Preliminary exploratory data analysis:

We begin by loading the relevant libraries.

In [None]:
# Load libraries for preliminary data analysis:
library(tidyverse)
library(repr)
library(readxl)
library(tidymodels)
library(GGally)

The following cell loads the dataset into R; as the dataset on the web is contained in a zip folder, the .csv file was manually extracted and pushed to the working Github repository.

The dataset is already in tidy format. The column headings were made more usable by removing spaces and shortening longer headings.

In [None]:
# displays first and last 8 rows of the dataset
options(repr.matrix.max.rows = 16)
# Set the seed
set.seed(123)

# Load the data in
fb_data_raw <- read_csv2("https://gist.githubusercontent.com/KolCrooks/691e5890b6747b4777d6032f019b2c0f/raw/20629a5da3d5a7683e3071798876f3e4b204fbbb/fb_data.csv",  col_types = cols())

fb_data_raw

In [None]:
# Na Rows:
sum(is.na(fb_data_raw))

# TODO: EXPLAIN WHY WE CONVERT COMMENT LIKE AND SHARE TO PERCENTS
Since we are going to be working with type, category, post month, post_weekday, and post hours as a categorical statistical variables, 
we are converting them to factors using the function `as_factor`. In addition, we realized that we had 6 NAs in our data set, so we are using `na.omit` function to remove all the NAs

In [None]:
# Clean the data:
fb_data_clean_cols <- fb_data_raw
colnames(fb_data_clean_cols) <- c("page_likes", "type", "category", "post_month", "post_weekday", "post_hour", "paid", "reach", 
      "impressions", "engaged_users", "post_consumers", "post_consumptions", "impressions_by_people_that_liked_page", 
      "reach_by_people_that_like_page", "people_liked_and_engaged", "comments", "likes", "shares", "interactions")
fb_data_clean <- fb_data_clean_cols %>% 
        mutate(type = as_factor(type)) %>% 
        mutate(category = as_factor(category)) %>% 
        mutate(post_month = as_factor(post_month)) %>% 
        mutate(post_weekday = as_factor(post_weekday)) %>% 
        mutate(post_hour = as_factor(post_hour)) %>% 
        na.omit()

fb_data_clean

In [None]:
# checking if all NAs are removed
#Na Rows:
sum(is.na(fb_data_clean))

# TODO: EXPLAIN WHY THE STRATA IS "TYPE"
# TODO: EXPLAIN WHY WE CHOSE 75%
**"Why have you decided to use 75% of the dataset as the training data?"**
<p> Here we are splitting our data into training and testing sets using `initial_split`. In order to get a more accurate calculation of our model performance, we choose to use a larger training data of 75% while keeping the remaining 25% for testing. As we want to classify facebook post type, we pass `type` to `strata` argument in initial splitting 

In [None]:
# Split the data:
fb_split <- initial_split(fb_data_clean, prop = 0.5, strata = type)
fb_train <- training(fb_split)
fb_test <- testing(fb_split)

fb_train

## TODO: include Plot title!
**The plot does not include a title. For the report, please make sure you are labelling your tables and plots.**

## Selecting our predictors

To figure out which predictors we want to use, we will use `ggpairs` to determine how well each column can predict post type. To do this we will look at how different each post type's box plot is when compared with the predictor.  Before looking at the ggpairs plot, we want to take out the columns that we know wouldn't work. This is because we don't need ggpairs to know that they wouldn't work, and it would help to reduce clutter in the plot while allowing us to better look at the remaing predictors.


The main predictors that we know we can't use are coulmns with factors. This includes the time based columns `post_month`, `post_weekday`, and `post_hour`, and also `category`.

In [None]:
# Columns before selection
colnames(fb_data_clean)

In [None]:
fb_data_selected1 <- fb_train %>% 
            select(-post_month, -post_weekday, -post_hour, -category) %>%
            select(type, page_likes, paid:interactions) # reorder the df so that type is first, so that we can display only that row
fb_data_selected1

#         mutate(comment_percent = comments / interactions) %>% 
#         mutate(like_percent = likes / interactions) %>% 
#         mutate(share_percent = shares / interactions) %>% 

In [None]:
options(repr.plot.height = 5, repr.plot.width = 30)
fb_select_plot <- fb_data_selected1 %>% 
    ggpairs() +
    ggtitle("Distribution of different factors")
    theme(text = element_text(size=14))

# Select just the top row because it is the only thing that we are trt
fb_select_plot$nrow <- 1
fb_select_plot$yAxisLabels <- fb_select_plot$yAxisLabels[1]
fb_select_plot

Looking at this plot, it might be better to say which predictors wouldn't be good classifiers for our prediction:
- `impressions` will not a great classifier because the difference in most columns look the same, meaning there is less variation in the data. While the video boxplot does look different from the rest, the outliers from other ones also occupy similar space.
- `impressions_by_people_that_liked_page` looks very bad with each box plot looking like just a line.
- `comments`, `likes`, `shares`, and `interactions` look like they have similar problems as `impressions`.

We think that some of these might actually be promissing but the scale is so small that it is hard to see any differences. Something we can do is scale a predictor based on another one.

With `comments`, `likes`, `shares`, and `interactions`, all of these are related in that `interactions` is the sum of the `comments`, `likes`, and `shares` on each post. We can turn these into ratios by turing `comments`, `likes`, and `shares`, into percentages of the total interactions. These are better than the raw values because the raw values are a measure of the popularity of the page, and not characteristics of the post type. This will normalize the data, allowing for the model to be effective in classifying posts for any page size. 

# Talk about why the columns we picked are good

# MAYBE LOOK INTO `impressions` AND `impressions_by_people_that_liked_page`

In [None]:
fb_data_selected2 <- fb_data_selected1 %>% 
            select(-impressions, -impressions_by_people_that_liked_page) %>% 
            mutate(comment_percent = comments / interactions) %>% 
            mutate(like_percent = likes / interactions) %>% 
            mutate(share_percent = shares / interactions) %>% 
            select(-comments, -likes, -shares, -interactions) # We don't need these anymore because they have been scaled
fb_data_selected2


In [None]:
options(repr.plot.height = 5, repr.plot.width = 30)
fb_select_plot2 <- fb_data_selected2 %>% 
    ggpairs() +
    ggtitle("Distribution of different factors for Second Selection Set")
    theme(text = element_text(size=14))

# Select just the top row because it is the only thing that we are trt
fb_select_plot2$nrow <- 1
fb_select_plot2$yAxisLabels <- fb_select_plot2$yAxisLabels[1]
fb_select_plot2

`comment_percent`, `like_percent`, and `share_percent` look very different now. `comment_percent` still does not look like it would be good, but `like_percent`, and `share_percent` seem like they could be good as the variance for each box plot looks different enough where you can tell them apart.

In [None]:
fb_data_selected <- fb_data_selected2 %>% 
            select(-comment_percent)
fb_data_selected2

Something else we want to think about is there any hidden predictors that we can get from combining other predictors? 
# Should we include this part?
When doing some research on this dataset, we found that some people created a engagement ratio factor that is `interactions` / `reach`. When we ggpairs this with type, we get this:

In [None]:
fb_select_plot3 <- fb_data_clean %>%
    mutate(engagement_ratio = interactions / reach) %>% 
    select(type, engagement_ratio) %>% 
    ggpairs() +
    ggtitle("Distribution of different factors for Second Selection Set")
    theme(text = element_text(size=14))

# Select just the top row because it is the only thing that we are trt
fb_select_plot3$nrow <- 1
fb_select_plot3$yAxisLabels <- fb_select_plot$yAxisLabels[1]
fb_select_plot3

This looks promising so we will add this as a factor that we want to look at


### Preliminary Summary Tables:

Tables were constructed to gain an initial summary of the data. Table *summary_table1* groups posts by type and computes the total posts, total interactions (including all likes, comments, shares), maximum interactions, and number of paid posts for each type.

In [None]:
summary_table1 = fb_train %>% 
    group_by(type) %>%
    summarize(total_of_type = n())

summary_table1

Looking at the number of different posts, it is clear that we need to upsample the data. This does raise some concerns about how well it will be able to predict some types (mainly the type `video`), but we are confident that we can still get good results.

#### Selected Columns:

In [None]:
selected_cols <- tibble(col_name = append(colnames(fb_data_selected), 'engagement_ratio'))
selected_cols

### Preliminary visualizations:

# TODO: should we create visualizations? I think the ggpairs exploration should be enough but who knows. - Kol

#### Methods:

Our analysis will use the following input columns of the original dataset: type `page_likes`, `paid`, `reach`, `engaged_users`, `post_consumers`, `post_consumptions`, `reach_by_people_that_like_page`, and `people_liked_and_engaged`. We will also be using the generated columns `comment_percent`, `like_percent`, `share_percent`, and `engagement_ratio`.


Our aim is to use the K-nearest neighbours algorithm to generate a classification model which will classify a post by type (photo, status, link, or video).

To visualize our results, we plan to use a confusion matrix. This will display how often our classification model labels a post correctly, and how often each label gets confused with another. We will also use bar charts to visualize relevant and intermediate results; e.g., we will create a bar chart with post type on the x-axis and interactions on the y-axis, filling out the bars with proportional values of the type of each interaction. As part of the tuning step of creating the model, we can create a line chart to show us the optimal K value. 

#### Expected outcomes and significance:

This analysis hopes to define a relationship between type of post (i.e., photo, status, link, or video) and ratio of the corresponding post’s interaction type. It is expected that videos and photos, for example, may have higher percentages of interactions that are comments and/or likes when compared to a link or status.

This classification application for labeling a post’s type could be helpful in identifying the types of reactions that a post might receive. It is possible that we find images get the most likes, while statuses get the most comments. Knowing how these metrics indicate the type could lead to better targeted ad campaigns that look for a certain type of user engagement.

Future questions following from this analysis may include:
- Do paid posts generate more traffic than unpaid posts?
- Does the category of a post (i.e., “action”, “product”, or “inspiration” classification) affect the overall and/or ratio of interactions on a post?
- Do posts with more interactions overall correlate with increases in users liking a company’s Facebook page? 

In examining the data for classification, it is also expected that trends may emerge which could in the future be used to predict post engagement. This predictive knowledge could be used by companies looking to grow their social media reach, as they may more accurately tailor their posts to yield higher engagement before publishing.

## Create the model

In [None]:
upsample_recipe <- recipe(type ~ ., data = fb_train) %>% 
                    step_upsample(type, over_ratio = 1, skip = FALSE ) %>% 
                    prep()



fb_train_upsampled <- upsample_recipe %>% bake(fb_train)
fb_train_upsampled

In [None]:
# Create the tune spec
knn_spec_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>% 
            set_engine("kknn") %>% 
            set_mode("classification")

In [None]:
# Create the recipe
fb_recipe <- recipe(type ~  page_likes +
                            paid +
                            reach +
                            engaged_users +
#                             post_consumers +
#                             post_consumptions +
#                             reach_by_people_that_like_page +
                            people_liked_and_engaged 
#                             likes +
#                             shares +
#                             interactions
#                             engagement_ratio
                    , data = fb_train_upsampled) %>%
#                 step_mutate(likes = likes / interactions,
#                             share_percent = shares / interactions) %>%  #,
#                             engagement_ratio = interactions / reach) %>%
#                 step_upsample(type, over_ratio = 1, skip = FALSE) %>%
                step_scale(all_predictors()) %>% 
                step_center(all_predictors()) %>% 
                prep()

fb_recipe

## Show that we are using the correct columns

In [None]:
baked_fb <- bake(fb_recipe, fb_train_upsampled)
baked_fb

## Show that data has been upsampled and balanced

In [None]:
baked_fb %>% 
    group_by(type) %>% 
    summarize(n = n())

## Tune the model:

# Why does output have neighbors 1..15 where it skips some?

In [None]:
# Create vfolds with v
fb_vfold <- vfold_cv(baked_fb, v = 20, strata = type)


In [None]:
gridvals = tibble(neighbors = 1:20)

fb_fit <- workflow() %>% 
        add_recipe(fb_recipe) %>% 
        add_model(knn_spec_tune) %>% 
        tune_grid(resamples = fb_vfold, grid = gridvals) %>% 
        collect_metrics()

fb_fit

In [None]:
fb_filtered <- fb_fit %>% filter(.metric == "accuracy")

fb_filtered %>% ggplot(aes(x = neighbors, y = mean)) +
            geom_point() +
            geom_line() +
            scale_x_continuous(breaks = 1:20)

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 2) %>%
  set_engine("kknn") %>%
  set_mode("classification")

knn_fit <- workflow() %>%
  add_recipe(fb_recipe) %>%
  add_model(knn_spec) %>%
  fit(data = baked_fb)
knn_fit

In [None]:
fb_test_predictions <- predict(knn_fit, fb_test) %>%
  bind_cols(fb_test)
fb_test_predictions

In [None]:
fb_test_predictions %>%
  metrics(truth = type, estimate = .pred_class)

### References: 
Statista. (2021). Facebook: Monthly Active Users 2021. Retrieved on February 28, 2021 from http://www.statista.com.ezproxy.library.ubc.ca/statistics/264810/number-of-monthly-active-facebook-users-worldwide/
