<img src="Header.png" style="width: 800px;">

# Executive Summary

### ` Context:` 
Facebook pages are a significant brand asset for thousands of companies and organisations worldwide. Companies invest heavily into developing social content to engage with customers and prospects in a two way 'conversation’. The impact of social media on business success is widely debated (often with very differing views) however one thing that most marketers agree on is that brand differentiation is a key aspect of any 'healthy' brand. Brands need to stand out from one another in their category. Not just in terms of what they offer but also in terms of how they communicate - across all media channels, including social media. Which leads to the focus of this project: 

<h2><center>Are brands doing enough to differentiate their social content on Facebook?</center></h2>



### ` Goal:` 
Focussing on the seven biggest UK supermarket brands on Facebook and using natural language processing and supervised classification modelling, can we train a machine to distinguish the different supermarket brands' social content from one another on Facebook?

### ` Approach:`
Seven of the UK's leading supermarket brands were chosen for the study: Sainsbury's, Tesco, Lidl, ASDA, Morrisons, M&S and Waitrose. Their social content was scraped from Facebook with automated web scraping (Selenium). A total of 6350 posts were scraped from c.2014 to early December 2019, and after cleaning and debranding was performed we managed to have the following distribution of brands, with a baseline of 0.21

    Lidl                 1353
    Tesco                1068
    Marks and Spencer     870
    Morrisons             798
    Waitrose              777
    ASDA                  770
    Sainsburys            714


### ` Results:`
After implementing term frequency-inverse document frequency (TF-IDF) vectorization and applying a range of supervised classification models, a cross validated and tuned logistic regression classfier performed the strongest when classifying the unseen social content (test data). A final accuracy score of 0.69 on the test data was achieved with equally performing precision (0.71) and recall (0.69)


# Walk-through

## `Acquiring the data:`

My main aim was to obtain the following data from each post at source (i.e the data was readily available and 'scrapable' from some html element on the page):

    - Date: Date posted
    - Year: Year
    - Brand: Class label
    - Post_Content: the post (KEY PREDICTORS)
    - All_Responses: the aggregate of likes, haha, angry, sad, wow, love each post received
    - Comments: the total number of comments the post received
    - Shares: the total number of shares the post received
    - Views: the total number of views a video got (if post contained video, 0 if no video present)
    
The following metrics were engineered in some form:

    - Contains_Link : If the post contained a link in text form e.g. 'bit.ly/1236
    - Contains_Video : If the post contained a video 
    - Has_Hashtag : If the post contained a hashtag e.g. #LidlSurprises 
    - Hashtag_Count : If the post contained a hashtag, how many hashtags did it contain?
    - Likes : Total page likes the page had when the post was made
    - Response_Rate : All_Responses / Likes - proxy for engagement
    - Comments_Rate : Comments / Likes - proxy for engagement
    - Shares_Rate : Shares / Likes - proxy for engagement
    - Video_Rate : Video / Likes - proxy for engagement
    
The diagram below summarises the streams of data I used and where features were engineered.

<img src="Assets/Overview.jpg">


## `EDA & Observations:`

My final predictors for classification were ultimately going to be the vectorised Facebook posts in order to provide the relevant features for the predictor matrix, however EDA analysis of all the other metrics ahead of this would help in giving an early read if there were any other differences observable in the content, that might give me an early indication if building an accurate model would be likely or not.

#### `EDA` - differences in the `types of content` shared (hashtags, videos)

Waitrose, Sainsburys and Morrisons are particulalry likely to post videos with at least half of their posts containing a video of some kind. Lidl and ASDA very rarely post videos (see below).

<img src="Assets/videos.png">

Morrisons use hashtags, *frequently*, with over 60% of their posts contain a hashtag of some kind. All other brands use hashtags far more sparingly in their posts (see below).

<img src="Assets/hashtags.png">

Further analysis of the content of those hashtags revealed that those hashtags were often about very different things; the majority of Morrisons hashtags were about Nutmeg (its clothing line) whereas the most used hashtag for ASDA was about Christmas (see below).

<img src="Assets/morrisons_hash.png">

<img src="Assets/asda_hash.png">


#### `EDA` - differences in `response to content` (i.e. engagement rates)

In addition to different types of content being posted by brands, there were differences in engagement rates that their posts - that is, relative to the population of a page (i.e the total number of people that have a liked a page) some brands received more engagement than others. For example, Waitrose scores particularly well in terms of getting comments from its fanbase (see below).

<img src="Assets/waitrose_engagement.png">

#### `EDA` - differences in `what they're saying` (i.e topics/words)

Analysis of the content by brand revealed some key words that scored particulalry highly when vectorised using TF-IDF vectorisation:

    Sainsburys :Recipes, magazine 
    Tesco : Christmas, delicous
    Waitrose: Watch, Recipes
    Lidl : Prices, Stock availability
    M&S : Shop, Christmas, Summer
    Morrisons : Win, voucher, online
    Asda : People

<img src="Assets/asda_cloud.png" width="400">


## `Data Preparation & Cleaning:`

I wanted to get to the point where I was only using the narrative of the post i.e keeping all the words and content that related to topics and 'tone of voice' but removing as many 'obvious' branding cues as possible. This was achieved through a range of regular expressions and string formatting - the main principles being:

    1) Remove as many direct branding cues as possible  
    2) Remove as many links and hashtags as possible (indirect branding cues)
    3) Leave the narrative / content
    
The diagram below summarises some examples of these transformations.

<img src="Assets/Dataprep.jpg">

# Statistical Analysis

### ` Training & Test splits:`
The content (X) and labels (y) were split into training and test sets - stratified to ensure the same distribution of classes between training and test splits. The training set represented 0.75 of the data and the test set at 0.25


### ` Feature generation:`
Once the data was cleaned and debranded, the training content (X_train) was vectorised using term frequency-inverse document vectorisation (TF-IDF) providing us with a predictor matrix. As we have already pre-processed our data to remove any obvious brand cues the only stop words we needed to apply were the default 'english' stop word list. Once the vectorizer was fitted to the training set (X_train), we transformed the test set (X_test) with the same instance of fitted vectorizer.


###  ` Model Choice & Hyperparameters`
Six supervised classification models were chosen: Logistic Regression, K-Nearest Neighbours, Support Vector Machine, Random Forest, Multinomial Naive Bayes and Stochastic Gradient. All models were cross validated to avoid overfitting and all had optimised parameters implemented via grid search in order to find the strongest performer on the unseen, test data. 

Our baseline for this data set was 0.21, the dominant class. Once models were fitted, all models exceeded the baseline however Logistic Regression turned out to be the strongest performer overall with an accuracy score of 0.69. Precision and recall were 0.7.  

![image.png](attachment:image.png)





# Looking ahead..

### ` Validating performance over time:`

The model is reasonably accurate - however not perfect. A key limitation I have is the volume of data the model was trained on (c.600 rows per class, after splitting into train/test). The more data I can train the model on the better my model should become at generalising to new data, in theory. I would suggest a weekly scrape of new content (the code is in place in section XX) and add this to the training data incrementally if I am to build a stronger, longer lasting model.

### ` New models / angles of analysis:`

The main aim of this project was to develop a classifier. Another big question we could tackle would be to find a model that helps predict what kinds of content best predict engagement i.e. what does a supermarket brand need to post about in order to get 'X' amounts of likes or 'X' amount of shares. In theory we could create a feature matrix like we did here however implement some kind of regressor to predict a continuous variable such as shares, likes, comments etc.

### ` Productionize the model:`
As I've created a pipeline for the strongest model (that covers pre-processing, vectorisation and model fitting) - and already tested this on very new data, I've already made steps to make the model ready for use in a production environment. Intergrating this with some app or interface in order to make it useable and useful would be key (see below).

Another benefit of designing a data capture method on Facebook is transferablity to other brands and their social content. Due to the consistent way Facebook is built across brand pages, it would be very easy to re-run this project for any brand page on Facebook, in any category, and obtain new data in exactly the same format and re-train a model relatively easily. For example if I wanted to build a model to classify all the UK mobile phone network brands on Facebook (EE, Vodafone, O2, Three) - this could be done fairly easily. Whilst some time would be needed to develop a bespoke stop word dictionary - the model infrastructure should be exactly the same.
 
### ` Deploying the model publicly:`
Using Flask, I could create a web app through which the strongest model could be accessed. If I wanted it to be available publicly, using some distributed computing platform such as AWS could mean anyone, anywhere could test my model with new social media data. Dressing it up in some easy-on-the-eye html could be achievable too. 

# Looking back....Risks, limitations & assumptions 

### ` Approach:`

Scraping data from Facebook was 'interesting' to say the least, and presented many challenges. As Selenium mimicks the web navigating behaviour of an individual, scraping can be a slow process, taking up to a few hours just to gather content for one brand. Furthermore, sometimes the way pages load on Facebook can vary meaning that some syntax failed to work. Although I managed to write a script that got what I needed fairly consistently - repeating this data capture process again for other brands / other categories would be a slow process. 

Futhermore, obtaining some metrics was very difficult i.e the breakdown of wether a post received like, haha, angry, sad, wow etc and I subsequently aggregated them all together. Which is ok, but we lose the detail.

### `Data - more of it in terms of volume:`

Although we had reasonable amounts of data more content would be helpful in making our model more robust and possibly improve its accuracy. My hands are somewhat tied however; because brands rarely post more than one post a day, in any given year you're looking at about c.300 posts available, less in some instances. Therefore if we want n=10K rows, we need to scrape back many, many years. As mentioned above, the scraping process was very slow so therefore obtaining greater volumes of data in an efficient manner could be tricky. 

### `Data - more of it in terms of granularity:`

Although my Selenium script allowed me to obtain my primary data source - I did use a secondary data source (Fanpage Karma https://www.fanpagekarma.com/ ) to provide me with time stamped 'total likes' data which I integrated with my intial data set. I subscribed to their free trial and took what I needed. However when I was exploring this source, there were many other useful metrics that I would have access to - such as the break out of like, haha, angry, sad, wow etc. This secondary data source could very well plug the gaps found in my primary data source so I would be happy to look into this in any future iterations of this project and see where I can synergise different data sets further.

### ` NLP Approaches:`

Although TF-IFD Vectorisation was successul in highlighting the differences in the posts and was a much better approach than a simple bag of words model (Count Vectorising), using TF-IFD, we are still effectively using word counts in some form (although weighted) to provide features. It would be interesting - volume of text data permitting - to explore more nuanced ways of text analysis that can give us 'richer' features e.g sentiment analysis (using some like VADER) and exploring topics and themes (using something like LDA).

### ` One Vs. Rest Classification Evaluation:`

Deeper evaluation and optimisatio  of this classifier could be achieved through using a one-vs-rest classifier strategy. In this case we would 'binarize' our seven classes into a 'one vs. rest' label system and effectively run seven models. By doing this we can use ROC AUC visualisations to see how a one-vs-rest model strikes a balance between the true positive rate and false positive rate, and how this can vary if we change its threshold. 

# Technical Appendix & Useful Links 

Chromedriver - http://chromedriver.chromium.org/downloads
    
Fanpage Karma - https://www.fanpagekarma.com/
    
Final, clean & merged data from this project -  