# Appendix A) WCD BIG DATA PROJECT - TWITTER SENTIMENT ANALYSIS - INFLATION

## Multi-Layer Perceptron (MLP) Classifier (Neural Net)

**WeCloudData Bootcamp 2022 (Part-time Cohort)**<br> </font>
By: Kevin Jeswani & Junaid Zafar <br>
The set of notebooks are segmented for the purpose of clarity & convenience <br>
The following is the suggested order for running the scripts:
- '1_WCD_Twitter_Inflation_Classification' - Mounted S3 bucket for inflation tweets, copied over twitter data, tweet cleaning. VADER & Spark-NLP pre-trained model is used to apply labels to the inflation tweets. The data is then transformed with spark-ml. Logistic regression & random forest are built and trained with gridsearchCV on the label and transformed token features.
- '2_WCD_Twitter_AllTopics_Clustering'  - All topics in the WCD twitter bucket are filtered, custom transformers are built and inserted into an extensive pipeline to load raw data from Kinesis firehose. Clustering uses Latent Dirichlet Allocation is conducted using a custom gridsearch to perform topic modelling.<br>

**Appendices** - Please note these notebooks are included simply as supporting information and to show that other experiments and exercises were conduct. Less time and effort was spent formatting on these notebooks, whereas Notebook 1) and 2) are the main submission documents.
- 'AppA_WCD_Twitter_Inflation_Classification_MLPOnly'**This Notebook**  - Experimentation for classification with multi-layer perceptron models - originally at the end of Notebook 1)
- 'AppB_WCD_Twitter_Inflation_Clustering' -Inflation tweet data with Spark-NLP labels imported, custom transformer for data cleaning built and combined with standard nlp transformers in a pipeline. LDA clustering implemented to model topics in the inflation dataset. An attemp was made with a GMM clustering model.
- 'AppC_WCD_Twitter_AllTopics_52mil_Clustering' - ALL streamed tweets (55mil+) are loaded from the WCD bucket, a transformation pipeline is built and all the data is transformed. A LDM clustering is built to cluster all the topics. 
- 'AppD_WCD_Twitter_AllTopics_Clustering_Evaluation' - An attempt was made to visualize the clustering using principal component analysis and t-SNE, but the data transformation required was too heavy to process and other issues occured.

This notebook is just a collection of what was taken out from Notebook 1: Classification on Inflation Tweets <br>

It is not meant to be run as a standalone, but just to serve as proof of advanced experimentation

**Note:** An attempt was made to use an MLP but after hours of research and debugging, a solution could not be found as the features structure is not consistent for every element. It seems to work with pyspark-ml's other classifers, but not the MLP. The following section would need to be explored further in the future.

In [0]:
# Section Imports
from pyspark.ml.classification import MultilayerPerceptronClassifier, MultilayerPerceptronClassificationModel

In this section, a spark-native MLP classifier (a artificial neural-net) is trained using the processed tweet token 'features' (with all the feature transformer applied from Section 2)
- A perceptron is a mathematical representation of a neuron, in which the inputs are combined in a weighted sum. When the weighted sum exceeds a given threshold specified as an activation function, the neuron is engaged and produces an output.
- The outputs can be binary, hence used in a binary classifier, which then defines a linear decision boundary. In a hyperplane, the distance between falsely classfied points and the decision boundary is minized using the stochastic gradient descent optimization function.
- The rectified Linear Unit (ReLU) is often used as the neuronal activation function
- The MLP has multiple input and output layers and 1+ hidden layers with multiple neurons stacked together; inputs are combined with initial weights in a weight sum then the activation function is applied. However the differences between each linear combination is propagated through the succeceeding layer
- It uses the backpropagation as the mechanism to combine inputs and weights in a neuron to iteratively adjust weights in the entire network to minimize the cost function <br>
<img src="https://miro.medium.com/max/1400/1*MF1q2Q3fbpYlXX8fZUiwpA.webp" width = "400"/> <br>
Source: https://towardsdatascience.com/multilayer-perceptron-explained-with-a-real-life-example-and-python-code-sentiment-analysis-cb408ee93141 <br>
**Advantages:**
- Applicable to complex non-linear problems
- Works well with large input data
- Provides quick predictions after training
- The same accuracy ratio can be achieved with smaller datasets <br>
**Disadvantages:**
- Independent variable contributions are unknown (black-box-y)
- Computatational expensive
- Functionality of model depends on training set quality <br>
Source: https://www.researchgate.net/figure/Multilayer-Perceptron-Advantages-and-Disadvantages_tbl4_338950098
<br>
**Parameters of Interest:**
- Layers = vector of neurons in [#input layer,#hidden layer, #hidden layer,#output layer]; output neurons to match number of classes, input layers approximately the number of input features (generally around 12 words in features col, hidden layers chosen arbitrarily) <Br>
**More info here:**
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.MultilayerPerceptronClassifier.html#pyspark.ml.classification.MultilayerPerceptronClassifier <br>
https://medium.com/swlh/pysparks-multi-layer-perceptron-classifier-on-iris-dataset-dcf70d553cd8 <br>
https://medium.com/analytics-vidhya/spark-mllibs-multilayer-perceptron-classifier-mlpc-hands-on-32ac4014eee9 <br>
https://towardsdatascience.com/spark-multilayer-perceptron-classifier-for-poi-classification-99e5c68b4a77 <br>
https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

In [0]:
# Setup MLP Model & Train
mlp_classifier =  MultilayerPerceptronClassifier(layers= [12, 24, 24, 3], seed=123) #Initialize classifier - by default it will try and find 'features' col and 'label' col
# 3 outputs neurons (pos,neg,neut)
mlp_model = mlp_classifier.fit(trainSet) #train

In [0]:
print(f"Input data shape: {trainSet.shape}")
print(f"Model shape: {mlp_model.shape}")

In [0]:
# Save the model
mlp_model.save("~/mlmodels/models/Twitter_MLP")

In [0]:
# Load the model
mlp_model_in = MultilayerPerceptronClassificationModel.load("dbfs:/~/mlmodels/models/Twitter_MLP")

In [0]:
mlp_pred_test = mlp_model_in.transform(testSet)

In [0]:
display(mlp_pred_test)

In [0]:
# EXCEPTION WILL OCCUR HERE AS FEATURES CANNOT BE READ FOR TRANSFORMATION FOR SOME REASON
trainSet_ = trainSet.select('*',trainSet['features']['values'].alias("feat_explode")) 

In [0]:
len(trainSet.select("features").first()[0])

In [0]:
display(tweets_features)

In [0]:
# Try to use other transformers to standardize the number of features - STILL NOT WORKING
from pyspark.ml.feature import VectorIndexer
vecindexer = VectorIndexer(maxCategories=12,inputCol="features",outputCol="features_")
vecindexer_model = vecindexer.fit(tweets_features)
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorIndexer.html#pyspark.ml.feature.VectorIndexer
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorSlicer.html#pyspark.ml.feature.VectorSlicer
#https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorSlicer.html#pyspark.ml.feature.VectorSlicer

In [0]:
tweets_features2 = vecindexer_model.transform(tweets_features)

**Note:** The MLP model being developed in above was facing issues with the size of the input features - ideally one could create a vector size hinter and filter out vectors that do not match the specified vector size. The problem being that it is trying to change the size of total feature among the entire document and not the features selected at the element level. This section is just here as a preview of what else could be explored in the future. <br>
See: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorSizeHint.html <br>
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html

Feature Assembly - With VectorSizeHinter

In [0]:
#from pyspark.ml.feature import VectorSizeHint, VectorAssembler
sizehint = VectorSizeHint(inputCol="features",size=10,handleInvalid="skip")
tweets_features = sizehint.transform(tweets_features)