# Machine Learning Final Project
# Blog Content Characterization (Morality, Emotion Analysis, Topic Analysis)

Abstract
===============
Blogs have become crucial to our daily lives, whether it be reading newsletters, and journals, documenting our stories, or following others' stories. 
Blogs are one of the important advancements of web2.0. Blog extends to other social media like Twitter and Facebook posts. <br>As important as blogs are, their content also plays a crucial role in the daily influence of their audience and writer. 
Audiences get influenced while writers become influential.<br> Characterizing blog content is analyzing and being able to label the blog by using various social metrics that are backed by a machine learning approach. With this in mind, this work will endeavor to apply various available machine learning approaches and libraries in characterizing blogs and their content helping users understand the influence of the blog or the author and attributing the written label to the blog to help users decode the underlying message of the author that the audience may not be able to infer naturally.

## Introduction

Many audiences read blogs and news articles online, and users usually bookmark their preferred blogs and subscribe to RSS (Really Simple Syndication) feeds from these blogs. <br>
Users have a limited understanding of the fact that they are being influenced by the author while some authors have a limited understanding of how much their influence is growing. <br>
Usually, some of these blogs may post information that may contain information or words that may help classify them rather than rely on the tags that the author may have given the website when creating the website. <br>

The topic analysis is used to figure out a text's topic structure, which is a picture of what topics are in a text and how they change over time. <br>The topic analysis consists of two main tasks: topic identification and text segmentation (based on topic changes). 

Emotions can be expressed verbally through emotional vocabulary or through nonverbal cues like intonation of voice, facial expressions, and gestures, all of which play an important role in human communication. <br>Most human-computer interaction (HCI) systems lack emotional intelligence and are incapable of interpreting emotions. For blog information retrieval, it is essential to characterize blog content using relevant, dependable, and distinguishing tags.

Although some authors traditionally set out to influence their audience, the majority of blog authors are usually of the opinion that they are expressing their point of view and things that they are passionate about.<br> Hence, being able to run some social analysis and using machine learning to classify these blogs will help the readers immensely to understand the author while the author will understand their content and the reason why their following is either decreasing or reducing depending on the preference of the author. <br>This is why we have introduced combined social analysis as a way of characterizing the collected blog data for this study.


## Methodology

To carry out this study, we will briefly describe each method and tool we will leverage on. 
First, we will collect blog data for about a two-year period by crawling blogs of interest. We will not use a keyword in collecting this data as we intend to use our results to describe or label these blogs by using results generated from morality, topic analysis, and emotion assessments. This will then allow us to use a classification model to classify these blogs.

### Dataset Description

The dataset to be used for this study was crawled using a crawler tool specifically focusing on indo-pacific blogs i.e blogs that discuss key issues related to the indo-pacific region, the collected blog is then stored in a CSV file and uploaded to the GitHub page link below. The repository will also house subsequent project source code implementation. 

https://github.com/nakinnubis/inpacblogdata

### Algorithm

#### Topic Analysis

We will be using Latent Semantic Analysis (LSA -NLP) for topic analysis. This approach supports singular value decomposition by keeping documents and words in a semantic space for classification hence it fits into our goal of characterizing blog post content and information.


#### Morality Analysis

For morality assessment, we will use the moral foundation theory along with a probabilistic inference to identify the changes. Using the MFT algorithm for moral quantification, this NLP approach will allow us to classify each blog post according to the appropriate moral scores.

#### Emotion Assessment

To predict emotions, we will be using Bidirectional LSTM with a CNN  and use Plutchik’s Wheel of Emotions to represent human emotions which will form the labels for each corpus of the document. We intend to have labels like joy, anger, sadness, and fear.

## Result Discussion



Importing into the notebook various python library for the study.

Step 1.
Import nbformat for notebook formatting
Step 2
Import word cloud which will be use in visualizing the top word from the LDA results

Step 3.

Import python lda visualization librarry this allow in visualizing the data

In [None]:
%pip install nbformat
# Import the wordcloud library
%pip install wordcloud
%pip install pyldavis

Step 4
Import pandas library for data massaging and manipulation

In [None]:
# Importing modules
import pandas as pd


Step 5
Import the collected blog data which contains the blog post text and their title.
This is done using the pandas read_csv which reads csv from the specified path
The pandas data frame is assigned to a blogpost dataframe and the pandas head method is called on it


In [None]:
# Read data into blogPosts
blogPosts = pd.read_csv('Indo-pacific-blog-data.csv')

# Print head
# This shows us the content of the crawled blog data for analysis purpose
blogPosts.head()

Step 6
Dropping all columns that will not be useful for this study, name categories, comments_url
The remaining preferred column is then saved back into a csv file

In [None]:
# Remove the columns that are not useful for this study examples of removed columns are 'categories', 'comments_url'
blogPosts = blogPosts.drop(columns=['categories', 'comments_url'], axis=1)

# Print out the first rows of blogPosts with new updated dataframe excluding 'categories', 'comments_url'
blogPosts.head()
blogPosts.to_csv("./results/lda/Indo-pacific-blog-post.csv")

Step 7
Import regular expression library for removing unwanted text character from the post
this is then stored into new column called blog_post_processed which contains the post text itself without the wrong characters

Step 8

All the blog_post_processed text are converted into lowercase text or characters

Step 9

The preferred columns are the written to a csv file. whic is made up of the following 
columns ['blogpost_id','title','date','blogger','tags','sentiment','location','blog_post_processed']


In [None]:
# Load the regular expression library
import re

# Remove punctuation and unwanted dataset to allow a more clean data when we start performing LDA on the dataset
# We used the post column for this purpose and create a new column from the dataset blog_post_processed column
blogPosts['blog_post_processed'] = \
blogPosts['post'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the all text to lowercase for the processed blogpost column
blogPosts['blog_post_processed'] = \
blogPosts['blog_post_processed'].map(lambda x: x.lower())

# Print out the head section which represents the first few columns present in the dataset
blogPosts['blog_post_processed'].head()
blog_post_processed_header = ['blogpost_id','title','date','blogger','tags','sentiment','location','blog_post_processed']
blogPosts.to_csv("./results/lda/Indo-pacific-processed-post.csv", columns=blog_post_processed_header)

Step 10

Import word cloud library to visualize the top words 

The string is then passed into the wordcloud library to enable the results to generate wordcloud for the top text that occur frequently

The word cloud to_image method is then called to show the image output of the top word which can be seen in the output figure below


In [None]:
from wordcloud import WordCloud

# Join the different processed titles together.
long_string = ','.join(list(blogPosts['blog_post_processed'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=1000, width=800, contour_width=2, contour_color='steelblue', height=800)

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

Step 11
Importing all required library for LDA and lanuage tool kits for computing nlp values

Download stop words using the nltk. this is used to remove the unwanted stopwords from the texts

Since the blog post data of study is written in english, the preferred stopwords are english stopwords

The next thing is removing the stopwords

The processed text is then stored back into a data frame and saved into a csv 

The next thing is printing out the words computed from the steps above

In [None]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]


data = blogPosts['blog_post_processed'].values.tolist()
data_words = list(sent_to_words(data))

# remove stop words
data_words = remove_stopwords(data_words)
# data_words
print(data_words[:1][0][:len(data_words)-1])
words = pd.DataFrame(data_words[:1][0][:len(data_words)-1])
words.to_csv("results/lda/Indo-pacific-processed-words.csv")

Step 12
Next we create the dictionary of LDA words and generate the term frequency 
This is used to generate a matrix the processed corpus term matrix is stored as csv

In [None]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_words)

# Create Corpus
texts = data_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1][0][:len(corpus)-1])
corpus_terms_m = pd.DataFrame(corpus[:1][0][:len(corpus)-1])
corpus_terms_m.to_csv("results/lda/Indo-pacific-processed-corpus_terms_m.csv")

Step 13

We specify the number of topics of study  or interest. we set the number as 10  but can be either reduced or increase

We build the LDA Model  using the gensim lda library

We then print the keywords in the ten topics specified and also store the lda into a csv file

In [None]:
from pprint import pprint

# number of topics
num_topics = 10

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
doc_ld_df = pd.DataFrame(doc_lda)
doc_ld_df.to_csv("results/lda/Indo-pacific-processed-doc_lda.csv")

Step 14
We generate a html visualizatiuon of the lda results using the same LDA libary .
This makes it easy to understand how the lda results. Also we can increase or reduce the threshold of the lda to see the worlds or terms 
that exist across.

The html output is also stored in an html file

In [None]:
import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

from pprint import pprint

import spacy

import pickle
import re
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

import matplotlib.pyplot as plt

# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_data_filepath = os.path.join('./results/lda/ldavis_prepared_'+str(num_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
    LDAvis_prepared = gensimvis.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

pyLDAvis.save_html(LDAvis_prepared, './results/lda/ldavis_prepared_'+ str(num_topics) +'.html')

LDAvis_prepared

#### Morality Analysis

Step 15

To further characterized the data, we apply morality foundation computation this will enable us to generate more categories or label for the data. 

First step is to install the seaborn library

Import the pandas and other library

In [None]:
%pip install seaborn

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

Step 16
we create a input template to parse the data, then read it into a dataframe using pandas
We then vusualize the head or top section of the dataframe.

In [None]:
template_input = pd.read_csv('data/Indo-pacific-blog-data_morality.csv', header=None)
template_input.head()

Step 17
we use the emfdscore library to comopute the morality score for the data
We specify the type of morality model to be used in DICT_TYPE to be the original Moral Foundations Dictionary
We specify the scoring method to be bag of words and the output path to store the result.

In [None]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'mfd'
PROB_MAP = ''
SCORE_METHOD = 'bow'
OUT_METRICS = ''
OUT_CSV_PATH = 'mfd.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

Step 18
We use the pandas library to read the results of the mdf output
We then visualize the top section of the results

In [None]:
# Inspect output 
mfd = pd.read_csv('mfd.csv')
mfd.head()

Step 19
To label the data, we select the top score or the value with the highest score to label the column for morality vice and virtue
the value of the vice and virtue will then be use to characterized the data. the method is to help pandas in filling the column with the updated label score.

In [None]:
def label_morality_data(rowData):
    vice = {
        'care': rowData["care.vice"],
        'fairnaess': rowData["fairness.vice"],
        'loyalty':rowData["loyalty.vice"],
        'authority':rowData["authority.vice"],
        'sanctity': rowData["sanctity.vice"]
    }
    virtue = {
        'care': rowData["care.virtue"],
        'fairnaess': rowData["fairness.virtue"],
        'loyalty':rowData["loyalty.virtue"],
        'authority':rowData["authority.virtue"],
        'sanctity': rowData["sanctity.virtue"]
    }
    return max(vice, key=vice.get), max(virtue, key=virtue.get)

def vice_label(rowData):
   (vice,virtue) = label_morality_data(rowData=rowData)
   return vice

def virtue_label(rowData):
   (vice,virtue) = label_morality_data(rowData=rowData)
   return virtue

Step 20
we filled the mfd dataframe and apply the existing method we created above to label the new column using the results of each of vice and virtue.

This gives the mfd dataframe two new column namely vice and virtue with their appropriate label

In [None]:
# care.vice	fairness.vice	loyalty.vice	authority.vice	sanctity.vice
mfd['vice'] = mfd.apply(lambda rowData: vice_label(rowData), axis=1)
mfd['virtue'] = mfd.apply(lambda rowData: virtue_label(rowData), axis=1)

Step 21
We visualize the head of the dataframe to show the top ten data rows

In [None]:
mfd.head()

Step 22
To label the data with the score of vice and virtue, we select the top score or the value with the highest score to label the column for morality vice and virtue
the value of the vice and virtue will then be use to characterized the data. the method is to help pandas in filling the column with the updated  score.

In [None]:
def label_moraltiy_score(rowData):
    vice = {
        'care': rowData["care.vice"],
        'fairnaess': rowData["fairness.vice"],
        'loyalty':rowData["loyalty.vice"],
        'authority':rowData["authority.vice"],
        'sanctity': rowData["sanctity.vice"]
    }
    virtue = {
        'care': rowData["care.virtue"],
        'fairnaess': rowData["fairness.virtue"],
        'loyalty':rowData["loyalty.virtue"],
        'authority':rowData["authority.virtue"],
        'sanctity': rowData["sanctity.virtue"]
    }
    vice_key = max(vice, key=vice.get)
    virtue_key = max(virtue, key=virtue.get)
    return vice[vice_key],virtue[virtue_key]
def vice_score(rowData):
   (vice,virtue) = label_moraltiy_score(rowData=rowData)
   return vice

def virtue_score(rowData):
   (vice,virtue) = label_moraltiy_score(rowData=rowData)
   return virtue

Step 23
we filled the mfd dataframe and apply the existing method we created above to label the new column using the results of each of vice and virtue.

This gives the mfd dataframe two new column namely vice_score and irtue_score with their appropriate label

In [None]:
mfd['vice_score'] = mfd.apply(lambda rowData: vice_score(rowData), axis=1)
mfd['virtue_score'] = mfd.apply(lambda rowData: virtue_score(rowData), axis=1)

Step 24
We visualize the head of the dataframe to show the top ten data rows

In [None]:
mfd.head()

# Emotion Analysis

Step 25
Import various library for computing emotion analysis.
import tensorflow
import keras
run pip install to install the emonet nlp library this library will be used in computing the emotion scores that is going to be used in labeling the blog data accordingly.

In [None]:
%pip install tensorflow
%pip install keras
%pip3 install torch torchvision torchaudio

%pip install git+https://github.com/UBC-NLP/EmoNet.git

Step 26
import the emonet module from the emonet library installed
import pandas for manipulating dataframe object
We then create an instance of the Emonet library

In [None]:
from emonet import EmoNet
import pandas as pd 
em = EmoNet()

Step 27
We read the path where the data for the blogpost data which is already labeled with other computation performed previously
this is stored in blogPosts dataframe. We create a method called predict_labels which will be use manipulating the emotion data so we can label each row of the blogpost data with emotion scores and emotion desciption accordingly.


In [None]:

blogPosts = pd.read_csv('results/lda/Indo-pacific-processed-post.csv')
processedBlogPost =  blogPosts.blog_post_processed
def predict_labels(post):
    predictions = em.predict(post)
    predictions = predictions[0]
    (label, score) = predictions 
    return label

Step 28
The predict scores method used to return the score of the generated emotion which will be used to update the pandas dataframe

In [None]:
def predict_scores(post):
    predictions = em.predict(post)
    predictions = predictions[0]
    (label, score) = predictions 
    return score

Step 29
we generate the emotion label and the emotion score fields and store the result in the updated blog post tables by calling the predict scores method that we already created.

In [None]:
blogPosts['emotion'] = blogPosts.apply(lambda row: predict_labels(row['blog_post_processed']), axis=1)
blogPosts['emotions_score'] = blogPosts.apply(lambda row: predict_scores(row['blog_post_processed']), axis=1)
blogPosts_result =pd.concat([blogPosts, mfd], axis=1)
# blogPosts_result = pd.merge(blogPosts,mfd, how='outer', on =)
blogPosts_result=blogPosts_result.drop(columns=['Unnamed: 0'], axis=1)
blogPosts_result.to_csv('results/emos/Indo-pacific-processed-post-emotions.csv')
blogPosts_result.head()

# Blog Text-Classification using Logistic Regression


Step 30
This steps involves using logistics regression to characterize the blog data and also the generate the confusion matrix for the result.
We did three difference characterization for emotion, vice and virtue which is computed from  morality using the moral foundation frame.

We used the following libraries for this purpose numpy, pandas, sklearn and nltk 

In [None]:
import numpy as np # 
import pandas as pd # 

import os

Step 31

This steps involves importing all the required library for plotting and visualizing the data, vectorization and term frequency library
We also import regex and downloaded the appropriate stopwords

In [None]:
import re
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sqlite3 import Error
from sklearn.ensemble import RandomForestClassifier
import sqlite3
import pickle
import nltk
nltk.download('stopwords')
%matplotlib inline

Step 32
We read the processed blog post data which has been labelled accordingly with the emotion analysis data and morality data, we dropped the columns that are not needed and display the top ten dataset.

In [None]:
dataset = pd.read_csv('results/emos/Indo-pacific-processed-post-emotions.csv')
dataset=dataset.drop(columns=['Unnamed: 0'], axis=1)
dataset.head()


Step 33
We group the dataset and plot using the enmoscore column to group the data

This is then plot using maplotlib

In [None]:
dataset.groupby('emotion').emotions_score.plot.bar(ylim=0)
plt.figure(figsize=(10, 10))
# plt.show()

Step 34
This step involves downloading the stopwords and we choose english since majority of the dataset is english
We then apply regex to remove characters that are not needed

In [None]:
nltk.download('stopwords')
stemmer = PorterStemmer()
words = stopwords.words("english")
dataset['cleaned'] = dataset['blog_post_processed'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in words]).lower())

Step 35
shows the dataset top ten values

In [None]:
dataset.head()

Step 36
This steps involves perform the vectorization by using the term frequence and ngram  before performiong vectorization and transformation of the data this also generates a matrix from the dataset

In [None]:
vectorizer = TfidfVectorizer(min_df= 3, stop_words="english", sublinear_tf=True, norm='l2', ngram_range=(1, 2))
final_features = vectorizer.fit_transform(dataset['cleaned']).toarray()
final_features.shape

Step 37
We import the logistics regression library from sklearn
and also used the confusion maxtrix from sklearn to generate the confusion matrix
after importing all the preferred library we define a method characterized_by_selected_fields.
characterized_by_selected_fields takes the dataset and the columns representing the x, y column of interest.
We run the train by spliting the data and specifying the test size to be 25% of our input data.

We also used the library to fit the data.

We then generate the classfication and confusion matrix reports of the dataset for the specified label.  

We also used the matplot to plot the data data

This is used to generate the blog text classification using the label of the vice which is generated from the morality data
by calling the characterized_by_selected_fields method and passing the dataset and the columns of interest

In [None]:
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

def characterized_by_selected_fields(dataset, x_name, y_name):
    X = dataset[x_name]
    Y = dataset[y_name]
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

    pipeline = Pipeline([('vect', vectorizer),
                        ('chi',  SelectKBest(chi2, k=1200)),
                        ('clf', LogisticRegression(random_state=0))])

    model = pipeline.fit(X_train, y_train)
    with open('LogisticRegression.pickle', 'wb') as f:
        pickle.dump(model, f)

    ytest = np.array(y_test)

    # confusion matrix and classification report(precision, recall, F1-score)
    print(classification_report(ytest, model.predict(X_test)))
    print(confusion_matrix(ytest, model.predict(X_test)))
    fig, ax = plt.subplots(figsize=(15, 15))
    plot_confusion_matrix(model, X_test, ytest,ax=ax)  
    plt.show()
    
characterized_by_selected_fields(dataset,'cleaned','emotion')


Step 38
This is used to generate the blog text classification using the label of the vice which is generated from the morality data
by calling the characterized_by_selected_fields method and passing the dataset and the columns of interest

In [None]:
characterized_by_selected_fields(dataset,'cleaned','vice')

Step 39
This is used to generate the blog text classification using the label of the virtue which is generated from the morality data
by calling the characterized_by_selected_fields method and passing the dataset and the columns of interest

In [None]:
characterized_by_selected_fields(dataset,'cleaned','virtue')