# Machine Learning Final Project
# Blog Content Characterization (Morality, Emotion Analysis, Topic Analysis)

Abstract
===============
Blogs have become crucial to our daily lives, whether it be reading newsletters, and journals, documenting our stories, or following others' stories. 
Blogs are one of the important advancements of web2.0. Blog extends to other social media like Twitter and Facebook posts. <br>As important as blogs are, their content also plays a crucial role in the daily influence of their audience and writer. 
Audiences get influenced while writers become influential.<br> Characterizing blog content is analyzing and being able to label the blog by using various social metrics that are backed by a machine learning approach. With this in mind, this work will endeavor to apply various available machine learning approaches and libraries in characterizing blogs and their content helping users understand the influence of the blog or the author and attributing the written label to the blog to help users decode the underlying message of the author that the audience may not be able to infer naturally.

## Introduction

Many audiences read blogs and news articles online, and users usually bookmark their preferred blogs and subscribe to RSS (Really Simple Syndication) feeds from these blogs. <br>
Users have a limited understanding of the fact that they are being influenced by the author while some authors have a limited understanding of how much their influence is growing. <br>
Usually, some of these blogs may post information that may contain information or words that may help classify them rather than rely on the tags that the author may have given the website when creating the website. <br>

The topic analysis is used to figure out a text's topic structure, which is a picture of what topics are in a text and how they change over time. <br>The topic analysis consists of two main tasks: topic identification and text segmentation (based on topic changes). 

Emotions can be expressed verbally through emotional vocabulary or through nonverbal cues like intonation of voice, facial expressions, and gestures, all of which play an important role in human communication. <br>Most human-computer interaction (HCI) systems lack emotional intelligence and are incapable of interpreting emotions. For blog information retrieval, it is essential to characterize blog content using relevant, dependable, and distinguishing tags.

Although some authors traditionally set out to influence their audience, the majority of blog authors are usually of the opinion that they are expressing their point of view and things that they are passionate about.<br> Hence, being able to run some social analysis and using machine learning to classify these blogs will help the readers immensely to understand the author while the author will understand their content and the reason why their following is either decreasing or reducing depending on the preference of the author. <br>This is why we have introduced combined social analysis as a way of characterizing the collected blog data for this study.


## Methodology

To carry out this study, we will briefly describe each method and tool we will leverage on. 
First, we will collect blog data for about a two-year period by crawling blogs of interest. We will not use a keyword in collecting this data as we intend to use our results to describe or label these blogs by using results generated from morality, topic analysis, and emotion assessments. This will then allow us to use a classification model to classify these blogs.

### Dataset Description

The dataset to be used for this study was crawled using a crawler tool specifically focusing on indo-pacific blogs i.e blogs that discuss key issues related to the indo-pacific region, the collected blog is then stored in a CSV file and uploaded to the GitHub page link below. The repository will also house subsequent project source code implementation. 

https://github.com/nakinnubis/inpacblogdata

### Algorithm

#### Topic Analysis

We will be using Latent Semantic Analysis (LSA -NLP) for topic analysis. This approach supports singular value decomposition by keeping documents and words in a semantic space for classification hence it fits into our goal of characterizing blog post content and information.


#### Morality Analysis

For morality assessment, we will use the moral foundation theory along with a probabilistic inference to identify the changes. Using the MFT algorithm for moral quantification, this NLP approach will allow us to classify each blog post according to the appropriate moral scores.

#### Emotion Assessment

To predict emotions, we will be using Bidirectional LSTM with a CNN  and use Plutchik’s Wheel of Emotions to represent human emotions which will form the labels for each corpus of the document. We intend to have labels like joy, anger, sadness, and fear.

## Result Discussion



In [None]:
%pip install nbformat
# Import the wordcloud library
%pip install wordcloud
%pip install pyldavis

In [None]:
# Importing modules
import pandas as pd
# import os

# os.chdir('..')

In [None]:
# Read data into blogPosts
blogPosts = pd.read_csv('Indo-pacific-blog-data.csv')

# Print head
# This shows us the content of the crawled blog data for analysis purpose
blogPosts.head()

In [None]:
# Remove the columns that are not useful for this study examples of removed columns are 'categories', 'comments_url'
blogPosts = blogPosts.drop(columns=['categories', 'comments_url'], axis=1)

# Print out the first rows of blogPosts with new updated dataframe excluding 'categories', 'comments_url'
blogPosts.head()
blogPosts.to_csv("./results/lda/Indo-pacific-blog-post.csv")

In [None]:
# Load the regular expression library
import re

# Remove punctuation and unwanted dataset to allow a more clean data when we start performing LDA on the dataset
# We used the post column for this purpose and create a new column from the dataset blog_post_processed column
blogPosts['blog_post_processed'] = \
blogPosts['post'].map(lambda x: re.sub('[,\.!?]', '', x))

# Convert the all text to lowercase for the processed blogpost column
blogPosts['blog_post_processed'] = \
blogPosts['blog_post_processed'].map(lambda x: x.lower())

# Print out the head section which represents the first few columns present in the dataset
blogPosts['blog_post_processed'].head()
blog_post_processed_header = ['blogpost_id','title','date','blogger','tags','sentiment','location','blog_post_processed']
blogPosts.to_csv("./results/lda/Indo-pacific-processed-post.csv", columns=blog_post_processed_header)

In [None]:
from wordcloud import WordCloud

# Join the different processed titles together.
long_string = ','.join(list(blogPosts['blog_post_processed'].values))

# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=1000, width=800, contour_width=2, contour_color='steelblue', height=800)

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

In [None]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]


data = blogPosts['blog_post_processed'].values.tolist()
data_words = list(sent_to_words(data))

# remove stop words
data_words = remove_stopwords(data_words)
# data_words
print(data_words[:1][0][:len(data_words)-1])
words = pd.DataFrame(data_words[:1][0][:len(data_words)-1])
words.to_csv("results/lda/Indo-pacific-processed-words.csv")

In [None]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_words)

# Create Corpus
texts = data_words

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1][0][:len(corpus)-1])
corpus_terms_m = pd.DataFrame(corpus[:1][0][:len(corpus)-1])
corpus_terms_m.to_csv("results/lda/Indo-pacific-processed-corpus_terms_m.csv")

In [None]:
from pprint import pprint

# number of topics
num_topics = 10

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
doc_ld_df = pd.DataFrame(doc_lda)
doc_ld_df.to_csv("results/lda/Indo-pacific-processed-doc_lda.csv")

In [None]:
import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

from pprint import pprint

import spacy

import pickle
import re
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

import matplotlib.pyplot as plt

# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_data_filepath = os.path.join('./results/lda/ldavis_prepared_'+str(num_topics))
# # this is a bit time consuming - make the if statement True
# # if you want to execute visualization prep yourself
if 1 == 1:
    LDAvis_prepared = gensimvis.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)

pyLDAvis.save_html(LDAvis_prepared, './results/lda/ldavis_prepared_'+ str(num_topics) +'.html')

LDAvis_prepared

#### Morality Analysis

In [None]:
%pip install seaborn

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
template_input = pd.read_csv('data/Indo-pacific-blog-data_morality.csv', header=None)
template_input.head()

In [None]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'emfd'
PROB_MAP = 'all'
SCORE_METHOD = 'bow'
OUT_METRICS = 'vice-virtue'
OUT_CSV_PATH = 'all-vv.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

In [None]:
# Inspect output 
all_vv = pd.read_csv('all-vv.csv')
all_vv.head()

In [None]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'emfd'
PROB_MAP = 'single'
SCORE_METHOD = 'bow'
OUT_METRICS = 'vice-virtue'
OUT_CSV_PATH = 'single-vv.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

In [None]:
# Inspect output 
single_vv = pd.read_csv('single-vv.csv')
single_vv.head()

In [None]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'mfd2'
PROB_MAP = ''
SCORE_METHOD = 'bow'
OUT_METRICS = ''
OUT_CSV_PATH = 'mfd2.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

In [None]:
# Inspect output 
mfd2 = pd.read_csv('mfd2.csv')
mfd2.head()

In [None]:
from emfdscore.scoring import score_docs 

num_docs = len(template_input)

DICT_TYPE = 'mfd'
PROB_MAP = ''
SCORE_METHOD = 'bow'
OUT_METRICS = ''
OUT_CSV_PATH = 'mfd.csv'

df = score_docs(template_input,DICT_TYPE,PROB_MAP,SCORE_METHOD,OUT_METRICS,num_docs)
df.to_csv(OUT_CSV_PATH, index=False)

In [51]:
# Inspect output 
mfd = pd.read_csv('mfd.csv')
mfd.head()

Unnamed: 0,care.virtue,fairness.virtue,loyalty.virtue,authority.virtue,sanctity.virtue,care.vice,fairness.vice,loyalty.vice,authority.vice,sanctity.vice,moral,moral_nonmoral_ratio,f_var
0,0.058824,0.117647,0.117647,0.117647,0.058824,0.352941,0.0,0.058824,0.0,0.058824,0.058824,0.045213,0.01015
1,0.2,0.0,0.133333,0.2,0.066667,0.233333,0.0,0.1,0.0,0.0,0.166667,0.041725,0.00884
2,0.333333,0.0,0.0,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.029703,0.025926
3,0.117647,0.0,0.0,0.058824,0.117647,0.294118,0.0,0.117647,0.176471,0.0,0.117647,0.046832,0.009419
4,0.027027,0.027027,0.162162,0.027027,0.0,0.540541,0.0,0.189189,0.0,0.0,0.027027,0.094148,0.029089


In [52]:
def label_morality_data(rowData):
    vice = {
        'care': rowData["care.vice"],
        'fairnaess': rowData["fairness.vice"],
        'loyalty':rowData["loyalty.vice"],
        'authority':rowData["authority.vice"],
        'sanctity': rowData["sanctity.vice"]
    }
    virtue = {
        'care': rowData["care.virtue"],
        'fairnaess': rowData["fairness.virtue"],
        'loyalty':rowData["loyalty.virtue"],
        'authority':rowData["authority.virtue"],
        'sanctity': rowData["sanctity.virtue"]
    }
    return max(vice, key=vice.get), max(virtue, key=virtue.get)

def vice_label(rowData):
   (vice,virtue) = label_morality_data(rowData=rowData)
   return vice

def virtue_label(rowData):
   (vice,virtue) = label_morality_data(rowData=rowData)
   return virtue

In [53]:
# care.vice	fairness.vice	loyalty.vice	authority.vice	sanctity.vice
mfd['vice'] = mfd.apply(lambda rowData: vice_label(rowData), axis=1)
mfd['virtue'] = mfd.apply(lambda rowData: virtue_label(rowData), axis=1)

In [54]:
mfd.head()

Unnamed: 0,care.virtue,fairness.virtue,loyalty.virtue,authority.virtue,sanctity.virtue,care.vice,fairness.vice,loyalty.vice,authority.vice,sanctity.vice,moral,moral_nonmoral_ratio,f_var,vice,virtue
0,0.058824,0.117647,0.117647,0.117647,0.058824,0.352941,0.0,0.058824,0.0,0.058824,0.058824,0.045213,0.01015,care,fairnaess
1,0.2,0.0,0.133333,0.2,0.066667,0.233333,0.0,0.1,0.0,0.0,0.166667,0.041725,0.00884,care,care
2,0.333333,0.0,0.0,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.029703,0.025926,care,care
3,0.117647,0.0,0.0,0.058824,0.117647,0.294118,0.0,0.117647,0.176471,0.0,0.117647,0.046832,0.009419,care,care
4,0.027027,0.027027,0.162162,0.027027,0.0,0.540541,0.0,0.189189,0.0,0.0,0.027027,0.094148,0.029089,care,loyalty


In [None]:
def label_moraltiy_score(rowData):
    pass

# Emotion Analysis

In [None]:
%pip install tensorflow
%pip install keras
%pip3 install torch torchvision torchaudio

%pip install git+https://github.com/UBC-NLP/EmoNet.git

In [None]:
from emonet import EmoNet
import pandas as pd 
em = EmoNet()

In [None]:
# Predict text in a tsv file line by line
blogPosts = pd.read_csv('results/lda/Indo-pacific-processed-post.csv')
processedBlogPost =  blogPosts.blog_post_processed
processedBlogPost
def predict_labels(post):
    predictions = em.predict(post)
    predictions = predictions[0]
    (label, score) = predictions 
    return label

In [None]:
def predict_scores(post):
    predictions = em.predict(post)
    predictions = predictions[0]
    (label, score) = predictions 
    return score

In [None]:
blogPosts['emotion'] = blogPosts.apply(lambda row: predict_labels(row['blog_post_processed']), axis=1)
blogPosts['emotions_score'] = blogPosts.apply(lambda row: predict_scores(row['blog_post_processed']), axis=1)
blogPosts.head()

blogPosts.to_csv('results/emos/Indo-pacific-processed-post-emotions.csv')

# Blog Text-Classification using Logistic Regression


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os

# Any results you write to the current directory are saved as output.

In [None]:
import re
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sqlite3 import Error
from sklearn.ensemble import RandomForestClassifier
import sqlite3
import pickle
import nltk
nltk.download('stopwords')
%matplotlib inline

In [None]:
dataset = pd.read_csv('data/Indo-pacific-processed-post-emotions.csv')
dataset.head()

In [None]:
dataset = dataset.drop(columns=['Unnamed: 0.1', 'Unnamed: 0'], axis=1)

In [None]:
dataset.groupby('emotion').emotions_score.plot.bar(ylim=0)
plt.figure(figsize=(10, 10))
# plt.show()

In [None]:
nltk.download('stopwords')
stemmer = PorterStemmer()
words = stopwords.words("english")
dataset['cleaned'] = dataset['blog_post_processed'].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in words]).lower())

In [None]:
dataset.head()

In [None]:
vectorizer = TfidfVectorizer(min_df= 3, stop_words="english", sublinear_tf=True, norm='l2', ngram_range=(1, 2))
final_features = vectorizer.fit_transform(dataset['cleaned']).toarray()
final_features.shape

In [None]:
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
X = dataset['cleaned']
Y = dataset['emotion']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)

pipeline = Pipeline([('vect', vectorizer),
                     ('chi',  SelectKBest(chi2, k=1200)),
                     ('clf', LogisticRegression(random_state=0))])

model = pipeline.fit(X_train, y_train)
with open('LogisticRegression.pickle', 'wb') as f:
    pickle.dump(model, f)

ytest = np.array(y_test)

# confusion matrix and classification report(precision, recall, F1-score)
print(classification_report(ytest, model.predict(X_test)))
print(confusion_matrix(ytest, model.predict(X_test)))
fig, ax = plt.subplots(figsize=(15, 15))
plot_confusion_matrix(model, X_test, ytest,ax=ax)  
plt.show()