<a href="https://colab.research.google.com/github/punkmic/unsupervised-Sentiment-Analysis---Comparisen-analysis/blob/master/Unsupervised_Sentiment_Analysis_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Intro**

## **Install Dependecies**

In [1]:
%%capture
# install dependecies here
#!pip install langdetect  # for language detection
#!pip install diagrams # for visualize the workflow
#!pip install graphviz # for visualize the workflow
#!pip install Pillow # for image manipulation
!pip install textblob # for unsupervised sentiment analysis
!pip install wordcloud # for wordcloud plot
!pip install matplotlib # for plot
!pip install nltk # for natural language prepocessing
!pip install enelvo # for fix slangs, abbreviations, spelling errors
!pip install gensim # for topic modeling 
!pip install tabulate # for print as table
!pip install transformers # for machine learning
!pip install numpy==1.21.6 # for mathematical
!pip install pyldavis # for model visualization
!pip install scikit-learn # machine learning
!pip install bertopic # topic modeling

## **Load Depencies**

In [2]:
%%capture
# load dependecies here
#from langdetect import detect as dt
#from diagrams import Diagram as dg
#from PIL import Image
import pandas as pd
import os 
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from tabulate import tabulate
import numpy as np
from numpy import mean, median
import itertools
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
sw = nltk.corpus.stopwords.words('portuguese')
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("portuguese")
from nltk.tokenize import RegexpTokenizer
wnl = WordNetLemmatizer()
import scipy
from scipy import spatial
import re
from textblob import TextBlob
from textblob import Word
import gensim
from gensim.models import Word2Vec
import string
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import sklearn.cluster
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.cm as cm
from sklearn.cluster import MiniBatchKMeans
from enelvo.normaliser import Normaliser
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
pyLDAvis.enable_notebook()
import warnings
import IPython
warnings.filterwarnings("ignore", category=DeprecationWarning)



## **Load Dataset**

### **Clone Github repository** 

In [3]:
# Files cloned from github may not automatically appear in files tab in this case right click and choose update
# this will update our files.
!git clone https://github.com/punkmic/unsupervised-Sentiment-Analysis---Comparisen-analysis.git

fatal: destination path 'unsupervised-Sentiment-Analysis---Comparisen-analysis' already exists and is not an empty directory.


In [4]:
 #!git pull 

### **Load csv file**

In [5]:
PATH_TO_CSV = '/content/unsupervised-Sentiment-Analysis---Comparisen-analysis/results/web_scraping_results.csv'
docs = pd.read_csv(PATH_TO_CSV, encoding='utf-8')['body']

KeyError: ignored

### **EDA**

In [None]:
print(f'Number of items {len(docs)}')
print(f'Number of unique words {set(docs)}')

## **Plot wordcloud**

In [None]:
# print currently directory
!pwd

In [None]:
# Create and generate a word cloud image:
wordcloud = WordCloud(max_font_size=50, max_words=100).generate(docs[0])

# Save wordcloud 
if not os.path.exists("wordclouds/raw/"):
  os.makedirs("wordclouds/raw/")
wordcloud.to_file('wordclouds/raw/body_wordcloud.png')

# Display wordcloud
plt.figure()
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## **Text Pre-Processing**

Guide
* Lower Case conversion
* Removing Punctuations
* Stop Words Removal
* Rare Words Removal
* Spelling correction
* Tokenization
* Lemmatization



### **Apply enelvo - Normalize noisy words, lowercase the words and remove punctuation.**
Enelvo is a tool for normalising noisy words in user-generated content written in Portuguese -- such as tweets, blog posts, and product reviews. It is capable of identifying and normalising spelling mistakes, internet slang, acronyms, proper nouns, and others.

### **Process text**

In [None]:
from nltk.corpus import stopwords
from string import punctuation

def preprocess_data(doc_set, language):
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    # create a set of stop words
    mystopwords = set(stopwords.words("english"))
    # Create s_stemmer of class SnowballStemmer
    s_stemmer = SnowballStemmer(language = language)
    # create a normaliser instance for portuguese language
    if language == 'portuguese':
      norm = Normaliser(tokenizer='readable', sanitize=True)
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        # remove numbers
        raw = re.sub(r'\d+', '', raw)
         # spelling correction
        if language == 'portuguese':
          raw = norm.normalise(raw)
        # tokenize words
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens 
        stopped_tokens = [token for token in tokens if token not in mystopwords and
                            not token.isdigit() and token not in punctuation]
        # remove short words
        stopped_tokens = [i for i in stopped_tokens if len(i) > 2]
        # stem tokens
        stemmed_tokens = [s_stemmer.stem(i) for i in stopped_tokens]
        # add tokens to list
        texts.append(stemmed_tokens)
    return texts

In [None]:
doc_clean = preprocess_data(docs, 'english')
print(f'Before cleaning: {len(docs)}')
print(f'After cleaning: {len(doc_clean)}')
print()
print(f'Documents before cleaning: {docs[0]}')
print()
print(f'Documents after cleaning: {doc_clean[0]}')

## **Feature engineer**

### **Count vectors**

In [None]:
def vectorize_as_count(doc_clean):
  cv = CountVectorizer()
  cv.fit(doc_clean)
  cv_tedfeatures = cv.transform(doc_clean)
  print(f"samples: {cv_tedfeatures.shape[0]}, features: {cv_tedfeatures.shape[1]}")
  print()
  df_bow_sklearn = pd.DataFrame(cv_tedfeatures.toarray(),columns=cv.get_feature_names_out())
  df_bow_sklearn.head()
  return cv, cv_tedfeatures

### **TF-IDF Vectors**

In [None]:
def vectorize_as_tfidf(doc_clean):
  tv = TfidfVectorizer()
  tv.fit(doc_clean)
  tv_tedfeatures = tv.transform(doc_clean)
  print(f"samples: {tv_tedfeatures.shape[0]}, features: {tv_tedfeatures.shape[1]}")
  print()
  # convert sparse matrix to dense
  dense = tv_tedfeatures.todense()
  denselist = dense.tolist()
  tfid_df = pd.DataFrame(denselist,columns=tv.get_feature_names_out())
  tfid_df.head()
  return tv, tv_tedfeatures

## **K-means Clustering**

In [None]:
def plot_elbow(vector_features):# Elbow method 
  elbow_method = {}
  for k in range(1, 10):
    kmeans_elbow = KMeans(n_clusters=k).fit(vector_features)
    elbow_method[k] = kmeans_elbow.inertia_
  plt.figure()
  plt.plot(list(elbow_method.keys()), list(elbow_method.values()))
  plt.xlabel("Number of cluster")
  plt.ylabel("SSE")
  plt.show()

In [None]:
def calculate_silhouette_score(vector_features):
  # Silhouette method 
  for n_cluster in range(2, 10):
    kmeans = KMeans(n_clusters=n_cluster).fit(cv_tedfeatures)
    label = kmeans.labels_
    sil_coeff = silhouette_score(cv_tedfeatures, label, metric='euclidean')
    print(f"For n_clusters={n_cluster}, The Silhouette Coefficient is {sil_coeff}")

### **Count Vectors as Features**

In [None]:
cv, cv_tedfeatures = vectorize_as_count(docs)
plot_elbow(cv_tedfeatures)

In [None]:
calculate_silhouette_score(cv_tedfeatures)

### **TF-IDF as Features**

In [None]:
tv, tv_tedfeatures = vectorize_as_tfidf(docs)
plot_elbow(tv_tedfeatures)

In [None]:
calculate_silhouette_score(tv_tedfeatures)

### **Clustering Mode**l

In [None]:
# define how many clusters K-means will generate
num_topics = 2
RANDOM_STATE = 20
segments = KMeans(n_clusters=num_topics).fit(tv_tedfeatures)
cluster_ids, cluster_sizes = np.unique(segments.labels_, return_counts=True)
print(f"Number of cluster {segments.n_clusters} \nNumber of elements asigned to each cluster: {cluster_sizes} ")

In [None]:
segments = KMeans(n_clusters=2)
segments.fit(tv_tedfeatures)
clusters = segments.labels_
#segment outputs
output = segments.labels_.tolist()
ted_segmentaion = {'text': docs, 'cluster': output}
output_df = pd.DataFrame(ted_segmentaion)
#talks per segment
output_df['cluster'] = segments.labels_.tolist()
output_df['cluster'].value_counts()

if not os.path.exists("wordclouds/clusters/"):
    os.makedirs("wordclouds/clusters/")

num_clusters = 2
for index in range(num_clusters):
  cluster = output_df[output_df.cluster == index]
  #wordcloud = WordCloud(width = 1000, height = 500,collocations = False).generate_from_text(' '.join(cluster['text']))

In [None]:
from sklearn.decomposition import PCA

# initialize PCA with 2 components
pca = PCA(n_components=2, random_state=42)
# pass our X to the pca and store the reduced vectors into pca_vecs
pca_vecs = pca.fit_transform(tv_tedfeatures.toarray())
# save our two dimensions into x0 and x1
x0 = pca_vecs[:, 0]
x1 = pca_vecs[:, 1]
# assign clusters and pca vectors to our dataframe 
output_df['x0'] = x0
output_df['x1'] = x1

In [None]:
output_df.head()

In [None]:
def get_top_keywords(n_terms):
    """This function returns the keywords for each centroid of the KMeans"""
    df = pd.DataFrame(tv_tedfeatures.todense()).groupby(clusters).mean() # groups the TF-IDF vector by cluster
    terms = tv.get_feature_names_out() # access tf-idf terms
    for i,r in df.iterrows():
        print('\nCluster {}'.format(i))
        print(','.join([terms[t] for t in np.argsort(r)[-n_terms:]])) # for each row of the dataframe, find the n terms that have the highest tf idf score
            
get_top_keywords(10)

## **Topic Modeling**

### **Create a dictionary using the bag of words model**
- Document: some text.
- Corpus: a collection of documents.

In [None]:
def prepare_corpus(doc_clean):
    # associate each word in the corpus with a unique integer ID.
    dictionary = corpora.Dictionary(doc_clean)
    
    # create the bag-of-word representation for documents (corpus)
    doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

    return dictionary,doc_term_matrix

In [None]:
def save_model(dir, cluster_id, model, passes, iterations):
   # Save models so they aren't lost
  if not os.path.exists(f"{dir}model_{iterations}i{passes}p_cluster{cluster_id}"):
    os.makedirs(f"{dir}model_{iterations}i{passes}p_cluster{cluster_id}")

  model.save(f"{dir}model_{iterations}i{passes}p_cluster{cluster_id}/model_{iterations}i{passes}p.model")
    
  print(f'Model saved at: {dir}model_{iterations}i{passes}p_cluster{cluster_id}/model_{iterations}i{passes}p.model')

In [None]:
def save_topics_plot(dir, cluster_id, model, doc_clean):
  dictionary, doc_term_matrix = prepare_corpus(doc_clean)
  vis = gensimvis.prepare(model, doc_term_matrix, dictionary)
  
  if not os.path.exists(f'{dir}'):
    os.makedirs(f'{dir}')
  pyLDAvis.save_html(vis, f'{dir}topics_cluster_{index}.html')
  print(f'Visualization saved at: {dir}topics_cluster_{index}.html')

### **Latent Dirichlet Allocation (LDA) model**

In [None]:
from gensim import models
from gensim import corpora
from gensim.models.callbacks import PerplexityMetric, ConvergenceMetric, CoherenceMetric

def create_gensim_lda_model(doc_clean, number_topics, words, iterations, passes):
  
  dictionary, doc_term_matrix = prepare_corpus(doc_clean) 

  # train model
  lda_model = models.ldamodel.LdaModel(doc_term_matrix,
            id2word=dictionary,
            num_topics=num_topics,
            iterations=iterations,
            passes=passes)
  
  print(lda_model.print_topics(num_topics=number_topics, num_words=words))

  return lda_model

In [None]:
def compute_coherence_values(dictionary, doc_term_matrix, doc_clean, stop, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, stop, step):
        model = models.LdaModel(doc_term_matrix, num_topics=num_topics, id2word = dictionary) 
        model_list.append(model)
        coherencemodel = models.CoherenceModel(model=model, texts=doc_clean, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

In [None]:
def plot_graph(doc_clean, start, stop, step):
    dictionary, doc_term_matrix=prepare_corpus(doc_clean)
    model_list, coherence_values = compute_lsa_coherence_values(dictionary, doc_term_matrix,doc_clean,
                                                            stop, start, step)
    # Show graph
    x = range(start, stop, step)
    plt.plot(x, coherence_values)
    plt.xlabel("Number of Topics")
    plt.ylabel("Coherence score")
    plt.legend(("coherence_values"), loc='best')
    plt.show()

In [None]:
iterations = 100
passes = 2
words=10

for index in range(kmeans.n_clusters):
  cluster = output_df[output_df.cluster == index]
  doc_clean = preprocess_data(cluster.text, 'english')
  model = create_gensim_lda_model(doc_clean, num_topics, words, iterations, passes)
  save_model('content/models/lda/', index, model, passes, iterations)

  # plot coherence score
  start,stop,step=2,12,1
  plot_graph(doc_clean, start, stop, step)

  # generate topic plot and save it
  save_topics_plot('content/topics/lda/', index, model, doc_clean)



In [None]:
# show cluster 1 topics
IPython.display.HTML(filename='topics/lda/topics_cluster_0.html')

### **Latent Semantic Analysis (LSA) model** 

In [None]:
def create_gensim_lsa_model(doc_clean, numb_topics, words):
    dictionary, doc_term_matrix = prepare_corpus(doc_clean)
    lsamodel = models.LsiModel(doc_term_matrix, num_topics=numb_topics, id2word = dictionary)  
    print(lsamodel.print_topics(num_topics=numb_topics, num_words=words))
    return lsamodel

In [None]:
def compute_lsa_coherence_values(dictionary, doc_term_matrix, doc_clean, stop, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, stop, step):
        model = models.LsiModel(doc_term_matrix, num_topics=num_topics, id2word = dictionary)  
        model_list.append(model)
        coherencemodel = models.CoherenceModel(model=model, texts=doc_clean, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

In [None]:
def plot_lsa_graph(doc_clean, start, stop, step):
    dictionary, doc_term_matrix=prepare_corpus(doc_clean)
    model_list, coherence_values = compute_lsa_coherence_values(dictionary, doc_term_matrix,doc_clean,
                                                            stop, start, step)
    # Show graph
    x = range(start, stop, step)
    plt.plot(x, coherence_values)
    plt.xlabel("Number of Topics")
    plt.ylabel("Coherence score")
    plt.legend(("coherence_values"), loc='best')
    plt.show()

In [None]:
iterations = 100
passes = 2
words=10
for index in range(kmeans.n_clusters):
  cluster = clusters_df[clusters_df.cluster == index]
  doc_clean = preprocess_data(cluster.body)
  model = create_gensim_lsa_model(doc_clean, num_topics, words)
  save_model('content/models/lsa/', index, model, passes, iterations)

  # plot coherence score
  start,stop,step=2,12,1
  plot_lsa_graph(doc_clean, start, stop, step)

  # generate topic plot and save it
  #save_topics_plot('topics/lsa/', index, model)

### **Hierarchical Dirichlet Process, HDP** 

In [None]:
def create_gensim_hdp_model(doc_clean, num_topics, words):
    dictionary, doc_term_matrix = prepare_corpus(doc_clean)
    model = models.HdpModel(doc_term_matrix, id2word=dictionary)
    return model

In [None]:
def compute_hdp_coherence_values(doc_clean):
    dictionary, doc_term_matrix = prepare_corpus(doc_clean)
    model = models.HdpModel(doc_term_matrix, id2word=dictionary)
    coherencemodel = models.CoherenceModel(model=model, texts=doc_clean, dictionary=dictionary, coherence='c_v')
    coherence_value = coherencemodel.get_coherence()
    return model, coherence_value

In [None]:
words = 10
for index in range(kmeans.n_clusters):
  #cluster = clusters_df[clusters_df.cluster == index]
  #doc_clean = preprocess_data(cluster.body)
  model = create_gensim_hdp_model(doc_clean, num_topics, words)
  save_model('content/models/hdp/', index, model, 'na', 'na')
  print(model.show_topics())


  compute_hdp_coherence_values(doc_clean)

  # generate topic plot and save it
  #save_topics_plot('content/topics/hdp/', index, model, doc_clean)

In [None]:
# show cluster 1 topics
IPython.display.HTML(filename='topics/hdp/topics_cluster_0.html')

### **Bertopic**

In [None]:
from bertopic import BERTopic

In [None]:
# create model 
model = BERTopic(verbose=False, language='english')

In [None]:
# train model
topics, probabilities = model.fit_transform(docs)

In [None]:
print(topics)

In [None]:
# get the topic frequency
model.get_topic_freq().head(11)

In [None]:
# get one topic
model.get_topic(0)

In [None]:
model.visualize_barchart(topics[:10])

### **Topic similarities**

In [None]:
model.visualize_heatmap(topics)

### **Bertopic prediction**

In [None]:
topics, probs = model.transform(new_docs)

### **Texblob**

In [None]:
from textblob import TextBlob
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

In [None]:
sid = SentimentIntensityAnalyzer()

In [None]:
def get_blob_sentiment(sentence):
  blob = TextBlob(sentence).sentiment
  return blob.polarity

### **Vader**

In [None]:
nltk.download('vader_lexicon')

In [None]:
def get_vader_sentiment(sentence):
  vader = sid.polarity_scores(sentence)
  return vader['compound']

In [None]:
df['TextBlob'] = df['body'].apply(lambda sentence: get_blob_sentiment(sentence))
df['Vader'] = df['body'].apply(lambda sentence: get_vader_sentiment(sentence))

A negative sentiment score means 
negative sentiment, and a positive sentiment score means positive sentiment. The higher 
the absolute value of the score, the more confident the system is about it

In [None]:
df.head(10)

### **Clustering sentences with K-Means**

In [None]:
import re
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.probability import FreqDist
from sklearn.model_selection import train_test_split

## **Save Models to Google Cloud Storage**

In [None]:
# import google cloud dependencies
#from google.colab import auth
#import uuid # for generate a unique identification for google bucket
# Define a project id in google cloud
#project_id = '<project_ID>'

#auth.authenticate_user()
# configure gsutil
## !gcloud config set project {project_id}
# set bucket name
##backet_name = f'sample-bucket-{uuid.uuid1()}'
## !gsuit mb gs://{bucket_name}

In [None]:
# upload model to Google Cloud Storage
#!gsuit cp /tmp/name_of_file.txt gs://{bucket_name}/

# location of model
#download_location = f"https://console.cloud.google.com/storage/browser?project={project_id}"

# donwload model from Google Cloud Storage
#!gsuit cp gs://{bucket_name}/{filename} {download_location}

## **References**


[LangDetect](https://pypi.org/project/langdetect/) <br/>
[Diagrams](https://pypi.org/project/diagrams/) <br/>
[Graphviz](https://pypi.org/project/graphviz/) <br/>
[Beautifulsoap4](https://pypi.org/project/beautifulsoup4/) <br/>
[OpLexicon](https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/oplexicon/)