<a href="https://colab.research.google.com/github/punkmic/unsupervised-Sentiment-Analysis---Comparisen-analysis/blob/master/Unsupervised_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Intro**

## **Install Dependecies**

In [None]:
# install dependecies here
!pip install langdetect  # for language detection
!pip install diagrams # for visualize the workflow
!pip install graphviz # for visualize the workflow
!pip install textblob # for unsupervised sentiment analysis
!pip install wordcloud # for wordcloud plot
!pip install matplotlib # for plot
!pip install PIL # for image manipulation
!pip install nltk # for natural language prepocessing
!pip install enelvo # for fix slangs, abbreviations, spelling errors

## **Load Depencies**

In [None]:
# load dependecies here
from langdetect import detect as dt
from diagrams import Diagram as dg
import pandas as pd
from PIL import Image
import os 
import matplotlib.pyplot as plt

## **Load Dataset**

### **Clone Github repository** 

In [None]:
# Files cloned from github may not automatically appear in files tab in this case right click and choose update
# this will update our files.
!git clone https://github.com/punkmic/unsupervised-Sentiment-Analysis---Comparisen-analysis.git
%cd /content/unsupervised-Sentiment-Analysis---Comparisen-analysis
!ls

In [None]:
# !git pull 

### **Load csv file**

In [None]:
PATH_TO_CSV = '/content/unsupervised-Sentiment-Analysis---Comparisen-analysis/results/web_scraping_results.csv'
df = pd.read_csv(PATH_TO_CSV, encoding='utf-8')
df.head()

### **Plot some statistics of text**

In [None]:
df.describe()

### **Plot wordcloud**

In [None]:
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
# print currently directory
!pwd

In [None]:
# create a new directory for wordclouds
wordclouds = '/content/unsupervised-Sentiment-Analysis---Comparisen-analysis/results/wordclouds/'
!mkdir wordclouds

In [None]:
# Create and generate a word cloud image:
text = str(df['title']).lower()
wordcloud = WordCloud(max_font_size=50, max_words=100,  stopwords=STOPWORDS).generate(text)

# Save wordcloud 
wordcloud.to_file('wordclouds/title_wordcloud.png')

# Display wordcloud
plt.figure()
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()



In [None]:

# Create and generate a word cloud image:
text = str(df['body']).lower()
wordcloud = WordCloud(max_font_size=50, max_words=100).generate(text)

# Save wordcloud 
wordcloud.to_file('wordclouds/body_wordcloud.png')

# Display wordcloud
plt.figure()
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()



## **Text Pre-Processing**

Guide
* Lower Case conversion
* Removing Punctuations
* Stop Words Removal
* Rare Words Removal
* Spelling correction
* Tokenization
* Lemmatization



### **Tokenizing**

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# download stop words
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
# Convert text to lowercase and split to a list of words
body = ''.join(review for review in df['body'])
tokens = word_tokenize(body.lower())
filtered_tokens = [token for token in tokens if token not in string.punctuation]
filtered_tokens[1:10]

### **Remove stop words**

In [None]:
# Remove stop words
# Print the list of available languages
portuguese_stop_words = stopwords.words('portuguese')
tokens_wo_stop_words = [word for word in filtered_tokens if word not in portuguese_stop_words]
tokens_wo_stop_words[1:10]

### **Word Stemming**

In [None]:
from nltk.stem import SnowballStemmer

In [None]:
# Use SnowballStemmer stemmer optionally nltk RSLPStemmer for portuguese text language
# Initialize stemmer with portuguese
stemmer = SnowballStemmer('portuguese') 
# Stem the words
stemmed_words = [stemmer.stem(word) for word in tokens_wo_stop_words]
stemmed_words[1:10]

### **Using enevol to increase performance Maybe ?**

In [None]:
from enelvo.normaliser import Normaliser

In [None]:
norm = Normaliser(tokenizer='readable', sanitize=True)

### **Texblob**

In [None]:
from textblob import TextBlob
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
sid = SentimentIntensityAnalyzer()

In [None]:
def get_blob_sentiment(sentence):
  blob = TextBlob(sentence).sentiment
  return blob.polarity

### **Vader**

In [None]:
nltk.download('vader_lexicon')

In [None]:
def get_vader_sentiment(sentence):
  vader = sid.polarity_scores(sentence)
  return vader['compound']

In [None]:
df['TextBlob'] = df['body'].apply(lambda sentence: get_blob_sentiment(sentence))
df['Vader'] = df['body'].apply(lambda sentence: get_vader_sentiment(sentence))

A negative sentiment score means 
negative sentiment, and a positive sentiment score means positive sentiment. The higher 
the absolute value of the score, the more confident the system is about it

In [None]:
df.head(10)

### **Clustering sentences with K-Means**

In [None]:
import re
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.probability import FreqDist
from sklearn.model_selection import train_test_split

In [None]:
# get train and test 
(train, test) = train_test_split()

## **Save Models to Google Cloud Storage**

In [None]:
# import google cloud dependencies
from google.colab import auth
import uuid # for generate a unique identification for google bucket
# Define a project id in google cloud
project_id = '<project_ID>'

auth.authenticate_user()
# configure gsutil
## !gcloud config set project {project_id}
# set bucket name
##backet_name = f'sample-bucket-{uuid.uuid1()}'
## !gsuit mb gs://{bucket_name}

In [None]:
# upload model to Google Cloud Storage
!gsuit cp /tmp/name_of_file.txt gs://{bucket_name}/

# location of model
download_location = f"https://console.cloud.google.com/storage/browser?project={project_id}"

# donwload model from Google Cloud Storage
!gsuit cp gs://{bucket_name}/{filename} {download_location}

## **References**


[LangDetect](https://pypi.org/project/langdetect/) <br/>
[Diagrams](https://pypi.org/project/diagrams/) <br/>
[Graphviz](https://pypi.org/project/graphviz/) <br/>
[Beautifulsoap4](https://pypi.org/project/beautifulsoup4/) <br/>
[OpLexicon](https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/oplexicon/)