<a href="https://colab.research.google.com/github/mojtabaSefidi/DataScience-SmallProjects/blob/master/Sentiment_Analysis_Using_DeepLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q wordcloud
!pip install -q tqdm

In [4]:
import os
import pandas as pd
import numpy as np
import string
import nltk
import re
from keras.utils import pad_sequences
from keras.preprocessing.text import Tokenizer
import tqdm

## **Dataset**

### **Capturing dataset from Kaggle**

In [5]:
!gdown 1R8waoO4GA-0SiyfadnSDcY4FeuNkTV3A
! pip install -q kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download -d kritanjalijain/amazon-reviews
!unzip /content/amazon-reviews.zip
os.rename('test.csv', 'Amazon_Review_Test.csv')
os.rename('train.csv', 'Amazon_Review_Train.csv')
os.remove("amazon-reviews.zip")

Downloading...
From: https://drive.google.com/uc?id=1R8waoO4GA-0SiyfadnSDcY4FeuNkTV3A
To: /content/kaggle.json
100% 73.0/73.0 [00:00<00:00, 176kB/s]
Downloading amazon-reviews.zip to /content
100% 1.29G/1.29G [00:40<00:00, 37.0MB/s]
100% 1.29G/1.29G [00:40<00:00, 34.0MB/s]
Archive:  /content/amazon-reviews.zip
  inflating: amazon_review_polarity_csv.tgz  
  inflating: test.csv                
  inflating: train.csv               


### **Inroduction**

In [6]:
train_dataset = pd.read_csv('Amazon_Review_Train.csv', names=['Polarity', 'Review Heading', 'Review Body'], dtype={'Polarity':np.int8, 'Review Heading':str, 'Review Body':str})
test_dataset = pd.read_csv('Amazon_Review_Test.csv', names=['Polarity', 'Review Heading', 'Review Body'], dtype={'Polarity':np.int8, 'Review Heading':str, 'Review Body':str})
print(f'We have {len(train_dataset)} samples for training and {len(test_dataset)} samples for evaluation.')

We have 3600000 samples for training and 400000 samples for evaluation.


In [7]:
train_dataset['Polarity'] = train_dataset['Polarity'].map({2:'positive', 1:'negative'})
test_dataset['Polarity'] = test_dataset['Polarity'].map({2:'positive', 1:'negative'})

In [8]:
train_dataset.head()

Unnamed: 0,Polarity,Review Heading,Review Body
0,positive,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,positive,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,positive,Amazing!,This soundtrack is my favorite music of all ti...
3,positive,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,positive,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."


In [9]:
test_dataset.head()

Unnamed: 0,Polarity,Review Heading,Review Body
0,positive,Great CD,My lovely Pat has one of the GREAT voices of h...
1,positive,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
2,negative,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
3,positive,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
4,positive,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...


### **Analysis**

## **Pre-Processing**

In [10]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [11]:
def process(text, remove_stopwords=False):
  text = text.lower()
  text = re.sub(r"what's", "what is ", text)
  text = re.sub(r"'s", " ", text)
  text = re.sub(r"'ve", " have ", text)
  text = re.sub(r"can't", "can not ", text)
  text = re.sub(r"n't", " not ", text)
  text = re.sub(r"i'm", "i am ", text)
  text = re.sub(r"'re", " are ", text)
  text = re.sub(r"'d", " would ", text)
  text = re.sub(r"'ll", " will ", text)
  text = text.translate(str.maketrans('', '', string.punctuation))
  if remove_stopwords:
    text = ' '.join([word for word in text.split() if word not in stopwords])
  text = re.sub(' +', ' ', text)
  return text

In [12]:
train_dataset = train_dataset[train_dataset['Review Heading'].notnull() & train_dataset['Review Body'].notnull()]
test_dataset = test_dataset[test_dataset['Review Heading'].notnull() & test_dataset['Review Body'].notnull()]

In [13]:
x_train_text = train_dataset['Review Heading'] + '. ' + train_dataset['Review Body']
x_train_text = x_train_text.map(process)
y_train = train_dataset['Polarity']
x_test_text = test_dataset['Review Heading'] + '. ' + test_dataset['Review Body']
x_test_text = x_test_text.map(process)
y_train = test_dataset['Polarity']

## **Text Representation**

### **Text2Sequence**

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_train_text)
X_train = tokenizer.texts_to_sequences(x_train_text)
X_test = tokenizer.texts_to_sequences(x_test_text)

### **Continuous Bag of Words (CBOW)**

### **Glove**

In [2]:
# !gdown https://huggingface.co/stanfordnlp/glove/resolve/main/glove.840B.300d.zip
!unzip /content/glove.840B.300d.zip

Downloading...
From: https://huggingface.co/stanfordnlp/glove/resolve/main/glove.840B.300d.zip
To: /content/glove.840B.300d.zip
100% 2.18G/2.18G [00:53<00:00, 40.7MB/s]


In [None]:
embedding_vector_glove = {}
f = open('embeddings/glove.840B.300d/glove.840B.300d.txt')
for line in tqdm(f):
    value = line.split(' ')
    word = value[0]
    coef = np.array(value[1:],dtype = 'float32')
    embedding_vector_glove[word] = coef

In [None]:
vocab_size = len(tokenizer.word_index)+1
embedding_matrix_glove = np.zeros((vocab_size,300))
for word, i in tqdm(tokenizer.word_index.items()):
    embedding_value = embedding_vector_glove.get(word)
    if embedding_value is not None:
        embedding_matrix_glove[i] = embedding_value

### **Word2vec**

### **Bert Pretrained Embedding**

### **Visualization**

## **Baseline Model**

### **Model Architecture**

### **Model Training & Evaluation**

### **Comparison Study**

### **Comparing with pre-trained model**

## **Inference**