<a href="https://colab.research.google.com/github/harnalashok/deeplearning-sequences/blob/main/sentimentAnalysis_for_DLS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# LAst amended: 01st Feb, 2024
# Objective: To prepare csv file from tweets data
#            for experimenting in deeplearning studio (DLS)
#
# My Github reference:
#    https://github.com/harnalashok/deeplearning-sequences/blob/main/sentimentAnalysis_for_DLS.ipynb
#
# Ref: https://community.deepcognition.ai/t/text-processing-and-tutorial-video-for-uploading-text-dataset/238

## How to?<br>
> 1.0 Your csv file must have at the least two columns. If there are more columns, no problem.<br>
> 2.0 The tweets or text column must have the header name as <b>'text'</b> and the class column should have the header name as <b>'label'</b>  
> 3.0 Upload csv file in your gdrive<br>
> 4.0 After the process is complete, the saved file will be in the folder '/content'. You can download the processed file directly from this folder. Right click on it and click <b>Download</b>.

## Call libraries

In [1]:
# 1.0
import tensorflow
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import csv

In [2]:
# 1.1
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


## Read data
Change here your csv file location appropriately.

In [3]:
# 2.0 Change this path as per your file location in gdrive:
#datapath = "/gdrive/MyDrive/sentiment/combined.csv"
datapath = "/gdrive/MyDrive/Colab_data_files/disaster_tweets/socialmedia_relevant_cols.csv"

In [4]:
# 2.1 If problem in reading try different encodings
#      For other encodings, see: https://stackoverflow.com/a/18172249/3282777#

data = pd.read_csv(datapath, encoding = "ISO-8859-1")
data.head()

Unnamed: 0,text,choose_one,class_label
0,Just happened a terrible car crash,Relevant,1
1,Our Deeds are the Reason of this #earthquake M...,Relevant,1
2,"Heard about #earthquake is different cities, s...",Relevant,1
3,"there is a forest fire at spot pond, geese are...",Relevant,1
4,Forest fire near La Ronge Sask. Canada,Relevant,1


In [5]:
# 2.2 How many classes does this data have?

data['class_label'].value_counts()  # Three: 0,1,2

0    6186
1    4673
2      16
Name: class_label, dtype: int64

In [6]:
# 2.3 If three, then remove the class with label 2:

data = data.loc[data['class_label'] != 2, :]

In [7]:
# 2.4 Check again now:

data['class_label'].value_counts()

0    6186
1    4673
Name: class_label, dtype: int64

In [8]:
# 2.5 Target should not have any NULL values:

data['class_label'].isnull().sum()

0

In [None]:
#text_file = open("reviews.txt", "r")
#lines = text_file.readlines()

In [9]:
#2.6 Get rows from data:

lines = data['text']

## Tokenize data

In [10]:
# 3.0 Select relevant parameters:

maxlen = 500                  # Maximum length of reviews. A review greater than maxlen
                              #  will be truncated
max_words = 10000             # We will only consider the top max_words in the dataset

In [11]:
# 3.1 Instantiate Tokenizer class
tokenizer = Tokenizer(num_words=max_words)

In [12]:
# 3.2 Fit it on text

tokenizer.fit_on_texts(lines)   # tokenizer.index_word

In [15]:
# 3.3 Transform text, tweet by tweet, as a list of numbers:

sequences = tokenizer.texts_to_sequences(lines)
print(sequences[:3])  # Print top 3-comments/tweets

[[34, 831, 5, 1518, 133, 97], [114, 5934, 25, 4, 877, 8, 22, 255, 154, 1821, 3834, 90, 43], [380, 56, 255, 11, 1316, 1822, 658, 1519, 275]]


In [16]:
# 3.4 How many tweets have been read:

len(sequences)


10859

## Pad all sequences to same length

In [18]:
# 4.0 Trasnform sequences to sameLengthSequences :
sameLengthSequences = pad_sequences(sequences, maxlen=maxlen)

In [19]:
# 4.1 Join every number in a sequence using semicolon:
#     The three sequences: [[0,0,23,45], [89,76,33,44],[49,98,34,22]]
#     become:              [[0;0;23;45], [89;76;33;44],[49;98;34;22]]

sequencesToStrings = []
for row in sameLengthSequences:
    sequencesToStrings.append(';'.join(str(col) for col in row))

## Save file

In [20]:
# 4.2 Our csv file:

csvfile = "processed.csv"

In [None]:
#with open(csvfile, "w") as output:
#    writer = csv.writer(output, lineterminator='\n')
#    for val in sequencesToStrings:
#        writer.writerow([val])

In [21]:
# 4.3 Create a blank dataframe
s = pd.DataFrame()

In [22]:
# 4.4 One column is text column
s['text'] = sequencesToStrings

In [23]:
# 4.5 The other is label column:
s['Label'] = data['class_label'].values


In [24]:
# 4.6 Save the file appropriately:
#     It is saved to /content folder:

s.to_csv(csvfile, index = False)

## Steps to take now:

DLS for tabular data
====================
 USe Chrome Browser

 Steps:<br>
 1.0 Rename your processed .csv file as 'train.csv'  
 2.0 Place your train.csv in an empty folder  
 3.0 Name of this folder will be the name of your dataset in the DLS. So rename the folder, if needed.  
 4.0 zip the folder  
 5.0 Upload the zipped folder in DLS with  DLS Native
     While uploading select DLS Native dataset option<br>
 6.0 Create a project and develop model as usual. To start with you can use the same model as in DLS for IMDB dataset.IMDB dataset is very large (15000 training samples) as ours of just 2000 samples. Try to make this sample alsolarge.




In [None]:
########## DONE ############