## Let's create our own sentiment analysis model

In [1]:
import pandas as pd
from tqdm.notebook import tqdm

## Loading  and formatting dataset

According to the creators of the dataset:

"Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search"

In [2]:
sentiment140_file = "/home/guscarrian@GU.GU.SE/ML_22/project/training.1600000.processed.noemoticon.csv"

We will give the columns a header to remember the data in each of them.

According to the kaggle dataset (https://www.kaggle.com/datasets/kazanova/sentiment140?resource=download):

- target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

- ids: The id of the tweet ( 2087)

- date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

- flag: The query (lyx). If there is no query, then this value is NO_QUERY.

- user: the user that tweeted (robotickilldozr)

- text: the text of the tweet (Lyx is cool)

These will be our column names. We'll use 'sentiment' instead of 'target', and 'query' instead of 'flag'.

In [3]:
columns = ['sentiment','id','date','query','user','text']

In [4]:
#Loading the training data as a pandas dataframe
dataset = pd.read_csv(sentiment140_file, encoding='latin-1', names=columns)

In [5]:
dataset.head()

Unnamed: 0,sentiment,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Although the kaggle dataset includes neutral (2 = neutral) as a class, there are not examples of neutral sentiment in the training set. We find 800000 tweets with a negative polarity (0 = negative) and 800000 tweets with positive polarity (4 = positive), which means the dataset is balanced.

In [6]:
dataset["sentiment"].value_counts()

0    800000
4    800000
Name: sentiment, dtype: int64

Not all columns are relevant to the task we're dealing with in this project so we need to remove/drop some of them: id, date, query and user.

In [7]:
dataset = dataset.drop(columns=['id','date','query','user'])

In [8]:
dataset.shape

(1600000, 2)

In [9]:
dataset

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
...,...,...
1599995,4,Just woke up. Having no school is the best fee...
1599996,4,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...


In [15]:
#dataset['pre_clean_len'] = [len(t) for t in dataset.text]

In [18]:
#dataset[dataset.pre_clean_len > 140].head(10)

Unnamed: 0,sentiment,text,pre_clean_len
213,0,Awwh babs... you look so sad underneith that s...,142
226,0,Tuesdayï¿½ll start with reflection ï¿½n then a...,141
279,0,Whinging. My client&amp;boss don't understand ...,145
343,0,@TheLeagueSF Not Fun &amp; Furious? The new ma...,145
400,0,#3 woke up and was having an accident - &quot;...,144
464,0,"My bathtub drain is fired: it haz 1 job 2 do, ...",146
492,0,"pears &amp; Brie, bottle of Cabernet, and &quo...",150
747,0,Have an invite for &quot;Healthy Dining&quot; ...,141
957,0,Damnit I was really digging this season of Rea...,141
1064,0,Why do I keep looking...I know that what I rea...,141


In [10]:
#dataset['&quot' in dataset.text].head(10)


Data Preparation 1: HTML decoding

HTML encoding has not been converted to text, and ended up in text field as ‘&amp’,’&quot’,etc. 

Esto de decoding HTML lo metería en el pipeline para preparar los datos

In [10]:
quot = []
amp = []
todo = []
for item in dataset.text:
    if '&quot;' in item:
        quot.append(item)
    if '&amp;' in item:
        amp.append(item)
    if '&quot;' and '&amp;' in item:
        todo.append(item)
        
        
print(len(quot))
print(len(amp))
print(len(todo))

34141
42927
42927


In [11]:
import lxml.html

def decode_html(html_string):
    # Use the lxml.html module's HTML parser to parse the HTML
    parsed_html = lxml.html.fromstring(html_string)

    # Use the text_content() method to extract the text from the parsed HTML
    return parsed_html.text_content()


In [12]:
html_string = dataset['text'][492]
decoded_text = decode_html(html_string)

print(decoded_text)


pears & Brie, bottle of Cabernet, and "Win a Date With Tad Hamilton"... oh gawwd my life flashed forward to when I'm 40 with my 75 cats 


In [13]:
dataset['text'][747]

'Have an invite for &quot;Healthy Dining&quot; session at Ashok Hotel today with Exec Chef R.Chopra but damn workload - will have to skip it! '

In [14]:
dataset['text'][343]

'@TheLeagueSF Not Fun &amp; Furious? The new mantra for the Bay 2 Breakers? It was getting 2 rambunctious;the city overreacted &amp; clamped down '

In [15]:
dataset['text'][492]

"pears &amp; Brie, bottle of Cabernet, and &quot;Win a Date With Tad Hamilton&quot;... oh gawwd my life flashed forward to when I'm 40 with my 75 cats "

Certainly! Here is a simple Python function that can be used to remove Twitter handles (the text following an '@' symbol) from a tweet:

REMOVING MENTIONS/USER NAMES AND URLs BC THEY DON'T ADD ANY VALUE TO OUR SENTIMENT ANALYSIS MODEL.
NOT POSITIVE OR NEGATIVE.

"this information doesn’t add value to build sentiment analysis model."

In [16]:
import re

def remove_handles(tweet):
    # Use a regular expression to find Twitter handles in the tweet
    pattern = r'@\w+'
    return re.sub(pattern, '', tweet)

In [17]:
tweet = dataset['text'][343]
modified_tweet = remove_handles(tweet)

print(modified_tweet)  # prints "Hello! How are you doing today?"

 Not Fun &amp; Furious? The new mantra for the Bay 2 Breakers? It was getting 2 rambunctious;the city overreacted &amp; clamped down 


ESTA FUNCIÓN ELIMINA TANTO LAS MENCIONES COMO LAS URLs

In [18]:
def remove_urls_and_mentions(tweet):
    # # Use a regular expression to match URLs starting with 'http' or 'www'
    tweet = re.sub(r"(?:http|www)\S+", "", tweet)
    
    # Remove mentions
    tweet = re.sub(r"@\S+", "", tweet)
    
    return tweet

In [19]:
tweet = "Check out this cool article I found: http://example.com you can also find it here: www.example.com #fun #article @friend"
cleaned_tweet = remove_urls_and_mentions(tweet)
print(cleaned_tweet)

Check out this cool article I found:  you can also find it here:  #fun #article 


In [20]:
dataset.text[226]

'Tuesdayï¿½ll start with reflection ï¿½n then a lecture in Stress reducing techniques. That sure might become very useful for us accompaniers '

LA PARTE DE DECODING NO HAY MANERA DE SACARLA!!! - SE QUEDA PENDIENTE POR SI PUEDO HACERLO MÁS ADELANTE O SI NO LO METO EN LA SECCIÓN DE COSAS QUE PUEDEN MEJORAR EN EL FUTURO.

In [21]:
def decoding(tweet):
    tweet.replace('ï¿½', "????")
    return tweet

In [22]:
tweet = dataset.text[226]
decoded_tweet = decoding(tweet)
print(decoded_tweet)

Tuesdayï¿½ll start with reflection ï¿½n then a lecture in Stress reducing techniques. That sure might become very useful for us accompaniers 


In [23]:
#ESTO PARECE FUNCIONAR!
raro = []
for item in dataset.text:
    if 'ï¿½' in item:
        twt = item.replace('ï¿½', "'")
        raro.append(twt)
        #raro.append(item)
        
print(len(raro))
print(raro[:11])


4051
["@JonathanRKnight I hate the limited letters,too.Hope you and the guys are fine?I pray for my dog,she's not well ", "Tuesday'll start with reflection 'n then a lecture in Stress reducing techniques. That sure might become very useful for us accompaniers ", "@DonnieWahlberg ooh I'm excited and not even going 2 be there  long love YOUTUBE!", "Time to move my posterior  and lose some fat. My articulation are creaking so no more running  but I'm drool for some swimming", "@Sofii_Noel that's bad ", "Tumblr: This is exactly how it feels wearing a 'tie'  http://tinyurl.com/c8bvqh", "@simX Yeah. I always slow down at the end  ''also, take that! I win.", "Still in bed and don't want to do anything else. University is callung too loud ", "Chi?u nay h?p chu?n b? t? ch?c m?y s? ki?n ? tr??ng ! Bao nhi'u vi?c ", "was super lucky to get a seat on the train. We pay '40 for this 25 min journey. ", "I NEVER THOUGHT THAT I COULD  HATE SOMBODY, BUT I REALLY HATE YOU 'TOBE D....', I ONLY GAVE YOU AL

In [None]:
#Remove doble spacing and space as a last character in the string as in dataset['text'][249]

In [24]:
def remove_double_spaces_and_trailing_whitespace(string):
  # Replace all double spaces with single spaces
    string = string.replace("  ", " ")
  
  # Remove any trailing whitespace
    string = string.rstrip()
    return string

In [25]:
dataset['text'][279]

"Whinging. My client&amp;boss don't understand English well. Rewrote some text unreadable. It's written by v. good writer&amp;reviewed correctly. "

In [26]:
new_string = remove_double_spaces_and_trailing_whitespace(dataset['text'][279])
print(new_string)

Whinging. My client&amp;boss don't understand English well. Rewrote some text unreadable. It's written by v. good writer&amp;reviewed correctly.


ANTES DE QUITAR LA PUNTUACIÓN, NECESITO ARREGLAR EL TEMA DE LAS CONTRACCIONES (I'll, she'll, etc) PORQUE SI LUEGO ELIMINO LA PUNTUACIÓN (') NO PUEDO ARREGLARLO. PARA ELLO NECESITO UNA FUNCIÓN QUE:

uses a regular expression to find all instances of a character followed by "'ll" in a given string of text, and replaces them with the character followed by "will".


In [27]:
#This function uses a regular expression to match any word that ends in "'ll" 
#(e.g., "I'll", "you'll", "he'll"), and replaces it with the same word followed 
#by " will". The regular expression uses a capture group (i.e., the parentheses) 
#to capture the character before "'ll", and then uses the \1 backreference to refer 
#to this captured character in the replacement string.

def find_and_replace_coincidence(text):
    pattern = r'(\b\w+)\'ll\b'
    return re.sub(pattern, r'\1 will', text)

#Keep in mind that this function is case-sensitive, 
#so it will only match and replace words with an uppercase or lowercase "I" or "s".

In [28]:
text = "I'll see you at the movie theater. She'll be there too. I'm sure you'll have a great time."
replaced_text = find_and_replace_coincidence(text)
print(replaced_text)

I will see you at the movie theater. She will be there too. I'm sure you will have a great time.


VOY A INTENTAR HACER LO MISMO PERO CON CASOS COMO LA NEGACIÓN: shouldn't -- should not, etc

HAY CASOS ESPECIALES COMO won't (will not) o CAN'T (cannot)

This function uses three regular expressions to find and replace different contractions in the input text. The first regular expression is used to match the "won't" contraction specifically, and replaces it with "will not". The second and third regular expressions are used to match the "can't" and other instances of a character followed by "n't" contractions, and replace them with "cannot" and the character followed by " not", respectively.

In [29]:
def find_and_replace_negation(text):
    pattern = r'(\bw)on\'t\b'
    text = re.sub(pattern, r'\1ill not', text)
    pattern = r'(\bcan)\'t\b'
    text = re.sub(pattern, r'\1not', text)
    pattern = r'(\b\w+)n\'t\b'
    return re.sub(pattern, r'\1 not', text)

In [30]:
text = "I shouldn't have eaten that last slice of pizza. You won't believe how full I am now. I can't even breathe. I didn't eat pizza since I haven't got my money."
replaced_text = find_and_replace_negation(text)
print(replaced_text)

I should not have eaten that last slice of pizza. You will not believe how full I am now. I cannot even breathe. I did not eat pizza since I have not got my money.


In [31]:
#Remove some punctuations such as # (the reason why why we don't remove the text in hashtags (e.g., #hello)
#is bc they might add useful information regarding the sentiment of the tweet (#good #bad #happy #awful)

It can be beneficial to remove punctuation when performing sentiment analysis, as punctuation may not always carry a lot of meaning in determining the sentiment of a piece of text. For example, an exclamation point (!) may indicate strong emotion, but it is not necessary for understanding the overall sentiment of a text.

On the other hand, punctuation can also convey important information and context, and removing it may result in the loss of some important features of the text. For example, a question mark (?) may indicate that the speaker is unsure or seeking clarification, which could affect the sentiment of the text.

It may be helpful to try both approaches and see which one performs better on our dataset, and then make a decision based on the results.

In [32]:
#This function uses a regular expression to match any character that is not a word 
#character (i.e., a letter, digit, or underscore) or a whitespace character, and replaces it 
#with an empty string. This will remove all punctuation, as well as any other non-alphabetic 
#characters such as emoji or non-Latin scripts.

def remove_punctuation(text):
    text = re.sub(r'[^\w\s]', '', text)
    return text

In [33]:
texto = "The movie was really #good! I can't wait to see it again!!! I'll go to the cinema next week :)"
no_punc = remove_punctuation(texto)
print(no_punc)

The movie was really good I cant wait to see it again Ill go to the cinema next week 


REMOVING NUMBERS:

It is generally a good idea to remove numbers when performing sentiment analysis since numbers typically do not carry much sentiment on their own, and they can often distract the model from learning the more important sentiment-bearing words and phrases in the text.

Other preprocessing steps, such as lowercasing the text, removing punctuation, and stemming or lemmatizing words. These steps can help to improve the model's performance and make it easier for the LSTM to learn the underlying patterns in the data.

In [34]:
num = []
numeros = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
for item in dataset.text:
    for n in numeros:
        if n in item:
            num.append(item)       
        
print(len(num))
#print(num[:11])

711185


INTENTO DE HACER UNA FUNCIÓN QUE HAGA TODO LO DE ARRIBA

In [200]:
import re

def data_preprocessing(text):
    
    # Using the lxml.html module's HTML parser to parse the HTML
    parsed_html = lxml.html.fromstring(text)
    # Using the text_content() method to extract the text from the parsed HTML
    tweet = parsed_html.text_content()
    
    # Replacing weird characters (probably because of the latin-1 encoding) -- sheï¿½s = she's
    tweet = tweet.replace('ï¿½', "'")
    
    # Use the sub function to replace all email addresses with an empty string
    tweet = re.sub(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', '', tweet)
    

    # Using a regular expression to match URLs starting with 'http' or 'www'
    tweet = re.sub(r"(?:http|www)\S+", "", tweet)

    # Removing mentions / usernames
    tweet = re.sub(r"@\S+", "", tweet)
    
    #This code uses the re.sub() function and a lambda function 
    #to replace all instances of commas, periods, interrogation marks, 
    #and exclamation marks in the text string with ", ", ". ", "? ", and "! " respectively.
    tweet = re.sub(r'[,\.\?!]', lambda x: x.group() + ' ', tweet)
    
    # Replacing I'm / i'm with I am
    tweet = re.sub(r'(I|i)\'m', r'\1 am', tweet)
    
    # Replacing won't with will not (EXPLANATION BELOW)
    tweet = re.sub(r'(W|w)on\'t', r'\g<1>ill not', tweet)
    
    # Replacing Can't / can't with cannot
    tweet = re.sub(r'(can|Can)\'t', r'\1not', tweet)
    
    # Replacing instances of a character followed by "n't" contractions with the character followed by " not"
    tweet = re.sub(r'(\b\w+)n\'t\b', r'\1 not', tweet)
    
    # Replacing 'll with will
    pattern_will = r'(\b\w+)\'ll\b'
    tweet =  re.sub(pattern_will, r'\1 will', tweet)
    
    # Replacing 've with have
    pattern_ve = r'(\b\w+)\'ve\b'
    tweet = re.sub(pattern_ve, r'\1 have', tweet)
    
    # Replacing any non-character or whitespace with an empty string
    tweet = re.sub(r'[^\w\s]', '', tweet)
    
    # Removing digits
    tweet = re.sub(r'\d', '', tweet)
    
    # Replacing all double spaces with single spaces
    #IF I TOKENIZE, MAYBE THIS IS USELESS
    tweet = tweet.replace("  ", " ")
    tweet = tweet.replace("  ", " ")
  
    # Removing any trailing whitespace
    tweet = tweet.rstrip()
    
    # Lowercase
    tweet = tweet.lower()
    
    
    return tweet


EXPLANATION HERE: The reason why we handle negation (that is, replacing won't with will not, can't with cannot and shouldn't, couldn't... with should/could/... not, is because we will remove punctuation later on when preprocessing the data and cases such as can't, for instance, will end up as "can t". During the stopword removal, the t will be removed as it is a single character adding no value. Since our purpose is to analyse the sentiment of tweets, something that originally was "can't" and could possibly have a negative sentiment, after prepocessing the data will turn into "can" (because we've removed punctuation and stopwords, so the "t" is removed) which could be interpreted as a positive sentiment. Thus, we'll end up with a false positive.

In [140]:
example = "I can't be there,my cat is sick. @example I pray for my dog, sheï¿½s not well. Have an invite for &quot;Healthy Dining&quot;. She'll be ok. I've had a great time. Double space here:  #example #testing Mira este link: https://ejemplo.com or here too: www.hola.es. I won't be there. My email is anagzl@student.gu.se I can't be there. You shouldn't be there. Today is my 29th birthday! "

In [90]:
example2 = "I'm silly @NKDreamer did you see Donnie's tweet stats? almost 700 @ replies...and no JRK "

In [92]:
example3 = "i'm ok,just found out I was working for nothing!how are you doing?hope well!"

In [149]:
example4 = "@BobMetcalfe I had the same experience this week. New web anita@gmail.com Won't work. I won't go home. I can't make it today.I was at 200 and am back to 204. The bicycle is calling me already.  Can't until 2nite@ "

In [202]:
result = data_preprocessing("@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D")
print(result)

 a thats a bummer you shoulda got david carr of third day to do it d


In [174]:
%%time

print('Preprocessing data...\n')
clean_data = []
num = 0
for item in dataset['text']:
    clean_data.append(data_preprocessing(item))
    num += 1
    if num % 100000 == 0:
        print(f"Tweets {num} of 1600000 have been processed")

Preprocessing data...

Tweets 100000 of 1600000 have been processed
Tweets 200000 of 1600000 have been processed
Tweets 300000 of 1600000 have been processed
Tweets 400000 of 1600000 have been processed
Tweets 500000 of 1600000 have been processed
Tweets 600000 of 1600000 have been processed
Tweets 700000 of 1600000 have been processed
Tweets 800000 of 1600000 have been processed
Tweets 900000 of 1600000 have been processed
Tweets 1000000 of 1600000 have been processed
Tweets 1100000 of 1600000 have been processed
Tweets 1200000 of 1600000 have been processed
Tweets 1300000 of 1600000 have been processed
Tweets 1400000 of 1600000 have been processed
Tweets 1500000 of 1600000 have been processed
Tweets 1600000 of 1600000 have been processed
CPU times: user 2min 37s, sys: 248 ms, total: 2min 38s
Wall time: 2min 39s


trying stuff

%%time

print("Preprocessing data...\n")
clean_data = []

for i in range(0, 800000):
    if (i+1) % 10000 == 0:
        print(f"Tweets {i+1} of 800000 have been processed")
    # Using the at[] method of the DataFrame to access the i-th element of the 'text' column
    clean_data.append(data_preprocessing(dataset.at[i, "text"]))

In [176]:
# Creating a Pandas DataFrame from the list of prepocessed tweets
clean_df = pd.DataFrame(clean_data,columns=['text'])
clean_df['sentiment'] = dataset.sentiment
clean_df.head(10)

Unnamed: 0,text,sentiment
0,a thats a bummer you shoulda got david carr o...,0
1,is upset that he cannot update his facebook by...,0
2,i dived many times for the ball managed to sa...,0
3,my whole body feels itchy and like its on fire,0
4,no its not behaving at all i am mad why am i ...,0
5,not the whole crew,0
6,need a hug,0
7,hey long time no see yes rains a bit only a b...,0
8,nope they did not have it,0
9,que me muera,0


In [179]:
# Save the DataFrame to a CSV file
clean_df.to_csv('clean_data.csv', encoding='utf-8')

In [180]:
#Loading the preprocessed data as a pandas dataframe
data_df = pd.read_csv('clean_data.csv', index_col=0)

In [181]:
data_df.head()

Unnamed: 0,text,sentiment
0,a thats a bummer you shoulda got david carr o...,0
1,is upset that he cannot update his facebook by...,0
2,i dived many times for the ball managed to sa...,0
3,my whole body feels itchy and like its on fire,0
4,no its not behaving at all i am mad why am i ...,0


In [205]:
data_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1600000 entries, 0 to 1599999
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   text       1596712 non-null  object
 1   sentiment  1600000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 36.6+ MB


In [207]:
data_df[data_df.isnull().any(axis=1)]


Unnamed: 0,text,sentiment
208,,0
249,,0
282,,0
398,,0
430,,0
...,...,...
1597326,,4
1597684,,4
1598272,,4
1599494,,4


NEXT STEP: STOP WORDS, TOKENIZATION, stemming/lemmatization -- check SA_model_LSTM

## Attention: following the step-by-step tutorial in PDF (SA_with_LSTM_tutorial.pdf) saved locally.

We'll be following some of these steps but applied to tweets and not Amazon reviews:

1. Load in and visualize the data
2. Data Processing — convert to lower case
3. Data Processing — Remove punctuation
4. Data Processing — Create list of reviews
5. Tokenize — Create Vocab to Int mapping dictionary
6. Tokenize — Encode the words
7. Tokenize — Encode the labels
8. Analyze Reviews Length
9. Removing Outliers — Getting rid of extremely long or short reviews
10. Padding / Truncating the remaining data
11. Training, Validation, Test Dataset Split
12. Dataloaders and Batching
13. Define the LSTM Network Architecture
14. Define the Model Class
15. Training the Network
16. Testing (on Test data and User- generated data)

### what about stopwords?

1. Load in and visualize the data - it's already done above

Next: 2. Data Processing — convert to lower case

In [57]:
dataset

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
...,...,...
1599995,4,Just woke up. Having no school is the best fee...
1599996,4,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...


In [56]:
def data_cleaner(data):
    #removing capital letters
    data['text'] = data['text'].str.lower()
    return data

In [42]:
clean = data_cleaner(dataset)

In [45]:
clean

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - awww, t..."
1,0,is upset that he can't update his facebook by ...
2,0,@kenichan i dived many times for the ball. man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
...,...,...
1599995,4,just woke up. having no school is the best fee...
1599996,4,thewdb.com - very cool to hear old walt interv...
1599997,4,are you ready for your mojo makeover? ask me f...
1599998,4,happy 38th birthday to my boo of alll time!!! ...


In [60]:
dataset

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
...,...,...
1599995,4,Just woke up. Having no school is the best fee...
1599996,4,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...


## Removing mentions - @


Although mentions (@username) carries some information, it doesn’t add value to the sentiment analysis model.

In [59]:
import re

In [61]:
dataset['text'] = re.sub(r"@[A-Za-z0-9]+","",dataset['text'])

TypeError: expected string or bytes-like object

In [63]:
testing = dataset.text[:10]

In [65]:
def cleaning(text):
    lower_case = text.lower()
    return lower_case

In [66]:
testing

0    @switchfoot http://twitpic.com/2y1zl - Awww, t...
1    is upset that he can't update his Facebook by ...
2    @Kenichan I dived many times for the ball. Man...
3      my whole body feels itchy and like its on fire 
4    @nationwideclass no, it's not behaving at all....
5                        @Kwesidei not the whole crew 
6                                          Need a hug 
7    @LOLTrish hey  long time no see! Yes.. Rains a...
8                 @Tatiana_K nope they didn't have it 
9                            @twittera que me muera ? 
Name: text, dtype: object

In [68]:
test_result = []
for t in testing:
    test_result.append(cleaning(t))

In [69]:
test_result

["@switchfoot http://twitpic.com/2y1zl - awww, that's a bummer.  you shoulda got david carr of third day to do it. ;d",
 "is upset that he can't update his facebook by texting it... and might cry as a result  school today also. blah!",
 '@kenichan i dived many times for the ball. managed to save 50%  the rest go out of bounds',
 'my whole body feels itchy and like its on fire ',
 "@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because i can't see you all over there. ",
 '@kwesidei not the whole crew ',
 'need a hug ',
 "@loltrish hey  long time no see! yes.. rains a bit ,only a bit  lol , i'm fine thanks , how's you ?",
 "@tatiana_k nope they didn't have it ",
 '@twittera que me muera ? ']

NOT SURE ABOUT THIS:

Next step: convert the sentiment (categorical) values into numerical values (for the machine to understand). We will use...



"The labels for this dataset are categorical. Machines understand only numeric data. So, convert the categorical values to numeric using the factorize() method. This returns an array of numeric values and an Index of categories."

https://neptune.ai/blog/sentiment-analysis-python-textblob-vs-vader-vs-flair