<a href="https://colab.research.google.com/github/prarthanaVengurlekar5/NLP/blob/main/TWEETER_SENTIMENT_ANALYSIS_DATASET.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Context
This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment.

**Content**

It contains the following 6 fields:

**target**: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

**ids**: The id of the tweet ( 2087)

**date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

**flag**: The query (lyx). If there is no query, then this value is NO_QUERY.

**user**: the user that tweeted (robotickilldozr)

**text**: the text of the tweet (Lyx is cool)

## Step:1 Read the DATA

In [None]:
import pandas as pd
import re

import warnings
warnings.filterwarnings('ignore')

In [None]:
columns=['target','ids','date','flag','user','text']

In [None]:
path='/content/drive/MyDrive/NLP/training.1600000.processed.noemoticon.csv'
df=pd.read_csv(path,encoding='ISO-8859-1',names=columns)
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
df.tail(1)

Unnamed: 0,target,ids,date,flag,user,text
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


In [None]:
dataset=df[['text','target']]
dataset.head()

Unnamed: 0,text,target
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


## Step 2: Remap the Target Column

In [None]:
dataset.target.unique()

array([0, 4])

In [None]:
dataset['target']=dataset['target'].replace(4,1)
dataset.target.unique()

array([0, 1])

## Step 3: Handling the Missing Values

In [None]:
dataset.isna().sum()

text      0
target    0
dtype: int64

## Step 4: Text Processing

### Step 4.1: Remove HTML Tags

In [None]:
pattern=re.compile(r'https.*?(?=\s)')
pattern.sub(r'',str(dataset['text'][0]))

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

In [None]:
pattern=re.compile(r'https?:\/\/\S+')
pattern.sub(r'',str(dataset['text'][0]))

"@switchfoot  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"

In [None]:
def remove_url(text):
  pattern=re.compile(r'https?:\/\/\S+')
  return pattern.sub(r'',text)

In [None]:
dataset['text']=dataset['text'].apply(lambda x: remove_url(x))

In [None]:
dataset['text'].head()

0    @switchfoot  - Awww, that's a bummer.  You sho...
1    is upset that he can't update his Facebook by ...
2    @Kenichan I dived many times for the ball. Man...
3      my whole body feels itchy and like its on fire 
4    @nationwideclass no, it's not behaving at all....
Name: text, dtype: object

### Step 4.2: Remove HTML Tags

In [None]:
def remove_tags(text):
  pattern=re.compile(r'<.*?>')
  return pattern.sub(r'',text)

In [None]:
text='<p>Save the document by pressing <kbd>Ctrl + S</kbd></p>'
remove_tags(text)

'Save the document by pressing Ctrl + S'

In [None]:
dataset['text']=dataset['text'].apply(lambda x: remove_tags(x))

### Step4.3 : Handling Emoticons

In [None]:
#Emojis
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad',
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed',
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink',
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat',';D':'laughing'}


In [None]:
'Emoji'+emojis[':)']

'Emojismile'

In [None]:
def remove_emoticons(text):
  for emoji in emojis:
    text=text.replace(emoji, 'Emoji'+emojis[emoji])
  return text

In [None]:
dataset['text']=dataset['text'].apply(lambda x: remove_emoticons(x))

In [None]:
dataset['text'][0]

"@switchfoot  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. Emojilaughing"

### Step 4.4: Handling Emojis

In [None]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m421.5/421.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.10.1


In [None]:
text='Business: We open at 10. üòÄ'

import emoji
print(type(emoji.demojize(text)))

<class 'str'>


In [None]:
def remove_emoji(text):
  return emoji.demojize(text)

In [None]:
remove_emoji(''' Business: Hi Jane, I am so sorry to hear this. üòØ Please tell me how I can help. ''')

' Business: Hi Jane, I am so sorry to hear this. :hushed_face: Please tell me how I can help. '

In [None]:
dataset['text']=dataset['text'].apply(lambda x: remove_emoji(x))

In [None]:
dataset['text'][0]

"@switchfoot  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. Emojilaughing"

### Step 4.5: Handling the USER

In [None]:
def handle_user(text):
  pattern=re.compile(r'@[^\s]+')
  text=pattern.sub('TUSER', text)

  return text

In [None]:
handle_user(dataset['text'][0])

"TUSER  - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. Emojilaughing"

In [None]:
dataset['text']=dataset['text'].apply(lambda x: handle_user(x))

In [None]:
dataset['text']

0          TUSER  - Awww, that's a bummer.  You shoulda g...
1          is upset that he can't update his Facebook by ...
2          TUSER I dived many times for the ball. Managed...
3            my whole body feels itchy and like its on fire 
4          TUSER no, it's not behaving at all. i'm mad. w...
                                 ...                        
1599995    Just woke up. Having no school is the best fee...
1599996    TheWDB.com - Very cool to hear old Walt interv...
1599997    Are you ready for your MoJo Makeover? Ask me f...
1599998    Happy 38th Birthday to my boo of alll time!!! ...
1599999             happy #charitytuesday TUSER TUSER TUSER 
Name: text, Length: 1600000, dtype: object

### Step 4.6: Remove Punctuation

In [None]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [None]:
punc=string.punctuation

def remove_punc(text):
  return text.translate(str.maketrans("","",punc))

In [None]:
remove_punc('Hi ! How are you ?')

'Hi  How are you '

In [None]:
dataset['text']=dataset['text'].apply(lambda x: remove_punc(x))

In [None]:
dataset['text']

0          TUSER   Awww thats a bummer  You shoulda got D...
1          is upset that he cant update his Facebook by t...
2          TUSER I dived many times for the ball Managed ...
3            my whole body feels itchy and like its on fire 
4          TUSER no its not behaving at all im mad why am...
                                 ...                        
1599995    Just woke up Having no school is the best feel...
1599996    TheWDBcom  Very cool to hear old Walt intervie...
1599997    Are you ready for your MoJo Makeover Ask me fo...
1599998    Happy 38th Birthday to my boo of alll time Tup...
1599999              happy charitytuesday TUSER TUSER TUSER 
Name: text, Length: 1600000, dtype: object

### Step 4.7 :Remove chat words or Slang Words

In [None]:
slang='/content/drive/MyDrive/NLP/slang.txt'

In [None]:
slang

'/content/drive/MyDrive/NLP/slang.txt'

In [None]:
with open(slang, 'r') as f:
  lines=f.readlines()

In [None]:
lines[0]

'AFAIK=As Far As I Know\n'

In [None]:
lines[0].split('=')

['AFAIK', 'As Far As I Know\n']

In [None]:
lines[0].split('=')[0]

'AFAIK'

In [None]:
lines[0].split('=')[1]

'As Far As I Know\n'

In [None]:
lines[0].split('=')[1][:-1]

'As Far As I Know'

In [None]:
slang_dict={}
for i in range(len(lines)):
  slang_dict[lines[i].split('=')[0]]=lines[i].split('=')[1][:-1]

In [None]:
slang_dict

{'AFAIK': 'As Far As I Know',
 'AFK': 'Away From Keyboard',
 'ASAP': 'As Soon As Possible',
 'ATK': 'At The Keyboard',
 'ATM': 'At The Moment',
 'A3': 'Anytime, Anywhere, Anyplace',
 'BAK': 'Back At Keyboard',
 'BBL': 'Be Back Later',
 'BBS': 'Be Back Soon',
 'BFN': 'Bye For Now',
 'B4N': 'Bye For Now',
 'BRB': 'Be Right Back',
 'BRT': 'Be Right There',
 'BTW': 'By The Way',
 'B4': 'Before',
 'CU': 'See You',
 'CUL8R': 'See You Later',
 'CYA': 'See You',
 'FAQ': 'Frequently Asked Questions',
 'FC': 'Fingers Crossed',
 'FWIW': "For What It's Worth",
 'FYI': 'For Your Information',
 'GAL': 'Get A Life',
 'GG': 'Good Game',
 'GN': 'Good Night',
 'GMTA': 'Great Minds Think Alike',
 'GR8': 'Great!',
 'G9': 'Genius',
 'IC': 'I See',
 'ICQ': 'I Seek you (also a chat program)',
 'ILU': 'ILU: I Love You',
 'IMHO': 'In My Honest/Humble Opinion',
 'IMO': 'In My Opinion',
 'IOW': 'In Other Words',
 'IRL': 'In Real Life',
 'KISS': 'Keep It Simple, Stupid',
 'LDR': 'Long Distance Relationship',
 'LM

In [None]:
def remove_chatwords(text):
  new_text=[]
  for w in text.split():
    if w.upper() in slang_dict:
      new_text.append(slang_dict[w.upper()])
    else:
      new_text.append(w)

  return " ".join(new_text)



In [None]:
remove_chatwords('rofl :This is so funny')

'Rolling On The Floor Laughing :This is so funny'

In [None]:
dataset['text']=dataset['text'].apply(lambda x: remove_chatwords(x))

In [None]:
dataset['text']

0          TUSER Awww thats a bummer You shoulda got Davi...
1          is upset that he cant update his Facebook by t...
2          TUSER I dived many times for the ball Managed ...
3             my whole body feels itchy and like its on fire
4          TUSER no its not behaving at all im mad why am...
                                 ...                        
1599995    Just woke up Having no school is the best feel...
1599996    TheWDBcom Very cool to hear old Walt interview...
1599997    Are you ready for your MoJo Makeover Ask me fo...
1599998    Happy 38th Birthday to my boo of alll Tears in...
1599999               happy charitytuesday TUSER TUSER TUSER
Name: text, Length: 1600000, dtype: object

### Step 4.8: Make Lower Case

In [None]:
dataset['text']=dataset['text'].str.lower()

In [None]:
dataset['text']

0          tuser awww thats a bummer you shoulda got davi...
1          is upset that he cant update his facebook by t...
2          tuser i dived many times for the ball managed ...
3             my whole body feels itchy and like its on fire
4          tuser no its not behaving at all im mad why am...
                                 ...                        
1599995    just woke up having no school is the best feel...
1599996    thewdbcom very cool to hear old walt interview...
1599997    are you ready for your mojo makeover ask me fo...
1599998    happy 38th birthday to my boo of alll tears in...
1599999               happy charitytuesday tuser tuser tuser
Name: text, Length: 1600000, dtype: object

### Step 4.9:  Spelling Correction

In [None]:
! pip install textblob



In [None]:
from textblob import TextBlob

str(TextBlob('I lvoe my INDIA').correct())

'I love my INDIA'

In [None]:
from textblob import TextBlob

str(TextBlob('I lvoe pizza').correct())

'I love penza'

In [None]:
text='this is not ture'
tl=text.split()

In [None]:
" ".join([str(TextBlob(i).correct()) for i in tl])

'this is not true'

In [None]:
!pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m622.8/622.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25l[?25hdone
  Created wheel for autocorrect: filename=autocorrect-2.6.1-py3-none-any.whl size=622364 sha256=92ae931c00b818e39c37421bcdc64ef1794a0e16698d49919f7e6da8254a328b
  Stored in directory: /root/.cache/pip/wheels/b5/7b/6d/b76b29ce11ff8e2521c8c7dd0e5bfee4fb1789d76193124343
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.6.1


In [None]:
from autocorrect import Speller
spell = Speller(lang='en')
print([spell(i) for i in tl])


['this', 'is', 'not', 'true']


In [None]:
pip install pyspellchecker==0.5.6

Collecting pyspellchecker==0.5.6
  Downloading pyspellchecker-0.5.6-py2.py3-none-any.whl (2.5 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.5/2.5 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.5.6


In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

happenning
{'hapening', 'happenning'}


In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['i', 'maje', 'pizza', 'in'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

Note: Since none of the spell correcting module working properly therefore we are not applying it on our data set

### Step 4.10 : Tokenization

In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [None]:
sent_tokenize('''All work and no play makes jack a dull boy, all work and no play''')

['All work and no play makes jack a dull boy, all work and no play']

In [None]:
type (word_tokenize('I love Pizza'))

list

In [None]:
def word_token(text):
  return word_tokenize(text)

In [None]:
dataset_copy=dataset.copy()

In [None]:
dataset_copy.head()

Unnamed: 0,text,target
0,tuser awww thats a bummer you shoulda got davi...,0
1,is upset that he cant update his facebook by t...,0
2,tuser i dived many times for the ball managed ...,0
3,my whole body feels itchy and like its on fire,0
4,tuser no its not behaving at all im mad why am...,0


In [None]:
dataset['text']=dataset['text'].apply(lambda x: word_token(x))

In [None]:
dataset['text']

0          [tuser, awww, thats, a, bummer, you, shoulda, ...
1          [is, upset, that, he, cant, update, his, faceb...
2          [tuser, i, dived, many, times, for, the, ball,...
3          [my, whole, body, feels, itchy, and, like, its...
4          [tuser, no, its, not, behaving, at, all, im, m...
                                 ...                        
1599995    [just, woke, up, having, no, school, is, the, ...
1599996    [thewdbcom, very, cool, to, hear, old, walt, i...
1599997    [are, you, ready, for, your, mojo, makeover, a...
1599998    [happy, 38th, birthday, to, my, boo, of, alll,...
1599999         [happy, charitytuesday, tuser, tuser, tuser]
Name: text, Length: 1600000, dtype: object

### Step 4.11: Stop Word Removal

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
print(len(stopwords.words('english')))

179


In [None]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
print(stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [None]:
stop_w=stopwords.words('english')

In [None]:
text_list=word_tokenize('i love pizza')
clean_text=[word for word in text_list if word not in stop_w]

In [None]:
from functools import lru_cache

@lru_cache(maxsize=50000)
def remove_stopword(text):
  stop_w=stopwords.words('english')
  text_list=text.split()
  clean_text=[word for word in text_list if word not in stop_w]
  return clean_text

In [None]:
remove_stopword('i love pizza')

['love', 'pizza']

In [None]:
dataset=dataset_copy.copy()

In [None]:
dataset.head()

Unnamed: 0,text,target
0,tuser awww thats a bummer you shoulda got davi...,0
1,is upset that he cant update his facebook by t...,0
2,tuser i dived many times for the ball managed ...,0
3,my whole body feels itchy and like its on fire,0
4,tuser no its not behaving at all im mad why am...,0


In [None]:
dataset['text']=dataset['text'].apply(lambda x: remove_stopword(x))

In [None]:
dataset['text']

0          [tuser, awww, thats, bummer, shoulda, got, dav...
1          [upset, cant, update, facebook, texting, might...
2          [tuser, dived, many, times, ball, managed, sav...
3                    [whole, body, feels, itchy, like, fire]
4                      [tuser, behaving, im, mad, cant, see]
                                 ...                        
1599995                  [woke, school, best, feeling, ever]
1599996    [thewdbcom, cool, hear, old, walt, interviews,...
1599997                [ready, mojo, makeover, ask, details]
1599998    [happy, 38th, birthday, boo, alll, tears, eyes...
1599999         [happy, charitytuesday, tuser, tuser, tuser]
Name: text, Length: 1600000, dtype: object

In [None]:
len(dataset['text'][0])

11

In [None]:
len(dataset_copy['text'][0])

88

In [None]:
len(dataset_copy['text'][1])

104

### Step 4.12: Stemming

In [None]:
from nltk.stem.porter import PorterStemmer

st=PorterStemmer()
stem=lru_cache(maxsize=50000)(st.stem)
def stemming_on_data(list_words):
  text=[stem(word) for word in list_words]

  return text

In [None]:
dataset['text']=dataset['text'].apply(lambda x: stemming_on_data(x))

In [None]:
dataset.head()

Unnamed: 0,text,target
0,"[tuser, awww, that, bummer, shoulda, got, davi...",0
1,"[upset, cant, updat, facebook, text, might, cr...",0
2,"[tuser, dive, mani, time, ball, manag, save, 5...",0
3,"[whole, bodi, feel, itchi, like, fire]",0
4,"[tuser, behav, im, mad, cant, see]",0


In [None]:
# Step 4.13: Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
def list_tosent(list_words):
  return ' '.join(list_words)

list_tosent(dataset['text'][0])

'tuser awww that bummer shoulda got david carr third day emojilaugh'

In [None]:
dataset['text']=dataset['text'].apply(lambda x: list_tosent(x))

In [None]:
lm=WordNetLemmatizer()
@lru_cache(maxsize=50000)
def lemmatization_on_data(list_words):
  list_word=list_words.split()
  text=[lm.lemmatize(word) for word in list_word]

  return text

In [None]:
dataset['text']=dataset['text'].apply(lambda x: lemmatization_on_data(x))

In [None]:
new_dataset=dataset.copy()

In [None]:
dataset['text']=dataset['text'].apply(lambda x: list_tosent(x))

In [None]:
dataset.head()

Unnamed: 0,text,target
0,tuser awww that bummer shoulda got david carr ...,0
1,upset cant updat facebook text might cri resul...,0
2,tuser dive mani time ball manag save 50 rest g...,0
3,whole bodi feel itchi like fire,0
4,tuser behav im mad cant see,0


# Step 5: Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(dataset['text'],dataset['target'],test_size=0.2,random_state=42)

tfidf=TfidfVectorizer(max_features=500000,ngram_range=(1,3),stop_words='english')

X_train_tfidf=tfidf.fit_transform(X_train)
X_test_tfif=tfidf.transform(X_test)



In [None]:
X_train_tfidf.shape

(1280000, 500000)

In [None]:
for i,f in enumerate(tfidf.get_feature_names_out()):
  print(i,f)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
495000 yeahw
495001 yeahwel
495002 yeahwer
495003 yeahyeah
495004 yeahyou
495005 yeahyour
495006 yeai
495007 yeaim
495008 yeait
495009 yeap
495010 yeap got
495011 yeap im
495012 yeap yeap
495013 year
495014 year 10
495015 year 11
495016 year 11 left
495017 year 12
495018 year 13
495019 year 1st
495020 year 2008
495021 year 2010
495022 year 2011
495023 year 2nd
495024 year 3000
495025 year 40
495026 year 40 year
495027 year activ
495028 year actual
495029 year afford
495030 year age
495031 year ago
495032 year ago amp
495033 year ago awesom
495034 year ago didnt
495035 year ago dont
495036 year ago fail
495037 year ago feel
495038 year ago good
495039 year ago got
495040 year ago great
495041 year ago havent
495042 year ago hope
495043 year ago horribl
495044 year ago im
495045 year ago laugh
495046 year ago long
495047 year ago love
495048 year ago make
495049 year ago miss
495050 year ago nice
495051 year ago realli
4950

# Step 6: Apply algorithm and Predict the Sentiment

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nb_model=MultinomialNB()
nb_model.fit(X_train_tfidf,y_train)

y_pred=nb_model.predict(X_test_tfif)
print(accuracy_score(y_test,y_pred))

0.773321875


In [None]:
def sentiment(list_of_tweets):
  new_tweet=tfidf.transform(list_of_tweets)
  if nb_model.predict(new_tweet)==1:
    return 'Happy'

  else:
    return 'Unhappy'

In [None]:
new_tweet=['i am unhappy']
sentiment(new_tweet)

'Unhappy'

In [None]:
def cleaner(text):
  pattern=re.compile(r'http[s]?:\/\/\S+')
  text= pattern.sub(r'',text)
  text=text.translate(str.maketrans("","",punc))

  return text

In [None]:
new=[(cleaner(new_tweet[0]))]

In [None]:
sentiment(new)

'Unhappy'

# Section 2: Sentiment Analysis Using RNN

### Step 2.1: Find the unique words

In [None]:
new_dataset.head()

Unnamed: 0,text,target
0,"[tuser, awww, that, bummer, shoulda, got, davi...",0
1,"[upset, cant, updat, facebook, text, might, cr...",0
2,"[tuser, dive, mani, time, ball, manag, save, 5...",0
3,"[whole, bodi, feel, itchi, like, fire]",0
4,"[tuser, behav, im, mad, cant, see]",0


In [None]:
words=set()

for data in new_dataset['text']:
  for word in data:
    words.add(word)

In [None]:
number_of_words=len(words)
number_of_words

396196

In [None]:
new_dataset['text']=new_dataset['text'].apply(lambda x: list_tosent(x))

In [None]:
new_dataset['text']

0          tuser awww that bummer shoulda got david carr ...
1          upset cant updat facebook text might cri resul...
2          tuser dive mani time ball manag save 50 rest g...
3                            whole bodi feel itchi like fire
4                                tuser behav im mad cant see
                                 ...                        
1599995                           woke school best feel ever
1599996           thewdbcom cool hear old walt interview √¢¬ô¬´
1599997                         readi mojo makeov ask detail
1599998    happi 38th birthday boo alll tear eye tupac am...
1599999               happi charitytuesday tuser tuser tuser
Name: text, Length: 1600000, dtype: object

In [None]:
new_dataset.to_csv('/content/drive/MyDrive/NLP/processed_tweets.csv',index=False)

### Step 2.2: Import Libraries and data

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

In [None]:
max_features=396196               #number_of_words

In [None]:
new_dataset=pd.read_csv('/content/drive/MyDrive/NLP/processed_tweets.csv')
new_dataset.head(1)

Unnamed: 0,text,target
0,tuser awww that bummer shoulda got david carr ...,0


In [None]:
new_dataset['text']=new_dataset['text'].astype('str')

In [None]:
(new_dataset['text']).head()

0    tuser awww that bummer shoulda got david carr ...
1    upset cant updat facebook text might cri resul...
2    tuser dive mani time ball manag save 50 rest g...
3                      whole bodi feel itchi like fire
4                          tuser behav im mad cant see
Name: text, dtype: object

In [None]:
new_dataset['text'].values

array(['tuser awww that bummer shoulda got david carr third day emojilaugh',
       'upset cant updat facebook text might cri result school today also blah',
       'tuser dive mani time ball manag save 50 rest go bound', ...,
       'readi mojo makeov ask detail',
       'happi 38th birthday boo alll tear eye tupac amaru shakur',
       'happi charitytuesday tuser tuser tuser'], dtype=object)

In [None]:
tokenizer_keras=Tokenizer(num_words=max_features,split=' ')

In [None]:
tokenizer_keras.fit_on_texts(new_dataset['text'].values)

In [None]:
X=tokenizer_keras.texts_to_sequences(new_dataset['text'].values)
X

[[1, 385, 52, 1078, 3041, 15, 721, 7461, 1663, 5, 1820],
 [607, 13, 228, 452, 372, 212, 243, 978, 84, 11, 195, 1073],
 [1, 3658, 229, 249, 879, 711, 515, 1159, 360, 3, 2803],
 [343, 668, 25, 2552, 8, 891],
 [1, 4071, 2, 470, 13, 24],
 [1, 343, 1936],
 [32, 401],
 [1, 91, 101, 17, 14, 24, 97, 105, 176, 176, 12, 21, 2, 435, 16, 685],
 [1, 687, 62],
 [1, 2125, 114911],
 [1308, 346, 2634, 489, 1428],
 [16535, 794],
 [1, 327, 1318, 31, 154, 12488, 1380, 2822],
 [1, 776, 741, 389, 99, 123, 324],
 [1, 2003, 106, 62, 2372, 27, 73, 3543, 25177, 114912],
 [1, 48, 15, 31, 20, 1, 2089],
 [2448, 929, 1608, 139, 1452, 31, 651, 25178, 4510, 456],
 [1246, 2285],
 [1, 597, 74, 127, 19, 24, 1537, 9, 3172],
 [1, 40, 522, 281, 2328, 1465, 281],
 [1, 5, 62, 4, 42, 125],
 [22, 65, 119, 302, 222, 3251, 3173, 7194, 74, 17, 14, 529],
 [1, 1270, 643, 573],
 [59, 3, 29],
 [30555, 294, 467, 47],
 [76, 119, 321, 83],
 [3, 243, 51, 31, 3018],
 [2, 55, 114913],
 [2804, 12, 21, 7281, 116, 121, 7281, 121, 4, 470],
 [1

In [None]:
type(X)

list

In [None]:
new_dataset['text'][0]

'tuser awww that bummer shoulda got david carr third day emojilaugh'

In [None]:
tokenizer_keras.word_index

{'tuser': 1,
 'im': 2,
 'go': 3,
 'get': 4,
 'day': 5,
 'good': 6,
 'work': 7,
 'like': 8,
 'love': 9,
 'dont': 10,
 'today': 11,
 'laugh': 12,
 'cant': 13,
 'eye': 14,
 'got': 15,
 'thank': 16,
 'tear': 17,
 'back': 18,
 'want': 19,
 'miss': 20,
 'loud': 21,
 'one': 22,
 'know': 23,
 'see': 24,
 'feel': 25,
 'think': 26,
 'realli': 27,
 'well': 28,
 'hope': 29,
 'night': 30,
 'watch': 31,
 'need': 32,
 'still': 33,
 'make': 34,
 'new': 35,
 'amp': 36,
 'home': 37,
 'look': 38,
 'come': 39,
 'oh': 40,
 '2': 41,
 'much': 42,
 'last': 43,
 'twitter': 44,
 'morn': 45,
 'great': 46,
 'tomorrow': 47,
 'wish': 48,
 'wait': 49,
 'ill': 50,
 'sleep': 51,
 'that': 52,
 'haha': 53,
 'way': 54,
 'sad': 55,
 'fun': 56,
 'tri': 57,
 'right': 58,
 'week': 59,
 'follow': 60,
 'happi': 61,
 'didnt': 62,
 'bad': 63,
 'would': 64,
 'friend': 65,
 'thing': 66,
 'sorri': 67,
 'tonight': 68,
 'say': 69,
 'take': 70,
 'nice': 71,
 'gonna': 72,
 'though': 73,
 'ive': 74,
 'better': 75,
 'hate': 76,
 'even': 

In [None]:
y=pd.get_dummies(new_dataset['target']).values

In [None]:
y[:2]

array([[1, 0],
       [1, 0]], dtype=uint8)

### Step 2.3: Pad Sequences

In [None]:
len(X)

1600000

In [None]:
X=pad_sequences(X)

In [None]:
X[:5]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    1,  385,   52, 1078, 3041,   15,
         721, 7461, 1663,    5, 1820],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,  607,   13,  228,  452,  372,  212,  243,
         978,   84,   11,  195, 1073],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    1, 3658,  229,  249,  879,  711,
         515, 1159,  360,    3, 2803],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,  

### Step 2.4: Split The Data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
X_train.shape,X_test.shape

((1120000, 38), (480000, 38))

In [None]:
valid_size=240000
X_valid=X_test[-valid_size:]
y_valid=y_test[-valid_size:]
X_test=X_test[:-valid_size]
y_test=y_test[:-valid_size]

In [None]:
X_test.shape

(240000, 38)

### Step 2.5: Create the RNN Architecture

In [None]:
from keras.models import Sequential
from keras.layers import Dense,Embedding,SimpleRNN,SpatialDropout1D
from keras.optimizers import Adam
from keras.regularizers import L2

In [None]:
embed_dim=128

In [None]:
# to detect the TPU
tpu=tf.distribute.cluster_resolver.TPUClusterResolver.connect()

# Instantiate the TPU
tpu_strategy=tf.distribute.TPUStrategy(tpu)

with tpu_strategy.scope():
  model=Sequential()
  model.add(Embedding(max_features,embed_dim,input_length=X.shape[1]))
  model.add(SpatialDropout1D(0.4))
  model.add(SimpleRNN(196,dropout=0.2,recurrent_dropout=0.2))
  model.add(Dense(2,activation='softmax',kernel_regularizer=L2(0.001)))

  model.compile(loss='categorical_crossentropy',optimizer=Adam(learning_rate=0.0001),metrics=['accuracy'])

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 38, 128)           50713088  
                                                                 
 spatial_dropout1d (SpatialD  (None, 38, 128)          0         
 ropout1D)                                                       
                                                                 
 simple_rnn (SimpleRNN)      (None, 196)               63700     
                                                                 
 dense (Dense)               (None, 2)                 394       
                                                                 
Total params: 50,777,182
Trainable params: 50,777,182
Non-trainable params: 0
_________________________________________________________________


In [None]:
from keras import callbacks

earlystopping=callbacks.EarlyStopping(monitor='val_loss',
                                      mode='min',
                                      patience=5,
                                      restore_best_weights=True)

model.fit(X_train,y_train,epochs=20,batch_size=512,verbose=1,
          validation_data=(X_valid,y_valid),
          callbacks=[earlystopping])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20


<keras.callbacks.History at 0x7c8955559840>

# Step 2.5: Create LSTM Architecture

In [None]:
from tensorflow.keras.layers import LSTM

In [None]:
# to detect the TPU
tpu=tf.distribute.cluster_resolver.TPUClusterResolver.connect()

# Instantiate the TPU
tpu_strategy=tf.distribute.TPUStrategy(tpu)

with tpu_strategy.scope():
  model_LSTM=Sequential()
  model_LSTM.add(Embedding(max_features,embed_dim,input_length=X.shape[1]))
  model_LSTM.add(SpatialDropout1D(0.4))
  model_LSTM.add(LSTM(196,dropout=0.2,recurrent_dropout=0.2))
  model_LSTM.add(Dense(2,activation='softmax',kernel_regularizer=L2(0.001)))

  model_LSTM.compile(loss='categorical_crossentropy',optimizer=Adam(learning_rate=0.0001),metrics=['accuracy'])



In [None]:
from keras import callbacks

earlystopping=callbacks.EarlyStopping(monitor='val_loss',
                                      mode='min',
                                      patience=5,
                                      restore_best_weights=True)

model_LSTM.fit(X_train,y_train,epochs=20,batch_size=512,verbose=1,
          validation_data=(X_valid,y_valid),
          callbacks=[earlystopping])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20


<keras.callbacks.History at 0x7c8951221660>