<div style="color:white; background-color: black; padding: 20px; border-radius:8px; font-size:26px"><b style="font-weight: 700;"><center>Coronavirus Tweets - Text Classification</center></b></div>

<div style="background-color:  #eddcd2; padding: 10px;">

### The Data

</div>

Data collected from [here](https://www.kaggle.com/datasets/lakshmi25npathi/coronavirus-tweets-dataset)

**About this Dataset**:

Perform Text Classification on the data. The tweets have been pulled from Twitter and manual tagging has been done then.
The names and usernames have been given codes to avoid any privacy concerns.

Columns:
1) Location
2) Tweet At
3) Original Tweet
4) Label

<div style="background-color:  #eddcd2; padding: 10px;">

### Text Classification

</div>


To predict the number of positive and negative tweets using:
 - **Classification ML algorithms**
 - **Deep learning algorithms**

### **Brief Summary of regular expression functions**

- `re.compile()`: It is used to compile a regular expression pattern into a regular expression object. This compiled pattern can be reused for multiple operations, making it more efficient, especially when working with complex or frequently used regular expressions. Benefits of re.compile():
    - *Improved Efficiency*: It can improve the performance of your code, especially if you are using the same regular expression in multiple places.
    - *Readability*: It can make your code more readable by separating the pattern definition from its usage.
    - *Facilitates Code Maintenance*: If you need to modify the regular expression later, you only have to do it in one place (where it's compiled), rather than multiple places where it's used.
    - *Pre-Compile Complex Patterns*: For complex regular expressions, pre-compiling can help manage the complexity and make your code more organized.
---

- `re.sub(pattern, replacement, string)`: It performs a search-and-replace operation where:
    - `pattern`: The regular expression pattern to search for in the string.
    - `replacement`: The string to replace the matched patterns.
    - `string`: The input string where the search and replace operation will be performed.

---

- `re.findall(pattern, string, flags=0)`: It is a function in Python's **re** module (regular expressions) that allows you to find all non-overlapping occurrences of a pattern in a string. It returns a list of all the matching substrings or elements.
    - `pattern`: This is the regular expression pattern you want to search for in the given string.
    - `string`: The input string where you want to search for matches.
    - `flags`: Optional. Flags can be used to modify the behavior of the regular expression. For example, you can use re.IGNORECASE to perform a case-insensitive search.

<div class="list-group" id="list-tab" role="tablist">

## TABLE OF CONTENTS

- <a href='#1'>1. IMPORTING LIBRARIES</a>
- <a href='#2'>2. READING DATA</a>
- <a href='#3'>3. DATA CLEANING</a>
    - <a href='#3-1'>3.1 Lowercasing</a>
    - <a href='#3-2'>3.2 Remove HTML Tags</a>
    - <a href='#3-3'>3.3 Remove URLs</a>
    - <a href='#3-4'>3.4 Remove Punctuation</a>
    - <a href='#3-5'>3.5 Chat word treatment</a>
    - <a href='#3-6'>3.6 Spelling Correction</a>
    - <a href='#3-7'>3.7 Handling Emojis</a>
- <a href='#4'>4. TEXT PREPROCESSING</a>
    - <a href='#4-1'>4.1 Remove Stop Words</a>
    - <a href='#4-2'>4.2 Tokenization</a>
    - <a href='#4-3'>4.3 Stemming</a>
    - <a href='#4-4'>4.4 Lemmatization</a>
- <a href='#5'>5. DATA SPLITTING</a>
- <a href='#6'>6. FEATURE EXTRACTION FROM TEXT</a>
    - <a href='#6-1'>6.1 One Hot Encoding</a>
    - <a href='#6-2'>6.2 Bag of Words</a>
    - <a href='#6-3'>6.3 N-grams</a>
    - <a href='#6-4'>6.4 Term frequency - Inverse document frequency </a>
    - <a href='#6-5'>6.5 Word2Vec </a>
</div>

# <a id='1'>1. Importing Libraries </a>


In [81]:
import pandas as pd
import numpy as np                          # for working with arrays and matrices

pd.set_option('display.max_rows', 500)      # Set max number of rows displayed
pd.set_option('display.max_columns', 500)   # Set max number of columns displayed
pd.set_option('display.width', 1000)

# Regex pkg
import re

# String and time module
import string, time

# Visualizations
import matplotlib.pyplot as plt             # for creating plots
from matplotlib.colors import ListedColormap
%matplotlib inline
import seaborn as sns
import plotly

# Split pkgs
from sklearn.model_selection import train_test_split

from scipy.stats import skew
import statsmodels.api as sm

# Save and load pkgs
from pickle import dump, load

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

# <a id='2'>2. Reading Data </a>


In [3]:
# load the data
df = pd.read_excel("D:/Data Science/Datasets/NLP/Covid_tweets/Corona_NLP_train.xlsx")
display(df.head(10))

print(df.shape)

print(df.iloc[1,:])

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,3804,48756,"ÜT: 36.319708,-82.363649",16-03-2020,As news of the region_x0092_s first confirmed ...,Positive
6,3805,48757,"35.926541,-78.753267",16-03-2020,Cashier at grocery store was sharing his insig...,Positive
7,3806,48758,Austria,16-03-2020,Was at the supermarket today. Didn't buy toile...,Neutral
8,3807,48759,"Atlanta, GA USA",16-03-2020,Due to COVID-19 our retail store and classroom...,Positive
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,"For corona prevention,we should stop to buy th...",Negative


(41157, 6)
UserName                                                      3800
ScreenName                                                   48752
Location                                                        UK
TweetAt                                                 16-03-2020
OriginalTweet    advice Talk to your neighbours family to excha...
Sentiment                                                 Positive
Name: 1, dtype: object


In [4]:
df['OriginalTweet'][3]

"My food stock is not the only one which is empty...\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \n\nStay calm, stay safe.\n\n\n\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j"

# <a id='3'>3. Data Cleaning </a>


## <a id='3-1'>3.1 Lowercasing</a>

In [5]:
df['OriginalTweet'][3].lower()

"my food stock is not the only one which is empty...\n\n\n\nplease, don't panic, there will be enough food for everyone if you do not take more than you need. \n\nstay calm, stay safe.\n\n\n\n#covid19france #covid_19 #covid19 #coronavirus #confinement #confinementotal #confinementgeneral https://t.co/zrlg0z520j"

In [6]:
df['OriginalTweet'] = df['OriginalTweet'].str.lower()

In [7]:
df

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@menyrbie @phil_gahan @chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,coronavirus australia: woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,my food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"me, ready to go at supermarket during the #cov...",Extremely Negative
...,...,...,...,...,...,...
41152,44951,89903,"Wellington City, New Zealand",14-04-2020,airline pilots offering to stock supermarket s...,Neutral
41153,44952,89904,,14-04-2020,response to complaint not provided citing covi...,Extremely Negative
41154,44953,89905,,14-04-2020,you know it_x0092_s getting tough when @kamero...,Positive
41155,44954,89906,,14-04-2020,is it wrong that the smell of hand sanitizer i...,Neutral


## <a id='3-2'>3.2 Remove HTML Tags</a>

In [8]:
def remove_html_tags(text):

    # Define the regex pattern to match HTML tags
    pattern = re.compile('<.*?>')

    # Use re.sub() to remove HTML tags from the text
    cleaned_text = pattern.sub(r'', text)

    return cleaned_text

In [9]:
df['OriginalTweet'] = df['OriginalTweet'].apply(remove_html_tags)

## <a id='3-3'>3.3 Remove URLs </a>

In [10]:
def remove_url(text):

    pattern = re.compile(r'https?://\S+|www\.\S+')

    cleaned_text = pattern.sub(r'', text)

    return cleaned_text

In [11]:
df['OriginalTweet'] = df['OriginalTweet'].apply(remove_url)

## <a id='3-4'>3.4 Remove Punctuation </a>

In [12]:
string.punctuation             # <--- contains a collection of punctuation characters

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
def remove_punc(text):

    exclude = string.punctuation

    for char in exclude:
        text = text.replace(char,'')       # <--- It replaces the punctuation with whitespace in text

    return text

In [14]:
df['OriginalTweet'] = df['OriginalTweet'].apply(remove_punc)

In [15]:
df['OriginalTweet'][20]

'with 100  nations inficted with  covid  19  the world must  not  play fair with china  100 goverments must demand  china  adopts new guilde  lines on food safty  the  chinese  goverment  is guilty of  being  irosponcible   with life  on a global scale'

## <a id='3-5'>3.5 Chat Word Treatment </a>

*Chat word treatment* refers to the handling or processing of abbreviated or informal language commonly used in online chat conversations, social media posts, and text messaging. It involves converting or normalizing these abbreviated forms into their full, standard English equivalents.

- Examples of abbreviated or informal language: *lmao*, *imho*, *fyi*, *asap*, *gn*



In [16]:
chat_words = {'AFAIK': 'As Far As I Know',
              'AFK': 'Away From Keyboard',
              'ASAP': 'As Soon As Possible',
              'ATK': 'At The Keyboard',
              'ATM': 'At The Moment',
              'A3': 'Anytime, Anywhere, Anyplace',
              'BAK': 'Back At Keyboard',
              'BBL': 'Be Back Later',
              'BBS': 'Be Back Soon',
              'BFN': 'Bye For Now',
              'B4N': 'Bye For Now',
              'BRB': 'Be Right Back',
              'BRT': 'Be Right There',
              'BTW': 'By The Way',
              'B4': 'Before',
              'CU': 'See You',
              'CUL8R': 'See You Later',
              'CYA': 'See You',
              'FAQ': 'Frequently Asked Questions',
              'FC': 'Fingers Crossed',
              'FWIW': "For What It's Worth",
              'FYI': 'For Your Information',
              'GAL': 'Get A Life',
              'GG': 'Good Game',
              'GN': 'Good Night',
              'GMTA': 'Great Minds Think Alike',
              'GR8': 'Great!',
              'G9': 'Genius',
              'IC': 'I See',
              'ICQ': 'I Seek you (also a chat program)',
              'ILU': 'I Love You',
              'IMHO': 'In My Honest/Humble Opinion',
              'IMO': 'In My Opinion',
              'IOW': 'In Other Words',
              'IRL': 'In Real Life',
              'KISS': 'Keep It Simple, Stupid',
              'LDR': 'Long Distance Relationship',
              'LMAO': 'Laugh My A.. Off',
              'LOL': 'Laughing Out Loud',
              'LTNS': 'Long Time No See',
              'L8R': 'Later',
              'MTE': 'My Thoughts Exactly',
              'M8': 'Mate',
              'NRN': 'No Reply Necessary',
              'OIC': 'Oh I See',
              'PITA': 'Pain In The A..',
              'PRT': 'Party',
              'PRW': 'Parents Are Watching',
              'QPSA?':	'Que Pasa?',
              'ROFL': 'Rolling On The Floor Laughing',
              'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
              'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
              'SK8': 'Skate',
              'STATS': 'Your sex and age',
              'ASL': 'Age, Sex, Location',
              'THX': 'Thank You',
              'TTFN': 'Ta-Ta For Now!',
              'TTYL': 'Talk To You Later',
              'U': 'You',
              'U2': 'You Too',
              'U4E': 'Yours For Ever',
              'WB': 'Welcome Back',
              'WTF': 'What The F...',
              'WTG': 'Way To Go!',
              'WUF': 'Where Are You From?',
              'W8': 'Wait...',
              '7K': 'Sick',
              ':-D': 'Laugher',
              'TFW': 'That feeling when',
              'MFW': 'My face when',
              'MRW': 'My reaction when',
              'IFYP': 'I feel your pain',
              'TNTL': 'Trying not to laugh',
              'JK': 'Just kidding',
              'IDC': 'I don’t care',
              'ILY': 'I love you',
              'IMU': 'I miss you',
              'ADIH': 'Another day in hell',
              'ZZZ': 'Sleeping, bored, tired',
              'WYWH': 'Wish you were here',
              'TIME': 'Tears in my eyes',
              'BAE': 'Before anyone else',
              'FIMH': 'Forever in my heart',
              'BSAAW': 'Big smile and a wink',
              'BWL': 'Bursting with laughter',
              'BFF': 'Best friends forever',
              'CSL': 'Can’t stop laughing'
              }

In [17]:
def chat_conversion(text):

    new_text = []

    for w in text.split():                               # <--- for each word in text
        if w.upper() in chat_words:                      # <--- checking if the word (in capital letters) is a key of chat_words
            new_text.append(chat_words[w.upper()])       # <--- adding the extended text of the chat word
        else:
            new_text.append(w)

    return ' '.join(new_text)                            # <--- return "text" with the "new_text" instead of chat words

In [18]:
print(chat_conversion('IMHO he is the best'))
print(chat_conversion('FYI La Habana is the capital of Cuba'))

In My Honest/Humble Opinion he is the best
For Your Information La Habana is the capital of Cuba


In [19]:
df['OriginalTweet'].apply(chat_conversion)

0                      menyrbie philgahan chrisitv and and
1        advice talk to your neighbours family to excha...
2        coronavirus australia woolworths to give elder...
3        my food stock is not the only one which is emp...
4        me ready to go at supermarket during the covid...
                               ...                        
41152    airline pilots offering to stock supermarket s...
41153    response to complaint not provided citing covi...
41154    you know itx0092s getting tough when kameronwi...
41155    is it wrong that the smell of hand sanitizer i...
41156    tartiicat well newused rift s are going for 70...
Name: OriginalTweet, Length: 41157, dtype: object

In [20]:
df['OriginalTweet'] = df['OriginalTweet'].apply(chat_conversion)

## <a id='3-6'>3.6 Spelling Correction (too long running time!)</a>

**TextBlob** is a Python library that provides a simple API for common natural language processing (NLP) tasks. It's built on top of NLTK (Natural Language Toolkit) and Pattern. It abstracts away many of the complexities of lower-level NLP tasks, allowing users to perform, among others, the following tasks:

- *Sentiment Analysis*: **TextBlob** can determine the sentiment (polarity and subjectivity) of a piece of text. It can tell you whether a text expresses a positive, negative, or neutral sentiment.

- *Part-of-Speech Tagging*: **TextBlob** can identify the grammatical parts of a sentence, such as nouns, verbs, adjectives, etc.

- *Noun Phrase Extraction*: It can extract noun phrases from text. This is useful for tasks like information extraction.

- *Translation and Language Detection*: **TextBlob** supports translation between different languages and can detect the language of a given text.

- *Word Inflection and Lemmatization*: It can convert words into their base or root form, which can be helpful for tasks like text classification.

- *Spell Checking*: **TextBlob** can correct spelling mistakes in a given text.

- *Tokenization*: It can break a text into individual words or sentences, a necessary step for many NLP tasks.

In [21]:
df.shape

(41157, 6)

In [22]:
from textblob import TextBlob

In [23]:
def spelling_correction(text, exceptions):

    # Tokenize the text
    words = text.split()

    # Initialize an empty list to store corrected words
    corrected_words = []

    # Correct each word, except for the specified exception
    for word in words:
        if word in exceptions:
            corrected_words.append(word)
        else:

            # Create a TextBlob object using the word
            textBlb = TextBlob(word)

            corrected_words.append(textBlb.correct().string)         # <--- The correct() method attempts to correct the spelling and grammar of the text. The string() method converts the corrected text in a string

    # Join the corrected words back into a sentence
    corrected_text = ' '.join(corrected_words)

    return corrected_text

In [24]:
# Example:
incorrect_text = 'with 100  nations inficted with  covid  19  the world must  not  play fair with china  100 goverments must demand  china  adopts new guilde  lines on food safty  the  chinese  goverment  is guilty of  being  irosponcible   with life  on a global scale'

print(spelling_correction(incorrect_text, ['covid', 'covid19']))

with 100 nations infected with covid 19 the world must not play fair with china 100 governments must demand china adopt new guide lines on food safety the chinese government is guilty of being irosponcible with life on a global scale


In [25]:
# start = time.time()
# df['OriginalTweet'] = df['OriginalTweet'].apply(spelling_correction, ['covid', 'covid19'])
# end = time.time() - start
# print(end)

## <a id='3-7'>3.7 Handling Emojis</a>

Emojis can be:
- Removed
- Replaced by its meaning

Remove emojis (also other symbols):

In [26]:
def remove_emoji(text):

    emoji_pattern = re.compile('['
                               u"\U0001F600-\U0001F64F"     # emoticons
                               u"\U0001F300-\U0001F5FF"     # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"     # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"     # flags (iOS)
                               u'\U0001F700-\U0001F77F'
                               u'\U0001F780-\U0001F7FF'
                               u'\U0001F800-\U0001F8FF'
                               u'\U0001F900-\U0001F9FF'
                               u'\U0001FA00-\U0001FA6F'
                               u'\U0001FA70-\U0001FAFF'
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               ']+', flags = re.UNICODE)

    new_text = emoji_pattern.sub(r'', text)
    return new_text

In [27]:
text = 'Loved the movie. It was \U0001F60D\U0001F929 '
print(text)

print(remove_emoji(text))

Loved the movie. It was 😍🤩 
Loved the movie. It was  


Replace emojis by its meaning (Problem with emoji-Django module):

In [28]:
# import emoji
#
# print(emoji.demojize('Python is 🔥'))

# <a id='4'>4. Text Preprocessing </a>

## <a id='4-1'>4.1 Removing Stop words</a>

In [69]:
import nltk
# nltk.download('stopwords')

In [30]:
from nltk.corpus import stopwords

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [31]:
def remove_stopwords(text):

    wordss = text.split()

    new_text = []

    for word in wordss:

        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)

    x = new_text[:]
    new_text.clear()

    text_nostopwords = ' '.join(x)

    return text_nostopwords


In [35]:
text = df['OriginalTweet'][100]
print(text)

df['OriginalTweet'] = df['OriginalTweet'].apply(remove_stopwords)

i hate grocery shopping in general but i swear ix0092m doing it online next shop can not deal with the swathes of panic buyers at all covid19 coronavirus coronavirusuk anxiety panicbuyinguk morons


## <a id='4-2'>4.2 Tokenization</a>

**Tokenization** is a process in Natural Language Processing (NLP) that involves breaking a text into individual units, or "tokens." These tokens can be *words*, *sentences*, or even *subwords*, depending on the level of granularity required for the specific NLP task.

Tokenization is a crucial preprocessing step in NLP pipelines because it converts raw text into a format that can be processed by machines. It enables the extraction of meaningful information and features from text, making it suitable for various NLP tasks such as sentiment analysis, part-of-speech tagging, named entity recognition, and more. Moreover, tokenization is not limited to English; it's a universal concept that applies to text in any language. Different languages may have specific tokenization rules based on their unique linguistic characteristics.

Depending on the type of project under analysis, there are three common types of tokenization:

- *Word Tokenization*:
    - This type of tokenization breaks a text into individual words. For example, the sentence "Chatbots are amazing!" would be tokenized into the following words: "Chatbots", "are", "amazing", and "!". Punctuation marks are usually treated as separate tokens.
    - Word tokenization is one of the most fundamental steps in NLP, as many subsequent NLP tasks rely on individual words.

- *Sentence Tokenization*:
    - Sentence tokenization involves splitting a text into individual sentences. For example, the paragraph "This is the first sentence. This is the second sentence." would be tokenized into two sentences: "This is the first sentence." and "This is the second sentence."
    - Sentence tokenization is important for tasks that require analysis at the sentence level, such as sentiment analysis or machine translation.

- *Subword Tokenization*:
    - Subword tokenization breaks words into smaller units, which may be meaningful in some languages or for specific tasks. For instance, "chatbots" might be tokenized into "chat" and "bots". This is particularly useful for handling rare or out-of-vocabulary words.
    - Subword tokenization is often used in tasks like machine translation and text generation.

**Notes**:
- *Prefix*: Character(s) at the beginning ---> $("
- *Suffix*: Character(s) at the end ---> km),.!?"
- *Infix*: Character(s) in between ---> - -- / ...
- *Exception*: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied ---> let's U.S.

**1. Using the split function**

In [33]:
# Word tokenization
text1 = 'I am going to La Habana'
text1.split()

['I', 'am', 'going', 'to', 'La', 'Habana']

In [34]:
# sentence tokenization
text2 = 'I am going to La Habana. I will stay for 3 days. Let\'s hope the trip to be great'
text2.split('.')

['I am going to La Habana',
 ' I will stay for 3 days',
 " Let's hope the trip to be great"]

**Split function in Tweet data**

In [55]:
tokenized_tweets = df['OriginalTweet'].apply(lambda x: x.split())

tokenized_tweets.head()

0                      [menyrbie, philgahan, chrisitv]
1    [advice, talk, neighbours, family, exchange, p...
2    [coronavirus, australia, woolworths, give, eld...
3    [food, stock, one, empty, please, dont, panic,...
4    [ready, go, supermarket, covid19, outbreak, im...
Name: OriginalTweet, dtype: object

**2. Using Regular Expression**

In [37]:
# Word tokenization
text3 = 'I am going to La Habana'
tokens = re.findall("[\w']+", text3)
tokens

['I', 'am', 'going', 'to', 'La', 'Habana']

In [40]:
# Sentence tokenization
text4 = """With these selections, players will take the reins of the Japanese Empire? With dense rainforests to the north, mountains to the west, and the sea to the south, enemies will find it massively difficult to approach one`s starting location. It`s a strong early-game position with a lot of potential to grow in peace. Players won`t be left wanting for resources on this map, as there are a plethora of useful goodies scattered close together. Initial access to the sea won`t be a problem either. There`s also a nearby Natural Wonder that can be claimed in the first few turns, granting bonuses to adjacent tiles."""
sentences = re.compile('[.!?] ').split(text4)
sentences

['With these selections, players will take the reins of the Japanese Empire',
 'With dense rainforests to the north, mountains to the west, and the sea to the south, enemies will find it massively difficult to approach one`s starting location',
 'It`s a strong early-game position with a lot of potential to grow in peace',
 'Players won`t be left wanting for resources on this map, as there are a plethora of useful goodies scattered close together',
 'Initial access to the sea won`t be a problem either',
 'There`s also a nearby Natural Wonder that can be claimed in the first few turns, granting bonuses to adjacent tiles.']

**Regex tokenization in Tweet data**

In [56]:
tokenized_tweets = df['OriginalTweet'].apply(lambda x: re.findall("[\w']+", x))

tokenized_tweets.head()

0                      [menyrbie, philgahan, chrisitv]
1    [advice, talk, neighbours, family, exchange, p...
2    [coronavirus, australia, woolworths, give, eld...
3    [food, stock, one, empty, please, dont, panic,...
4    [ready, go, supermarket, covid19, outbreak, im...
Name: OriginalTweet, dtype: object

**3. Using NLTK**

**NLTK** stands for **Natural Language Toolkit**. It's a comprehensive library in Python that provides tools to work with human language data (text). It's particularly useful for tasks related to linguistic analysis and natural language processing (**NLP**).
In the context of tokenization, NLTK provides a module called `nltk.tokenize` which contains various tokenizers for splitting text into tokens. Some of the tokenizers available in NLTK include:

- *Word Tokenizer* (`nltk.tokenize.word_tokenize`): This tokenizer breaks text into individual words. It's particularly useful for tasks where you need to analyze the text at the word level.
<br>

- Sentence Tokenizer* (`nltk.tokenize.sent_tokenize`): This tokenizer splits text into sentences. It's used when you want to analyze text at the sentence level, such as in tasks like sentiment analysis.
<br>

- *Whitespace Tokenizer* (`nltk.tokenize.WhitespaceTokenizer`): This tokenizer splits text based on whitespace characters (spaces, tabs, etc.). It can be useful for specific cases where whitespace serves as a clear delimiter.
<br>

- *Regexp Tokenizer* (`nltk.tokenize.RegexpTokenizer`): This tokenizer allows you to define your own custom tokenization rules using regular expressions. It's highly flexible and can be tailored to specific tokenization needs.
<br>

- *Treebank Tokenizer* (`nltk.tokenize.TreebankWordTokenizer`): This is a tokenizer that is designed to work with the Penn Treebank dataset, which is a large corpus of English text. It follows the tokenization conventions used in this dataset.

In [41]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [47]:
# word tokenizer

text1 = 'I have a Ph.D in A.I.'
text2 = "We're here to help mail us at nks@gmail.com"
text3 = 'A 5km ride cost $10.50'

print(word_tokenize(text1))
print(word_tokenize(text2))
print(word_tokenize(text3))

['I', 'have', 'a', 'Ph.D', 'in', 'A.I', '.']
['We', "'re", 'here', 'to', 'help', 'mail', 'us', 'at', 'nks', '@', 'gmail.com']
['A', '5km', 'ride', 'cost', '$', '10.50']


In [45]:
# sentence tokenizer

sent_tokenize(text4)

['With these selections, players will take the reins of the Japanese Empire?',
 'With dense rainforests to the north, mountains to the west, and the sea to the south, enemies will find it massively difficult to approach one`s starting location.',
 'It`s a strong early-game position with a lot of potential to grow in peace.',
 'Players won`t be left wanting for resources on this map, as there are a plethora of useful goodies scattered close together.',
 'Initial access to the sea won`t be a problem either.',
 'There`s also a nearby Natural Wonder that can be claimed in the first few turns, granting bonuses to adjacent tiles.']

**NLTK tokenization in Tweet data**

In [59]:
tokenized_tweets = df['OriginalTweet'].apply(lambda x: sent_tokenize(x))

tokenized_tweets.head()

0                        [menyrbie philgahan chrisitv]
1    [advice talk   neighbours family  exchange pho...
2    [coronavirus australia woolworths  give elderl...
3    [ food stock     one   empty please dont panic...
4    [ ready  go  supermarket   covid19 outbreak   ...
Name: OriginalTweet, dtype: object

**4. Using Spacy**

**spaCy** is an open-source natural language processing (NLP) library designed for high-performance NLP tasks. It is known for its speed, accuracy, and ease of use. spaCy provides a wide range of NLP functionalities, including tokenization.

In the context of tokenization, spaCy offers a tokenization component that is highly efficient and capable of handling multiple languages. spaCy's tokenizer not only splits text into words, but also handles more complex tokenization tasks, such as splitting off punctuation that appears at the beginning or end of a word.

spaCy is widely used in both research and industry for various NLP tasks, including part-of-speech tagging, named entity recognition, dependency parsing, and more. Its efficient implementation and pre-trained models make it a popular choice for a wide range of NLP applications.

spaCy uses an object-oriented approach and usually returns document objects with their own attributes and methods. Many users find spaCy to be more time and memory efficient than NLTK and therefore more suitable for production.

`space.load()`: Function used to load a pre-trained spaCy model. It takes a model name as an argument and returns a loaded model object, which can be used for NLP tasks like tokenization, name entity recognition, dependency parsing, etc.

In [63]:
import spacy
nlp = spacy.load('en_core_web_sm')   # ---> 'en_core_web_sm' is one of the pre-trained models (tokenizer) provided by spaCy for the English language. It is a small and efficient model that is suitable for a wide range of NLP tasks.

type(nlp)    # <--- nlp has the data type Language, meaning that contains all components necessary for processing English text.

spacy.lang.en.English

In [62]:
doc1 = nlp(text1)
doc2 = nlp(text2)
doc3 = nlp(text3)

for token in doc3:       # <--- Change the number of 'doc' to see tokenization of doc1 to doc3
    print(token)

A
5
km
ride
cost
$
10.50


**Spacy tokenization in Tweet data** (It takes longer than the other tokenizers)

In [60]:
def tokenize_text(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]     # <--- token.text: Attribute containing the text of the token 'token'
    return tokens

tokenized_tweets = df['OriginalTweet'].apply(tokenize_text)

tokenized_tweets.head()

0                   [menyrbie, philgahan, chrisitv,  ]
1    [advice, talk,   , neighbours, family,  , exch...
2    [coronavirus, australia, woolworths,  , give, ...
3    [ , food, stock,     , one,   , empty, please,...
4    [ , ready,  , go,  , supermarket,   , covid19,...
Name: OriginalTweet, dtype: object

In [None]:
df['OriginalTweet'] == tokenized_tweets

df.head()

## <a id='4-3'>4.3 Stemming</a>

*"In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood"*

"**Stemming** *is the procesof reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language*"

**Stemming** is a text processing technique in Natural Language Processing (NLP) that involves **reducing words to their base or root form**. The resulting form may not always be a valid word, but it can help in tasks like text analysis, information retrieval, and text mining.

For example, the stem of the words "running", "runner", and "ran" is "run". By reducing these words to their common base form, we can treat them as the same word in terms of their core meaning.

There are different algorithms for stemming, and one of the most widely used is the *Porter stemming algorithm*. The goal of stemming algorithms is **to strip affixes (prefixes and suffixes) from words to obtain their root form**.

Stemming is useful in scenarios where you want to reduce the complexity of text data without being overly concerned about linguistic correctness. It's often used in tasks like information retrieval, search engines, and text mining, where reducing the variety of words can lead to more effective processing and analysis. However, **in contexts where linguistic precision is critical, more advanced techniques like lemmatization may be preferred**.

Here are some key points about stemming:

- *Reduces Dimensionality*: Stemming helps in reducing the dimensionality of the feature space in NLP tasks. It reduces the number of unique words or tokens in a text, making it easier to process.
<br>

- *Speeds Up Processing*: It can improve the efficiency of text processing tasks since it reduces the number of distinct words that need to be handled.
<br>

- *May Produce Non-Standard Words*: The resulting stems may not always be real words. For instance, the stem of "jumps" is "jump", but the stem of "jumping" is also "jump". This doesn't always align with standard English.
<br>

- *May Produce Ambiguous Results*: Stemming can sometimes produce stems that are ambiguous. For example, the stem of "meeting" could be either "meet" or "meat", depending on the context.
<br>

- *Less Contextually Sensitive*: Stemming is a rule-based approach and doesn't take context into account. It applies predefined rules to trim affixes, which means it might not always capture the correct root word.

In [64]:
from nltk.stem.porter import PorterStemmer

In [65]:
ps = PorterStemmer()

def stem_words(text):
    return " ".join(ps.stem(word) for word in text.split())

In [68]:
# Examples:
# 1)
sample = 'walk walks walking walked'
print(stem_words(sample))

# 2)
text = 'probably my alltime favorite movie a story of selflessness sacrifice and dedication to a noble cause but its not preachy or boring it just never gets'
print(stem_words(text))

walk walk walk walk
probabl my alltim favorit movi a stori of selfless sacrific and dedic to a nobl caus but it not preachi or bore it just never get


## <a id='4-4'>4.4 Lemmatization</a>

" **Lemmatization**, *unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called* **Lemma**.*A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.*

**Lemmatization** is a text processing technique in Natural Language Processing (NLP) that **involves reducing words to their base or root form, while still ensuring that the root form belongs to the language**. Unlike stemming, which may produce non-standard or even non-existent words, lemmatization ensures that the root word belongs to the language's dictionary.

For example, the lemma (base form) of the words "running", "runner", and "ran" is "run". Similarly, the lemma of "better" is "good", and the lemma of "meeting" is "meet".

Lemmatization is commonly used in applications where preserving linguistic correctness and semantic meaning is important, such as in machine translation, question-answering systems, chatbots, and other contexts where precise understanding of the text is crucial. It's also used in tasks that require word sense disambiguation, as it retains the original meaning of words.

Here are some key points about lemmatization:

- *Retains Valid Words*: Unlike stemming, lemmatization produces valid words that exist in the language's dictionary. This makes it more linguistically accurate.
<br>

- *Contextually Sensitive*: Lemmatization takes into account the context and part-of-speech (POS) of a word. For instance, the word "better" could be a comparative adjective or a verb. Lemmatization identifies the correct base form based on the context.
<br>

- *Slower Than Stemming*: Lemmatization is typically slower than stemming because it involves dictionary lookups and morphological analysis to determine the correct lemma.
<br>

- *Use of POS Tags*: Lemmatization often requires part-of-speech (POS) tags to accurately identify the lemma. For example, the lemma of "better" as an adjective is different from its lemma as a verb.
<br>

- *More Precise for Tasks Requiring Linguistic Accuracy*: Lemmatization is preferred in tasks where linguistic accuracy is critical, such as in language translation, sentiment analysis, or text summarization.
<br>

- *Larger Resource Requirements*: Lemmatization may require access to a larger linguistic resource, like a comprehensive dictionary or corpus, in order to perform accurately.



The **WordNetLemmatizer** is a class provided by the Natural Language Toolkit (NLTK) library in Python. It is **used for lemmatization**, which is the process of reducing words to their base or root form while ensuring that the resulting form belongs to the language's dictionary.

Here are the key points about WordNetLemmatizer:

- *Based on WordNet*: WordNet is a lexical database for the English language. It provides a hierarchy of words and their relationships, including synonyms, antonyms, and more. The WordNetLemmatizer uses WordNet as a reference to perform lemmatization.

- *Linguistically Accurate*: The lemmas produced by the WordNetLemmatizer are valid words that exist in the English language. It ensures that the base form belongs to the language's dictionary.

- *Contextually Sensitive*: It takes into account the context and part-of-speech (POS) of a word. Lemmatization is more accurate when accompanied by POS tagging, as different parts of speech may have different lemmatizations.

- *Part of the NLTK Library*: The WordNetLemmatizer is part of the NLTK library, which is a widely-used library for natural language processing tasks in Python.

- *Simple to Use*: It is easy to use. You instantiate the WordNetLemmatizer class and then apply the lemmatize() method to words.

In [74]:
from nltk.stem import WordNetLemmatizer

In [75]:
# Example 1:

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize a word
lemma = wordnet_lemmatizer.lemmatize("running", pos = "v")
print(lemma)

run


In [77]:
# Example 2:

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Setting up the Sentence and Punctuations
sentence = 'He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun.'
punctuations = '?:!.,;'

# Tokenize the sentence into a list of words
sentence_words = nltk.word_tokenize(sentence)

# Remove the punctuation marks (puntuactions) from the list of words (sentence_words)
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)
sentence_words

# Print the Words and their Lemmas
print('{0:20}{1:20}'.format('Word','Lemma'))       # <--- This line prints the headers for the Word and Lemma columns, formatted to take up 20 characters each.

for word in sentence_words:
    print('{0:20}{1:20}'.format(word,wordnet_lemmatizer.lemmatize(word, pos = 'v')))     # <--- For each word, it prints the word and its lemma using wordnet_lemmatizer.lemmatize(word). The output is formatted to take up 20 characters for each column. The pos='v' argument specifies that the lemmatization should be performed assuming the word is a verb. This means it's finding the base form of each word assuming it's a verb.

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


**Lemmatization with WordNetLemmatizer to Tweet Data**

In [79]:
# Initialize the Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Define a Function for Lemmatization:
def lemmatize_text(text):
    return ' '.join([wordnet_lemmatizer.lemmatize(word, pos = 'v') for word in text.split()])

# Apply the Lemmatization Function to df:
df['LemmatizedTweets'] = df['OriginalTweet'].apply(lemmatize_text)


In [80]:
df.head(20)

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,LemmatizedTweets
0,3799,48751,London,16-03-2020,menyrbie philgahan chrisitv,Neutral,menyrbie philgahan chrisitv
1,3800,48752,UK,16-03-2020,advice talk neighbours family exchange phon...,Positive,advice talk neighbour family exchange phone nu...
2,3801,48753,Vagabonds,16-03-2020,coronavirus australia woolworths give elderly...,Positive,coronavirus australia woolworths give elderly ...
3,3802,48754,,16-03-2020,food stock one empty please dont panic ...,Positive,food stock one empty please dont panic enough ...
4,3803,48755,,16-03-2020,ready go supermarket covid19 outbreak i...,Extremely Negative,ready go supermarket covid19 outbreak im paran...
5,3804,48756,"ÜT: 36.319708,-82.363649",16-03-2020,news regionx0092s first confirmed covid19 c...,Positive,news regionx0092s first confirm covid19 case c...
6,3805,48757,"35.926541,-78.753267",16-03-2020,cashier grocery store sharing insights cov...,Positive,cashier grocery store share insights covid19 p...
7,3806,48758,Austria,16-03-2020,supermarket today didnt buy toilet paper re...,Neutral,supermarket today didnt buy toilet paper rebel...
8,3807,48759,"Atlanta, GA USA",16-03-2020,due covid19 retail store classroom atlanta...,Positive,due covid19 retail store classroom atlanta ope...
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,corona preventionwe stop buy things cash ...,Negative,corona preventionwe stop buy things cash use o...


## <a id='5'>5 Train-Test Split of Data</a>

In [82]:
# SPLIT DATA
X_train, X_test, Y_train, Y_test = train_test_split(df.drop('Sentiment', axis = 1),
                                                    df['Sentiment'],
                                                    train_size=0.8,                            # <--- 80% train and 20% test
                                                    random_state=42)

print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(32925, 6)
(32925,)
(8232, 6)
(8232,)


## <a id='6'>6 Feature Extraction from Text</a>

**Feature Extraction from Text** involves **converting raw text data into a numerical format** that can be used as input for machine learning models. This is a crucial step in natural language processing (NLP) tasks, as most machine learning algorithms require numerical input. The extracted features are then used as input for machine learning models to perform tasks like classification, regression, clustering, and more. The choice of feature extraction technique depends on the specific NLP task, the nature of the data, and the characteristics of the text corpus.

Here are **some common techniques** for feature extraction from text:

- *Bag of Words (BoW)*:
    - BoW represents text data as a collection of words, disregarding grammar and word order. It creates a vocabulary of all unique words in a corpus and counts the frequency of each word in a document. Each document is then represented as a vector where each element corresponds to the frequency of a word in the vocabulary.
<br>

- *Term Frequency-Inverse Document Frequency (TF-IDF)*:
    - TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It takes into account both the frequency of a word in a document (Term Frequency) and the rarity of the word in the entire corpus (Inverse Document Frequency).
<br>

- *Word Embeddings*:
    - Word embeddings are dense, low-dimensional vectors that represent words in a continuous vector space. Techniques like Word2Vec, GloVe, and FastText learn these embeddings by considering the context in which words appear. They capture semantic relationships between words and are effective in capturing word similarity and analogy.
<br>

- *Word Counts and Character Counts*:
    - Simple features like the total number of words in a document, average word length, or frequency of specific characters can also be used as features.
<br>

- *N-grams*:
    - N-grams are sequences of $N$ consecutive words. For example, Bi-grams consist of pairs of adjacent words. By considering sequences of words, N-grams can capture more context compared to BoW.
<br>

- *Part-of-Speech (POS) Tagging*:
    - POS tagging assigns a grammatical label to each word in a sentence (e.g., noun, verb, adjective). These tags can be used as features to capture linguistic information.
<br>

- *Sentiment Scores*:
    - Sentiment analysis tools can be used to assign sentiment scores to text, indicating the sentiment (positive, negative, neutral) expressed in the text.
<br>

- *Topic Modeling*:
    - Topic modeling techniques like Latent Dirichlet Allocation (LDA) can be used to extract topics from a collection of documents. The distribution of topics in a document can be used as features.
<br>

- *Syntactic Features*:
    - Features related to sentence structure, such as the presence of specific grammatical constructs (e.g., passive voice, conditional clauses), can be used.
<br>

- *Dependency Parsing*:
    - Features based on syntactic relationships between words in a sentence can be used to capture structural information.
<br>

- *Lexical Diversity Measures*:
    - Metrics like type-token ratio or TTR (ratio of unique words to total words) can be used to measure the richness and diversity of vocabulary in a document.



### <a id='6-1'>6.1 Bags of Words</a>

**The Bag of Words (BoW)** method is a fundamental technique in Natural Language Processing (NLP) used for text processing and feature extraction. It's called "bag" because it **involves treating text data as an unordered collection or bag of words, disregarding grammar, word order, and context**. BoW is widely used in various NLP tasks like sentiment analysis, document classification, and information retrieval.

Here are the **key steps** and concepts in the Bag of Words method:

- *Tokenization*:
    - The first step is to break down a piece of text into individual words or tokens. This process may involve removing punctuation and handling special cases like contractions.

- *Vocabulary Building*:
    - Once the text is tokenized, a vocabulary is constructed. This vocabulary consists of all unique words (or tokens) that appear in the corpus (collection of documents).

- *Word Frequency Count*:
    - For each document in the corpus, a vector is created where each element represents the frequency of a word in the document. These vectors can be very high-dimensional, with each dimension corresponding to a word in the vocabulary.

- *Sparse Matrix Representation*:
    - The result of BoW is often represented as a sparse matrix. A sparse matrix is a data structure that only stores non-zero elements, which are the counts of words in this case. This is efficient in terms of memory.

- *Normalization (Optional)*:
    - Depending on the specific task, the frequency counts can be normalized to make them more comparable across different documents. Common normalization techniques include TF-IDF (Term Frequency-Inverse Document Frequency).

- *Feature Vectors*:
    - Each document is represented as a feature vector where each element corresponds to the frequency of a specific word in the vocabulary. The order of the words does not matter, hence the term "bag of words".

- *Loss of Contextual Information*:
    - One limitation of BoW is that it completely ignores the order of words and any contextual information. For example, "not good" and "good not" would be represented the same way.

- *High Dimensionality*:
    - BoW can lead to high-dimensional feature spaces, especially for large vocabularies and extensive documents. This can impact the efficiency of some machine learning algorithms.

- *Application in Machine Learning*:
    - BoW vectors are commonly used as input features for various machine learning models. For example, in sentiment analysis, these vectors can be fed into a classifier to predict the sentiment of a document.

<u>Note: </u>
BoW is a powerful and versatile technique, but it may not be suitable for tasks where word order or context is crucial (e.g., language translation or tasks requiring understanding of semantics). In such cases, more advanced techniques like word embeddings or deep learning models may be more appropriate.