In [97]:
#! /usr/bin/python3

# INTRODUCTION

Welcome to this Jupyter Notebook where we will delve into various Natural Language Processing (NLP) techniques. Through hands-on exploration, we aim to gain a deeper understanding of this evolving NLP paradigm and its practical applications. Join me on this journey as we unravel the intricacies of language processing and analysis.

# 1. Classify the spam using Naive Bayes

In this section, we'll dive into the fascinating task of spam classification using the Naive Bayes algorithm. Spam classification is a classic problem in the realm of Natural Language Processing (NLP) and machine learning. By employing the Naive Bayes approach, we aim to build a model that can effectively distinguish between spam and non-spam messages based on their textual content. Let's walk through the process of preprocessing the data, training the Naive Bayes classifier, and evaluating its performance. By the end of this section, you'll have a solid grasp of how Naive Bayes can be harnessed for practical NLP tasks like spam detection.

In [98]:
# import the libraries

import requests
import pandas as pd
import zipfile

In [99]:
# get the data

url = 'https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip'
data_file = 'SMSSpamCollection'

response = requests.get(url)
filename = url.split("/")[-1]

with open(filename, 'wb') as file:
    file.write(response.content)

with zipfile.ZipFile(filename, 'r') as zip:
    zip.extractall('')

data = pd.read_table(data_file,
                    header = 0,
                    names = ['type', 'message'])


In [100]:
# see a sample of our data

data.sample(3)

Unnamed: 0,type,message
4478,ham,I anything lor.
159,spam,You are a winner U have been specially selecte...
270,ham,"Come to mu, we're sorting out our narcotics si..."


In [101]:
# description of our data

data.describe()

Unnamed: 0,type,message
count,5571,5571
unique,2,5168
top,ham,"Sorry, I'll call later"
freq,4824,30


As we can see, we have 5571 rows with two columns each. One column is the message, and the other is the type of the message, spam or ham (no spam).

Now, we will transform our data, so we will be able to work with it.

In [102]:
# we tokenize our message

import nltk

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *

nltk.download('punkt')
nltk.download('stopwords')

stop = stopwords.words('english')

data['tokens'] = data['message'].apply(lambda x: nltk.word_tokenize(x))

data['tokens'] = data['tokens'].apply(lambda x: [palabra for palabra in x if palabra not in stop])

stemmer = PorterStemmer()

data['tokens'] = data['tokens'].apply(lambda x: [stemmer.stem(item) for item in x])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Perrosato\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Perrosato\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [103]:
# see our data again

data.sample(5)

Unnamed: 0,type,message,tokens
1620,ham,"Fuck babe, I miss you sooooo much !! I wish yo...","[fuck, babe, ,, i, miss, sooooo, much, !, !, i..."
3385,ham,Ok can...,"[ok, ...]"
5217,ham,I accidentally brought em home in the box,"[i, accident, brought, em, home, box]"
873,ham,Ugh its been a long day. I'm exhausted. Just w...,"[ugh, long, day, ., i, 'm, exhaust, ., just, w..."
2191,ham,Thankyou so much for the call. I appreciate yo...,"[thankyou, much, call, ., i, appreci, care, .]"


Once we have the dataset, we need to transform it into a matrix format. A common representation is the term-document matrix or feature matrix, where each row corresponds to a document (data point) and each column represents a unique feature. In your case, where you're dealing with text data for spam classification, you might use techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numerical features.

In [104]:
# import the functions

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# re-join the strings

data['tokens'] = data['tokens'].apply(lambda x: ' '.join(x))

# split the data

x_train, x_test, y_train, y_test = train_test_split(
    data['tokens'],
    data['type'],
    test_size = 0.2
)

# create the vectorizer

vectorizer = CountVectorizer(
    strip_accents = 'ascii',
    lowercase = True
)

# fit vectorizer

vectorizer_fit = vectorizer.fit(x_train)
x_train_transformed = vectorizer_fit.transform(x_train)
x_test_transformed = vectorizer_fit.transform(x_test)


With the matrix prepared, we can apply the Naive Bayes algorithm for classification. Naive Bayes is a probabilistic algorithm that makes predictions based on the calculated probabilities of a data point belonging to different classes.

In [105]:
# import the functions

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, balanced_accuracy_score

# train the model

naive_bayes = MultinomialNB()
naive_bayes_fit = naive_bayes.fit(x_train_transformed, y_train)

# make the predictions

train_predict = naive_bayes_fit.predict(x_train_transformed)
test_predict = naive_bayes_fit.predict(x_test_transformed)

# get the results

print(f"The train has {balanced_accuracy_score(y_train, train_predict)} of accuracy, with the confusion matrix:\n",
        f"{confusion_matrix(y_train, train_predict)}")

print(f"\nThe test has {balanced_accuracy_score(y_test, test_predict)} of accuracy, with the confusion matrix:\n",
        f"{confusion_matrix(y_test, test_predict)}")


The train has 0.9849953658640458 of accuracy, with the confusion matrix:
 [[3856   11]
 [  16  573]]

The test has 0.9583382934539635 of accuracy, with the confusion matrix:
 [[956   1]
 [ 13 145]]


In [106]:
# we will create a function that is able to predict if a message is spam:

def spam_detector(message_vect = None):

    if message_vect == None:
        message = str(input("Insert your message: "))
    else:
        message = message_vect
    message_token = nltk.word_tokenize(message)
    message_clean = [word for word in message_token if word not in stop]
    message_clean = [stemmer.stem(item) for item in message_clean]

    message = ' '.join(message_clean)

    message_vect = vectorizer_fit.transform([message])
    
    return(naive_bayes_fit.predict(message_vect)[0])


In [107]:
# example of the use of the detector

spam_detector('You just won a brand new car. To get if for free, just click in this link')

'spam'

# 2. Sentimetal analysis

In the upcoming section, we're set to explore the captivating realm of sentiment analysis using advanced machine learning techniques. Sentiment analysis, a pivotal application of Natural Language Processing (NLP), involves discerning the emotional tone underlying textual content. Our focus turns towards unraveling the sentiments expressed within text, ranging from positive and neutral to negative. Armed with cutting-edge methods, we'll delve into the process of preparing data, constructing robust sentiment analysis models, and evaluating their effectiveness. By immersing ourselves in this section, you'll gain a profound understanding of how to wield these techniques for real-world NLP tasks, enabling you to develop models adept at deciphering and classifying sentiments within text.

In [108]:
# install kaggle and download the data

!pip install kaggle

from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

!kaggle datasets download -d arkhoshghalb/twitter-sentiment-analysis-hatred-speech

twitter-sentiment-analysis-hatred-speech.zip: Skipping, found more recently modified local copy (use --force to force download)


In [109]:
# get the two datasets

train = pd.read_csv("train.csv", index_col = 'id')
test = pd.read_csv("test.csv", index_col = 'id')

In [110]:
# see an example of the train

train.sample(5)

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
11181,0,@user ahhhh might have guessed #euro2016
19640,0,(1/3) #samanthabee may be the only #woman in #...
7059,0,it's arrived!! my free scarf from @user for re...
19128,0,i finally found a way how to delete old tweets...
9746,0,sunday morning beach run #weekend #running #...


In [111]:
# see an example of the test 

test.sample(5)

Unnamed: 0_level_0,tweet
id,Unnamed: 1_level_1
44081,hot new release! swiftly sharpens the fang #r...
37592,#playinggames buffalo simulation: buffalo fo...
35404,@user happy it's friday #tgif #kitten #catsof...
46521,@user i always think that at 1st .. i'm sure w...
46743,@user danish #imam tried for #antijewish incit...


In this NLP project, our goal is to develop a model that can determine whether tweets are racist or not. The dataset we're working with contains labeled tweets, where the label is assigned as follows:

- **Label 1**: Tweets that are identified as racist.
- **Label 0**: Tweets that are not racist.

We'll be using a machine learning approach to build and train a classifier for this task.


In [112]:
# import some libraries 

!pip install tweet-preprocessor
import re
import preprocessor as p

REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\|)|(\()|(\))|(\[)|(\])|(\%)|(\$)|(\>)|(\<)|(\{)|(\})")
REPLACE_WITH_SPACE = re.compile("(<br\s/><br\s/?)|(-)|(/)|(:).")



In [113]:
# function to clean the tweets

def clean_tweets(df):
  tempArr = []
  for line in df:
    tmpL = p.clean(line)
    tmpL = REPLACE_NO_SPACE.sub("", tmpL.lower()) 
    tmpL = REPLACE_WITH_SPACE.sub(" ", tmpL)
    tempArr.append(tmpL)
  return tempArr

In [114]:
# clean the tweets

train['tweet'] = clean_tweets(train['tweet'])


In [115]:
# split the data

x_train, x_test, y_train, y_test = train_test_split(train['tweet'],
                                                    train['label'],
                                                    test_size = 0.2,
                                                    shuffle = True 
                                                    )

In [116]:
# create and train the vectorizer

vectorizer = CountVectorizer(stop_words = 'english', binary = True)

vectorizer.fit(train['tweet'])

x_train_vec = vectorizer.transform(x_train)
x_test_vec = vectorizer.transform(x_test)

In [117]:
# import the algorithm and create the model

from sklearn import svm

svm = svm.SVC(kernel = 'linear')

svm.fit(x_train_vec, y_train)

y_train_pred = svm.predict(x_train_vec)
y_test_pred = svm.predict(x_test_vec)

In [118]:
# evaluate the model

from sklearn.metrics import accuracy_score

print("The accuracy score is: ", round(accuracy_score(y_true = y_test, y_pred = y_test_pred) * 100, 2), '%.')


The accuracy score is:  95.15 %.


Now that we have our model trained, we will take the test data to make an example of how to predict if a tweet or a group of them are or not racists.

In [132]:
# we take a random tweet to do the example

import random

num = random.randint(test.index.min(), test.index.max())

tweet = test.loc[num]['tweet']

tweet_cleaned = clean_tweets([tweet])
tweet_vec = vectorizer.transform(tweet_cleaned)
tweet_pred = svm.predict(tweet_vec)

if tweet_pred[0] == 1:
    predict = "is"
else:
    predict = "is not"

print(f'The {num} tweet {predict} racist.')

The 42346 tweet is not racist.


We will also evaluate our model with the test data.

In [133]:
# create a column with the predict

test['tweet'] = clean_tweets(test['tweet'])
test_vect = vectorizer.transform(test['tweet'])
test['predict'] = svm.predict(test_vect)

test.head(5)

Unnamed: 0_level_0,tweet,predict
id,Unnamed: 1_level_1,Unnamed: 2_level_1
31963,to find,0
31964,want everyone to see the new and heres why,0
31965,safe ways to heal your,0
31966,is the hp and the cursed child book up for res...,0
31967,rd to my amazing hilarious eli ahmir uncle dav...,0


In [135]:
# print the accuracy

print(f"There are {len(test.loc[test['predict'] == 1])} racist and {len(test) - len(test.loc[test['predict'] == 1])} not racists.")

There are 873 racist and 16324 not racists.


In [139]:
# some of the racist tweets 

test.loc[test['predict'] == 1].sample(5)

Unnamed: 0_level_0,tweet,predict
id,Unnamed: 1_level_1,Unnamed: 2_level_1
43803,get familiar w the elite roots of richard spe...,1
40214,the impression given was the houses would have...,1
39172,is not an racism is people based on race to j...,1
44497,im genuinely disappointed that a performer as ...,1
33911,thank fuck for that i was running out of vids ...,1
