Build a text classification model using the Naive Bayes algorithm to categorize the blog posts accurately. Furthermore, perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts. 

# Data Exploration and Preprocessing

In [1]:
import warnings                   # import library warnings
warnings.filterwarnings('ignore')  # it will ignore warnings in the code

In [2]:
import pandas as pd
data = pd.read_csv(r'blogs.csv', header=0)  # load dataset, 0th row as a header
data.head() # display top 5 rows

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,alt.atheism
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,alt.atheism
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,alt.atheism
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,alt.atheism


In [3]:
data.info() # gives info about null values and data type of each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Data    2000 non-null   object
 1   Labels  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [4]:
data.isnull().sum()  # there is no null value 

Data      0
Labels    0
dtype: int64

In [5]:
data.dtypes # data type in each column

Data      object
Labels    object
dtype: object

In [6]:
data.shape  # rows and columns

(2000, 2)

In [7]:
data[data.duplicated()] # display the duplicated row

Unnamed: 0,Data,Labels


there is no duplicate row in the dataset

In [8]:
data['Labels'].unique()

array(['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
       'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
       'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
       'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt',
       'sci.electronics', 'sci.med', 'sci.space',
       'soc.religion.christian', 'talk.politics.guns',
       'talk.politics.mideast', 'talk.politics.misc',
       'talk.religion.misc'], dtype=object)

# convert the Labels into numbers using LabelEncoding

In [9]:
colname = ['Labels']

In [10]:
from sklearn.preprocessing import LabelEncoder # import LabelEncoder function from preprocessing sublibrary
le=LabelEncoder()                              # save LabelEncoder function in a variable le
for x in colname:                             
    data[x]=le.fit_transform(data[x]) # it assigns numbers to all values of categorical column
    le_name_mapping = dict(zip(le.classes_,le.transform(le.classes_)))  # represent in a dictionary
    print('Feature',x)
    print('mapping',le_name_mapping)

Feature Labels
mapping {'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5, 'misc.forsale': 6, 'rec.autos': 7, 'rec.motorcycles': 8, 'rec.sport.baseball': 9, 'rec.sport.hockey': 10, 'sci.crypt': 11, 'sci.electronics': 12, 'sci.med': 13, 'sci.space': 14, 'soc.religion.christian': 15, 'talk.politics.guns': 16, 'talk.politics.mideast': 17, 'talk.politics.misc': 18, 'talk.religion.misc': 19}


In [11]:
data.head()

Unnamed: 0,Data,Labels
0,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,0
1,Newsgroups: alt.atheism\nPath: cantaloupe.srv....,0
2,Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...,0
3,Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...,0
4,Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...,0


In [12]:
#Labels count: 
data['Labels'].value_counts() # our dataset is balanced

0     100
1     100
18    100
17    100
16    100
15    100
14    100
13    100
12    100
11    100
10    100
9     100
8     100
7     100
6     100
5     100
4     100
3     100
2     100
19    100
Name: Labels, dtype: int64

In [13]:
data['Data']

0       Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...
1       Newsgroups: alt.atheism\nPath: cantaloupe.srv....
2       Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...
3       Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...
4       Xref: cantaloupe.srv.cs.cmu.edu alt.atheism:53...
                              ...                        
1995    Xref: cantaloupe.srv.cs.cmu.edu talk.abortion:...
1996    Xref: cantaloupe.srv.cs.cmu.edu talk.religion....
1997    Xref: cantaloupe.srv.cs.cmu.edu talk.origins:4...
1998    Xref: cantaloupe.srv.cs.cmu.edu talk.religion....
1999    Xref: cantaloupe.srv.cs.cmu.edu sci.skeptic:43...
Name: Data, Length: 2000, dtype: object

# 	Preprocess the data by cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing, and removing stopwords.

Let's remove the non-alphanumeric characters i.e. special characters (like @, #, $ etc.) from the dataset using regex function

In [14]:
import re # regex function

In [15]:
def remove_tags(string):
    result = re.sub('','',string)          #remove HTML tags 
    result = re.sub('https://.*','',result)   #remove URLs
    result = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", result) # remove special characters in result & replace them with a blank space
    result = result.lower()  # convert text to lowercase
    return result # after execution of return statement, interpreter will come out of the function and go to the location where the func. is called

data['Data']=data['Data'].apply(lambda cw : remove_tags(cw)) # apply func used to apply lambda func to a dataframe, remove_tags()func is called & cw will copy over string 

In [16]:
data['Data']   # we can see now Data column doesn't have special characters 

0       path  cantaloupe srv cs cmu edu magnesium club...
1       newsgroups  alt atheism path  cantaloupe srv c...
2       path  cantaloupe srv cs cmu edu das news harva...
3       path  cantaloupe srv cs cmu edu magnesium club...
4       xref  cantaloupe srv cs cmu edu alt atheism 53...
                              ...                        
1995    xref  cantaloupe srv cs cmu edu talk abortion ...
1996    xref  cantaloupe srv cs cmu edu talk religion ...
1997    xref  cantaloupe srv cs cmu edu talk origins 4...
1998    xref  cantaloupe srv cs cmu edu talk religion ...
1999    xref  cantaloupe srv cs cmu edu sci skeptic 43...
Name: Data, Length: 2000, dtype: object

# Remove stop words . Stop words don't hold any special meaning in a sentence like 'and', 'the' etc. So, we should remove them using nltk library which has stop words list

In [17]:
import nltk  # nltk lib. (Natural Language Toolkit)
nltk.download('stopwords') # download stopwords package to sublib. corpus 
from nltk.corpus import stopwords # import stopwords func. from sublib. corpus 
stop_words = set(stopwords.words('english')) # list of stop words in english

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
stop_words # list of stop words in english 

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [19]:
# remove stop words 
# join words in Data column with a blank space if they are not in stopwords list 
data['Data'] = data['Data'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [20]:
data['Data']  # now Data column doesn't have stop words 

0       path cantaloupe srv cs cmu edu magnesium club ...
1       newsgroups alt atheism path cantaloupe srv cs ...
2       path cantaloupe srv cs cmu edu das news harvar...
3       path cantaloupe srv cs cmu edu magnesium club ...
4       xref cantaloupe srv cs cmu edu alt atheism 534...
                              ...                        
1995    xref cantaloupe srv cs cmu edu talk abortion 1...
1996    xref cantaloupe srv cs cmu edu talk religion m...
1997    xref cantaloupe srv cs cmu edu talk origins 41...
1998    xref cantaloupe srv cs cmu edu talk religion m...
1999    xref cantaloupe srv cs cmu edu sci skeptic 435...
Name: Data, Length: 2000, dtype: object

# Now, we perform lemmatization on the text column. Lemmatization is used to find the root form of words in NLP, for ex: root form of the words: reading, reads, read is read. This save unnecessary computational cost in decoding the entire words.

In lemmatization, text convert into tokens/words and then each token convert into root form

In [21]:
nltk.download('wordnet') # download package wordnet to sublib. stem
nltk.download('omw-1.4') # download package omw-1.4 to sublib. tokenize 
w_tokenizer = nltk.tokenize.WhitespaceTokenizer() # save function in a variable 
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(sentence): # define a func and pass sentence in it 
    st = ""    # empty string 
    for w in w_tokenizer.tokenize(sentence):   # convert text into tokens using WhitespaceTokenizer() func saved in var. w_tokenizer 
        st = st + lemmatizer.lemmatize(w) + " "   # convert token into root form using WordNetLemmatizer() func saved in var. lemmatizer  
    return st
data['Data'] = data['Data'].apply(lambda t: lemmatize_text(t)) # func. is called, t will copy over sentence 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [22]:
data['Data']  # we can see Data column has root form of words 

0       path cantaloupe srv c cmu edu magnesium club c...
1       newsgroups alt atheism path cantaloupe srv c c...
2       path cantaloupe srv c cmu edu da news harvard ...
3       path cantaloupe srv c cmu edu magnesium club c...
4       xref cantaloupe srv c cmu edu alt atheism 5348...
                              ...                        
1995    xref cantaloupe srv c cmu edu talk abortion 12...
1996    xref cantaloupe srv c cmu edu talk religion mi...
1997    xref cantaloupe srv c cmu edu talk origin 4103...
1998    xref cantaloupe srv c cmu edu talk religion mi...
1999    xref cantaloupe srv c cmu edu sci skeptic 4356...
Name: Data, Length: 2000, dtype: object

classify the blog posts into different categories 

In [23]:
data.head()

Unnamed: 0,Data,Labels
0,path cantaloupe srv c cmu edu magnesium club c...,0
1,newsgroups alt atheism path cantaloupe srv c c...,0
2,path cantaloupe srv c cmu edu da news harvard ...,0
3,path cantaloupe srv c cmu edu magnesium club c...,0
4,xref cantaloupe srv c cmu edu alt atheism 5348...,0


In [24]:
# define X and Y
X = data['Data']
Y = data['Labels']

In [25]:
X

0       path cantaloupe srv c cmu edu magnesium club c...
1       newsgroups alt atheism path cantaloupe srv c c...
2       path cantaloupe srv c cmu edu da news harvard ...
3       path cantaloupe srv c cmu edu magnesium club c...
4       xref cantaloupe srv c cmu edu alt atheism 5348...
                              ...                        
1995    xref cantaloupe srv c cmu edu talk abortion 12...
1996    xref cantaloupe srv c cmu edu talk religion mi...
1997    xref cantaloupe srv c cmu edu talk origin 4103...
1998    xref cantaloupe srv c cmu edu talk religion mi...
1999    xref cantaloupe srv c cmu edu sci skeptic 4356...
Name: Data, Length: 2000, dtype: object

In [26]:
Y

0        0
1        0
2        0
3        0
4        0
        ..
1995    19
1996    19
1997    19
1998    19
1999    19
Name: Labels, Length: 2000, dtype: int32

# Split the data into training and test sets.

In [27]:
# split the dataset 
# 75% training set, 25% testing set 
# stratify = Y will make sure that random split has same proportion of classes in both training(Y_train) & testing set(Y_test) 

from sklearn.model_selection import train_test_split 
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,stratify=Y, test_size=0.25,random_state=42) 

In [28]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(1500,)
(500,)
(1500,)
(500,)


# Perform feature extraction to convert text data into a format that can be used by the Naive Bayes model

# vectorize text to numbers using func. CountVectorizer

In [29]:
from sklearn.feature_extraction.text import CountVectorizer # import CountVectorizer func
vec = CountVectorizer(stop_words='english')

In [30]:
vec

CountVectorizer(stop_words='english')

In [31]:
X_train = vec.fit_transform(X_train).toarray() # convert X_train into numbers 

In [32]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [33]:
X_test = vec.transform(X_test).toarray() # convert X_test into numbers 

In [34]:
X_test

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

# Implement a Naive Bayes classifier to categorize the blog posts into their respective categories. We can use libraries like scikit-learn for this purpose.

In [35]:
# Now, we will fit the Naive Bayes model to the training data
from sklearn.naive_bayes import MultinomialNB   # multinomial for multiple classes, Gaussian for binary classes
classifier = MultinomialNB()

In [36]:
classifier

MultinomialNB()

# Train the model on the training set and make predictions on the test set.

In [37]:
classifier.fit(X_train,Y_train)  # train the data

MultinomialNB()

In [38]:
classifier.score(X_train, Y_train) # score of the model on training data

0.9966666666666667

In [39]:
# we will predict the test data 
Y_pred = classifier.predict(X_test) # predict the class of Y for the given testing data

In [40]:
Y_pred

array([ 9, 19, 11, 10,  5, 19,  8,  1,  8,  1, 15, 17, 16,  7, 17,  8,  6,
       10,  4,  7, 12,  2,  0,  3,  2,  3, 13,  9, 11,  9, 18, 12,  2, 13,
        8, 12,  5,  2, 14, 11, 14, 13,  2, 13, 17, 15, 10, 14,  3,  2, 17,
       19, 15, 18,  7, 19, 17,  9, 17,  1,  7, 18, 19, 16,  4,  2, 16, 15,
       17, 15, 14, 18,  9, 13,  5, 18, 16, 14,  5,  8,  1,  9,  5, 14, 19,
        0, 11,  4, 15, 14,  6, 11,  3,  1, 14, 19,  5,  7, 15,  4, 18, 17,
        1, 17,  3,  9,  7,  9, 19, 10, 16, 10,  7, 19,  3,  6, 16,  2, 19,
       19,  2, 14, 14,  3, 14, 11, 16, 17,  2,  5,  3,  6,  9,  2, 11,  7,
        1,  3,  7,  4, 13,  3,  0, 11, 13,  6,  3,  4,  5,  3, 12, 19, 11,
        3, 18,  0,  6, 11,  2, 18, 10,  0,  4,  9, 13,  1, 11, 19, 13, 13,
       12,  8,  3,  2,  7, 13, 13, 18, 10, 19,  1,  4, 13,  4, 16, 17, 11,
       16,  5, 14, 15, 14, 18,  3, 10, 12,  8,  3,  0,  6,  4, 14, 10,  3,
       16, 14, 18, 10,  5, 15, 18,  7,  0, 15, 19,  6,  2, 10, 11, 15,  3,
        1,  8, 11, 15,  5

In [41]:
print(list(zip(Y_test, Y_pred))) # compare actual Y with predicted Y

[(9, 9), (19, 19), (12, 11), (10, 10), (5, 5), (19, 19), (8, 8), (13, 1), (8, 8), (1, 1), (15, 15), (18, 17), (16, 16), (8, 7), (17, 17), (8, 8), (6, 6), (12, 10), (4, 4), (7, 7), (12, 12), (2, 2), (0, 0), (3, 3), (6, 2), (3, 3), (13, 13), (9, 9), (11, 11), (9, 9), (18, 18), (12, 12), (2, 2), (13, 13), (8, 8), (12, 12), (5, 5), (2, 2), (14, 14), (11, 11), (14, 14), (13, 13), (2, 2), (13, 13), (17, 17), (15, 15), (10, 10), (14, 14), (3, 3), (2, 2), (17, 17), (19, 19), (15, 15), (16, 18), (7, 7), (0, 19), (17, 17), (9, 9), (18, 17), (4, 1), (7, 7), (16, 18), (19, 19), (16, 16), (4, 4), (2, 2), (16, 16), (15, 15), (17, 17), (15, 15), (14, 14), (19, 18), (9, 9), (13, 13), (5, 5), (18, 18), (16, 16), (14, 14), (5, 5), (8, 8), (1, 1), (9, 9), (1, 5), (14, 14), (19, 19), (0, 0), (11, 11), (4, 4), (15, 15), (14, 14), (6, 6), (11, 11), (3, 3), (1, 1), (14, 14), (19, 19), (5, 5), (7, 7), (15, 15), (4, 4), (10, 18), (17, 17), (1, 1), (17, 17), (3, 3), (9, 9), (7, 7), (9, 9), (19, 19), (10, 10), (

In [42]:
# model evaluation 
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report # import these functions from metrics sublib.
cfm = confusion_matrix(Y_test,Y_pred)  #confusion matrix
print(cfm)

print('classification report')  # classification report
print(classification_report(Y_test,Y_pred))

acc = accuracy_score(Y_test,Y_pred)  # accuracy of the model
print('Multinomial Naive Bayes model accuracy:',acc)

[[15  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  0  0  0  8]
 [ 0 20  1  1  0  3  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0 23  1  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  4 19  1  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0]
 [ 0  1  1  2 20  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0]
 [ 0  1  2  2  0 19  1  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  1  1  0  0 17  0  1  0  0  0  1  1  0  0  1  0  2  0]
 [ 0  0  0  0  0  0  0 20  0  0  1  0  0  0  0  0  2  0  2  0]
 [ 0  0  0  0  0  0  0  2 21  0  0  0  0  0  0  0  2  0  0  0]
 [ 0  0  0  0  0  0  0  0  1 21  3  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0 24  0  0  0  0  0  0  0  1  0]
 [ 0  0  0  0  0  0  0  0  0  0  0 25  0  0  0  0  0  0  0  0]
 [ 0  1  0  0  1  0  1  1  0  0  1  1 16  0  2  0  0  0  1  0]
 [ 0  1  0  0  0  0  0  0  0  0  0  0  0 22  0  0  0  0  2  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  1  0 23  0  0  0  1  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 25  0  0

accuracy of our model on testing data is 0.82 which is very good

In [43]:
# model validating 
classifier.predict(vec.transform(['Newsgroups: alt.atheism']).toarray())  # we can pass any text for testing our model 

array([0])

0 means category 'alt.atheism' which is also confirmed by our dataset 

In [44]:
classifier.predict(vec.transform(['Xref: cantaloupe.srv.cs.cmu.edu comp.graphics:38728']).toarray())

array([1])

1 means category 'comp.graphics' which is also confirmed by our dataset 

In [45]:
classifier.predict(vec.transform(['Association of the United States has alerted the Defense Base']).toarray())

array([11])

11 means category 'sci.crypt' which is also confirmed by our dataset

Our model is making correct predictions. So, our model can classify an unseen data as one of the above given categories.

# perform sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts i.e. classify the blog post as positive, negative or neutral using SentimentIntensityAnalyzer

In [46]:
data.head()

Unnamed: 0,Data,Labels
0,path cantaloupe srv c cmu edu magnesium club c...,0
1,newsgroups alt atheism path cantaloupe srv c c...,0
2,path cantaloupe srv c cmu edu da news harvard ...,0
3,path cantaloupe srv c cmu edu magnesium club c...,0
4,xref cantaloupe srv c cmu edu alt atheism 5348...,0


In [47]:
# nltk lib. has vader lexicon sub lib. which has SentimentIntensityAnalyzer func. 
#SentimentIntensityAnalyzer func runs the entire NLP pipeline to do sentiment analysis of any sentence  
nltk.download("vader_lexicon")
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()  # initialise SentimentIntensityAnalyzer func 

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [48]:
# apply lambda function to a dataframe
# sid method has polarity_scores function that returns a dictionary of sentiment scores when we pass review as input 
# add a new column score that has dictionary of sentiment scores

data["score"] = data["Data"].apply(lambda review:sid.polarity_scores(review))
data.head()

Unnamed: 0,Data,Labels,score
0,path cantaloupe srv c cmu edu magnesium club c...,0,"{'neg': 0.178, 'neu': 0.68, 'pos': 0.142, 'com..."
1,newsgroups alt atheism path cantaloupe srv c c...,0,"{'neg': 0.015, 'neu': 0.871, 'pos': 0.115, 'co..."
2,path cantaloupe srv c cmu edu da news harvard ...,0,"{'neg': 0.18, 'neu': 0.747, 'pos': 0.074, 'com..."
3,path cantaloupe srv c cmu edu magnesium club c...,0,"{'neg': 0.257, 'neu': 0.59, 'pos': 0.153, 'com..."
4,xref cantaloupe srv c cmu edu alt atheism 5348...,0,"{'neg': 0.027, 'neu': 0.859, 'pos': 0.114, 'co..."


In [49]:
# Since, overall sentiment of the text given by compound score. So, retrieve the compound score from the score column
# add a new column compound which has compound scores 
data["compound"] = data["score"].apply(lambda d:d["compound"])

In [50]:
data.head()

Unnamed: 0,Data,Labels,score,compound
0,path cantaloupe srv c cmu edu magnesium club c...,0,"{'neg': 0.178, 'neu': 0.68, 'pos': 0.142, 'com...",-0.9904
1,newsgroups alt atheism path cantaloupe srv c c...,0,"{'neg': 0.015, 'neu': 0.871, 'pos': 0.115, 'co...",0.9251
2,path cantaloupe srv c cmu edu da news harvard ...,0,"{'neg': 0.18, 'neu': 0.747, 'pos': 0.074, 'com...",-0.99
3,path cantaloupe srv c cmu edu magnesium club c...,0,"{'neg': 0.257, 'neu': 0.59, 'pos': 0.153, 'com...",-0.9997
4,xref cantaloupe srv c cmu edu alt atheism 5348...,0,"{'neg': 0.027, 'neu': 0.859, 'pos': 0.114, 'co...",0.9778


In [51]:
# define a function to classify the blog posts as positive, negative or neutral post using compound score
def blog_sentiment(compound_score):  

    if compound_score > 0:
        return 'positive'
    elif compound_score == 0:
        return 'neutral'
    else:
        return 'negative'

data["sentiment"] = data["compound"].apply(lambda s : blog_sentiment(s)) # function is called and s will copy over compound_score

In [52]:
data.head()

Unnamed: 0,Data,Labels,score,compound,sentiment
0,path cantaloupe srv c cmu edu magnesium club c...,0,"{'neg': 0.178, 'neu': 0.68, 'pos': 0.142, 'com...",-0.9904,negative
1,newsgroups alt atheism path cantaloupe srv c c...,0,"{'neg': 0.015, 'neu': 0.871, 'pos': 0.115, 'co...",0.9251,positive
2,path cantaloupe srv c cmu edu da news harvard ...,0,"{'neg': 0.18, 'neu': 0.747, 'pos': 0.074, 'com...",-0.99,negative
3,path cantaloupe srv c cmu edu magnesium club c...,0,"{'neg': 0.257, 'neu': 0.59, 'pos': 0.153, 'com...",-0.9997,negative
4,xref cantaloupe srv c cmu edu alt atheism 5348...,0,"{'neg': 0.027, 'neu': 0.859, 'pos': 0.114, 'co...",0.9778,positive


In [53]:
del data['Labels']   # delete Labels column
del data["score"]    # delete score column
del data["compound"]  # delete compound column 

In [54]:
data.head()

Unnamed: 0,Data,sentiment
0,path cantaloupe srv c cmu edu magnesium club c...,negative
1,newsgroups alt atheism path cantaloupe srv c c...,positive
2,path cantaloupe srv c cmu edu da news harvard ...,negative
3,path cantaloupe srv c cmu edu magnesium club c...,negative
4,xref cantaloupe srv c cmu edu alt atheism 5348...,positive


In [55]:
data['sentiment'].unique()

array(['negative', 'positive', 'neutral'], dtype=object)

# apply LabelEncoding on sentiment column

In [56]:
colname = ['sentiment']
from sklearn.preprocessing import LabelEncoder # import LabelEncoder function from preprocessing sublibrary
le=LabelEncoder()                              # save LabelEncoder function in a variable le
for x in colname:                             
    data[x]=le.fit_transform(data[x]) # it assigns numbers to all values of categorical column
    le_name_mapping = dict(zip(le.classes_,le.transform(le.classes_)))  # represent in a dictionary
    print('Feature',x)
    print('mapping',le_name_mapping)

Feature sentiment
mapping {'negative': 0, 'neutral': 1, 'positive': 2}


In [57]:
data.head()

Unnamed: 0,Data,sentiment
0,path cantaloupe srv c cmu edu magnesium club c...,0
1,newsgroups alt atheism path cantaloupe srv c c...,2
2,path cantaloupe srv c cmu edu da news harvard ...,0
3,path cantaloupe srv c cmu edu magnesium club c...,0
4,xref cantaloupe srv c cmu edu alt atheism 5348...,2


In [58]:
# define X and Y
X = data['Data']
Y = data['sentiment']

In [59]:
X

0       path cantaloupe srv c cmu edu magnesium club c...
1       newsgroups alt atheism path cantaloupe srv c c...
2       path cantaloupe srv c cmu edu da news harvard ...
3       path cantaloupe srv c cmu edu magnesium club c...
4       xref cantaloupe srv c cmu edu alt atheism 5348...
                              ...                        
1995    xref cantaloupe srv c cmu edu talk abortion 12...
1996    xref cantaloupe srv c cmu edu talk religion mi...
1997    xref cantaloupe srv c cmu edu talk origin 4103...
1998    xref cantaloupe srv c cmu edu talk religion mi...
1999    xref cantaloupe srv c cmu edu sci skeptic 4356...
Name: Data, Length: 2000, dtype: object

In [60]:
Y

0       0
1       2
2       0
3       0
4       2
       ..
1995    2
1996    2
1997    2
1998    2
1999    0
Name: sentiment, Length: 2000, dtype: int32

In [61]:
# split the dataset 
# 75% training set, 25% testing set 
# stratify = Y will make sure that random split has same proportion of 0's, 1's, 2's in both training(Y_train) & testing set(Y_test) 

from sklearn.model_selection import train_test_split 
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,stratify=Y, test_size=0.25,random_state=42) 

In [62]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(1500,)
(500,)
(1500,)
(500,)


In [63]:
#vectorize text to numbers using func. CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words='english')

In [64]:
vec

CountVectorizer(stop_words='english')

In [65]:
X_train = vec.fit_transform(X_train).toarray() # convert X_train into numbers 

In [66]:
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [67]:
X_test = vec.transform(X_test).toarray() # convert X_test into numbers 

In [68]:
X_test

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [69]:
# Now, we will fit the Naive Bayes model to the training data
from sklearn.naive_bayes import MultinomialNB   # multinomial for multiple classes, Gaussian for binary classes
classifier = MultinomialNB()

In [70]:
classifier.fit(X_train,Y_train)  # train the data

MultinomialNB()

In [71]:
classifier.score(X_train, Y_train) # score of the model on training data

0.9433333333333334

In [72]:
# we will predict the test data 
Y_pred = classifier.predict(X_test) # predict the class of Y for the given testing data

In [73]:
Y_pred

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2, 2,
       2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 0,
       0, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2,
       2, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 0,
       2, 2, 2, 2, 2, 0, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2,
       2, 2, 2, 0, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2,
       0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 0, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2,

In [74]:
print(list(zip(Y_test, Y_pred))) # compare actual Y with predicted Y

[(0, 2), (0, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 0), (2, 2), (2, 0), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 0), (0, 2), (2, 2), (2, 2), (2, 2), (0, 2), (2, 2), (0, 2), (2, 2), (1, 2), (2, 2), (2, 2), (2, 2), (2, 0), (2, 2), (0, 2), (0, 2), (2, 2), (0, 0), (0, 2), (2, 2), (0, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 2), (2, 2), (2, 0), (2, 2), (0, 2), (0, 2), (0, 2), (2, 2), (2, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 2), (0, 2), (2, 2), (2, 2), (0, 2), (2, 2), (2, 2), (2, 2), (0, 2), (2, 2), (2, 2), (0, 2), (0, 2), (2, 0), (2, 0), (2, 2), (2, 2), (2, 2), (0, 0), (0, 0), (2, 0), (0, 0), (2, 2), (0, 0), (2, 2), (2, 2), (0, 2), (2, 2), (0, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 2), (0, 0), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 2), (2, 2), (2, 2), (1, 2), (0, 0), (2, 2), (0, 2), (2, 2), (2, 2), (2, 2), (2, 2), (2, 2), (0, 2),

In [75]:
# model evaluation 
from sklearn.metrics import confusion_matrix, accuracy_score,classification_report # import these functions from metrics sublib.
cfm = confusion_matrix(Y_test,Y_pred)  #confusion matrix
print(cfm)

print('classification report')  # classification report
print(classification_report(Y_test,Y_pred))

acc = accuracy_score(Y_test,Y_pred)  # accuracy of the model
print('Multinomial Naive Bayes model accuracy:',acc)

[[ 59   0  90]
 [  0   0   7]
 [ 39   0 305]]
classification report
              precision    recall  f1-score   support

           0       0.60      0.40      0.48       149
           1       0.00      0.00      0.00         7
           2       0.76      0.89      0.82       344

    accuracy                           0.73       500
   macro avg       0.45      0.43      0.43       500
weighted avg       0.70      0.73      0.70       500

Multinomial Naive Bayes model accuracy: 0.728


In [76]:
# model validating 
classifier.predict(vec.transform(['because a person can see the positives']).toarray())  # we can pass any text for testing our model 

array([2])

2 means positive sentiment which is also confirmed by our dataset

In [77]:
classifier.predict(vec.transform(['negatives of that culture']).toarray()) #0 means negative sentiment which is also confirmed by our dataset 

array([0])

0 means negative sentiment which is also confirmed by our dataset

In [78]:
classifier.predict(vec.transform(['terrorism']).toarray())  #negative sentiment

array([0])

Conclusion:

Our model correctly perform the sentiment analysis to understand the general sentiment (positive, negative, neutral) expressed in these posts i.e. our model can classify the unseen blog post as positive, negative or neutral 