# Text Classification with Naïve Bayes and NLP Techniques

## 1. Prepare

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
df_train = pd.read_csv("Text Classification/raw_train.csv")
df_train.head()

Unnamed: 0,type,posts
0,INFP,'One stereotype I disagree with is that INFPs ...
1,INTP,'The fridge and the buzzing of my roommates ph...
2,INFP,"'The thing is, the mbti is so much more than d..."
3,INFP,'Almost never. The only other results I got ot...
4,ENFP,'She was curious of how many others didn't mat...


In [3]:
df_test = pd.read_csv("Text Classification/raw_test.csv")
df_test.head()

Unnamed: 0.1,Unnamed: 0,posts,ID
0,5443,'Captain America: ISFJ Iron Man: ENTP Thor: ES...,1
1,4886,'Is a X-Files fan. (What else is there to say?...,2
2,7127,'Thank you!|||This exactly. I think my SO is a...,3
3,3206,"'As stressful as school is, I'm happy to say t...",4
4,3528,Orthodox Iconoclast Yummy Donuts do you guys h...,5


In [4]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2169 entries, 0 to 2168
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  2169 non-null   int64 
 1   posts       2169 non-null   object
 2   ID          2169 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 51.0+ KB


In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6506 entries, 0 to 6505
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    6506 non-null   object
 1   posts   6506 non-null   object
dtypes: object(2)
memory usage: 101.8+ KB


In [6]:
df_train.type.value_counts()

INFP    1374
INFJ    1078
INTP    1023
INTJ     811
ENFP     510
ENTP     499
ISTP     265
ISFP     210
ENTJ     161
ENFJ     154
ISTJ     154
ISFJ     119
ESTP      61
ESFP      32
ESTJ      30
ESFJ      25
Name: type, dtype: int64

**NOTE: Imbalance dataset**

## 2. Data Preprocessing

### Modify dataframe

In [4]:
df_test.drop('Unnamed: 0', axis=1, inplace=True)

### Check duplicated

In [8]:
df_train.duplicated().sum()

0

In [9]:
df_test.duplicated().sum()

0

### Clean text

In [9]:
import regex
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
translator = str.maketrans('', '', string.punctuation + string.digits)

def clean_process(text):
    # Lowercasing
    document = text.lower()
    document = document.replace("’",'')
    document = regex.sub(r'\.+', ".", document)
    
    new_sentence = ''
    for sentence in sent_tokenize(document):
        pattern = r'(?i)\b[a-z]+\b'
        sentence = ' '.join(regex.findall(pattern,sentence))
        sentence = regex.sub(r'http\S+', '', sentence)
        sentence = regex.sub(r'[A-Za-z0-9]*@[A-Za-z]*\.?[A-Za-z0-9]*', '', sentence)

        # Tokenization
        tokens = word_tokenize(sentence)
    
        # Stopword Removal
        filtered_tokens = [token for token in tokens if token not in stop_words]

        # Stemming
        stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

        # Special Character and Number Removal
        cleaned_tokens = [token.translate(translator) for token in stemmed_tokens]
        
        ## Lemmatization
        lemmatized_tokens = [lemmatizer.lemmatize(token) for token in cleaned_tokens]

        ## join to make a sentence
        sentence = ' '.join(lemmatized_tokens)

        ## append
        new_sentence = new_sentence + sentence + '. ' 
        
    document = new_sentence
    document = regex.sub(r'\s+', ' ', document).strip()
    return document

In [10]:
s = '''
	posts
1	'One stereotype I disagree with is that INFPs all have this one giant Cause. I don't have A Cause: I have several smaller causes. I've actual felt at one time that there might be something wrong...|||Inspired by DiaphinisedBat's first image, some chiaroscuro:  http://farm6.static.flickr.com/5131/5470706131_8fc2d0646f.jpg ...|||GROWNUPS http://imgs.xkcd.com/comics/grownups.png  MYSPACE http://imgs.xkcd.com/comics/join_myspace.png  DELICIOUS http://imgs.xkcd.com/comics/delicious.png  DREAMS|||I present: The Tiniest Snorfer! :happy: :happy: :happy:     http://farm6.static.flickr.com/5129/5304355986_78e4751dba.jpg   http://farm5.static.flickr.com/4027/5145776103_6296c27044_z.jpg|||http://point001percent.files.wordpress.com/2009/05/steven-klein-3.jpg?w=500&h=323      I like weird.|||INTx: That doesn't make any sense. That kitten has the body of various foods. INFP: Yum..pop tarts!!!|||I convinced myself too. :D|||Some pet peeves:  1. People who think that since they're in a parking lot, common driving laws no longer apply.  2. Cats. I'm an animal lover. I try to love the little feline devils, but they're...|||...when you accidentally forget to recharge your cellphone.|||Where's my keys?|||I really really liked LOST. I cried for 2 hours straight during the series finale. LOST fueled my imagination and my emotions. It helped me understand other people even better.  And Hurley was...|||That page isn't the the best source, imho. A lot of that list seems weaknesses of the human race, not just INFPs and INFJs. I wouldn't use it myself. Why are you reconsidering your type, if I may ask?|||I think NFs fall in love easily. Lots of unrequited love throughout life. Sucks muchly. :sigh:|||Aw, man, you've got it bad, baby! :D My mother is a T, but she would never say something like that, and I know she can appreciate the beauty of a song without having to analyze it.|||Dear Powers That Be,  I demand more ENFJ women. Thank you in advance.  Sincerely, INFP male|||When I'm in stream of consciousness mode, I can be really eccentric, that's for sure. But I rarely let that side show to people I don't know well. So I guess I don't know why. Maybe because I talk...|||20210  here's mine.|||But you can't quantify the human condition. Think of it this way... You buy a copy of the sheet music for Moonlight Sonata. You have an idea to explain it to the world, so that the world is filled...|||Another Simmer! yays for us!|||I write (and rewrite) a lot of character sketches. I've something really amazing floating around in my mental ionosphere, but I need to flesh the characters out a lot. It's set in multiple time...|||There are some things that cannot be said, they must be sung, because if said the magic is lost.  I hope this helps some. :shrug:|||http://www.youtube.com/watch?v=5AhU12zC8fc  Just browsing YouTube and found this gem. :happy:  Love the video too, INFP male that I am, also highly symbolic, love it, love it, love it.|||I also couldn't get very far into the description. Come on, man, throw in a purple monkey or something every once in while so I don't fall asleep! I'm just glad people quoted the article so I knew...|||@ sparkle: wondering when someone would put some Natalie Merchant! About time!   Here's another Tori Amos. A direct quote from Tori herself: This song describes the irreversible damage and pain...|||Lot's of good stuff in this thread! Like double-stuffed oreos! Chocolate-covered double-stuffed oreos! :nom nom nom:|||...when you want to be noticed, but don't want to notice that you're being noticed.|||I think my brain just flipped upside down.|||I'm not a part of your system!|||And also there should be clowns and pandas with umbrellas. Why? Because I like clowns and pandas. That's why. (The umbrellas are there only because I suspect things will get messy.)|||Not grape vinaigrette: raspberry vinaigrette.|||The only thing is that I can't decide what kind of juice. And am I limited to fruit? And if not, wouldn't soup be acceptable as well? And if soup, why not milk-based soups. Then again, fruit juice is...|||INFPs, they're too moody, lol, j/k.|||J is work first, play later. P is play first, work later. A better description: if you put off making decisions till the last possible moment, you're a P.   For example, you go to the bookstore to...|||Paaaaaaaaaaaaaaaaartyyyyyyyyyyyyyyyyyy!!!!!!!!!!!!!!! Woooooooooooooooooooo!  (wait...maybe that should be in all caps?)|||I'm a 2w1, 7w6, 9w1. So basically I'll be single forever. :sigh: Coffee helps.|||I think we just seem unhappy because we dwell on our feelings so much. Maybe the sense of unhappiness comes from looking at ourselves and not being the ideal we want for ourselves? Feeling unhappy...|||We're like dew on flower petals: we're there, but you have to get close to see our true beauty.  We like doing things by ourselves a lot, so we can be hard to find. We'll be anywhere we can find...|||I think he's a P. The two types that I think are most like him are ESFP and ENFP, and possibly ESTP. I don't think I'm ever going to know for sure, but I'm narrowing it down, at least.   Thanks for...|||Nice.   I'm new here too. :)|||That sounds like something he'd say.|||About 15 years ago, I worked at a seasonal job at a hotel about 200 miles from all my friends and family. I had no phone (so communication was solely through letter writing), and I lived where the...|||I need some advice...  I get this a lot when I finally pester a woman enough to get her to go out with me at least once: You're actually a normal person.|||Try to identify what areas of your life are causing you the most anxiety and then set concrete goals to remedy that anxiety. I know exactly what you're feeling, and this is what I do. Having an ideal...|||Ah, man, now I want to swim in a pool of juice!|||...when you don't understand why you don't understand why you don't understand.  ...when you play video games for the storyline. (me)  ...when you don't know what love is but you do know what is...|||What! no Tori Amos?! Or do I have the wrong idea of INFP? Or not enough have heard of her? Here's one of my favorite songs of hers (a very INFP song, too).    Silent All These Years  Excuse...|||Hi, everyone. Just saying hi, new here, etc. :tongue: Interesting place you have here, think I'll pull up a chair. Ok, start talking. I'll just interject a wise and occasionally witty comment...'
'''
clean_process(s)

'post one stereotyp disagre infp one giant caus. caus sever smaller caus. actual felt one time might someth wrong inspir diaphinisedbat first imag chiaroscuro http static flickr com jpg grownup http img xkcd com comic grownup png myspac http img xkcd com comic png delici http img xkcd com comic delici png dream present tiniest snorfer. happi happi happi http static flickr com jpg http static flickr com jpg http file wordpress com steven klein jpg w h like weird intx make sen. kitten bodi variou food. infp yum pop tart. convinc. pet peev peopl think sinc park lot common drive law longer appli. cat. anim lover. tri love littl felin devil accident forget recharg cellphon key realli realli like lost. cri hour straight seri final. lost fuel imagin emot. help understand peopl even better. hurley page best sourc imho. lot list seem weak human race infp infj. use. reconsid type may ask think nf fall love easili. lot unrequit love throughout life. suck muchli. sigh aw man got bad babi. mother w

In [11]:
df_train['cleaned'] = df_train.posts.apply(lambda x: clean_process(x))
df_test['cleaned'] = df_test.posts.apply(lambda x: clean_process(x))

## 3. Train test split

In [12]:
X_train = df_train['cleaned']
y_train = df_train['type']

In [13]:
X_test = df_test['cleaned']

In [14]:
print(X_train.shape)
print(X_test.shape)

(6506,)
(2169,)


## 4. Data Transformation

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
vectorizer = CountVectorizer()

In [17]:
X_train_T = vectorizer.fit_transform(X_train).toarray()
X_test_T = vectorizer.transform(X_test).toarray()

In [19]:
X_train_T[:5]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [20]:
X_test_T[:5]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## 5. Build Model

In [18]:
from sklearn.naive_bayes import MultinomialNB

In [19]:
model = MultinomialNB()

In [20]:
# Training phase
model.fit(X_train_T, y_train)
model.score(X_train_T, y_train) # accuracy of model on the training set

0.6847525361205041

In [21]:
# Testing phase
preds = model.predict(X_test_T)
preds

array(['INFJ', 'INTJ', 'INFP', ..., 'INFP', 'INFP', 'INFP'], dtype='<U4')

## 6. Evaluate Model

### Load label ground truth

In [22]:
df_solution = pd.read_csv("Text Classification/solution.csv")
df_solution.head()

Unnamed: 0,Id,Category
0,1,INFJ
1,2,INTJ
2,3,ENTJ
3,4,ISFP
4,5,ENTP


In [23]:
df_test2 = df_test.copy()
df_test2.rename({"ID":"Id"}, axis="columns", inplace=True)
df_test2 = pd.merge(df_test2, df_solution, on="Id")
y_test = df_test2['Category']
y_test

0       INFJ
1       INTJ
2       ENTJ
3       ISFP
4       ENTP
        ... 
2164    INFJ
2165    INTJ
2166    INFJ
2167    ENFP
2168    INFP
Name: Category, Length: 2169, dtype: object

### Evaluate Model

In [24]:
from sklearn.metrics import classification_report

In [25]:
print(classification_report(preds, y_test, digits=4))

              precision    recall  f1-score   support

        ENFJ     0.0000    0.0000    0.0000         0
        ENFP     0.0545    0.5000    0.0984        18
        ENTJ     0.0000    0.0000    0.0000         0
        ENTP     0.0484    0.4737    0.0878        19
        ESFJ     0.0000    0.0000    0.0000         0
        ESFP     0.0000    0.0000    0.0000         0
        ESTJ     0.0000    0.0000    0.0000         0
        ESTP     0.0000    0.0000    0.0000         0
        INFJ     0.5689    0.4313    0.4906       517
        INFP     0.8253    0.3743    0.5150      1010
        INTJ     0.3107    0.6259    0.4153       139
        INTP     0.6548    0.3966    0.4940       464
        ISFJ     0.0000    0.0000    0.0000         0
        ISFP     0.0000    0.0000    0.0000         1
        ISTJ     0.0000    0.0000    0.0000         0
        ISTP     0.0139    1.0000    0.0274         1

    accuracy                         0.4108      2169
   macro avg     0.1548   

**NOTE: Train accuracy: 68.47%, Test accuracy: 41.08%**