# Document Classification with Naive Bayes - Lab
https://github.com/learn-co-students/dsc-document-classification-with-naive-bayes-lab-online-ds-pt-100719/tree/solution

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [1]:
# Your code here
from fsds_100719.imports import *

df = pd.read_csv('SMSSpamCollection',sep='\t',names=['label','text'])
df.head()

fsds_1007219  v0.7.17 loaded.  Read the docs: https://fsds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds_100719,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


[i] Pandas .iplot() method activated.


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [2]:

df['label'].value_counts(normalize=False).idxmin()

'spam'

In [3]:
# Your code here
df['label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

In [4]:
def undersample_df(df, col='label',random_state=42):
    """Undersample dataframe to match minority class count in col"""
    val_counts =  df[col].value_counts()
    minority_count = val_counts.min()

    df2  = pd.DataFrame(columns=df.columns)
    for grp,idx in df.groupby(col).groups.items():

        df_temp = df.loc[idx].sample(n=minority_count,random_state=random_state)
        df2 = pd.concat([df2,df_temp],axis=0)
    
    return df2

In [5]:
df2 = undersample_df(df)
df2.head()

Unnamed: 0,label,text
3714,ham,If i not meeting ü all rite then i'll go home ...
1311,ham,"I.ll always be there, even if its just in spir..."
548,ham,"Sorry that took so long, omw now"
1324,ham,I thk 50 shd be ok he said plus minus 10.. Did...
3184,ham,Dunno i juz askin cos i got a card got 20% off...


In [6]:
df2['label'].value_counts()

spam    747
ham     747
Name: label, dtype: int64

In [7]:
p_classes=dict(df2['label'].value_counts(normalize=True))
p_classes

{'spam': 0.5, 'ham': 0.5}

## Train-test split

Now implement a train-test split on the dataset: 

In [8]:
# Your code here
from sklearn.model_selection import train_test_split
X = df2['text'].copy()
y = df2['label'].copy()
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=17)
train_df = pd.concat([X_train,y_train],axis=1)
test_df = pd.concat([X_test,y_test],axis=1)

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [9]:
## my way
from collections import Counter

class_word_freq = {}
for class_ in  train_df['label'].unique():
    text = train_df[train_df['label']==class_]['text']
    text = ' '.join(text)

    class_word_freq[class_] = Counter(text.split(' '))
class_word_freq.keys()

dict_keys(['spam', 'ham'])

In [10]:
## Solution way
class_word_freq = {} 
classes = train_df['label'].unique()
for class_ in classes:
    temp_df = train_df[train_df['label'] == class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag
class_word_freq.keys()

dict_keys(['spam', 'ham'])

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [11]:
## My way
def join_freq_dicts(*ds):
    """Combines any number of frequency word dicts"""
    D = {}
    for d in ds:
        for k,v in d.items():
            cur_v = D.get(k,0)
            D[k] = cur_v+v
    return D

vocab = join_freq_dicts(*list(class_word_freq.values()))
V = len(vocab)
V

5958

In [12]:
## Solution way
vocabulary = set()
for text in train_df['text']:
    for word in text.split():
        vocabulary.add(word)
V = len(vocabulary)
V

5958

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [13]:
# May way
def bag_it(doc):
    return Counter(doc.split())
bag_it(X_train.iloc[0])

Counter({'25p': 1,
         '4': 1,
         'alfie': 1,
         "Moon's": 1,
         'Children': 1,
         'in': 1,
         'need': 1,
         'song': 1,
         'on': 1,
         'ur': 2,
         'mob.': 1,
         'Tell': 1,
         'm8s.': 1,
         'Txt': 1,
         'Tone': 1,
         'charity': 2,
         'to': 1,
         '8007': 1,
         'for': 2,
         'Nokias': 1,
         'or': 1,
         'Poly': 1,
         'polys:': 1,
         'zed': 1,
         '08701417012': 1,
         'profit': 1,
         '2': 1,
         'charity.': 1})

In [14]:
## Solution way
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

In [15]:
bag_it(X_train.iloc[0])

{'25p': 1,
 '4': 1,
 'alfie': 1,
 "Moon's": 1,
 'Children': 1,
 'in': 1,
 'need': 1,
 'song': 1,
 'on': 1,
 'ur': 2,
 'mob.': 1,
 'Tell': 1,
 'm8s.': 1,
 'Txt': 1,
 'Tone': 1,
 'charity': 2,
 'to': 1,
 '8007': 1,
 'for': 2,
 'Nokias': 1,
 'or': 1,
 'Poly': 1,
 'polys:': 1,
 'zed': 1,
 '08701417012': 1,
 'profit': 1,
 '2': 1,
 'charity.': 1}

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [16]:
# Solution way
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    
    for class_ in class_word_freq.keys():
        p = np.log(p_classes[class_])
        
        for word in bag.keys():
            num = bag[word]+1
            denom = class_word_freq[class_].get(word, 0) + V
            p += np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [17]:
class_word_freq

{'spam': {'25p': 6,
  '4': 71,
  'alfie': 1,
  "Moon's": 2,
  'Children': 2,
  'in': 46,
  'need': 3,
  'song': 2,
  'on': 109,
  'ur': 75,
  'mob.': 2,
  'Tell': 2,
  'm8s.': 2,
  'Txt': 45,
  'Tone': 8,
  'charity': 3,
  'to': 461,
  '8007': 15,
  'for': 138,
  'Nokias': 1,
  'or': 149,
  'Poly': 1,
  'polys:': 1,
  'zed': 3,
  '08701417012': 2,
  'profit': 2,
  '2': 116,
  'charity.': 1,
  'New': 12,
  'TEXTBUDDY': 2,
  'Chat': 6,
  'horny': 4,
  'guys': 3,
  'area': 3,
  'just': 41,
  'Free': 23,
  'receive': 23,
  'Search': 2,
  'postcode': 3,
  'at': 21,
  'gaytextbuddy.com.': 2,
  'TXT': 3,
  'ONE': 3,
  'name': 5,
  '89693': 1,
  'all': 20,
  'the': 126,
  'lastest': 1,
  'from': 87,
  'Stereophonics,': 1,
  'Marley,': 1,
  'Dizzee': 1,
  'Racal,': 1,
  'Libertines': 1,
  'and': 90,
  'The': 16,
  'Strokes!': 1,
  'Win': 9,
  'Nookii': 1,
  'games': 3,
  'with': 72,
  'Flirt!!': 1,
  'Click': 2,
  'TheMob': 1,
  'WAP': 6,
  'Bookmark': 1,
  'text': 53,
  '82468': 2,
  'Marvel':

In [18]:
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

False    0.733036
True     0.266964
dtype: float64

In [19]:
y_hat_test= X_test.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_test== y_hat_test
residuals.value_counts(normalize=True)

False    0.708556
True     0.291444
dtype: float64

# with `sklearn`


In [None]:
# from nltk.corpus import stopwords
# from string import punctuation
# stopwords_list = stopwords.words('english')
# stopwords_list += punctuation


# from nltk.tokenize import word_tokenize
# def clean_text(txt):
#     txt = word_tokenize(txt)
#     txt = [w.lower() for w in txt if w not in stopwords_list]
#     return ' '.join(txt)

In [None]:
 df2['label'].value_counts()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = df2['text']
y = df2['label'].copy()

X_train, X_test, y_train,y_test =train_test_split(X, y,random_state=17)

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

In [None]:
# My way
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_vec,y_train)

y_hat_test = model.predict(X_test_vec)

print(metrics.classification_report(y_test,y_hat_test))
metrics.plot_confusion_matrix(model,X_test_vec,y_test,cmap='Blues', normalize='true')
y_test.value_counts(normalize=True)

In [None]:
model.class_count_

In [None]:
model.classes_

In [None]:
vocab = vectorizer.vocabulary_
len(vocab)

In [None]:
model.n_features_

In [None]:
model.coef_.flatten()

In [None]:
len(model.coef_.flatten())

In [None]:
model.feature_count_ 

In [None]:
X_test_vec[1]

In [None]:
vectorizer.inverse_transform(X_test_vec[1])

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!