# Linear Discriminant Analysis
First we start with a simplified LDA classfifier to gain intuition.

In [2]:
import pandas as pd
from nlpia.data.loaders import get_data
pd.options.display.width = 120

sms = get_data('sms-spam')
index = ['sms{}{}'.format(i, '!'*j) for (i, j) in zip(range(len(sms)), sms.spam)]
index[:10]

INFO:nlpia.futil:Reading CSV with `read_csv(*('c:\\Users\\lived\\.conda\\envs\\nlpia1\\lib\\site-packages\\nlpia\\data\\sms-spam.csv',), **{'nrows': None, 'low_memory': False})`...


['sms0',
 'sms1',
 'sms2!',
 'sms3',
 'sms4',
 'sms5!',
 'sms6',
 'sms7',
 'sms8!',
 'sms9!']

Here we create indexes by prepending '!' if the message is a spam.

In [4]:
sms

Unnamed: 0,spam,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
4832,1,This is the 2nd time we have tried 2 contact u...
4833,0,Will ü b going to esplanade fr home?
4834,0,"Pity, * was in mood for that. So...any other s..."
4835,0,The guy did some bitching but I acted like i'd...


In [6]:
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms

Unnamed: 0,spam,text
sms0,0,"Go until jurong point, crazy.. Available only ..."
sms1,0,Ok lar... Joking wif u oni...
sms2!,1,Free entry in 2 a wkly comp to win FA Cup fina...
sms3,0,U dun say so early hor... U c already then say...
sms4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
sms4832!,1,This is the 2nd time we have tried 2 contact u...
sms4833,0,Will ü b going to esplanade fr home?
sms4834,0,"Pity, * was in mood for that. So...any other s..."
sms4835,0,The guy did some bitching but I acted like i'd...


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize

tfidf_model = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf_model.fit_transform(raw_documents=sms.text).toarray()

print('TF-IDF shape:', tfidf_docs.shape)
print('Spam sum:', sms.spam.sum())

TF-IDF shape: (4837, 9232)
Spam sum: 638


Naive Bayes does not work well when the vocabulary size is much larger than the labeled sample in the dataset. For this, we need to use the LDA algorithm.

In [10]:
mask = sms.spam.astype(bool).values
print(mask)

[False False  True ... False False False]


In [11]:
tfidf_docs[mask]

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.17598105, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.26091803, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.08705223, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.08933439, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [12]:
tfidf_docs[~mask]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [14]:
spam_centroid = tfidf_docs[mask].mean(axis=0)
ham_centroid = tfidf_docs[~mask].mean(axis=0)

spamminess_score = tfidf_docs.dot(spam_centroid - ham_centroid)
print(spamminess_score.round(2))

[-0.01 -0.02  0.04 ... -0.01 -0.    0.  ]


In [15]:
from sklearn.preprocessing import MinMaxScaler
sms['lda_score'] = MinMaxScaler().fit_transform(spamminess_score.reshape(-1, 1))
sms['lda_predict'] = (sms.lda_score > 0.5).astype(int)
sms['spam lda_predict lda_score'.split()].round(2).head(6)

Unnamed: 0,spam,lda_predict,lda_score
sms0,0,0,0.23
sms1,0,0,0.18
sms2!,1,1,0.72
sms3,0,0,0.18
sms4,0,0,0.29
sms5!,1,1,0.55


In [18]:
(1. - (sms.spam - sms.lda_predict).abs().sum() / len(sms))

0.9774653710977879

We've got a pretty high accuracy with a simple LDA classifier.

In [20]:
from pugnlp.stats import Confusion
Confusion(sms['spam lda_predict'.split()])

  index = pd.Index(np.concatenate([df[columns[0]], df[columns[1]]])).unique()
  setattr(self, '_colnums', np.arange(0, self._num_classes))
  setattr(self, '_colnums', np.arange(0, self._num_classes))
  self.__setattr__('_hist_labels', self.sum().astype(int))
  setattr(self, '_hist_classes', self.T.sum())
  with np.errstate(divide='raise', invalid='raise'):
  setattr(self, '_tn', np.diag(self).sum() - self._tp)


lda_predict,0,1
spam,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4135,64
1,45,593


The false positives and false negatives are relatively low. We can adjust the classification threshold if they are out of balance.

# Latent semantic analysis

In [1]:
from nlpia.book.examples.ch04_catdog_lsa_3x6x16 import word_topic_vectors

  MIN_TIMESTAMP = pd.Timestamp(pd.datetime(1677, 9, 22, 0, 12, 44), tz='utc')
INFO:nlpia.constants:Starting logger in nlpia.constants...
INFO:nlpia.loaders:No BIGDATA index found in c:\Users\lived\.conda\envs\nlpia1\lib\site-packages\nlpia\data\bigdata_info.csv so copy c:\Users\lived\.conda\envs\nlpia1\lib\site-packages\nlpia\data\bigdata_info.latest.csv to c:\Users\lived\.conda\envs\nlpia1\lib\site-packages\nlpia\data\bigdata_info.csv if you want to "freeze" it.
INFO:nlpia.futil:Reading CSV with `read_csv(*('c:\\Users\\lived\\.conda\\envs\\nlpia1\\lib\\site-packages\\nlpia\\data\\mavis-batey-greetings.csv',), **{'low_memory': False})`...
INFO:nlpia.futil:Reading CSV with `read_csv(*('c:\\Users\\lived\\.conda\\envs\\nlpia1\\lib\\site-packages\\nlpia\\data\\sms-spam.csv',), **{'low_memory': False})`...
  lines = np.empty(dtype=object, shape=nrows)
100%|██████████| 263/263 [00:00<00:00, 262643.32it/s]


In [2]:
word_topic_vectors.T.round(1)

Unnamed: 0,cat,dog,apple,lion,nyc,love
top0,-0.6,-0.4,0.5,-0.3,0.4,-0.1
top1,-0.1,-0.3,-0.4,-0.1,0.1,0.8
top2,-0.3,0.8,-0.1,-0.5,0.0,0.1


This is what the documents would look like if it were vectorized by SVD. SVD is able to generate topics and assign scores for each term. While it does not understand "cityness", it is able to group "apple", "nyc" into topic0, which kinda represents "cityness"

## Single value decomposition

In [3]:
from nlpia.book.examples.ch04_catdog_lsa_sorted import lsa_models, prettify_tdm
bow_svd, tfidf_svd = lsa_models()
prettify_tdm(**bow_svd)

  lines = np.empty(dtype=object, shape=nrows)
100%|██████████| 263/263 [00:00<?, ?it/s]
100%|██████████| 263/263 [00:00<?, ?it/s]


Unnamed: 0,cat,dog,apple,lion,nyc,love,text
0,,,1.0,,1.0,,NYC is the Big Apple.
1,,,1.0,,1.0,,NYC is known as the Big Apple.
2,,,,,1.0,1.0,I love NYC!
3,,,1.0,,1.0,,I wore a hat to the Big Apple party in NYC.
4,,,1.0,,1.0,,Come to NYC. See the Big Apple!
5,,,1.0,,,,Manhattan is called the Big Apple.
6,1.0,,,,,,New York is a big city for a small cat.
7,1.0,,,1.0,,,"The lion, a big cat, is the king of the jungle."
8,1.0,,,,,1.0,I love my pet cat.
9,,,,,1.0,1.0,I love New York City (NYC).


This is called the *document-term* matrix. Each row is a document, and each column tells you how many times the term appears in the document.

In [4]:
tdm = bow_svd['tdm']
tdm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
cat,0,0,0,0,0,0,1,1,1,0,1
dog,0,0,0,0,0,0,0,0,0,0,1
apple,1,1,0,1,1,1,0,0,0,0,0
lion,0,0,0,0,0,0,0,1,0,0,0
nyc,1,1,1,1,1,0,0,0,0,1,0
love,0,0,1,0,0,0,0,0,1,1,0


This is the transpose of the *document-term matrix*, called the *term-document matrix* (TDM). SVD accepts TDM. In short, TDM is able to group the related terms together into topics.

SVD breaks a matrix down into 3 matrices.
$$W_{m \times n} = U_{m \times p} S_{p \times p} V_{p \times n}^T$$

In [6]:
import numpy as np
U, s, Vt = np.linalg.svd(tdm)

import pandas as pd
pd.DataFrame(U, index=tdm.index).round(2)

Unnamed: 0,0,1,2,3,4,5
cat,-0.04,0.83,-0.38,-0.0,0.11,-0.38
dog,-0.0,0.21,-0.18,-0.71,-0.39,0.52
apple,-0.62,-0.21,-0.51,0.0,0.49,0.27
lion,-0.0,0.21,-0.18,0.71,-0.39,0.52
nyc,-0.75,0.0,0.24,-0.0,-0.52,-0.32
love,-0.22,0.42,0.69,0.0,0.41,0.37


In [12]:
print(s.round(1))
S = np.zeros((len(U), len(Vt)))
pd.np.fill_diagonal(S, s)
pd.DataFrame(S).round(1)

[3.1 2.2 1.8 1.  0.8 0.5]


  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,3.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,2.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.8,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0


In [9]:
pd.DataFrame(Vt).round(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,-0.44,-0.44,-0.31,-0.44,-0.44,-0.2,-0.01,-0.01,-0.08,-0.31,-0.01
1,-0.09,-0.09,0.19,-0.09,-0.09,-0.09,0.37,0.47,0.56,0.19,0.47
2,-0.16,-0.16,0.52,-0.16,-0.16,-0.29,-0.22,-0.32,0.17,0.52,-0.32
3,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,0.71,0.0,-0.0,-0.71
4,-0.04,-0.04,-0.14,-0.04,-0.04,0.58,0.13,-0.33,0.62,-0.14,-0.33
5,-0.09,-0.09,0.1,-0.09,-0.09,0.51,-0.73,0.27,-0.01,0.1,0.27
6,-0.57,0.21,0.11,0.33,-0.31,0.34,0.34,-0.0,-0.34,0.23,0.0
7,-0.32,0.47,0.25,-0.63,0.41,0.07,0.07,0.0,-0.07,-0.18,0.0
8,-0.5,0.29,-0.2,0.41,0.16,-0.37,-0.37,-0.0,0.37,-0.17,0.0
9,-0.15,-0.15,-0.59,-0.15,0.42,0.04,0.04,-0.0,-0.04,0.63,-0.0


# Principle component analysis