<b>Topic Extraction</b> -  can be done using <i>non-negative matrix factorization (NMF)</i> or <i>latent semantic
analysis (LSA)</i>, which is also known as <b>Singular Value Decompoistion (SVD)</b>.<br>
These are decomposition techniques that reduce the data to a given number of cmoponents.
<p></p>
Lets apply <i>TfidfVectorizer</i> to the IMDB dataset.

In [5]:
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = pd.read_csv("input/imdb.csv", nrows=10000)
corpus = corpus.review.values

tfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)

tfv.fit(corpus)

corpus_transformed = tfv.transform(corpus)

svd = decomposition.TruncatedSVD(n_components=10)

corpus_svd = svd.fit(corpus_transformed)

sample_index = 0
feature_scores = dict(
    zip(
            tfv.get_feature_names(),
            corpus_svd.components_[sample_index]
        )
    )

N = 5
print(sorted(feature_scores, key=feature_scores.get, reverse=True)[:N])

['the', ',', '.', 'a', 'and']


In [6]:
N = 5
for sample_index in range(5):
    feature_scores = dict(
        zip(
                tfv.get_feature_names(),
                corpus_svd.components_[sample_index]
            )
    )
    print(
        sorted(
                feature_scores,
                key=feature_scores.get,
                reverse=True
            )[:N]
    )

['the', ',', '.', 'a', 'and']
['br', '<', '>', '/', '-']
['i', 'movie', '!', 'it', 'was']
[',', '!', "''", '``', 'you']
['!', 'the', "''", '``', '...']


You can see it doesn't make any sense at all. We can try cleaning and see it make sense after that.

In [7]:
import re
import string
def clean_text(s):
    s = s.split()
    s = " ".join(s)
    s = re.sub(f'[{re.escape(string.punctuation)}]','',s)
    return s

In [8]:
import pandas as pd
from nltk.tokenize import word_tokenize
from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = pd.read_csv("input/imdb.csv", nrows=10000)
corpus.loc[:,"review"] = corpus.review.apply(clean_text)

tfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None)

tfv.fit(corpus)

corpus_transformed = tfv.transform(corpus)

svd = decomposition.TruncatedSVD(n_components=10)

corpus_svd = svd.fit(corpus_transformed)

sample_index = 0
feature_scores = dict(
    zip(
            tfv.get_feature_names(),
            corpus_svd.components_[sample_index]
        )
    )

N = 5
print(sorted(feature_scores, key=feature_scores.get, reverse=True)[:N])

ValueError: n_components must be < n_features; got 10 >= 2