<a href="https://colab.research.google.com/github/kunal077/Natural-Language-Processing/blob/main/Natural%20Language%20Processing%20Series%20/%20TopicModelingWithSVDandNMF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In topic modelling the goal is to find the TOPIC which occurs in a piece of text, for example in a paragraph, we can have multiple words or phrases that can be clubbed together under one TOPIC, so that is TOPIC Modelling, undetstanding what topic is what.

In [None]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline
np.set_printoptions(suppress = True)

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train')
print(list(newsgroups_train.target_names))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [None]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
#We are removing these words so that they do not bother us in overfitting the classifier
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

In [None]:
newsgroups_train.filenames.shape, newsgroups_train.target.shape

((2034,), (2034,))

In [None]:
print(newsgroups_train.filenames); print()
print(newsgroups_train.target)

['/root/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38816'
 '/root/scikit_learn_data/20news_home/20news-bydate-train/talk.religion.misc/83741'
 '/root/scikit_learn_data/20news_home/20news-bydate-train/sci.space/61092'
 ...
 '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38737'
 '/root/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/53237'
 '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38269']

[1 3 2 ... 1 0 1]


In [None]:
print(np.array(newsgroups_train.target_names)[newsgroups_train.target[:10]])

['comp.graphics' 'talk.religion.misc' 'sci.space' 'alt.atheism'
 'sci.space' 'alt.atheism' 'sci.space' 'comp.graphics' 'sci.space'
 'comp.graphics']


In [None]:
print(newsgroups_train.target[:10])

[1 3 2 0 2 0 2 1 2 1]


In [None]:
#now we set some custom numbers of topic that we want
#There is no truth here, since this is purely a case of 
#Unsupervised Learning, so we set how many topics i want.
num_topics, num_top_words = 6, 8


Stop Words


---

Some extremely common words which would appear to be of little value in helping select documents matching a user need, they are excluded from the vocabulary entirely, These words are called Stop Words.

---

The general trend in IR Systems over time has been from the standard use of quire large stop words lists (200-300) terms in there) to very small stop lists (7-12), now Web Search Engines do not use Stop Lists.

In [None]:
from sklearn.feature_extraction import stop_words
print(len(sorted(list(stop_words.ENGLISH_STOP_WORDS))))
print(sorted(list(stop_words.ENGLISH_STOP_WORDS))[:100])

318
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former']


Stemming and Lemmatization


---

They both generate the ROOT form of the words.
Lemmatization uses the rules about a language and resulting tokens are all actual words.

Stemming is poor Lemmatization, crude Heuristic that chops the ends of  of words and the resulting tokens may not be actual words, Stemming is faster.

In [None]:
import nltk
nltk.download('wordnet')
#wordnet is an English Dictionary

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from nltk import stem
#We are using nltk becuase it has both Lem and Stem
#Spacy only has Lem

In [None]:
wnl = stem.WordNetLemmatizer()
porter = stem.porter.PorterStemmer()

In [None]:
word_list = ['feet', 'foot', 'foots', 'footing']
[wnl.lemmatize(word) for word in word_list]

['foot', 'foot', 'foot', 'footing']

In [None]:
[porter.stem(word) for word in word_list]

['feet', 'foot', 'foot', 'foot']

In [None]:
word_list = ['organize', 'organizes', 'organizing']
word_list2 = ['universe', 'university']
print([wnl.lemmatize(word) for word in word_list])
print([porter.stem(word) for word in word_list])
print([wnl.lemmatize(word) for word in word_list2])
print([porter.stem(word) for word in word_list2])
#Lemmatizing makes more sense as compared to Stemming
#More morphological languages are better for Lemmatization and Stemming

['organize', 'organizes', 'organizing']
['organ', 'organ', 'organ']
['universe', 'university']
['univers', 'univers']


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
inSpacy = nlp.Defaults.stop_words - stop_words.ENGLISH_STOP_WORDS
inSklearn = stop_words.ENGLISH_STOP_WORDS - nlp.Defaults.stop_words
print(inSpacy, "\n", inSklearn)

{'‘re', '‘ll', '‘s', "'s", "'m", "'d", '’ll', 'n’t', 'regarding', 'did', 'doing', "'ll", 'just', '‘d', '’s', 'does', 'make', "'ve", '’d', 'unless', 'used', '’m', "n't", '’ve', '‘m', 'using', '’re', 'n‘t', 'say', 'really', 'ca', "'re", '‘ve', 'various', 'quite'} 
 frozenset({'fill', 'de', 'un', 'sincere', 'etc', 'hasnt', 'system', 'found', 'con', 'cry', 'co', 'describe', 'bill', 'interest', 'mill', 'eg', 'couldnt', 'thick', 'inc', 'fire', 'cant', 'amoungst', 'ie', 'detail', 'thin', 'ltd', 'find'})


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
nltk.download('punkt')

In [None]:
from nltk import word_tokenize

class LemmaTokenizer(object):
  def __init__ (self):
    self.wnl = stem.WordNetLemmatizer()
  
  def __call__ (self, doc):
    return [self.wnl.Lemmatize(t) for t in word_tokenize(doc)]
