In [2]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

In [3]:
%matplotlib inline
np.set_printoptions(suppress=True)

In [4]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [5]:
newsgroups_train.filenames.shape, newsgroups_train.target.shape

((2034,), (2034,))

In [6]:
# looking at some of the data
print("\n".join(newsgroups_train.data[:3]))

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.c

In [7]:
np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]

array(['comp.graphics', 'talk.religion.misc', 'sci.space'], dtype='<U18')

In [8]:
newsgroups_train.target[:10]

array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1])

In [9]:
num_topics, num_top_words = 6, 8

# Stop words, stemming, lemmatization
Stop words

Some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.

The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists.

In [14]:
from sklearn.feature_extraction import stop_words
sorted(list(stop_words.ENGLISH_STOP_WORDS))[:20]

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst']

## There is no single universal list of stop words!


# Stemming and Lemmatization
from Information Retrieval textbook:

Are the below words the same?

organize, organizes, and organizing

democracy, democratic, and democratization

Stemming and Lemmatization both generate the root form of the words.

Lemmatization uses the rules about a language. The resulting tokens are all actual words

"Stemming is the poor-man’s lemmatization." (Noah Smith, 2011) Stemming is a crude heuristic that chops the ends off of words. The resulting tokens may not be actual words. Stemming is faster.

In [15]:
import nltk 
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tomasresendiz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [16]:
from nltk import stem

In [17]:
wnl = stem.WordNetLemmatizer()
porter = stem.porter.PorterStemmer()

In [18]:
word_list = ['feet', 'foot', 'foots', 'footing']

In [19]:
[wnl.lemmatize(word) for word in word_list]

['foot', 'foot', 'foot', 'footing']

In [20]:
[porter.stem(word) for word in word_list]

['feet', 'foot', 'foot', 'foot']

In [22]:
fly_list = ['fly', 'flies', 'flying']
org_list = ['organize', 'organizes', 'organizing']
uni_list = ['universe', 'university']

In [27]:
print("Original List: ", fly_list)
print("Lemmatized: ", [wnl.lemmatize(word) for word in fly_list])
print("Porter: ", [porter.stem(word) for word in fly_list])

Original List:  ['fly', 'flies', 'flying']
Lemmatized:  ['fly', 'fly', 'flying']
Porter:  ['fli', 'fli', 'fli']


In [29]:
print("Original List: ", org_list)
print("Lemmatized: ", [wnl.lemmatize(word) for word in org_list])
print("Porter: ", [porter.stem(word) for word in org_list])

Original List:  ['organize', 'organizes', 'organizing']
Lemmatized:  ['organize', 'organizes', 'organizing']
Porter:  ['organ', 'organ', 'organ']


In [28]:
print("Original List: ", uni_list)
print("Lemmatized: ", [wnl.lemmatize(word) for word in uni_list])
print("Porter: ", [porter.stem(word) for word in uni_list])

Original List:  ['universe', 'university']
Lemmatized:  ['universe', 'university']
Porter:  ['univers', 'univers']


Stemming and lemmatization are language dependent. Languages with more complex morphologies may show bigger benefits. For example, Sanskrit has a very large number of verb forms.



In [32]:
import spacy

In [39]:
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lemmatizer = Lemmatizer(lookups)

In [43]:
print("Original List: ", word_list)
print("Spacy Lemm: ", [lemmatizer.lookup(word) for word in word_list])

Original List:  ['feet', 'foot', 'foots', 'footing']
Spacy Lemm:  ['feet', 'foot', 'foots', 'footing']


Spacy doesn't offer a stemmer (since lemmatization is considered better)

Stop words vary from library to library