# Unsupervised Learning
## Dataset - 20newsgroup 

About the Dataset

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.



### Importing required Libraries 

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

## Kmeans
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. 
[Click here to view the gif better](http://shabal.in/visuals/kmeans/random.gif)
![alt text](http://shabal.in/visuals/kmeans/random.gif)

K Means algorithm:

1. Choose value for K
2. Randomly select K featuresets to start as your centroids
3. Calculate distance of all other featuresets to centroids
4. Classify other featuresets as same as closest centroid
5. Take mean of each class (mean of all featuresets by class), making that mean the new centroid
6. Repeat steps 3-5 until optimized (centroids no longer moving)

In [2]:

newsgroups_train = fetch_20newsgroups(subset='train')


In [3]:
#vectorize
vectorizer = TfidfVectorizer(max_df=0.5,
                             min_df=2,
                             stop_words='english')
X = vectorizer.fit_transform(newsgroups_train.data)



max_df : float in range [0.0, 1.0] or int (default=1.0)

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int (default=1)

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.



## Clustering

In [4]:
km = KMeans(n_clusters=20, init='k-means++', max_iter=100, n_init=1)
km.fit(X)

KMeans(max_iter=100, n_clusters=20, n_init=1)

## Checking your output clusters

In [5]:
order_centroids = km.cluster_centers_.argsort()[:,::-1] # descending order
terms = vectorizer.get_feature_names()

k = 20 # replace with another K value
for i in range(k):
    print("cluster %d:" % (i+1))
    for ind in order_centroids[i][:20]:
        print('%s' % terms[ind])
    print()

cluster 1:
com
university
posting
host
nntp
thanks
ca
distribution
article
know
mail
computer
like
does
new
cs
uk
reply
sale
just

cluster 2:
drive
scsi
ide
controller
drives
hard
disk
floppy
bus
hd
mac
problem
pc
boot
tape
cd
isa
rom
computer
problems

cluster 3:
key
clipper
encryption
chip
keys
escrow
government
com
crypto
intercon
algorithm
nsa
des
secure
security
amanda
public
secret
privacy
phone

cluster 4:
israel
israeli
jews
arab
arabs
lebanese
adam
israelis
policy
cpr
lebanon
jewish
peace
palestinian
palestinians
hernlem
igc
gaza
org
apc

cluster 5:
gun
people
com
turkish
guns
keith
government
armenian
caltech
armenians
don
livesey
stratus
weapons
sgi
armenia
think
right
article
like

cluster 6:
__
___
_____
baalke
____
jpl
petch
grass
kelvin
valley
_______
______
gov
nasa
ac
ca
bnr
propulsion
uk
nick

cluster 7:
nasa
space
gov
henry
alaska
toronto
moon
ax
zoo
launch
spencer
orbit
lunar
article
shuttle
larc
earth
aurora
nsmca
zoology

cluster 8:
cwru
cleveland
freenet
reserve


## Trying the same with count vectorizer

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
# remove stop words before running this
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(newsgroups_train.data)

In [7]:
km = KMeans(n_clusters=20, init='k-means++', max_iter=100, n_init=1)
km.fit(X)

KMeans(max_iter=100, n_clusters=20, n_init=1)

In [8]:
order_centroids = km.cluster_centers_.argsort()[:,::-1] # descending order
terms = vectorizer.get_feature_names()

k = 8 # replace with another K value
for i in range(k):
    print("cluster %d:" % (i+1))
    for ind in order_centroids[i][:20]:
        print('%s' % terms[ind])
    print()

cluster 1:
the
to
of
and
in
is
that
it
you
for
edu
from
on
this
be
have
not
are
with
com

cluster 2:
ax
max
75u
pl
b8f
2j
7u
45
giz
1t
75
bhj
a86
0t
6ei
145
g9v
au
76
as

cluster 3:
ax
max
b8f
a86
145
0t
1t
g9v
pl
giz
1d9
bhj
bxn
3t
75u
2di
34u
7ey
wm
6ei

cluster 4:
the
of
to
and
in
is
that
it
for
on
with
this
you
are
be
not
was
as
by
they

cluster 5:
ax
max
a86
b8f
pl
1t
as
qq
bhj
qax
bj
giz
0q
gk
1d9
wwiz
i4
7ey
6ei
1f

cluster 6:
ax
max
g9v
b8f
a86
1d9
75u
bhj
pl
2di
mg9v
giz
1t
145
7ey
0d
gk
b4q
b8e
3t

cluster 7:
the
to
and
of
is
in
for
on
it
are
from
you
by
with
that
or
be
this
file
can

cluster 8:
the
to
of
and
in
that
is
it
you
for
not
be
are
this
have
as
on
with
from
but

