CLUSTERING PROBLEMS
Types 
	- K-Means Clustering
	- Hierarchal Clustering 
	- Density-based Clustering
	- Distribution-based Clustering

Clustering does not have defined categories 
Unlike Classification which uses pre-defined categories
Clustering is a form of unsupervised learning (no training step)
  
What: Group items based on measured similarity
	- Maximise Intragroup Similarity
	- Minimise Intergroup Similarity 

How: All users can be represented by using some features
	- Age
	- Location
	- Freq of usage for each topic
User can then be represented by a point in N-Dimensional space
Similarity is then represented by the distance between users

Large Dataset -> Features are represent datapoints numerically -> Clustering algorithm 
Clustering and Classification can work together
Clustering can provide training data for a Classifier 

CLUSTERING USING K-MEANS ALGORITHM 
(UNSUPERVISED)

Example: Document Clustering around themes 
Represent Text using 'Term Frequency Representation'

Term Frequency - Inverse Document Frequency (TF-IDF)
Weight the term frequency to account for word rarity 
Words which are not common differentiate doc - upgrade these words
Words which are more common do not differentiate doc - downgrade these words
WEIGHT = 1 / # docs the words appear in
Results is that each document becomes an - A tuple of N numbers
A tuple of N numbers -> A point in an N-Dimensional Hypercube

K-Means divides data into K clusters where K is specified by the user (i.e. # of clusters)
Start by initializing a set of points as the 'K Means' (Centeroids of the cluster)
	1. The user specifies the initial number of clutsers/means/centeroids
	2. Each data point is assigned to the cluster of the nearest Mean
	3. Find the new means/centroids of the clusters
	4. Rinse and Repeat until the means don't change anymore

Large Dataset -> TF-IDF -> K-Means Clustering  

DEMO
https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
Implement K-Means Clustering on IMDB reviews


In [8]:
import csv

In [9]:
with open("/pluralsight/sentiment labelled sentences/imdb_labelled.txt","r")as text_file:
    lines = text_file.read().split('\n')
lines = [line.split("\t") for line in lines if len(line.split("\t"))==2 and line.split("\t")[1]!='']

In [10]:
train_documents = [line[0] for line in lines]

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
#transform each review into a tuple of numbers - word frequency TF-IDF
#stop words gets rid of junk words such 'the' 'a' 'and' etc.
# max and min can be changed for tuning
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')
train_documents = tfidf_vectorizer.fit_transform(train_documents)

In [13]:
from sklearn.cluster import KMeans

In [16]:
# instantiate the K-Means clustering algorithm with the number of clusters set to 3
# after runnning this ever document will be assigned to a cluster
km = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 100, n_init = 1, verbose = True)
km.fit(train_documents)

Initialization complete
Iteration  0, inertia 1901.607
Iteration  1, inertia 965.607
Iteration  2, inertia 962.818
Iteration  3, inertia 962.496
Iteration  4, inertia 962.461
Converged at iteration 4: center shift 0.000000e+00 within tolerance 1.002596e-07


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=3, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=True)

In [17]:
# now it will be necessary to examine each cluster to identify the themes of each cluster
# so we now have a loop that prints out three reviews that have the label 0
count=0
for i in range(len(lines)):
    if count>3:
        break
    if km.labels_[i]==0:
        print(lines[i])
        count+=1

['This review is long overdue, since I consider A Tale of Two Sisters to be the single greatest film ever made.  ', '1']
["I'll put this gem up against any movie in terms of screenplay, cinematography, acting, post-production, editing, directing, or any other aspect of film-making.  ", '1']
['" The structure of this film is easily the most tightly constructed in the history of cinema.  ', '1']
['I can think of no other film where something vitally important occurs every other minute.  ', '1']
