## **K Means Clustering**

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

## **Importing Libraries**

In [17]:
import collections
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Defining a function tokenizer(text)**



In [0]:
def tokenizer(text):
  tokens = word_tokenize(text)
  stemmer = PorterStemmer()
  tokens = [stemmer.stem(t) for t in tokens if t not in stopwords.words('english')]
  return tokens

## **Defining a function cluster_sentences(sentences,k=(int))**

### ***(KMeans Clustering)***



In [0]:
def cluster_sentences(sentences, k):
  #Create tf ifd again: stopwords--> we filter out common words (I,my, the,and...)
  tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenizer, stop_words=stopwords.words('english'),lowercase=True)
  #builds a tf-idf matrix for the sentences 
  tfidf_matrix = tfidf_vectorizer.fit_transform(sentences)
  kmeans = KMeans(n_clusters=k)
  kmeans.fit(tfidf_matrix)
  clusters = collections.defaultdict(list)
  for i, label in enumerate(kmeans.labels_):
    clusters[label].append(i)
  return dict(clusters)

## **Main Body**

In [27]:
if __name__ == "__main__":
  sentences= ["Graphics designers are most creative people",
              "Snooker is a billiards sport for normally two players.",
              "Snooker is played on a large (12 feet by 6 feet) table that is covered with a smooth green material.",
              "FOREX is the stock market for trading currencies",
              "Software Engineering is hotter and hotter topic in Silicon Valley",
              "Love is blind",
              "Snooker is popular in the United Kingdom and many other countries.",
              "The flying or operating of aircraft is known as aviation.",
              "Falling in love is like being on drugs.",
              "Warren Buffet is famous for making good investments.He knows stock markets",
              "The biggest of the many uses of aviation are in air travel and military aircraft.",
              "All giant majors in Silicon Valley is focusing AI for their business productivity",
              "Investing in stocks and trading with them are not that easy",
              "Being in love is the number one reason why people wed.",
              "Aviation refers to flying using an aircraft, like an aeroplane.",
              "Graphics Designing is high rated freelance subject",
              "Loving from a long distance actually strengthens a relationship."
              ]
  k = 6
  clusters = cluster_sentences(sentences,k)
  for cluster in range (k):
    print("CLUSTER ",cluster,":")
    for i, sentence in enumerate(clusters[cluster]):
      print("\t",(i+1),": ",sentences[sentence])


CLUSTER  0 :
	 1 :  Software Engineering is hotter and hotter topic in Silicon Valley
	 2 :  All giant majors in Silicon Valley is focusing AI for their business productivity
CLUSTER  1 :
	 1 :  Snooker is a billiards sport for normally two players.
	 2 :  Snooker is played on a large (12 feet by 6 feet) table that is covered with a smooth green material.
	 3 :  Snooker is popular in the United Kingdom and many other countries.
CLUSTER  2 :
	 1 :  Love is blind
	 2 :  Falling in love is like being on drugs.
	 3 :  Being in love is the number one reason why people wed.
	 4 :  Loving from a long distance actually strengthens a relationship.
CLUSTER  3 :
	 1 :  FOREX is the stock market for trading currencies
	 2 :  Warren Buffet is famous for making good investments.He knows stock markets
	 3 :  Investing in stocks and trading with them are not that easy
CLUSTER  4 :
	 1 :  Graphics designers are most creative people
	 2 :  Graphics Designing is high rated freelance subject
CLUSTER  5 :


  'stop_words.' % sorted(inconsistent))
