# Project Extension 1: Gaussian Mixture Model 

Shortly after officially finishing my capstone I came across the idea of using a Gaussian mixture model as an unsupervised learning technique that would allow me to cluster data. In the primary project I used a KMeans model model to achieve my desired clustering. According to the sklearn documentation:  

>"A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians."  

This notebook contains an exploration of gaussian mixture models using the same science standards corpora. 

In [67]:
import pandas as pd 
import pickle  
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [68]:
#import text data
df = pickle.load( open( "Pickles/standards_corpi.pkl", "rb" ) ) 
df.drop(df.tail(1).index,inplace=True) 
df

Unnamed: 0,level_0,state,corpus
0,0,TXTfiles/alabama,"'information', 'regarding', 'course', 'study',..."
1,1,TXTfiles/alaska,"'dept', 'education', 'early', 'development', '..."
2,2,TXTfiles/arizona,"'department', 'education', 'academic', 'introd..."
3,3,TXTfiles/colorado,"'review', 'revision', 'committee', 'chairperso..."
4,4,TXTfiles/flordia,"'specifications', 'florida', 'state', 'adoptio..."
5,5,TXTfiles/georgia,"'excellence', 'first', 'excellence', 'designed..."
6,6,TXTfiles/idaho,"'content', 'state', 'superintendent', 'public'..."
7,7,TXTfiles/indiana,"'physics', 'engineering', 'process', 'seps', '..."
8,8,TXTfiles/louisiana,"'shifts', 'following', 'key', 'shifts', 'calle..."
9,9,TXTfiles/mass,"'massachusetts', 'technology', 'framework', 'e..."


In [71]:
#set features & labels to list
corpi_list = df['corpus'].values.tolist() 
state_list = df['state'].values.tolist()

Text data needs to be prepared in the same way it was for KMeans model by vectorizing the terms. This data was already cleaned and tokenized prior to pickling so I am choosing not to apply additional processing for the moment but that could be revisited & refined.

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity  

In [72]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(corpi_list)

In [73]:
tfidf_matrix.shape

(30, 22482)

In [74]:
#values needed later for kmeans 
terms = tfidf_vectorizer.get_feature_names() 
dist = 1 - cosine_similarity(tfidf_matrix) 

In the original KMeans project the clusters produced were elliptically than spherically shaped (as reproduced below) and at least one point was clustered in a different than the points located directly next to it. The data produced a non circular grouping and thus circular groupings were a poor fit. The main problem with KMeans then becomes there is no way to determine the probability that a specific data point will fall within one or a different cluster. With a Gaussian Mixture Model there is the possibility to finding the probability that a point will fall within a specific cluster. I will apply the same number of groups that I used in the KMeans clusters in the initial model.

In [75]:
from sklearn import mixture

The matrix needs to be turned into a dense matrix before being passed into the model.

In [76]:
dense_matrix = tfidf_matrix.toarray()

In [77]:
#fit GMM Model
model = mixture.GaussianMixture(n_components=3, covariance_type='full')  
gmm = model.fit(dense_matrix)
labels = gmm.predict(dense_matrix)

Below is a matrix of the probability that a given point will fall into a specific cluster.

In [78]:
#Utilize the statistical 
probs = gmm.predict_proba(dense_matrix)
print(probs[:5].round(3))

[[0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
