# Project Extension 1: Gaussian Mixture Model 

Shortly after officially finishing my capstone I came across the idea of using a Gaussian mixture model as an unsupervised learning technique that would allow me to cluster data. In the primary project I used a KMeans model model to achieve my desired clustering. According to the sklearn documentation:  

>"A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians."  

This notebook contains an exploration of gaussian mixture models using the same science standards corpora. 

In [2]:
import pandas as pd 
import pickle 

In [6]:
#import text data
df = pickle.load( open( "Pickles/standards_corpi.pkl", "rb" ) ) 
df.head()

Unnamed: 0,level_0,state,corpus
0,0,TXTfiles/alabama,"'information', 'regarding', 'course', 'study',..."
1,1,TXTfiles/alaska,"'dept', 'education', 'early', 'development', '..."
2,2,TXTfiles/arizona,"'department', 'education', 'academic', 'introd..."
3,3,TXTfiles/colorado,"'review', 'revision', 'committee', 'chairperso..."
4,4,TXTfiles/flordia,"'specifications', 'florida', 'state', 'adoptio..."


Text data needs to be prepared in the same way it was for KMeans model by vectorizing the terms. This data was already cleaned and tokenized prior to pickling so I am choosing not to apply additional processing for the moment but that could be revisited & refined.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity 

In [14]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(df)

In [15]:
#values needed later for kmeans 
terms = tfidf_vectorizer.get_feature_names() 
dist = 1 - cosine_similarity(tfidf_matrix)