Julian Pulido  
STAT 129   
clustering  
Goal: Apply unsupervised machine learning methods to group text documents into clusters.


In [2]:
#import necesary libraries
from collections import Counter
import random

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
import scipy as sp
import pandas as pd

**1. What percentage of the total variation in the data do the first 50 principal components explain?**


In [3]:
#read in sparse TF-IDF matrix
sparseMatrix = sp.sparse.load_npz("C:\\Users\\Julian\\Downloads\\tfidf.npz")

#reading in terms
dataTerms = pd.read_csv("C:\\Users\\Julian\\Downloads\\terms.txt", delimiter= " ", header = None)

terms = dataTerms.iloc[:,0].tolist()

#sparse matrix has 6454 rows, 3166 columns.
print("Shape of sparse matrix: ", sparseMatrix.shape)

#reduce to this dimension
ndim = 100

#create new object, with dimenion of ndim
svd = TruncatedSVD(n_components=ndim, random_state =1)

#fit it according to our sparse matrix
svd.fit(sparseMatrix)

#transform 
Xpc = svd.transform(sparseMatrix)


#PCA can tell us which variable is the most valuable for clustering the data
print("Shape of principle component:" , Xpc.shape)

#print(100 * sum(svd.explained_variance_ratio_))
#print(svd.explained_variance_ratio_)
sumation = 0

for i in range(50):
    sumation += svd.explained_variance_ratio_[i]
print("Percentage of total ratio for first 50 principal components: " , 100*sumation)

Shape of sparse matrix:  (6454, 3166)
Shape of principle component: (6454, 100)
Percentage of total ratio for first 50 principal components:  23.437367238814474


**2. Fit the K means clustering model.**  
**Experiment with using both the original data and the principal components.**  
**Experiment with values of K = 2, 3, 4, 5.**  
**Pick your favorite model.**  

The model I am picking is k = 3 on the principal components

In [4]:
#different k values
values = [2,3,4,5]

#create a list of k means models with different values
kmModels = [KMeans(n_clusters=value, random_state=1) for value in values]

clustersOrignal = []
#try on our original data
for model in kmModels:
    clustersOrignal.append(model.fit_predict(sparseMatrix))

#how many are in each cluster?
print("Experimenting on principle components where dimension is " ,sparseMatrix.shape[1])
for i in range (len(clustersOrignal)):
    print("k value:" , kmModels[i].n_clusters , " clusters: " , Counter(clustersOrignal[i]))


#fitting km models in our orignal data
clusters = []
for model in kmModels:
    clusters.append(model.fit_predict(Xpc))
    
#how many are in each cluster?
print("\nExperimenting on principle components where dimension is " ,ndim)
for i in range (len(clusters)):
    print("k value:" , kmModels[i].n_clusters , " clusters: " , Counter(clusters[i]))

Experimenting on principle components where dimension is  3166
k value: 2  clusters:  Counter({0: 3332, 1: 3122})
k value: 3  clusters:  Counter({0: 3383, 2: 1770, 1: 1301})
k value: 4  clusters:  Counter({0: 2559, 2: 1787, 1: 1296, 3: 812})
k value: 5  clusters:  Counter({4: 2047, 2: 1452, 1: 1301, 0: 920, 3: 734})

Experimenting on principle components where dimension is  100
k value: 2  clusters:  Counter({0: 4981, 1: 1473})
k value: 3  clusters:  Counter({0: 3307, 1: 1868, 2: 1279})
k value: 4  clusters:  Counter({0: 2505, 3: 1882, 2: 1271, 1: 796})
k value: 5  clusters:  Counter({0: 2423, 3: 1859, 2: 1271, 4: 619, 1: 282})


**3. For the model you picked, print and comment on the 5 terms have the highest coefficients in the cluster centers.**

I picked the model with k==3 on the principal components. This created three clusters with the following key words as their centroids:

Cluster 0 contained key terms such as `ill provid servic health menta` with the highest coefficients in the cluster centers. Cluster 0 contains non profits that are aimed on providing mental health services to the community. 

Cluster 1 contained `organ educ communiti refuge immigr` as the coefficients with the highest coefficients. The non profits in this cluster are organizations aimed to deliver educational services to immigrants or refugees.

Cluster 2 contained `preparatori educ school colleg student` as the coefficients with the highest coefficients. These non profits are likely to help students prepare for higher education through scholarships, internships, or other assistance.

In [13]:
#choosing the model with k =4 for principle components

#need to inverse transform of SVD
Xcenters = svd.inverse_transform(kmModels[1].cluster_centers_)

#sorts Xcenters, but only saves the index of the values
bigIndexes = np.argsort(Xcenters)

print("Index of our indexes after argsort: ", bigIndexes.shape)
for i in range(kmModels[1].n_clusters):
    print("Cluster " , i, ":" , end= " ")
    #prints the k most important terms in each cluster
    for j in range(-5, 0):
        print(terms[bigIndexes[i,j]] , end= " ")
    print("\n")

Index of our indexes after argsort:  (3, 3166)
Cluster  0 : ill provid servic health mental 

Cluster  1 : organ educ communiti refuge immigr 

Cluster  2 : preparatori educ school colleg student 



**4. Print out a few random descriptions (from the original mission description data) in each cluster. Comment on what the clusters found. Did clustering do something reasonable?**

After choosing a better k value, the clusters are much more clear on what topics they are centered around and doing reasonable clusterings. Choosing a too high k value made clusters overlap with one another. With k==3:

Cluster 0 are non profits that are aimed to assist people who have have mental health problems. Each non profit have different measures on assiting, whether providing temporary housing or economic assistanct, but the target group are people who suffer mental health problems.

Cluster 1 are communities that are aimed to help refugees or immigrants. Some of them provide legal or financial assistance to families that are detained by immigration.

Cluster 2 are communities aimed to help university/college students through scholarships or housing assistance. 

In [12]:

#read our descriptions dataset
descriptionsDF = pd.read_csv("C:\\Users\\Julian\\Downloads\\descriptions.csv")

#for k we chose in this model
#we chose k==3, so we will loop for 3 of those clusters
for i in range(kmModels[1].n_clusters):
    #get an boolean array if a document is in the X cluster
    indexInCluster = clusters[1] == i

    #get the subset of documents that are in this cluster
    gdocs = descriptionsDF[indexInCluster]

    print("Cluster: " ,i)


    #for all rows, isolate the 2nd column since that contains mission statements
    #get few random descriptions
    print(np.random.choice(gdocs.iloc[:,2], size =3, replace = True))
    print("\n")


Cluster:  0
['TO ASSIST PEOPLE AFFECTED BY OR AT RISK FOR HIV/AIDS, SUBSTANCE ABUSE AND MENTAL HEALTH THROUGH CULTURALLY APPROPRIATE COUNSELING, EDUCATION, TRAINING, AND ADVOCACY, WHICH RESULTS IN MORE INFORMED CHOICES THAT MAXIMIZE AVAILABLE BENEFITS ANDEMPLOYMENT OPPORTUNITIES FOR CLIENTS.'
 'LEADING WITH PREVENTION AND INTERVENTION FOR SUBSTANCE USE AND MENTAL HEALTH CONCERNS'
 'ASSISTING INDIVIDUALS WITH HISTORIES OF MENTAL ILLNESS TO OBTAIN AND RETAIN AFFORDABLE HOUSING.']


Cluster:  1
['The Corporation is organized primarily for the purpose of providing legal and financial assistance to low-income individuals and their families who have been detained by immigration authorities, are charged with or suspected of immigration-related violations, or are seeking to ac'
 'TO CARRY OUT THE CHARITABLE PURPOSES OF LUTHERAN IMMIGRATION & REFUGEE SVC (LIRS) AND LUTHERAN WORLD RELIEF (LWR) BY OPERATING THE LUTHERAN CENTER. THE PRIMARY PURPOSE OF THE LCC IS TO MAINTAIN AND OPERATE THE LUTHERA