## word matrix
Goal: Convert unstructured text into a document matrix  
Julian Pulido
STAT 129

In [94]:
#import necesary libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import EnglishStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import sys

In [95]:
#read the dataset
data = pd.read_csv("C:\\Users\\Julian\\Downloads\\descriptions.csv")
data["mission"]

0       FDC IS DEDICATED TO MEETING THE NEEDS OF A DIV...
1       Offer unique opportunity for the high-school s...
2       TO PROVIDE TRANSITIONAL HOUSING, PERMANENT HOU...
3       NATIONAL HONOR SOCIETY FOR COLLEGE STUDENTS IN...
4       PROVIDE PERMANENT SUPPORTIVE HOUSING FOR INDIV...
                              ...                        
6449    THE CENTRAL IOWA TRAUMA RECOVERY CENTER (CITRC...
6450    Accompany Capital creates a pathway to prosper...
6451    TO ENSURE ALL MICHIGAN CHILDREN AND YOUTH WITH...
6452    Advocacy Unlimited is a CT based peer-led orga...
6453    "GUIDED BY THE TEACHINGS OF JESUS CHRIST, SAIN...
Name: mission, Length: 6454, dtype: object

### Show an example of how a word was stemmed.  
I used the same function from last homework to stem and remove stop words

In [96]:
#create stemmer object
stemmer = EnglishStemmer()

#create Count Vectorizer veector 
analyzer = CountVectorizer().build_analyzer()


import nltk
from nltk.corpus import stopwords

#function to stem words and remove stop words
def stemmed_words1(doc):
	# Better to use builtin:
  #stopwords = {"it", "to", "the"}
    stop_words = set(stopwords.words('english'))
    docwords = [stemmer.stem(w) for w in analyzer(doc)]
    return [x for x in docwords if x.lower() not in stop_words]

#create new count vectorizer object with the analyzer being the funciton above
cv1 = CountVectorizer(analyzer=stemmed_words1)


example = "This is my homework for big data and an this is an example sentence. Today is a hot day."
print("Sentence example: " , example)

print("Stemmed and stop words removed:" , stemmed_words1(example))

Sentence example:  This is my homework for big data and an this is an example sentence. Today is a hot day.
Stemmed and stop words removed: ['homework', 'big', 'data', 'exampl', 'sentenc', 'today', 'hot', 'day']


### What are the dimensions of your matrix?
6454 rows, 8257 columns

In [97]:

#CV
#Fit it to our data
#CV is count vectorizer
cv1.fit(data["mission"])


# What does each column in the matrix (feature) represent?
words =cv1.get_feature_names_out()

# The actual matrix
m = cv1.transform(data["mission"])

print(m.shape)

#Tfidf 
#for our tfidf, re-weight the count features into floating point
V = TfidfVectorizer(analyzer=stemmed_words1)
V.fit(data["mission"])

#transform into tdidf sparse matrix. this scales so that its in euclidean norm 1
tfidf = V.transform(data["mission"])
print(tfidf.shape)

(6454, 8257)
(6454, 8257)


### How many bytes of memory are used to represent the matrix in sparse form?
48 bytes of memory are used to represent the matrix in sparse form

In [98]:
print("Size of our matrix in sparse form in bytes:" , sys.getsizeof(m))

Size of our matrix in sparse form in bytes: 48


### How many bytes of memory would used to represent the matrix in dense form? (Don't actually convert it to dense- it will be too large!)

In [99]:
denseFormSize = 6454 * 8257* 8 
print("Bytes in dense form:" , denseFormSize)
print("MB of matrix in dense form:", denseFormSize/10**6)

Bytes in dense form: 426325424
MB of matrix in dense form: 426.325424


### What happens if you try to transform a document that contains some words that were not in the original corpus? Does it work?
It still works, but it does not count the word that was not in the original corpus. My sparse matrix only contains values in two locations, but does account for the special word 'brusque' that does not pop up in the orginal corpus

In [100]:
#example doc
print("Example document:" , doc)
doc = ["brusque is the word of the day"]
print("Is the word 'brusque' in the original corpus?" , "brusque" in cv1.get_feature_names_out())

test = cv1.transform(doc)
print("Word at 2196 index: ", words[2196] , "\nWord at 8152 index:" , words[8152])
print(test)

Example document: ['brusque is the word of the day']
Is the word 'brusque' in the original corpus? False
Word at 2196 index:  day 
Word at 8152 index: word
  (0, 2196)	1
  (0, 8152)	1


### Which are the top 10 most frequent words? Are they meaningful, or should they be removed?
The 10 most popular words are meaningful and should be kept

In [101]:
#get the sum of each unique word into a matrix. axis =1 is rows
#use axis =0 for the columns. sum down the column for each word
wordCounts = m.sum(axis=0)

#wordCounts is a list of lists. check with shape
print("Shape of wordCounts ", wordCounts.shape)

#fix using np.ravel to return a 1d contiguous flattned array
wordCounts = np.ravel(wordCounts)

#use argsort to get sort the word counts, but we only save their indexes
rank = np.argsort(wordCounts)
print("Shape of orignal count vectorized matrix:", m.shape)
#print(m.sum(axis=0)[:,4637])
rank.shape

#our words was a list of all the unique words. use rank (a list of indexes) 
#to create a new list of popular words thats sorted based on counts
#so last word had the most times appeard
popularWords = words[rank]


print("Most popular words: ", popularWords[-10:])
print("Least popular words: ", popularWords[:10])


Shape of wordCounts  (1, 8257)
Shape of orignal count vectorized matrix: (6454, 8257)
Most popular words:  ['organ' 'student' 'famili' 'support' 'educ' 'communiti' 'servic' 'provid'
 'health' 'mental']
Least popular words:  ['jeopard' 'educationon' 'educationw' 'educatyion' 'educt' 'edutain'
 'edututorva' 'persian' 'efec' 'perpetr']


In [102]:
#type(rank)
#rank.shape
#rank[-10:]
#m.shape
#print(m.sum(axis=0))

### Experiment with the min_df, max_df arguments to CountVectorizerLinks to an external site.. Which values did you choose, and why?

I created a function and edited the `min_df` to be 5. This means that the CountVectorizer will ignore words that appeared less than 5 documents. I chose this value because if a word appears in less than 5 documents out of 6454, then that likely means that the word is mispelled or a very unique word that other non profits don't use

In [103]:
#https://stackoverflow.com/questions/27697766/understanding-min-df-and-max-df-in-scikit-countvectorizer
#using min_df =0.01 ignores all words that appeared less than 0.01 of documents

def removeSparseWords(lowerBound = 0):
  """
  Function to remove sparse words. Creates a count vectorizer and returns a list of the top 10 least and most popular words

  """
  cv2 = CountVectorizer(analyzer=stemmed_words1, min_df= lowerBound)

  #fit the new model on the mission column
  cv2.fit(data["mission"])

  # The actual matrix
  m2 = cv2.transform(data["mission"])

  print("Shape of count vectorizer matrix", m2.shape)

  wordCounts = m2.sum(axis=0)

  #wordCounts is a list of lists. check with shape
  print("Shape of wordCounts ", wordCounts.shape)

  #fix using np.ravel to return a 1d contiguous flattned array
  wordCounts = np.ravel(wordCounts)

  print(wordCounts.shape)
  
  #use argsort to get sort the word counts, but we only save their indexes
  rank = np.argsort(wordCounts)


  #our words was a list of all the unique words. use rank (a list of indexes) 
  #to create a new list of popular words thats sorted based on counts
  #so last word had the most times appeard
  popularWords = words[rank]

  print("Most popular words: ", popularWords[-10:])
  print("Least popular words: ", popularWords[:10])

removeSparseWords(lowerBound= 5)



Shape of count vectorizer matrix (6454, 2302)
Shape of wordCounts  (1, 2302)
(2302,)
Most popular words:  ['ccsd' 'correct' 'assault' 'coup' 'anguish' 'addit' 'conceiv' 'ciri'
 'basketbal' 'butler']
Least popular words:  ['compatriot' 'appropri' 'acommun' 'control' 'cielo' 'airport' '2014'
 'circl' '72' 'cerebr']


### Use Python to pick a random nonprofit. Verify that the bag of words approach does the correct operation on this particular description, both in the counts and the TF-IDF.

From a random non profit's mission statement, I am going to calculate the sum of unique words and compare that to the sum of unique words from my matrix for the same mission statement. If this is the same then that verifies that my bag of words approach was done correctly.

In [116]:
import random

random.seed(1)
#get random index from 1 to number of rows of data
randomIndex = random.randint(1, data.shape[0])

#get mission statement at random Index
missionStatement = data.iloc[randomIndex,2]
print("mission statement" , missionStatement)

#stem the words using our function
stemmedStatement = stemmed_words1(missionStatement)
print("Stemmed mission statement" , stemmedStatement)

#compare the number of words and see if that matches to what we have in our matrix
print("Sum of words after stemmed: "  , len(stemmedStatement))
print("Same sum compared to my matrix?" , m[randomIndex].sum() == len(stemmedStatement))

#our stemmedStament is a list of all words. I want all the words in a single element of a list since it was part of the whole sentence
combinedStatment  = ' '.join(stemmedStatement)

#clear our list
stemmedStatement = stemmedStatement.clear

#set the first element to be the sentence
stemmedStatement = [combinedStatment]


#transform our stemmed Staement into a sparse matrix of tfidf values. only has one row of our stemmed statement
tfidfList = V.transform(stemmedStatement)
print(tfidfList)
print(tfidfList.shape)
print(tfidfList[0].sum())
print("Same sum of for TFIDF for the same mission statement?" )
print(tfidf[randomIndex].sum())

mission statement DISBURSE PAYMENTS TO OR ON BEHALF OF PARTICIPANTS RELATED TO MEDICAL, DENTAL, MENTAL HEALTH AND PRESCRIPTION DRUGS.
Stemmed mission statement ['disburs', 'payment', 'behalf', 'particip', 'relat', 'medic', 'dental', 'mental', 'health', 'prescript', 'drug']
Sum of words after stemmed:  11
Same sum compared to my matrix? True
  (0, 6258)	0.2611507965641592
  (0, 5881)	0.4662346282216208
  (0, 5608)	0.4308232712431329
  (0, 5563)	0.28834383153918935
  (0, 4766)	0.0989896286383229
  (0, 4739)	0.25946449952181866
  (0, 3533)	0.11045303485675782
  (0, 2512)	0.30732909856923896
  (0, 2282)	0.34177120729796506
  (0, 1052)	0.3809137667572531
(1, 8257)
2.945473763209459
Same sum of for TFIDF for the same mission statement?
3.093906843500359
