# Unguided LDA

## Initial Setup

After uploading your dataset, use train_test_split() to choose the size of your testing data.
Since we will eventually print out the best matching clusters for each project in the testing data, we have selected a low testing data size (.02%, or slightly over 50 projects) for this notebook.

In [1]:
import json
import sklearn
import pprint

from sklearn.externals import joblib
from sklearn.model_selection import train_test_split

from UnguidedLDA import NGOUnguidedLDA

# Using project summaries instead of scraped data for improved clustering results.
with open("project_summaries.json", "r") as datafile:
    input_data = json.load(datafile)

# Set up training and testing data
# Clusters won't match NGO categories directly, so no need for huge testing data to compute "accuracy statistic")
training_data, testing_data = train_test_split(input_data, test_size=0.002)

## Process Project Text

We will tokenize, stem, and remove stopwords from the text provided for each project. 

In [2]:
clusterer = NGOUnguidedLDA(training_data)
processed_projects = clusterer.process_projects()

In [3]:
# Print out the processed text for the first 10 projects
first10_projects = [processed_projects[i] for i in range(0,10)]
pprint.pprint(first10_projects)

[['main',
  'purpose',
  'project',
  'provide',
  'food',
  'proper',
  'home',
  'shelter',
  'clean',
  'clothing',
  'medical',
  'care',
  'education',
  'needy',
  'child',
  'several',
  'slum',
  'macedonia',
  'child',
  'family',
  'victim',
  'extreme',
  'poverty',
  'social',
  'exclusion',
  'mainly',
  'abandoned',
  'forgotten',
  'institution',
  'project',
  'support',
  'socially',
  'excluded',
  'economically',
  'marginalized',
  'family',
  'child',
  'providing',
  'food',
  'school',
  'fee',
  'school',
  'kit',
  'uniform',
  'one',
  'year'],
 ['purpose',
  'project',
  'provide',
  'early',
  'childhood',
  'education',
  'health',
  'care',
  'recreation',
  'underprivileged',
  'child',
  'mine',
  'worker',
  'kolar',
  'gold',
  'field',
  'kgf',
  'state',
  'karnataka',
  'india',
  'mine',
  'worker',
  'lost',
  'job',
  'year',
  'back',
  'mine',
  'shut',
  'government',
  'without',
  'making',
  'alternative',
  'arrangement',
  'rehabilitation

## Create Training Dictionary

The purpose of the training dictionary is to contain the most significant words appearing in the training data.

Set how many top words you want in the training dictionary (in this case, 40,000) as well as the maximum proportion of the training data (in this case, 40%) a candidate top word can appear in.

The reason why this maximum proportion attribute is important is that if a particular word appears in the majority of projects, that word does not help differentiate or describe the projects it belongs to in a meaningful way (and is thus ineffective for clustering those projects).

In [4]:
# Keeping 40000 top words in the training dictionary, each of which cannot appear in over 40% of corpus
training_dict = clusterer.create_training_dict(0.4, 40000)

Note that dictionary keys are simply ID's corresponding to each token; they have no relation to the word's significance or count.

In [5]:
# Print out the first ten pairs from the dictionary.
# Note that dictionary keys are token ID's, values are the words/tokens themselves.
first10_pairs = {k: training_dict[k] for k in list(training_dict)[:10]}
pprint.pprint(first10_pairs)

{0: 'abandoned',
 1: 'care',
 2: 'child',
 3: 'clean',
 4: 'clothing',
 5: 'economically',
 6: 'education',
 7: 'excluded',
 8: 'exclusion',
 9: 'extreme'}


## LDA Modeling with Bag of Words Corpus

We will now create the LDA Model, whose clusters should represent the variety of projects in the dataset.

Set the number of clusters for the model as well as the number of processes for parallelization.

The Bag of Words corpus contains, for each project, the words that can be found in the training dictionary.
For example, the representation of a particular project within the corpus may look like: 
[(token_id_1, count), (token_id_2, count)]. This corpus is then used to create the LDA Model.

In [6]:
# Create LDA Model with 10 clusters, 3 extra processes for parallelization, and no tf-idf.
clusterer.create_lda_model(10, 3, False)

# Print out the first 10 projects from the Bag of Words Corpus
first10_bow = [clusterer.word_corpus[i] for i in range(0,10)]
pprint.pprint(first10_bow)

[[(0, 1),
  (1, 1),
  (2, 3),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 2),
  (11, 1),
  (12, 2),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 2),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1)],
 [(1, 1),
  (2, 2),
  (6, 1),
  (24, 1),
  (26, 1),
  (28, 1),
  (29, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 3),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 2)],
 [(1, 1),
  (12, 1),
  (26, 1),
  (31, 1),
  (35, 1),
  (48, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 4),
  (68, 1),
  (69, 1),
  (70, 2),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1),


Use print_lda_topics() to display the top words for each cluster.

In [7]:
# Print out the top words from each LDA Cluster.
clusterer.print_lda_topics()

Topic: 0 
Words: 0.027*"child" + 0.025*"water" + 0.018*"community" + 0.013*"health" + 0.010*"provide" + 0.008*"people" + 0.008*"support" + 0.007*"access" + 0.007*"education" + 0.007*"family"
Topic: 1 
Words: 0.014*"community" + 0.009*"help" + 0.007*"rural" + 0.007*"family" + 0.006*"woman" + 0.005*"child" + 0.005*"area" + 0.005*"forest" + 0.005*"farmer" + 0.005*"tree"
Topic: 2 
Words: 0.029*"woman" + 0.020*"training" + 0.015*"skill" + 0.010*"business" + 0.009*"family" + 0.009*"help" + 0.009*"provide" + 0.009*"life" + 0.008*"machine" + 0.007*"sewing"
Topic: 3 
Words: 0.063*"school" + 0.031*"girl" + 0.030*"child" + 0.022*"education" + 0.012*"provide" + 0.009*"year" + 0.008*"support" + 0.008*"community" + 0.008*"help" + 0.007*"book"
Topic: 4 
Words: 0.019*"student" + 0.016*"school" + 0.011*"program" + 0.011*"youth" + 0.009*"community" + 0.008*"provide" + 0.008*"education" + 0.008*"child" + 0.007*"skill" + 0.007*"support"
Topic: 5 
Words: 0.011*"computer" + 0.010*"community" + 0.008*"people

Since this is a topic clustering algorithm (as opposed to topic classification), we cannot directly compare the topics we have clustered to pre-ordained topics.

Thus, our testing will consist of finding the best-matching clusters for each project in the testing data.

In [8]:
# For each project, print the top 5 words from (up to) 3 "closest" clusters to this project.
clusterer.test_lda_model(testing_data, 3, 5)

Project ID: 23435, Theme: Women and Girls
Score: 0.4256046712398529	 Topic: 0.015*"woman" + 0.014*"life" + 0.010*"people" + 0.010*"help" + 0.009*"support"
Score: 0.20305265486240387	 Topic: 0.019*"student" + 0.016*"school" + 0.011*"program" + 0.011*"youth" + 0.009*"community"
Score: 0.20191477239131927	 Topic: 0.029*"woman" + 0.020*"training" + 0.015*"skill" + 0.010*"business" + 0.009*"family"
Project ID: 30326, Theme: Climate Change
Score: 0.42823344469070435	 Topic: 0.029*"child" + 0.017*"food" + 0.014*"education" + 0.012*"provide" + 0.012*"year"
Score: 0.3474957048892975	 Topic: 0.014*"community" + 0.009*"help" + 0.007*"rural" + 0.007*"family" + 0.006*"woman"
Score: 0.13547582924365997	 Topic: 0.063*"school" + 0.031*"girl" + 0.030*"child" + 0.022*"education" + 0.012*"provide"
Project ID: 26171, Theme: Children
Score: 0.7823265194892883	 Topic: 0.063*"school" + 0.031*"girl" + 0.030*"child" + 0.022*"education" + 0.012*"provide"
Score: 0.20026028156280518	 Topic: 0.064*"child" + 0.017*

Score: 0.6485225558280945	 Topic: 0.015*"woman" + 0.014*"life" + 0.010*"people" + 0.010*"help" + 0.009*"support"
Score: 0.32778656482696533	 Topic: 0.014*"community" + 0.009*"help" + 0.007*"rural" + 0.007*"family" + 0.006*"woman"
Project ID: 20537, Theme: Women and Girls
Score: 0.4350419342517853	 Topic: 0.064*"child" + 0.017*"help" + 0.014*"education" + 0.009*"family" + 0.009*"school"
Score: 0.3102205693721771	 Topic: 0.029*"child" + 0.017*"food" + 0.014*"education" + 0.012*"provide" + 0.012*"year"
Score: 0.13961444795131683	 Topic: 0.063*"school" + 0.031*"girl" + 0.030*"child" + 0.022*"education" + 0.012*"provide"
Project ID: 24590, Theme: Women and Girls
Score: 0.9812462329864502	 Topic: 0.014*"community" + 0.009*"help" + 0.007*"rural" + 0.007*"family" + 0.006*"woman"
Project ID: 14823, Theme: Economic Development
Score: 0.7532786726951599	 Topic: 0.027*"child" + 0.025*"water" + 0.018*"community" + 0.013*"health" + 0.010*"provide"
Score: 0.22850124537944794	 Topic: 0.019*"student" +

## LDA Modeling with TF-IDF

We will set the TF-IDF attribute to True to update the LDA model to use a TF-IDF representation of the corpus.

[TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) refers to term-frequency-inverse document frequency, which will be used to reflect a word's importance to the projects it belongs to.

TF stands for Term Frequency, and IDF stands for Inverse Document Frequency, which calculates the weight of "rare" words in the corpus; a word that rarely occurs in the corpus (increasing its significance to the documents it DOES belong to) will have a high IDF score. TF-IDF is the product of these two calculations.

Benefits of using TF-IDF include providing weight to uncommon words in overall corpus, increasing their significance for classifying the projects they belong to.

In [9]:
# Create LDA Model, but with tf-idf this time.
clusterer.create_lda_model(10, 3, True)

# Print out the first 10 projects from the Bag of Words Corpus
first10_tfidf = [clusterer.word_corpus[i] for i in range(0,10)]
pprint.pprint(first10_tfidf)

[[(0, 0.16489479325009543),
  (1, 0.08658567935625132),
  (2, 0.11577116037208579),
  (3, 0.13990418769006008),
  (4, 0.17115730581877878),
  (5, 0.1738187608689246),
  (6, 0.05596087831864857),
  (7, 0.23696491687083726),
  (8, 0.24018164672621503),
  (9, 0.17561070381185384),
  (10, 0.13514976208381227),
  (11, 0.14672305874354322),
  (12, 0.17803153292514579),
  (13, 0.25089311888244137),
  (14, 0.09663801939463718),
  (15, 0.19112408256868096),
  (16, 0.1717466456991798),
  (17, 0.31919429922661635),
  (18, 0.1703467128146759),
  (19, 0.20894202560225106),
  (20, 0.15663400983256495),
  (21, 0.11683809460321949),
  (22, 0.17846073087227549),
  (23, 0.08790299945155791),
  (24, 0.09681614570504711),
  (25, 0.16910557323553024),
  (26, 0.054938591942249114),
  (27, 0.08560553756852136),
  (28, 0.1835605721856881),
  (29, 0.10238095901473095),
  (30, 0.1851948532756534),
  (31, 0.1354891257734888),
  (32, 0.140654366876962),
  (33, 0.10541759570777393),
  (34, 0.1971875190728986),
  (

In [10]:
# Print out the top words from each LDA Cluster.
clusterer.print_lda_topics()

Topic: 0 
Words: 0.005*"child" + 0.004*"family" + 0.003*"woman" + 0.003*"care" + 0.003*"school" + 0.003*"water" + 0.003*"health" + 0.003*"provide" + 0.003*"support" + 0.003*"life"
Topic: 1 
Words: 0.006*"school" + 0.006*"water" + 0.004*"community" + 0.004*"woman" + 0.004*"student" + 0.004*"girl" + 0.003*"child" + 0.003*"education" + 0.003*"training" + 0.003*"help"
Topic: 2 
Words: 0.005*"child" + 0.004*"family" + 0.004*"person" + 0.003*"people" + 0.003*"provide" + 0.003*"home" + 0.003*"homeless" + 0.003*"need" + 0.003*"food" + 0.003*"age"
Topic: 3 
Words: 0.004*"child" + 0.004*"youth" + 0.004*"skill" + 0.004*"woman" + 0.003*"community" + 0.003*"training" + 0.003*"program" + 0.003*"student" + 0.003*"people" + 0.003*"school"
Topic: 4 
Words: 0.008*"child" + 0.008*"school" + 0.008*"girl" + 0.007*"education" + 0.004*"support" + 0.004*"orphan" + 0.004*"woman" + 0.003*"community" + 0.003*"provide" + 0.003*"help"
Topic: 5 
Words: 0.008*"microproject" + 0.008*"attend" + 0.007*"girl" + 0.007*"s

In [11]:
# For each project, print the top 5 words from (up to) 3 "closest" clusters to this project.
clusterer.test_lda_model(testing_data, 3, 5)

Project ID: 23435, Theme: Women and Girls
Score: 0.7495407462120056	 Topic: 0.004*"child" + 0.004*"youth" + 0.004*"skill" + 0.004*"woman" + 0.003*"community"
Score: 0.14842475950717926	 Topic: 0.005*"child" + 0.004*"family" + 0.003*"woman" + 0.003*"care" + 0.003*"school"
Score: 0.08739704638719559	 Topic: 0.005*"child" + 0.005*"water" + 0.004*"health" + 0.004*"school" + 0.004*"woman"
Project ID: 30326, Theme: Climate Change
Score: 0.5938747525215149	 Topic: 0.006*"woman" + 0.005*"child" + 0.003*"family" + 0.003*"food" + 0.003*"community"
Score: 0.20081591606140137	 Topic: 0.005*"child" + 0.005*"water" + 0.004*"health" + 0.004*"school" + 0.004*"woman"
Score: 0.19007012248039246	 Topic: 0.006*"school" + 0.006*"water" + 0.004*"community" + 0.004*"woman" + 0.004*"student"
Project ID: 26171, Theme: Children
Score: 0.3763737678527832	 Topic: 0.006*"child" + 0.006*"school" + 0.005*"year" + 0.004*"education" + 0.004*"care"
Score: 0.3115409314632416	 Topic: 0.005*"child" + 0.004*"family" + 0.00

Project ID: 6376, Theme: Climate Change
Score: 0.9470373392105103	 Topic: 0.006*"woman" + 0.005*"child" + 0.003*"family" + 0.003*"food" + 0.003*"community"
Project ID: 22308, Theme: Education
Score: 0.3389875292778015	 Topic: 0.006*"child" + 0.006*"school" + 0.005*"year" + 0.004*"education" + 0.004*"care"
Score: 0.26382479071617126	 Topic: 0.008*"child" + 0.008*"school" + 0.008*"girl" + 0.007*"education" + 0.004*"support"
Score: 0.1587633341550827	 Topic: 0.004*"child" + 0.004*"youth" + 0.004*"skill" + 0.004*"woman" + 0.003*"community"
Project ID: 28800, Theme: Children
Score: 0.6707894206047058	 Topic: 0.008*"child" + 0.008*"school" + 0.008*"girl" + 0.007*"education" + 0.004*"support"
Score: 0.17680984735488892	 Topic: 0.005*"child" + 0.004*"family" + 0.003*"woman" + 0.003*"care" + 0.003*"school"
Score: 0.13647723197937012	 Topic: 0.006*"woman" + 0.005*"child" + 0.003*"family" + 0.003*"food" + 0.003*"community"
Project ID: 19055, Theme: Women and Girls
Score: 0.6125174760818481	 Topic