<font color = green >

## Home task: Topic Modeling 

</font>


<font color = green >

### Load data 

</font>

[voted-kaggle-dataset](https://www.kaggle.com/canggih/voted-kaggle-dataset/version/2#voted-kaggle-dataset.csv)

In [1]:
import pandas as pd

TEXT_FILE = 'voted-kaggle-dataset.csv'

In [2]:
# Read text data into a DataFrame
texts_df = pd.read_csv(TEXT_FILE)

print('Length of texts_df = {:,}\n'.format(len(texts_df)))
print('Example of 11th document:')
texts_df.loc[10, 'Description']

Length of texts_df = 2,150

Example of 11th document:


'These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file. k'

In [3]:
# Remove the NA values from documents
texts = texts_df[~texts_df['Description'].isna()]['Description']

print('Length of cleared text data = {:,}'.format(len(texts)))

Length of cleared text data = 2,145


In [4]:
HOLDOUT = 3

# Select 3 random documents for further topic modeling
holdout = texts.sample(HOLDOUT, random_state=0)
texts.drop(holdout.index)
holdout = holdout.tolist()

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Transform documents into a sparse matrix of words counts (select words having 3 or more characters and exclude stop words)
vectorizer = CountVectorizer(stop_words='english', min_df=20, max_df=200, token_pattern=r'\b\w{3,}\b')
texts_vectorized = vectorizer.fit_transform(texts)

In [6]:
# Review the features of tha matrix
features = vectorizer.get_feature_names_out()

print('Length of features = {:,}\n'.format(len(features)))
print('First 50 features:')
print(features[:50])

Length of features = 1,692

First 50 features:
['100' '1000' '1995' '1st' '200' '2000' '2001' '2003' '2004' '2005' '2006'
 '2007' '2008' '2009' '2010' '2011' '2012' '2013' '2014' '2018' '2nd'
 '300' '400' '500' '600' 'ability' 'able' 'abs' 'abstract' 'academic'
 'access' 'accessed' 'accessible' 'accompanying' 'according' 'account'
 'accounts' 'accuracy' 'accurate' 'accurately' 'achieve' 'achieved'
 'acknowledge' 'acknowledgement' 'acknowledgments' 'acquired' 'act'
 'action' 'active' 'activities']


In [7]:
from gensim.matutils import Sparse2Corpus

# Build a corpus based on the matrix
corpus = Sparse2Corpus(texts_vectorized, documents_columns=False)

print('Example of corpus entries:')
for i, doc in enumerate(corpus):
    if i == 5:
        break
    print(f'{i + 1}:', doc, end='\n\n')

Example of corpus entries:
1: [(1553, 4), (385, 1), (210, 1), (1379, 1), (17, 1), (539, 1), (1034, 1), (401, 1), (699, 1), (1133, 1), (255, 3), (35, 1), (764, 1), (1612, 1), (1304, 1), (1580, 1), (796, 1), (1191, 1), (585, 3), (147, 1), (311, 1), (1032, 1), (584, 4), (1359, 1), (1552, 3), (364, 1), (1300, 1), (1611, 1), (1504, 1), (1609, 1), (213, 1), (1224, 1), (37, 2), (114, 1), (924, 1), (257, 2), (679, 1), (166, 1), (963, 2), (432, 1), (393, 1), (1081, 1), (1185, 1), (1260, 1), (1543, 1), (249, 1), (1164, 1), (315, 1), (777, 1), (728, 1)]

2: [(210, 1), (539, 1), (255, 1), (585, 1), (213, 1), (1164, 1), (921, 4), (1116, 7), (367, 1), (842, 1), (1356, 1), (12, 1), (1513, 4), (134, 5), (1418, 2), (1435, 3), (1623, 1), (642, 5), (1381, 2), (1650, 1), (1592, 1), (1512, 4), (865, 1), (351, 1), (430, 1), (920, 1), (543, 3), (662, 1), (1573, 1), (390, 1), (1498, 1), (339, 1), (490, 1), (1604, 1), (1648, 1), (274, 1), (1172, 1), (860, 1), (489, 1), (288, 1), (612, 2), (96, 1), (1347, 1), (

In [8]:
# Build a vocabulary mapping the word ids to the words
id2word_map = {i: word for word, i in vectorizer.vocabulary_.items()}

import random
ids = random.sample(list(id2word_map), 10)

print('Example of 10 random id2word_map entries:')
for id in ids:
    print(f'{id}: {id2word_map[id]}')

Example of 10 random id2word_map entries:
1175: product
444: difference
354: core
1502: tags
587: federal
908: management
1200: purpose
1056: outcomes
591: field
995: near


In [9]:
from gensim.models.ldamodel import LdaModel

# Create a LDA model with 5 topics using the corpus and the vocabulary
model = LdaModel(corpus, num_topics=5, id2word=id2word_map, passes=25, random_state=0)

In [10]:
# Show the top 20 significant words for each topic
for i, words in model.print_topics(num_topics=-1, num_words=20):
    print(f'Significant words for topic {i}:')
    print(words, end='\n\n')

Significant words for topic 0:
0.011*"column" + 0.010*"class" + 0.010*"instances" + 0.009*"images" + 0.009*"activity" + 0.008*"cell" + 0.007*"features" + 0.007*"recognition" + 0.007*"image" + 0.007*"group" + 0.006*"labels" + 0.006*"human" + 0.006*"body" + 0.006*"values" + 0.006*"results" + 0.005*"solar" + 0.005*"size" + 0.005*"mean" + 0.005*"paper" + 0.005*"attribute"

Significant words for topic 1:
0.012*"game" + 0.012*"player" + 0.011*"team" + 0.011*"price" + 0.009*"news" + 0.008*"company" + 0.008*"players" + 0.007*"games" + 0.006*"score" + 0.006*"market" + 0.006*"non" + 0.006*"integer" + 0.006*"reviews" + 0.006*"music" + 0.006*"match" + 0.005*"companies" + 0.005*"percentage" + 0.005*"python" + 0.005*"season" + 0.005*"just"

Significant words for topic 2:
0.006*"department" + 0.006*"age" + 0.006*"health" + 0.005*"gov" + 0.005*"united" + 0.005*"variables" + 0.005*"census" + 0.005*"survey" + 0.005*"country" + 0.005*"location" + 0.005*"row" + 0.004*"education" + 0.004*"crime" + 0.004*"p

### Topic naming

Now we can name the topics using the most significant words. Considering these words:

- let name of topic 0 is `Science/Technology`
- let name of topic 1 is `Games/Intertainment`
- let name of topic 2 is `Healthcare/Demographics`
- let name of topic 3 is `Education/Social media`
- let name of topic 4 is `Machine learning/Linguistics`

In [11]:
# Mapping the topic indexes to the topic names
TOPICS = {
    0: 'Science/Technology',
    1: 'Games/Intertainment',
    2: 'Healthcare/Demographics',
    3: 'Education/Social media',
    4: 'Machine learning/Linguistics'
}

def get_topics(text, vectorizer, lda_model):
    '''
    Builds a topic distribution for document `text` using the count
    vectorizer `vectorizer` and the LDA algorithm model `lda_model`
    :param text: text of document
    :type text: str
    :param vectorizer: count vectorizer
    :type vectorizer: CountVectorizer
    :param lda_model: LDA model
    :type lda_model: LdaModel
    :return: a dictionary mapping the topic names to the probabilities
    the document is related to a specific topic
    :rtype: dict[str, float]
    '''
    # Trasform the document text into a sparse matrix of words counts
    text_vectorized = vectorizer.transform([text])
    text_corpus = Sparse2Corpus(text_vectorized, documents_columns=False)

    # Costruct a topic distribution for the document using the LDA model
    topics, = list(lda_model.get_document_topics(text_corpus))
    return {TOPICS[t]: p for t, p in topics}

### Topic modeling for document samples

In [12]:
doc_1, doc_2, doc_3 = holdout

#### Document 1

In [13]:
print(doc_1)

Story
This is a set of 13,000 images from the site https://prnt.sc/. It is a site that enables users to easily upload images, either through the web interface, or, most commonly, through the downloadable screen cap tool which enables easy selection and uploading of an area of your screen. As you can see on their homepage, at the point of posting this, they have almost a billion images uploaded. The amount of information in there will be incredible, it’s an information enthusiast dream. Around 2 years ago I discovered this, and I thought it was interesting to mass download these images with a tool I created, but I was manually looking at every single image. As I became more interested in machine learning, I figured experimenting with the 20,000 or so images I had downloaded at the time from the site would be interesting, especially since because of the nature of the site and its ease of access, it gets used for a few very specific purposes which is very useful for image categorisation. 

In [14]:
# Determine which topic Document 1 is related to
topics = get_topics(doc_1, vectorizer, model)

print('Topic distribution for doc_1:', end='\n\n')
for topic in topics:
    print(f'{topic}: {topics[topic]:.4f}')

Topic distribution for doc_1:

Science/Technology: 0.3741
Games/Intertainment: 0.3479
Healthcare/Demographics: 0.2139
Machine learning/Linguistics: 0.0627


The highest probabilities have topics `Science/Technology` and `Games/Intertainment`, so Document 1 can be related to both these topics. 

#### Document 2

In [15]:
print(doc_2)

Context
Safebooru (safebooru.org) is a tag-based image archive maintained by anime enthusiasts. It allows users to post images and add tags, annotation, translations and comments. It's derived from Danbooru, and differs from it in that it disallows explicit content. It's quite popular, and there are more than 2.3 million posts as of January 24, 2018.
Content
The data was scraped via Safebooru's online API, then converted from XML to CSV (some attributes were discarded during the conversion to make the whole csv a little smaller). There are 1,934,214 rows of the metadata. Contains images uploaded to safebooru.org in the time range of 2010-01-29 through 2016-11-20.
Acknowledgements
Banner image taken from https://safebooru.org/index.php?page=post&s=view&id=1514244
Inspiration
What tags are highly correlated? Can you predict missing tags? Can you predict the score of an image based on its tags?


In [16]:
# Determine which topic Document 2 is related to
topics = get_topics(doc_2, vectorizer, model)

print('Topic distribution for doc_2:', end='\n\n')
for topic in topics:
    print(f'{topic}: {topics[topic]:.4f}')

Topic distribution for doc_2:

Science/Technology: 0.1588
Games/Intertainment: 0.2370
Healthcare/Demographics: 0.2819
Machine learning/Linguistics: 0.3181


Topic `Machine learning/Linguistics` has the highest probability value in distribution above. Then we can say that Document 2 most probably is related to topic `Machine learning/Linguistics`.

#### Document 3

In [17]:
print(doc_3)

Context
Open Payments is a national disclosure program created by the Affordable Care Act (ACA) and managed by Centers for Medicare & Medicaid Services (CMS). The purpose of the program is to promote transparency into the financial relationships between pharmaceutical and medical device industries, and physicians and teaching hospitals. The financial relationships may include consulting fees, research grants, travel reimbursements, and payments from industry to medical practitioners.
Content
There are 3 datasets that represent 3 different payment types:
General Payments: Payments not made in connection with a research agreement. This dataset contains 65 variables.
Research Payments: Payments made in connection with a research agreement. This dataset contains 166 variables.
Physician Ownership or Investment Interest: Information about physicians who hold ownership or investment interest in the manufacturer/GPO or who have an immediate family member holding such interest. This dataset co

In [18]:
# Determine which topic Document 3 is related to
topics = get_topics(doc_3, vectorizer, model)

print('Topic distribution for doc_3:', end='\n\n')
for topic in topics:
    print(f'{topic}: {topics[topic]:.4f}')

Topic distribution for doc_3:

Science/Technology: 0.1549
Games/Intertainment: 0.0671
Healthcare/Demographics: 0.7290
Education/Social media: 0.0453


Document 3 is related to topic `Healthcare/Demographics`, which have a much higher probability in the distribution.