# Part 2 | NLP Real Estate Description LDA Model
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1dMK6VoeAQbpV59hDO4vBp2R7rOLn7ErB?usp=sharing)

## Overview
| Detail Tag            | Information                                                                                        |
|-----------------------|----------------------------------------------------------------------------------------------------|
| Originally Created By | Ariel Herrera arielherrera@analyticsariel.com                                                      |
| External References   |  |
| Input Datasets        | Training datasets Nov 2020, Mar 2021                                                    |
| Output Datasets       | Topics |
| Input Data Source     | Dataframe |
| Output Data Source    | Dataframe, Visual |

## History
| Date         | Developed By  | Reason                                                |
|--------------|---------------|-------------------------------------------------------|
| 5th Apr 2021 | Ariel Herrera | Create LDA model |

## Getting Started
1. Copy this notebook -> File -> Save a Copy in Drive

## Useful Resources
- [Google Colab Cheat Sheet](https://towardsdatascience.com/cheat-sheet-for-google-colab-63853778c093)

## <font color="blue">Install Packages</font>

In [1]:
!pip install pyLDAvis==2.1.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis==2.1.2
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting funcy (from pyLDAvis==2.1.2)
  Downloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97721 sha256=3428c3872104f20b82d7bb644e6a42398d384950c4bd7c3bd704e14b564a9bc7
  Stored in directory: /root/.cache/pip/wheels/d9/93/d6/16c95da19c32f037fd75135ea152d0df37254c25cd1a8b4b6c
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-2.1.2


## <font color="blue">Imports</font>

In [2]:
# data transformations
import os
import re
import numpy as np
import pandas as pd
from pprint import pprint

# nlp
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel, LdaModel
from gensim.test.utils import datapath
import nltk

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## <font color="blue">Functions</font>

In [4]:
def read_training_data(file_dir):
  """
  Read in both training datasets (Nov 2020 & March 2021)
  """
  _df1 = pd.read_csv(file_dir + 'READI_NLP_Training_Dataset_v1.csv')
  _df2 = pd.read_csv(file_dir + 'READI_NLP_Training_Dataset_v2.csv')

  # label dataset tpes
  _df1['dataset_type'] = 'v1'
  _df2['dataset_type'] = 'v2'

  # union
  _df = pd.concat([_df1, _df2])
  _df.columns = [c.replace(" ", "_") for c in _df.columns] # format cols

  # clean up labels, replace nulls with "unknown"
  _df_null = _df.loc[_df['human_label'].isnull()]
  _df_null['human_label'] = "unknown"
  _df_not_null = _df.loc[~(_df['human_label'].isnull())]

  # union, sort values
  df = pd.concat([_df_null, _df_not_null])
  return df.sort_values(by=['dataset_type', 'state_code', 'city'])

In [5]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

def remove_stopwords(texts):
    keep_list = ["as", "is", "no", "only"]
    for w in keep_list:
      if w in stop_words:
        stop_words.remove(w)
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'], allowed_words=[]):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        # keep words in allowed list and if they are bigrams
        texts_out.append([token.lemma_ for token in doc if (token.pos_ in allowed_postags) or (str(token) in allowed_words) or ("_" in str(token))])
    return texts_out

In [6]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [7]:
def format_topics_sentences(ldamodel, corpus, texts):
  # Init output
  sent_topics_df = pd.DataFrame()

  # Get main topic in each document
  for i, row in enumerate(ldamodel[corpus]):
      row = sorted(row[0], key=lambda x: (x[1]), reverse=True)
      # Get the Dominant topic, Perc Contribution and Keywords for each document
      for j, (topic_num, prop_topic) in enumerate(row):
          if j == 0:  # => dominant topic
              wp = ldamodel.show_topic(topic_num)
              topic_keywords = ", ".join([word for word, prop in wp])
              sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
          else:
              break
  sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

  # Add original text to the end of the output
  contents = pd.Series(texts)
  sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
  return(sent_topics_df)

## <font color="blue">Local & Constants</font>

In [8]:
from google.colab import output, drive
# mount drive
drive.mount("/content/drive", force_remount=False)

Mounted at /content/drive


In [9]:
# data location
base_dir = '/content/drive/My Drive/Colab Notebooks/READI/data/'
file_dir = base_dir + 'input/'

# remove warnings
warnings.filterwarnings('ignore')

## <font color="blue">Data</font>

In [10]:
# read data
df = read_training_data(file_dir='https://raw.githubusercontent.com/analyticsariel/public-data/main/')
print('Length of training data:', len(df))
df.head()

Length of training data: 12637


Unnamed: 0,line,city,state_code,postal_code,property_id,rdc_web_url,original_description,normalized_description,keyword_label,human_label,dataset_type
234,3705 7th Ave,Birmingham,AL,35224,M7358062309,https://www.realtor.com/realestateandhomes-det...,Calling All Investors!! Home sold AS IS This i...,call investors home sell nice split level home...,distressed,distressed,v1
261,1148 1st St N,Birmingham,AL,35204,M7067072160,https://www.realtor.com/realestateandhomes-det...,Solid investment property with great bones and...,solid investment property great bone fantastic...,distressed,distressed,v1
294,1229 15th Way SW,Birmingham,AL,35211,M7396479920,https://www.realtor.com/realestateandhomes-det...,Investment property currently rented at $795 p...,investment property currently rent 795 per mon...,distressed,not-distressed,v1
427,426 80th St S,Birmingham,AL,35206,M8264613223,https://www.realtor.com/realestateandhomes-det...,This 4 sides brick home is the ideal investmen...,4 side brick home ideal investment property wh...,distressed,distressed,v1
461,914 Knoxville Pl,Birmingham,AL,35224,M7604129328,https://www.realtor.com/realestateandhomes-det...,Don't miss out on this four sided brick home! ...,dont miss four side brick home home would grea...,distressed,distressed,v1


In [11]:
# group by label
df.groupby(['human_label'])['line'].count()

human_label
distressed        1740
not-distressed    1440
remove             135
undecided          178
unknown           9144
Name: line, dtype: int64

## <font color="blue">Transformations</font>
Resources


*   [Topic Modeling with Gensim (Python)](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#14computemodelperplexityandcoherencescore)
*   [MALLET from UMASS](http://mallet.cs.umass.edu/)



### <font color="green">Preprocessing</font>
The following are key factors to obtaining good segregation topics:

1.   The quality of text processing.
2.   The variety of topics the text talks about.
3.   The choice of topic modeling algorithm.
4.   The number of topics fed to the algorithm.
5.   The algorithms tuning parameters.

In [12]:
# set list of text
text = df.original_description.values.tolist()
text[:10]

['Calling All Investors!! Home sold AS IS This is a nice split level home corned lot home. Great investment property! Just need some TLC.',
 'Solid investment property with great bones and fantastic location. All brick home with tons of space. Fix and flip or add to your rental properties. You will not want to let this one get away! Bedrooms are large, living room and separate den/living area. Spacious kitchen and separate dining area.',
 'Investment property currently rented at $795 per month. Recently renovated and professionally managed. Terrific street. Step right in to cashflow! Lease runs until January. 48hr notice required for showings.',
 'This 4 sides brick home is the ideal investment property. Whether you are new to the investment world or you are a seasoned veteran, this property is cash flowing with a tenant in place. No immediate expenditures necessary. The market in this area is hot and ripe! So run your numbers and take advantage of this deal today!',
 "Don't miss out o

In [13]:
# prepare stop words
stop_words = nltk.corpus.stopwords.words('english')
stop_words.extend([]) # add stop words here

In [14]:
# tokenize and clean up text
data_words = list(sent_to_words(text))
print(data_words[:1])

[['calling', 'all', 'investors', 'home', 'sold', 'as', 'is', 'this', 'is', 'nice', 'split', 'level', 'home', 'corned', 'lot', 'home', 'great', 'investment', 'property', 'just', 'need', 'some', 'tlc']]


In [15]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=30) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=10)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

['calling_all', 'investors', 'home', 'sold_as_is', 'this', 'is', 'nice', 'split_level', 'home', 'corned', 'lot', 'home', 'great_investment', 'property', 'just', 'need_some_tlc']


In [16]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm")
# nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_words=['tlc', 'rental', 'cashflow', 'rehab', 
                                                                  #  'fix_flip', 'hud_case', 'insured_escrow',
                                                                  #  'last_long', 'fixer_upper', 'cash_flowing',
                                                                  #  'rental_portfolio', 'attention_investors', 'turn_key',
                                                                  #  'highest_best', 'sold_as'
                                                                   ])

print(data_lemmatized[:1])

[['call', 'investor', 'home', 'sold_as', 'nice', 'split', 'level', 'home', 'corn', 'lot', 'home', 'great', 'investment', 'property', 'need', 'tlc']]


In [17]:
# view_example
idx = 23
print("ORIGINAL:")
print(df['original_description'].iloc[idx])
print("NO STOP WORDS:")
print(' '.join(data_words_nostops[idx]))
print("BIGRAM:")
print(' '.join(data_words_bigrams[idx]))
print("LEMMATIZE:")
print(' '.join(data_lemmatized[idx]))

ORIGINAL:
Green Acres-Central Park area, Great Price for this fixer upper, easy to view, All Information Should Be Independently Verified.
NO STOP WORDS:
green acres central park area great price fixer upper easy view information independently verified
BIGRAM:
green acres central park area great price fixer_upper easy view information independently_verified
LEMMATIZE:
acre central park area great price fixer_upper easy view information independently_verifie


In [18]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
# print(corpus[:1])
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('call', 1),
  ('corn', 1),
  ('great', 1),
  ('home', 3),
  ('investment', 1),
  ('investor', 1),
  ('level', 1),
  ('lot', 1),
  ('need', 1),
  ('nice', 1),
  ('property', 1),
  ('sold_as', 1),
  ('split', 1),
  ('tlc', 1)]]

## <font color="blue">Modeling</font>

### <font color="green">Unsupervised</font>

#### <font color="purple">Topic Modeling: LDA</font>

In [19]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=4, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [20]:
# Print the Keyword in the topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.039*"buyer" + 0.028*"seller" + 0.027*"offer" + 0.023*"pm" + '
  '0.019*"verify" + 0.017*"due" + 0.017*"obtain" + 0.016*"property" + '
  '0.016*"multiple_offer" + 0.012*"agent"'),
 (1,
  '0.036*"room" + 0.034*"bedroom" + 0.029*"home" + 0.024*"large" + '
  '0.018*"kitchen" + 0.018*"bath" + 0.016*"space" + 0.014*"living" + '
  '0.012*"area" + 0.012*"yard"'),
 (2,
  '0.058*"home" + 0.032*"great" + 0.016*"lot" + 0.015*"property" + '
  '0.014*"close" + 0.014*"location" + 0.012*"make" + 0.011*"potential" + '
  '0.010*"downtown" + 0.010*"bath"'),
 (3,
  '0.123*"new" + 0.038*"update" + 0.029*"floor" + 0.023*"roof" + '
  '0.021*"window" + 0.020*"kitchen" + 0.019*"appliance" + 0.017*"paint" + '
  '0.016*"home" + 0.015*"bathroom"')]


In [21]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -6.835746133224248

Coherence Score:  0.5415260122688559


In [22]:
# Can take a long time to run.
limit=30
step=1
%time model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=limit, step=step)

CPU times: user 29min 45s, sys: 6.39 s, total: 29min 52s
Wall time: 30min 3s


In [23]:
x = list(range(2, limit, step))
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=coherence_values))
fig.update_layout(title='Choose Optimal Model with Coherence Scores',
                   xaxis_title='Num Topics',
                   yaxis_title='Coherence Score')
fig.show()

In [24]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

Num Topics = 2  has Coherence Value of 0.5285
Num Topics = 3  has Coherence Value of 0.4328
Num Topics = 4  has Coherence Value of 0.5415
Num Topics = 5  has Coherence Value of 0.5305
Num Topics = 6  has Coherence Value of 0.5123
Num Topics = 7  has Coherence Value of 0.4868
Num Topics = 8  has Coherence Value of 0.4757
Num Topics = 9  has Coherence Value of 0.4483
Num Topics = 10  has Coherence Value of 0.4661
Num Topics = 11  has Coherence Value of 0.4552
Num Topics = 12  has Coherence Value of 0.4201
Num Topics = 13  has Coherence Value of 0.4185
Num Topics = 14  has Coherence Value of 0.4129
Num Topics = 15  has Coherence Value of 0.4428
Num Topics = 16  has Coherence Value of 0.4001
Num Topics = 17  has Coherence Value of 0.3871
Num Topics = 18  has Coherence Value of 0.3875
Num Topics = 19  has Coherence Value of 0.3823
Num Topics = 20  has Coherence Value of 0.375
Num Topics = 21  has Coherence Value of 0.3594
Num Topics = 22  has Coherence Value of 0.3728
Num Topics = 23  has C

In [25]:
# Select the model and print the topics
optimal_model = model_list[2] # 4 topics
# Save model to disk.
temp_file = datapath(base_dir + 'output/models/lda_v1')
optimal_model.save(temp_file)
optimal_model = LdaModel.load(temp_file)
# show topics
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

[(0,
  '0.039*"buyer" + 0.028*"seller" + 0.027*"offer" + 0.023*"pm" + '
  '0.019*"verify" + 0.017*"due" + 0.017*"obtain" + 0.016*"property" + '
  '0.016*"multiple_offer" + 0.012*"agent"'),
 (1,
  '0.036*"room" + 0.034*"bedroom" + 0.029*"home" + 0.024*"large" + '
  '0.018*"kitchen" + 0.018*"bath" + 0.016*"space" + 0.014*"living" + '
  '0.012*"area" + 0.012*"yard"'),
 (2,
  '0.058*"home" + 0.032*"great" + 0.016*"lot" + 0.015*"property" + '
  '0.014*"close" + 0.014*"location" + 0.012*"make" + 0.011*"potential" + '
  '0.010*"downtown" + 0.010*"bath"'),
 (3,
  '0.123*"new" + 0.038*"update" + 0.029*"floor" + 0.023*"roof" + '
  '0.021*"window" + 0.020*"kitchen" + 0.019*"appliance" + 0.017*"paint" + '
  '0.016*"home" + 0.015*"bathroom"')]


In [26]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(optimal_model, corpus, id2word)
# vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
# save to html file
pyLDAvis.save_html(vis, base_dir + 'output/lda_vis_topics.html')
vis

In [27]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=text)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,2,0.6928,"home, great, lot, property, close, location, m...",Calling All Investors!! Home sold AS IS This i...
1,1,1,0.5828,"room, bedroom, home, large, kitchen, bath, spa...",Solid investment property with great bones and...
2,2,2,0.7335,"home, great, lot, property, close, location, m...",Investment property currently rented at $795 p...
3,3,2,0.6636,"home, great, lot, property, close, location, m...",This 4 sides brick home is the ideal investmen...
4,4,2,0.4878,"home, great, lot, property, close, location, m...",Don't miss out on this four sided brick home! ...
5,5,2,0.6101,"home, great, lot, property, close, location, m...",Great rental property or beginner home! Corner...
6,6,1,0.4642,"room, bedroom, home, large, kitchen, bath, spa...","Little Darling has 3 bedrooms, 1 full bath and..."
7,7,1,0.7849,"room, bedroom, home, large, kitchen, bath, spa...",3 bedrooms 2 bath with a extra lot. Walk in th...
8,8,0,0.6827,"buyer, seller, offer, pm, verify, due, obtain,...",You must see this 3 bedroom 1 bathroom home. A...
9,9,2,0.7071,"home, great, lot, property, close, location, m...",Investors: looking for a quiet house to conver...


In [28]:
# Group top 5 sentences under each topic
sent_topics_sorteddf = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf = pd.concat([sent_topics_sorteddf, 
                grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                axis=0)

# Reset Index    
sent_topics_sorteddf.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf

Unnamed: 0,Topic_Num,Topic_Perc_Contrib,Keywords,Text
0,0,0.9161,"buyer, seller, offer, pm, verify, due, obtain,...",This property is sold as a bundle of 4 with 14...
1,1,0.9612,"room, bedroom, home, large, kitchen, bath, spa...","Beautiful, impeccably maintained split floorpl..."
2,2,0.9272,"home, great, lot, property, close, location, m...",Instant Equity in HOT Atlanta neighborhood! Mi...
3,3,0.8859,"new, update, floor, roof, window, kitchen, app...",Renovated home better than new.All new Kitchen...


In [29]:
sent_topics_sorteddf.loc[sent_topics_sorteddf['Topic_Num'] == 2]['Text'].iloc[0]

'Instant Equity in HOT Atlanta neighborhood! Minutes to Grant Park, East Atlanta Village, The Beacon Atlanta, Beltline, Zoo Atlanta! 4 Sided Brick centrally located close to Freeways, Downtown, Airport and Stadiums! Perfect for first time home buyer, FHA, VA, Conventional financing ok! Great opportunity for savvy investor, add this gem to your portfolio! Choose your exit, MULTIPLE EXIT strategies! Tenant in place (instant Cash Flow), renovate and sell OR tear down and go from the ground up! ARV $400k+ INSTANT EQUITY!!!!'

In [30]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics.sort_values(by=['Perc_Documents']).head()

Unnamed: 0,Dominant_Topic,Topic_Keywords,Num_Documents,Perc_Documents
0,2,"home, great, lot, property, close, location, m...",443.0,0.0351
3,2,"home, great, lot, property, close, location, m...",1435.0,0.1136
2,2,"home, great, lot, property, close, location, m...",3347.0,0.2649
1,1,"room, bedroom, home, large, kitchen, bath, spa...",7412.0,0.5865
4,2,"home, great, lot, property, close, location, m...",,


### <font color="green">Supervised</font>

In [31]:
# in part 3

# End Notebook