In [1]:
import os
import pandas as pd
all_text_samples = []
labels = []
all_files = []
# List all files inside the "clean_data" directory
file_list = os.listdir("clean_data/")

for filename in file_list:
    # Construct filename and its path
    file = (f"clean_data/" + filename)
    my_text_file = open(file, encoding="utf8")
    file_data = my_text_file.read()
    all_text_samples.append(file_data)
    
dataframe = pd.DataFrame(all_text_samples)
dataframe.columns = ["Text"]

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
tfidf = TfidfVectorizer(max_df=0.90, min_df=4, stop_words="english")

In [23]:
dtm_with_tfidf = tfidf.fit_transform(dataframe["Text"])

In [24]:
dtm_with_tfidf

<7911x45783 sparse matrix of type '<class 'numpy.float64'>'
	with 3482007 stored elements in Compressed Sparse Row format>

In [8]:
from sklearn.decomposition import NMF

In [25]:
nmf_model = NMF(n_components=30, random_state=1)

In [26]:
nmf_model.fit(dtm_with_tfidf)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
  n_components=30, random_state=1, shuffle=False, solver='cd', tol=0.0001,
  verbose=0)

I can view any word within the matrix of names in each document just as we did with LDA.

In [27]:
tfidf.get_feature_names()[5000]

'benchwarmer'

In [28]:
word_list = []
probability_list = []

top_number = 15
count = 0
for probability_number in nmf_model.components_:
    text_message = f"Top words for topic {count} are : "
    print(text_message)    
    for number in probability_number.argsort()[-top_number:]:
        print([tfidf.get_feature_names()[number]], end= "")
        word_list.append([tfidf.get_feature_names()[number]])
        probability_list.append(number)
    #show_chart(word_list, probability_list, text_message)
    print("\n")  
    count += 1

Top words for topic 0 are : 
['day']['life']['make']['think']['ll']['ve']['work']['know']['want']['things']['like']['just']['time']['don']['people']

Top words for topic 1 are : 
['el']['más']['este']['la']['se']['si']['habilitada']['continuación']['automáticamente']['reproducirá']['siguiente']['está']['vídeo']['automática']['reproducción']

Top words for topic 2 are : 
['customers']['online']['posts']['blog']['audience']['post']['share']['instagram']['brand']['twitter']['facebook']['marketing']['content']['media']['social']

Top words for topic 3 are : 
['white']['donald']['voters']['party']['mr']['house']['political']['republicans']['government']['election']['republican']['obama']['congress']['president']['trump']

Top words for topic 4 are : 
['flight']['flights']['app']['rankings']['engines']['searching']['keyword']['engine']['searches']['seo']['results']['want']['term']['google']['search']

Top words for topic 5 are : 
['diets']['health']['sugar']['body']['protein']['paleo']['heal

### Add topic number to original dataframe

Just as we did with LDA, now I would like to add the relevant topic number to the original dataframe.

We can view the probability of each particular text file belonging to a particular topic as follows:

In [30]:
textfile_topics = nmf_model.transform(dtm_with_tfidf)

In [31]:
textfile_topics

array([[0.00000000e+00, 6.06487900e-04, 3.16513011e-04, ...,
        0.00000000e+00, 0.00000000e+00, 3.91064411e-05],
       [1.48472845e-02, 0.00000000e+00, 4.80045461e-03, ...,
        0.00000000e+00, 0.00000000e+00, 5.33331022e-03],
       [1.59862823e-02, 0.00000000e+00, 0.00000000e+00, ...,
        1.39454281e-03, 0.00000000e+00, 0.00000000e+00],
       ...,
       [7.06973122e-03, 0.00000000e+00, 0.00000000e+00, ...,
        7.53119750e-03, 0.00000000e+00, 3.40153233e-03],
       [1.21099398e-03, 0.00000000e+00, 1.99435698e-04, ...,
        0.00000000e+00, 9.19710790e-04, 2.09084194e-03],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 1.28626807e-02, 0.00000000e+00]])

Lets view the index values of each word for the first text file

In [32]:
textfile_topics[0]

array([0.00000000e+00, 6.06487900e-04, 3.16513011e-04, 9.72943769e-04,
       1.65906253e-04, 8.20678945e-04, 1.42364380e-03, 1.49145528e-03,
       4.49516115e-04, 0.00000000e+00, 3.91425916e-03, 0.00000000e+00,
       4.66834930e-04, 3.25802743e-03, 0.00000000e+00, 8.13547793e-04,
       8.19956243e-04, 0.00000000e+00, 5.90019433e-04, 0.00000000e+00,
       4.01814767e-04, 1.47614114e-03, 8.54672078e-02, 1.18697821e-03,
       3.39248863e-03, 2.28920823e-03, 4.04963068e-04, 0.00000000e+00,
       0.00000000e+00, 3.91064411e-05])

In [33]:
# Contains list of the 30 topics for each text file, so there are
# 7911 text files
textfile_topics.shape

(7911, 30)

Just as we did with LDA, we can see the values as a more representative topic number, we can round these values up. This example shows the index positions for each topic for the first text file.

In [34]:
textfile_topics[0].round(2)

array([0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
       0.09, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ])

It appears that this document definitely belongs to topic 22.

We can use the command `argmax()` to view the position of the highest probability within the array for the first topic.

In [35]:
textfile_topics[0].argmax()

22

And just as we did previously, we can add a column called **Topic number** to the text file dataframe.

In [36]:
topic_list = []
# Textfile_topics is a list of arrays containing 
# all index positions of words for each textfile
for popular_index_pos in textfile_topics:
    # Get the max index position in each array
    # and add to the topic_list list
    topic_list.append(popular_index_pos.argmax())

# Add a new column to the dataframe
dataframe["Topic number"] = topic_list

In [38]:
dataframe

Unnamed: 0,Text,Topic number
0,"2016 Update: Whether you enjoy myth busting, P...",22
1,Let's start with the truth. The 3-point shot w...,7
2,Media playback is not supported on this device...,19
3,Krampus with babies postcard (via riptheskull/...,6
4,"Last week, Michael Dorf published a long and c...",25
5,"""Eva Braun was the ""first lady"" of the Third R...",6
6,Reproducción automática Si la reproducción aut...,1
7,"Journal reference:\n\nIn C. Freksa, ed., Found...",22
8,1. Keep makeup remover next to your bed so you...,23
9,"Here, we refrain from providing another genera...",27
