# Topic Modeling with SciKit Learn

In this notebook we create a topic model from our corpus  using SciKit Learn's library. We'll save our results and then use another notebook to explore the results.

# Set Up

## Imports

In [1]:
import pandas as pd
import numpy as np
import lib.tapi as tapi

## Configuration

In [2]:
tapi.list_corpora()

['airbnb',
 'anphoblacht',
 'arxiv',
 'covid19',
 'jstor_hyperparameter',
 'novels',
 'okcupid',
 'tamilnet',
 'winereviews',
 'yelp',
 'zuboff']

In [3]:
data_prefix = 'winereviews'

In [4]:
db = tapi.Edition(data_prefix)

## Parameters

In [5]:
n_terms = 4000 # Vocabulary size
ngram_range = (1,4) # ngram min and max lengths
n_topics = 20 # Number of topics
max_iter = 5 # Number of iterations for topic model

In [6]:
topic_cols = [t for t in range(n_topics)]

## Create Tables Object

These tables constitute a "digital critical edition."

# Import Corpus Data

We import a corpus in our standard format

In [7]:
corpus = db.get_corpus()

## Inspect contents

In [8]:
corpus.head()

Unnamed: 0_level_0,doc_key,doc_title,doc_label,doc_province,doc_points,doc_price,doc_content,doc_original,doc_variety,doc_taster,doc_place
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,86023,Lange 2011 Three Hills Cuvée Pinot Noir (Willa...,US,Oregon,88,40.0,"A tart, astringent Pinot, it needs a bit more ...","A tart, astringent Pinot, it needs a bit more ...",Pinot Noir,Paul Gregutt,US Oregon Willamette Valley Willamette Valley
1,45852,Finca Casa Lo Alto 2008 Reserva Red (Utiel-Req...,Spain,Levante,84,35.0,The cola and licorice aromas are candied and e...,The cola and licorice aromas are candied and e...,Red Blend,Michael Schachner,Spain Levante Utiel-Requena
2,32297,Plantagenet 2004 Omrah Cabernet Sauvignon (Wes...,Australia,Western Australia,88,15.0,"A good value, this starts off a little shaky t...","A good value, this starts off a little shaky t...",Cabernet Sauvignon,Joe Czerwinski,Australia Western Australia Western Australia
3,43293,Bougrier 2012 Rosé d'Anjou (Rosé) by Roger Voss,France,Loire Valley,84,13.0,"Typical, light and sweet rosé, fruity with bri...","Typical, light and sweet rosé, fruity with bri...",Rosé,Roger Voss,France Loire Valley Rosé d'Anjou
4,118523,Bolla 2007 Le Poiane (Valpolicella Classico S...,Italy,Veneto,87,14.0,If you aren't familiar with Ripasso (a hybrid ...,If you aren't familiar with Ripasso (a hybrid ...,"Corvina, Rondinella, Molinara",,Italy Veneto Valpolicella Classico Superiore R...


In [9]:
corpus.shape

(10000, 11)

In [10]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   doc_key       10000 non-null  int64  
 1   doc_title     10000 non-null  object 
 2   doc_label     9993 non-null   object 
 3   doc_province  9993 non-null   object 
 4   doc_points    10000 non-null  int64  
 5   doc_price     9301 non-null   float64
 6   doc_content   10000 non-null  object 
 7   doc_original  10000 non-null  object 
 8   doc_variety   10000 non-null  object 
 9   doc_taster    7945 non-null   object 
 10  doc_place     10000 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 859.5+ KB


# Create Bag-of-Words 

ie. a __Count Vector Space__

We use Scikit Learn's CountVectorizer to convert our F1 corpus of paragraphs into a document-term vector space of word counts.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [12]:
count_engine = CountVectorizer(max_features=n_terms, stop_words='english', ngram_range=ngram_range)
count_model = count_engine.fit_transform(corpus.doc_content)

## Get Generated VOCAB

In [13]:
db.VOCAB = pd.DataFrame(count_engine.get_feature_names(), columns=['term_str'])
db.VOCAB = db.VOCAB.set_index('term_str')
db.VOCAB['ngram_len'] = None # To be added later
# VOCAB.index.name = 'term_id' # For convenience, we'll use strings for IDs

## Get Generated BOW

We do this just to show what the counter vectorizer produced. `DTM` stands for documet-term matrix. We convert this sparse matrix into a "thin" dataframe that keeps only terms with counts for each document. 

In [14]:
db.DTM = pd.DataFrame(count_model.toarray(), index=corpus.index, columns=db.VOCAB.index)
db.BOW = db.DTM.stack().to_frame('n')
db.BOW = db.BOW[~(db.BOW.n == 0)]

In [15]:
db.DTM.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 4000 entries, 000 to zippy acidity
dtypes: int64(4000)
memory usage: 305.2 MB


In [16]:
db.BOW.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 267755 entries, (0, 'accented') to (9999, 'wine')
Columns: 1 entries, n to n
dtypes: int64(1)
memory usage: 3.3+ MB


## Compute TF-IDF

In [17]:
tfidf_engine = TfidfTransformer()
tfidf_model = tfidf_engine.fit_transform(count_model)

In [18]:
db.TFIDF = pd.DataFrame(tfidf_model.toarray(), index=corpus.index, columns=db.VOCAB.index)

In [19]:
db.BOW['tfidf'] = db.TFIDF.stack()

In [20]:
db.BOW

Unnamed: 0_level_0,Unnamed: 1_level_0,n,tfidf
doc_id,term_str,Unnamed: 2_level_1,Unnamed: 3_level_1
0,accented,1,0.212521
0,astringent,1,0.219259
0,barrel,1,0.183510
0,bit,1,0.164003
0,bottle,1,0.198999
...,...,...,...
9999,showing,1,0.160109
9999,stone,1,0.161591
9999,stone fruit,1,0.177589
9999,texture,1,0.118262


## Add Features to VOCAB

In [21]:
db.VOCAB['ngram_len'] = db.VOCAB.apply(lambda x: len(x.name.split()), 1)
db.VOCAB['n'] = db.DTM.sum()
db.VOCAB['tfidf_mean'] = db.TFIDF.mean()

In [22]:
db.VOCAB.ngram_len.value_counts()

1    2099
2    1722
3     172
4       7
Name: ngram_len, dtype: int64

# Generate Topic Models

We run Scikit Learn's [LatentDirichletAllocation algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation) and extract the THETA and PHI tables.

In [23]:
from sklearn.decomposition import LatentDirichletAllocation as LDA, NMF

## Using LDA

In [24]:
lda_engine = LDA(n_components=n_topics, max_iter=max_iter, learning_offset=50., random_state=0)

### THETA

In [25]:
db.THETA = pd.DataFrame(lda_engine.fit_transform(count_model), index=corpus.index)
db.THETA.index.name = 'doc_id'
db.THETA.columns.name = 'topic_id'

In [26]:
db.THETA.sample(20).style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
8155,0.002778,0.002778,0.002778,0.002778,0.002778,0.132343,0.002778,0.002778,0.002778,0.002778,0.002778,0.385319,0.002778,0.002778,0.002778,0.435116,0.002778,0.002778,0.002778,0.002778
8298,0.001613,0.001613,0.001613,0.001613,0.089914,0.001613,0.001613,0.001613,0.001613,0.001613,0.001613,0.001613,0.001613,0.001613,0.135222,0.001613,0.137885,0.001613,0.001613,0.611172
3272,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.002083,0.406074,0.002083,0.002083,0.002083,0.002083,0.556426,0.002083
1262,0.003846,0.286451,0.003846,0.003846,0.003846,0.003846,0.003846,0.003846,0.003846,0.003846,0.003846,0.003846,0.003846,0.003846,0.316838,0.003846,0.003846,0.003846,0.331326,0.003846
7451,0.000962,0.000962,0.000962,0.000962,0.000962,0.068426,0.877835,0.000962,0.000962,0.000962,0.000962,0.000962,0.000962,0.000962,0.000962,0.000962,0.037392,0.000962,0.000962,0.000962
8251,0.002381,0.002381,0.002381,0.002381,0.002381,0.746307,0.002381,0.002381,0.002381,0.002381,0.002381,0.210835,0.002381,0.002381,0.002381,0.002381,0.002381,0.002381,0.002381,0.002381
7751,0.001667,0.001667,0.001667,0.610767,0.058474,0.302426,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667,0.001667
7639,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273,0.956818,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273,0.002273
746,0.001515,0.001515,0.001515,0.868663,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.001515,0.104064
7450,0.002632,0.002632,0.002632,0.431175,0.002632,0.521457,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632,0.002632


### PHI

In [27]:
db.PHI = pd.DataFrame(lda_engine.components_, columns=db.VOCAB.index)
db.PHI.index.name = 'topic_id'
db.PHI.columns.name  = 'term_str'

In [28]:
db.PHI.T.head().style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
000,0.099775,0.05,0.05,1.05,0.131313,0.05,0.05,0.05,0.05,0.05,17.609169,0.05,0.05,0.05,0.05,0.05,0.050046,1.05,0.05,0.359698
000 cases,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,0.05,13.65166,0.05,0.05,0.05,0.05,0.05,0.050111,0.05,0.05,0.448229
10,3.022868,6.050108,0.086162,12.000268,2.211543,0.101013,22.331032,1.045292,10.185422,1.16038,0.050408,1.294196,30.28955,39.21027,2.655304,0.616758,0.05,0.05,5.396295,4.19313
10 merlot,0.05,0.05,0.05,0.074871,0.05,0.05,3.677772,0.05,0.05,0.05,0.060776,0.05001,10.148418,0.05,0.05,0.05,0.05,0.05,0.05,0.288153
10 years,0.05,0.050028,0.246771,4.402638,0.05,0.05,0.239903,0.05,9.034005,0.05,0.05,0.05,0.05,9.282589,0.714246,0.05,0.05,0.05,6.429821,0.05


### Create Topic Glosses

In [29]:
n_top_words = 7

In [30]:
db.TOPICS = db.PHI.stack()\
    .to_frame('weight')\
    .groupby('topic_id')\
    .apply(lambda x: x.weight.sort_values(ascending=False)\
               .head(n_top_words)\
               .reset_index()\
               .drop('topic_id',1)\
               .term_str)

In [31]:
db.TOPICS

term_str,0,1,2,3,4,5,6
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,nose,fruit,palate,shows,bottling,aromas,flavors
1,fruit,wine,flavors,stone,stone fruit,aromas,blend
2,ripe,acidity,fruit,wine,bright,flavors,drink
3,wine,drink,acidity,fruits,ripe,tannins,fruit
4,flavors,acidity,wine,finish,peach,lemon,palate
5,wine,fruit,acidity,blanc,flavors,texture,sauvignon blanc
6,palate,cherry,tannins,black,aromas,alongside,offers
7,fruit,flavors,finish,wine,black,fruit flavors,oak
8,wine,tannins,black,plum,flavors,berry,blackberry
9,fruit,aromas,palate,wine,white,finish,red


In [32]:
db.TOPICS['topwords'] = db.TOPICS.apply(lambda x: str(x.name) + ' ' + ', '.join(x), 1)

### Add Doc Weights

In [33]:
db.TOPICS['doc_weight_sum'] = db.THETA.sum()

In [48]:
db.TOPICS.iloc[:, 7:].sort_values('doc_weight_sum', ascending=False).style.bar()

term_str,topwords,doc_weight_sum
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,"3 wine, drink, acidity, fruits, ripe, tannins, fruit",1328.340014
4,"4 flavors, acidity, wine, finish, peach, lemon, palate",1090.53897
19,"19 flavors, finish, aromas, berry, palate, plum, oak",709.035556
14,"14 cherry, flavors, wine, pinot, aromas, black, noir",607.757173
18,"18 wine, flavors, ripe, tannins, rich, fruit, drink",578.143584
15,"15 apple, palate, finish, citrus, aromas, green, fresh",562.286731
13,"13 wine, fruit, oak, cabernet, blend, tannins, black",555.768099
6,"6 palate, cherry, tannins, black, aromas, alongside, offers",538.424108
9,"9 fruit, aromas, palate, wine, white, finish, red",434.146039
8,"8 wine, tannins, black, plum, flavors, berry, blackberry",426.987877


## Using NMF

In [34]:
nmf_engine = NMF(n_components=n_topics, init='nndsvd', random_state=1, alpha=.1, l1_ratio=.5)

### THETA

In [35]:
db.THETA_NMF = pd.DataFrame(nmf_engine.fit_transform(tfidf_model), index=corpus.index)
db.THETA_NMF.columns.name = 'topic_id'

In [36]:
db.THETA_NMF.sample(20).style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2883,0.005742,0.038395,0.0,0.0,0.013442,0.0,0.005331,0.0,0.023039,0.006168,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001461,0.005605,0.01861
8050,0.017422,0.0,0.0,0.0,0.007969,0.0,0.002846,0.01535,0.0,0.009183,0.0,0.001558,0.017325,0.0,0.022189,0.0,0.057078,0.0,0.0,0.0
2667,0.0,0.006896,0.057787,0.007193,0.022397,0.0,0.0,0.0,0.0,0.0,0.0,0.002846,0.0,0.0,0.0,0.040372,0.0,0.007698,0.0,0.0
3797,0.0,0.0,0.0,0.075665,0.0,0.0,0.005716,0.020949,0.0,0.0,0.0,0.008345,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5317,0.002177,0.0,0.0,0.030987,0.0,0.0,0.008499,0.0,0.005159,0.0,0.0,0.0,0.015043,0.0,0.0,0.0,0.0,0.0,0.000215,0.032829
6550,0.00222,0.016728,0.0,0.0,0.0,0.0,0.0,0.0,6.4e-05,0.000788,0.0,0.003011,0.0,0.0,0.0,0.0,0.02436,0.0,0.019732,0.0
7699,0.0,0.0,0.000779,0.028253,0.0,0.0,0.0,0.0,0.007054,0.0,0.0,0.0099,0.0,0.0,0.0,0.032041,0.0,0.011566,0.0,0.0
836,0.009282,0.014784,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005435,0.0,0.0,0.080562,0.0,0.009809,0.0
1241,0.001396,0.0,0.0,0.006764,0.0,0.0,0.001216,0.001287,0.0,0.000416,0.054944,0.0,0.0,0.0,0.0,0.0,0.015178,0.0,0.007406,0.00153
646,0.005025,0.004308,0.0,0.0,0.0,0.0,0.001238,0.0,0.0,0.0,0.0,0.0,0.0,0.004464,0.022112,0.0,0.073465,0.0,0.0,0.0


### PHI

In [37]:
db.PHI_NMF = pd.DataFrame(nmf_engine.components_, columns=db.VOCAB.index)

In [38]:
db.PHI_NMF.index.name = 'topic_id'
db.PHI_NMF.columns.name = 'term_str'

In [39]:
db.PHI_NMF.T.head().style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000 cases,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.043333,0.0,0.0,0.055016,0.160916,0.0,0.0,0.022803,0.0,0.0,0.0,0.0,0.017203,0.0,0.0,0.007809,0.024557,0.0,0.0,0.0
10 merlot,0.0,0.0,0.0,0.004434,0.036964,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10 years,0.023723,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04473,0.0,0.0,0.0


### Topics

In [40]:
db.TOPICS_NMF = db.PHI_NMF.stack()\
    .to_frame('weight')\
    .groupby('topic_id')\
    .apply(lambda x: 
           x.weight.sort_values(ascending=False)\
               .head(n_top_words)\
               .reset_index()\
               .drop('topic_id',1)\
               .term_str)

In [41]:
db.TOPICS_NMF

term_str,0,1,2,3,4,5,6
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,wine,fruits,ripe,rich,drink,tannins,wood
1,crisp,light,acidity,fresh,bright,wine,fruity
2,lemon,lime,lemon lime,grapefruit,zest,riesling,orange
3,tannins,alongside,palate,aromas,cherry,offers,palate offers
4,cabernet,sauvignon,cabernet sauvignon,blend,merlot,franc,cabernet franc
5,red,red berry,red cherry,red fruit,red fruits,berry,red currant
6,berry,finish,plum,flavors,aromas,herbal,feels
7,black,black cherry,cherry,pepper,black pepper,blackberry,currant
8,fruit,fruit flavors,flavors,tropical fruit,black fruit,tropical,aromas
9,sweet,flavors,vanilla,like,pineapple,honey,soft


In [42]:
db.TOPICS_NMF['topwords'] = db.TOPICS_NMF.apply(lambda x: str(x.name) + ' ' + ', '.join(x), 1)

### Add Doc Weights

In [43]:
db.TOPICS_NMF['doc_weight_sum'] = db.THETA_NMF.sum()

In [44]:
db.TOPICS_NMF.iloc[:, 7:].sort_values('doc_weight_sum', ascending=False).style.bar()

term_str,topwords,doc_weight_sum
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"0 wine, fruits, ripe, rich, drink, tannins, wood",89.181919
6,"6 berry, finish, plum, flavors, aromas, herbal, feels",70.376952
3,"3 tannins, alongside, palate, aromas, cherry, offers, palate offers",67.308991
1,"1 crisp, light, acidity, fresh, bright, wine, fruity",65.148885
11,"11 nose, palate, notes, nose palate, finish, shows, bottling",64.25942
9,"9 sweet, flavors, vanilla, like, pineapple, honey, soft",60.523884
7,"7 black, black cherry, cherry, pepper, black pepper, blackberry, currant",57.334993
8,"8 fruit, fruit flavors, flavors, tropical fruit, black fruit, tropical, aromas",55.332295
15,"15 white, peach, white peach, stone, flower, citrus, stone fruit",49.729342
4,"4 cabernet, sauvignon, cabernet sauvignon, blend, merlot, franc, cabernet franc",44.589535


# Save the Model

## Keep Corpus Label Info

This is effectively the LIB table.

In [45]:
db.LABELS = corpus[set(corpus.columns.tolist()) - set(['doc_key', 'doc_content', 'doc_original'])]

## Save each dataframe

This could of course be generalized as a function or class method.

In [46]:
db.save_tables()

In [47]:
!ls -l ./db/{data_prefix}*.csv

-rw-r--r--@ 1 rca2t1  staff   9347007 Jun 13 20:54 ./db/winereviews-BOW.csv
-rw-r--r--@ 1 rca2t1  staff  80091190 Jun 13 20:55 ./db/winereviews-DTM.csv
-rw-r--r--@ 1 rca2t1  staff   1832690 Jun 13 20:54 ./db/winereviews-LABELS.csv
-rw-r--r--@ 1 rca2t1  staff   1387059 Jun 13 20:55 ./db/winereviews-PHI.csv
-rw-r--r--@ 1 rca2t1  staff    544306 Jun 13 20:55 ./db/winereviews-PHI_NMF.csv
-rw-r--r--@ 1 rca2t1  staff   4322057 Jun 13 20:55 ./db/winereviews-THETA.csv
-rw-r--r--@ 1 rca2t1  staff   1967524 Jun 13 20:55 ./db/winereviews-THETA_NMF.csv
-rw-r--r--@ 1 rca2t1  staff      2611 Jun 13 20:54 ./db/winereviews-TOPICS.csv
-rw-r--r--@ 1 rca2t1  staff      3015 Jun 13 20:55 ./db/winereviews-TOPICS_NMF.csv
-rw-r--r--@ 1 rca2t1  staff    150779 Jun 13 20:54 ./db/winereviews-VOCAB.csv
