# News ETA: Text Preparation
Becky Desrosiers | rn7ena@virginia.edu | DS5001 F24 | 13 December, 2024

## 0 Setup (F0)

In this phase, I import my packages and data as well as the raw text.

### 0.1 Imports

In [1]:
import pandas as pd
import numpy as np
import nltk
from scipy.linalg import eigh
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from gensim.models.word2vec import Word2Vec
from helper import MyDB, display_tables

### 0.2 Data ingest

In [2]:
SAMPLE_SIZE = 10000
F0 = pd.read_csv('newzy.zip', sep = '|', index_col = 'doc_id', compression = 'infer')

# sample SAMPLE SIZE non-empty rows
F0 = F0.loc[~F0.doc_content.isna()].sample(SAMPLE_SIZE, random_state = 5001)

display_tables(F0, 0)

F0 table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_content,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
287034,Drudge Report,Goat-Herding Dog Refuses To Abandon Flock Amid...,Goat-Herding Dog Refuses To Abandon Flock Amid...,10/16/2017,http://feedproxy.google.com/~r/DrudgeReportFee...
198924,US News,"Iowa Hospital Unwittingly Posts 5,300 Patients...",The University of Iowa Hospitals and Clinics s...,07/12/2017,https://www.usnews.com/news/best-states/iowa/a...
772036,CNN,Former California congresswoman Ellen Tauscher...,Former California Democratic Rep. Ellen Tausch...,04/30/2019,http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/d...


## 1 Machine Learning Corpus Format (F1)

For F1, I create a DataFrame without the metadata, just the content of each document and a document index.

### 1.1 F1 Code

In [3]:
F1 = F0.loc[:, 'doc_content'].to_frame()
F1.doc_content = F1.doc_content.astype('str')

### 1.2 F1 Results

In [4]:
display_tables(F1, 1, table = 'Document')

F1 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
743857,What happened in Chicago was an absolute trave...
62533,REPORT: El Chapo's sons say ex-ally trying to ...
763145,"Since the 1999 Columbine High School shooting,..."


## 2 Standard Text Analytic Data Model (F2)

For F2, I add a Library table with document metadata as well as a simplet Token table with OHCO index and a Vocabulary table mapping terms to term IDs.

### 2.1 F2 Code

#### 2.1.1 Library table

In [5]:
F2_LIBRARY = F0.loc[:, ['doc_source', 'doc_title', 'doc_date', 'doc_url']]

#### 2.1.2 Document table

My texts are mostly very short. I considered breaking them up into sentences, but some of them are only one sentence and breaking them up doesn't make much sense when you think about it like one chunk is supposed to be a unit of discourse or a "message," because sentences are often very related. In the end, I chose to leave my OHCO at the document level.

In [6]:
# show that some texts are very short
for i in F0.sample(3).index:
    print(F0.loc[i].doc_content, end = '\n\n')

I have three teenage children – an 18-year-old and two 16-year-old twins, and they’re all learning to drive right at the same time.

New MALE boss starts restyling BRITISH VOGUE...            (First column, 10th story, link)

Southwest says cracks in engine fan blades like the flaw blamed in a fatal accident have turned up at other airlines.



In [7]:
F2_DOC = F1.copy()

#### 2.1.3 Token table

I considered splitting into paragraphs, but there were no newlines in the corpus, so tokens will be indexed by document ID, sentence number, and token number.

In [8]:
# split into paragraphs
# TOKEN.doc_content.str.split(r'\n+', expand = True) -- no effect

# TOKEN.query("doc_content == ''") -- none
# TOKEN.query("doc_content == ' '") -- no empty lines

# split into sentences
TOKEN = F1.doc_content.apply(lambda content: pd.Series(nltk.sent_tokenize(content)))\
        .stack().to_frame().rename(columns = {0: 'sentence'})
TOKEN.index.names = ['doc_id', 'sent_num']

# split into tokens
TOKEN = TOKEN.sentence.apply(lambda sent: pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(sent))))
TOKEN = TOKEN.stack().to_frame().rename(columns = {0 : 'pos_tuple'})
TOKEN.index.names = ['doc_id', 'sent_num', 'token_id']
TOKEN['token'] = TOKEN.pos_tuple.apply(lambda x: x[0])
TOKEN['pos'] = TOKEN.pos_tuple.apply(lambda x: x[1])
TOKEN['term'] = TOKEN.token.str.extract(r'([\w-]+)').squeeze().str.lower()
TOKEN['punctuation'] = TOKEN.term.isna() # flag tokens that are just punctuation
TOKEN.loc[TOKEN.term.isna(), 'term'] = TOKEN.loc[TOKEN.term.isna(), 'token'] # copy over tokens that are just punctuation

F2_TOKEN = TOKEN.token.to_frame().dropna()

#### 2.1.4 Vocabulary table

In [9]:
VOCAB = TOKEN.term.value_counts().to_frame().sort_index().reset_index()
VOCAB.index.name = 'term_id'

F2_VOCAB = VOCAB.copy()

### 2.2 F2 Results

In [10]:
f2_tables = [F2_LIBRARY, F2_DOC, F2_TOKEN, F2_VOCAB]
display_tables(f2_tables, 2, table = 'all')

F2 Library table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
82361,US News,AP FACT CHECK: McCaskill Wrong About Contacts ...,03/03/2017,https://www.usnews.com/news/politics/articles/...
890822,US News,Bodies Found in Connection With California Amb...,09/23/2019,https://www.usnews.com/news/best-states/califo...
38680,UPI Latest,Kansas City Chiefs vs San Diego Chargers: pred...,12/30/2016,http://www.upi.com/Sports_News/NFL/2016/12/30/...


F2 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
216165,Jennifer Lawrence is creeped out in the first ...
386014,The new ice skating ribbon in Spokane's Riverf...
55934,It’s so patently absurd to handcuff ANY of the...


F2 Token table sample:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token
doc_id,sent_num,token_id,Unnamed: 3_level_1
874751,0,11,illuminating
990959,0,32,clergy
41799,10,24,Morgan


F2 Vocabulary table sample:


Unnamed: 0_level_0,term,count
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
9818,farts,1
15210,lima,5
28811,xeveryone,1


In [11]:
f2_db = MyDB('tables/F2')
f2_db.save_table(f2_tables, 'all')

## 3 NLP Annotated STADM (F3)

For F3, I add some annotations to my Token and Vocabulary tables.

### 3.1 F3 Code

#### 3.1.0 Library and Document tables

The Library and Document tables will not change from F2 onward.

#### 3.1.1 Token table

For F3, we add the following to our token table:
- term ID
- part of speech (POS)

In [12]:
term_to_id_map = F2_VOCAB.reset_index().set_index('term').to_dict()['term_id']

TOKEN['term_id'] = TOKEN.term.map(term_to_id_map)
TOKEN.term_id = TOKEN.term_id.astype('int')

F3_TOKEN = TOKEN.drop(columns = ['pos_tuple', 'punctuation'])

#### 3.1.2 Vocabulary table

For F3, we add the following to our vocabulary table:
- Flags:
    - numeric
    - punctuation
    - stopwords
    
- Annotations:
  - stems
  - max POS

In [13]:
# add flags
VOCAB['punctuation'] = VOCAB.term.map(TOKEN.reset_index().set_index('term').to_dict()['punctuation'])
VOCAB['numeric'] = VOCAB.term.str.match('\d+').astype('bool')
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns = ['term']).set_index('term')
sw['stopword'] = True
VOCAB['stopword'] = VOCAB.term.map(sw.stopword)
VOCAB.stopword = VOCAB.stopword.fillna(0).astype('bool')

# add annotations
VOCAB['stem'] = VOCAB.term.apply(nltk.stem.porter.PorterStemmer().stem)
max_pos = TOKEN[['pos', 'term']].groupby('term').value_counts().to_frame().reset_index().groupby('term').max().pos
VOCAB['pos_max'] = VOCAB.term.map(max_pos)

F3_VOCAB = VOCAB.copy()

### 3.3 F3 Resutls

In [14]:
f3_tables = [F2_LIBRARY, F2_DOC, F3_TOKEN, F3_VOCAB]
display_tables(f3_tables, 3, table = 'all')

F3 Library table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
314612,US News,Slovakia to Build Army Personnel Carriers With...,11/15/2017,https://www.usnews.com/news/world/articles/201...
692883,Reuters,North Korea unlikely to give up nuclear weapon...,01/29/2019,http://feeds.reuters.com/~r/Reuters/PoliticsNe...
22398,US News,"Economy, Russia top issues as Bulgarians pick ...",11/13/2016,http://www.usnews.com/news/world/articles/2016...


F3 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
569907,"By Aref MohammedBASRA, Iraq (Reuters) - Iraqi ..."
268711,A judge has declared a mistrial in the case of...
1006767,Police evacuated portions of London's busy and...


F3 Token table sample:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token,pos,term,term_id
doc_id,sent_num,token_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
280857,1,84,And,CC,and,2067
923704,0,17,at,IN,at,2679
776033,0,6,vowed,VBD,vowed,27960


F3 Vocabulary table sample:


Unnamed: 0_level_0,term,count,punctuation,numeric,stopword,stem,pos_max
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1805,alerted,1,False,False,False,alert,VBD
18948,pass-through,1,False,False,False,pass-through,JJ
25351,swayed,1,False,False,False,sway,VBN


In [15]:
f3_db = MyDB('tables/F3')
f3_db.save_table(f3_tables, 'all')

## 4 STADM with Vector Space Models (F4)

For F4, I add a vector representation of each term in the form of TFIDF.

### 4.1 F4 Code

#### 4.1.1 Token table

In [16]:
# document-term count matrix without stopwords and proper nouns
TOKEN['stopword'] = TOKEN.term_id.map(VOCAB.stopword)
bow_tokens = TOKEN.query("punctuation == False & stopword == False & pos != 'NNP' & pos != 'NNPS'")
BOW = bow_tokens.groupby(['doc_id','term_id']).term_id.value_counts().to_frame()
BOW.columns = ['count']
dt_matrix = BOW['count'].unstack().fillna(0).astype('int')
    
# summed tf
tf = ( dt_matrix.T / dt_matrix.T.sum() ).T
        
# standard idf
df = dt_matrix[dt_matrix > 0].count()
N = dt_matrix.shape[0]
idf = np.log10(N / df)

TFIDF = (tf * idf).T
TFIDF.columns = [f'tfidf[{column}]' for column in TFIDF.columns]

#### 4.1.2 Vocabulary table

In [17]:
VOCAB['tfidf_sum'] = TFIDF.sum(axis = 1)
F4_VOCAB = VOCAB.copy()

### 4.2 F4 Results

In [18]:
f4_tables = [F2_LIBRARY, F2_DOC, F3_TOKEN, F4_VOCAB]
display_tables([f4_tables, TFIDF], 4, table = ['all', 'TFIDF'])

F4 Library table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
276961,US News,"Syria Fighting Worst Since Aleppo, Civilian Ca...",10/05/2017,https://www.usnews.com/news/world/articles/201...
163729,US News,'Wonder Woman' Director Sought Lynda Carter's ...,05/31/2017,https://www.usnews.com/news/best-states/califo...
830409,US News,Democratic Lawmakers Renew Push for Parole Cha...,07/09/2019,https://www.usnews.com/news/best-states/new-me...


F4 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
347018,GREAT: Consumer Spending Jumps... (...
832885,Vermont police say they found dozens of cannab...
810143,"1 Dead, 3 Wounded In Costco Shooting; Suspect ..."


F4 Token table sample:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token,pos,term,term_id
doc_id,sent_num,token_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
860965,0,5,have,VBP,have,11979
133387,0,17,say,VBP,say,22743
155281,0,12,Medicaid,NNP,medicaid,16206


F4 Vocabulary table sample:


Unnamed: 0_level_0,term,count,punctuation,numeric,stopword,stem,pos_max,tfidf_sum
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
24779,sticks,1,False,False,False,stick,NNS,0.235084
20645,puts,11,False,False,False,put,VBZ,0.372733
16283,memorably,1,False,False,False,memor,RB,0.018083


F4 TFIDF table sample:


Unnamed: 0_level_0,tfidf[41],tfidf[233],tfidf[679],tfidf[747],tfidf[795],tfidf[843],tfidf[943],tfidf[974],tfidf[1195],tfidf[1247]
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
25505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21447,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28196,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
f4_db = MyDB('tables/F4')
f4_db.save_table(f4_tables, 'all')
TFIDF.to_csv('tables/TFIDF.csv')

## 5 STADM with Analytical Models (F5)

For F5, I add additional vector representations of terms.

### 5.1 F5 Code

#### 5.1.1 Principal components

In [20]:
# remove common words and proper nouns
VOCAB_reduced = VOCAB.query("pos_max != 'NNP' & pos_max != 'NNPS'")
VOCAB_reduced = VOCAB_reduced.tfidf_sum.sort_values().dropna().tail(VOCAB_reduced.shape[0]//2) # top 50%
reduced_term_ids = VOCAB_reduced.index.tolist()
dt_reduced = dt_matrix.loc[:, reduced_term_ids]

# get eigen values and vectors
eig_vals, eig_vecs = eigh(dt_reduced.cov())
eig = pd.DataFrame(eig_vecs, index = dt_reduced.columns, columns = dt_reduced.columns)
eig.insert(0, 'eigenvalue', eig_vals)
eig.insert(0, 'explained_variance', eig.eigenvalue / eig_vals.sum())
eig = eig.sort_values('explained_variance', ascending = False)
eig.index = ['PC' + str(i)  for i in range(len(eig.index))]
eig.columns.name = None

# get loadings
LOADINGS = eig.iloc[:10, 2:].T
LOADINGS.columns = [col + '_loading' for col in LOADINGS.columns]
LOADINGS.index.name = 'term_id'

# save PCS
PCS = eig.iloc[:10, :2]
PCS.index.name = 'principal_component'

#### 5.1.2 Topic models (LDA)

In [21]:
# get F1 table with only regular nouns
doc_strings = TOKEN[TOKEN.pos.str.match(r'^NNS?$')]\
                .groupby('doc_id').term.apply(lambda x: ' '.join(x))\
                .to_frame().rename(columns = {'term' : 'doc_string'})

tfv = CountVectorizer(max_features = None, stop_words = list(sw))
tf = tfv.fit_transform(doc_strings.doc_string)
lda_terms = tfv.get_feature_names_out()

lda = LDA(n_components = 15, # number of topics
          max_iter = 50,
          learning_offset = 50.,
          random_state = 5001)

# doc-topic table (distribution of topic preference over each document)
THETA = pd.DataFrame(lda.fit_transform(tf), index = doc_strings.index)
THETA.columns.name = 'topic_id'

# sanity check
if not ((THETA.sum(axis = 1) < 1.000001).all() and
        (THETA.sum(axis = 1) > 0.999999).all()):
    raise ValueError('Documents do not sum to 1 over all topics')

# topic-word table (distribution of words over each topic)
PHI = pd.DataFrame(lda.components_, columns = lda_terms).T
# some terms were vectorized differently with CountVectorizer, causing problematic NAs
PHI = PHI.loc[~PHI.index.map(term_to_id_map).isna(), :]
PHI.index.name = 'term'
PHI.columns.name = 'topic_id'

# define topics by top 3 terms
TOPICS = PHI.stack().to_frame().rename(columns = {0 : 'weight'})\
            .groupby('topic_id')\
            .apply(lambda x:
                       x.weight.sort_values(ascending = False)\
                       .head(3)\
                       .reset_index()\
                       .drop('topic_id', axis = 1)\
                       .term)
TOPICS['label'] = TOPICS.apply(lambda x: str(x.name) + ' ' + ' '.join(x), 1)
TOPICS.drop(columns = [0, 1, 2], inplace = True)
TOPICS['doc_weight_sum'] = THETA.sum()
TOPICS.index.name = 'topic_id'

# make PHI have the same index as VOCAB
PHI = PHI.reset_index().rename(columns = {'index' : 'term'})
PHI['term_id'] = PHI.term.map(term_to_id_map).astype('int')
PHI = PHI.set_index('term_id')
PHI.drop(columns = 'term', inplace = True)

max_topic_id = pd.Series(np.array(PHI).argmax(axis = 1), index = PHI.index)
VOCAB['max_topic_id'] = VOCAB.index.map(max_topic_id)
F5_VOCAB = VOCAB

#### 5.1.3 Word Embeddings (Word2Vec)

In [22]:
# remove proper nouns
corpus = TOKEN.loc[~TOKEN.pos.str.match('NNPS?'), 'term']\
                .groupby('doc_id')\
                .apply(lambda x: x.tolist())\
                .reset_index()\
                .term\
                .tolist()
vec_model = Word2Vec(corpus,
                     vector_size = 50,
                     window = 5,
                     min_count = 2,
                     workers = 4)

vec = {}
for l in corpus:
    for element in l:
        try:
            vec[element] = vec_model.wv[element]
        except KeyError:
            pass
VECTORS = pd.DataFrame(vec).T.reset_index().rename(columns = {'index' : 'term'})
VECTORS.index = VECTORS.term.map(term_to_id_map)
VECTORS.index.name = 'term_id'
VECTORS.drop(columns = 'term', inplace = True)
VECTORS.columns = [f'word_vector[{i}]' for i in VECTORS.columns]

vec_model.save('word2vec.model')

#### 5.1.4 Sentiment analysis

In [23]:
# add sentiments from salex_nrc.csv used in class
SENTIMENTS = pd.read_csv('salex_nrc.csv', index_col = 'term_str')
SENTIMENTS.columns = [col.replace('nrc_', '') for col in SENTIMENTS.columns]
SENTIMENTS['polarity'] = SENTIMENTS.positive - SENTIMENTS.negative

# fix index to match VOCAB table
SENTIMENTS = SENTIMENTS.reset_index().rename(columns = {'term_str' : 'term'})
SENTIMENTS['term_id'] = SENTIMENTS.term.map(term_to_id_map)
SENTIMENTS = SENTIMENTS.dropna()
SENTIMENTS.term_id = SENTIMENTS.term_id.astype('int')
SENTIMENTS = SENTIMENTS.set_index('term_id')
SENTIMENTS.drop(columns = 'term', inplace = True)

### 2.5 F5 Results

In [24]:
f5_tables = [[F2_LIBRARY, F2_DOC, F3_TOKEN, F5_VOCAB], PCS, LOADINGS, THETA, PHI, TOPICS, VECTORS, SENTIMENTS]
f5_table_names = ['all', 'Principal_Components', 'Loadings', 'Theta', 'Phi', 'Topics', 'Vectors', 'Sentiments']
display_tables(f5_tables, 5, table = f5_table_names)

F5 Library table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
47239,US News,Poll: Young Americans fear they will be worse ...,01/17/2017,http://www.usnews.com/news/politics/articles/2...
259650,US News,"With 2 in 3 Months, Ohio Executions Could Be B...",09/16/2017,https://www.usnews.com/news/best-states/ohio/a...
183651,US News,"Trump, Pence to Attend Treasury Secretary Mnuc...",06/24/2017,https://www.usnews.com/news/entertainment/arti...


F5 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
874666,OUAGADOUGOU (Reuters) - Two senior allies of B...
339590,The family of an 8-year-old girl who was tortu...
479419,U.S. government forecasters are set to release...


F5 Token table sample:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token,pos,term,term_id
doc_id,sent_num,token_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
99711,0,0,Dozens,NNS,dozens,8312
616376,0,1,-,:,-,18
533546,0,9,shuttle,VB,shuttle,23622


F5 Vocabulary table sample:


Unnamed: 0_level_0,term,count,punctuation,numeric,stopword,stem,pos_max,tfidf_sum,max_topic_id
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
20899,ramped,2,False,False,False,ramp,VBN,0.349896,
1106,88-year-old,1,False,True,False,88-year-old,JJ,0.363311,
23951,slovakia,4,False,False,False,slovakia,NNP,0.333035,


F5 Principal_Components table sample:


Unnamed: 0_level_0,explained_variance,eigenvalue
principal_component,Unnamed: 1_level_1,Unnamed: 2_level_1
PC9,0.00623,0.097251
PC7,0.006953,0.108546
PC8,0.006459,0.100825


F5 Loadings table sample:


Unnamed: 0_level_0,PC0_loading,PC1_loading,PC2_loading,PC3_loading,PC4_loading,PC5_loading,PC6_loading,PC7_loading,PC8_loading,PC9_loading
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
460,0.001419,0.000916,0.005541,-0.012512,0.000519,-0.000172,-0.000392,0.000111,-0.002238,-0.002003
22244,-0.001557,8.7e-05,0.003011,-0.002693,-0.000456,0.000493,0.001616,-0.00033,-0.00029,-0.000485
19166,-0.004223,0.000328,0.003496,0.002965,-0.00106,-0.000677,0.000678,-2.8e-05,0.000202,0.000122


F5 Theta table sample:


topic_id,0,1,2,3,4,5,6,7,8,9
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
70363,0.136998,0.004762,0.004762,0.004762,0.004762,0.004762,0.004762,0.004762,0.330774,0.004762
811062,0.196735,0.006061,0.006061,0.006061,0.006061,0.006061,0.334478,0.006061,0.006061,0.006061
13933,0.216729,0.013333,0.013333,0.013333,0.013333,0.013333,0.013333,0.013333,0.013333,0.609938


F5 Phi table sample:


topic_id,0,1,2,3,4,5,6,7,8,9
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8381,0.066667,5.769753,4.113679,2.316569,0.066667,0.066667,0.066667,2.066666,0.066667,0.066667
19057,0.066667,0.066667,10.519004,0.066667,7.814068,5.478541,0.066667,0.066667,0.066667,1.22496
7475,0.066667,0.066667,0.066667,1.121363,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667


F5 Topics table sample:


term,label,doc_weight_sum
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1 story link stories,727.476
5,5 points film service,491.149553
10,10 people world groups,537.369749


F5 Vectors table sample:


Unnamed: 0_level_0,word_vector[0],word_vector[1],word_vector[2],word_vector[3],word_vector[4],word_vector[5],word_vector[6],word_vector[7],word_vector[8],word_vector[9]
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
3842,0.004434,0.007598,-0.000672,0.025868,0.021455,-0.013984,0.044028,0.040068,-0.023773,-0.020132
26084,-0.00548,-0.054595,-0.037831,0.03373,-0.007217,-0.125992,0.240865,0.290666,-0.178819,-0.140063
16193,-0.021661,-0.002886,0.017216,-0.003641,-0.014461,-0.031649,0.041497,0.042879,-0.017796,-0.028819


F5 Sentiments table sample:


Unnamed: 0_level_0,anger,anticipation,disgust,fear,joy,negative,positive,sadness,surprise,trust
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
21684,1,0,1,0,0,1,0,1,0,0
13067,1,0,0,1,0,1,0,0,1,0
27471,1,1,0,1,0,1,0,0,0,0


In [25]:
f5_db = MyDB('tables/F5')
f5_db.save_table(f5_tables, f5_table_names)

## Next up: Statistical and Visual Exploration will be addressed in [rn7ena_news_eta.ipynb](rn7ena_news_eta.ipynb)