# DS 5001 Final Project: Text Preparation
Becky Desrosiers | rn7ena@virginia.edu | DS5001 F24 | 13 December, 2024

## 0 Setup (F0)

In this phase, I import my packages and data as well as the raw text.

### 0.1 Imports

In [1]:
import pandas as pd
import numpy as np
import nltk
from scipy.linalg import eigh
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from gensim.models.word2vec import Word2Vec
from helper import MyDB, display_tables

### 0.2 Data ingest

In [2]:
SAMPLE_SIZE = 10000
F0 = pd.read_csv('newzy.zip', sep = '|', index_col = 'doc_id', compression = 'infer')

# sample SAMPLE SIZE non-empty rows
F0 = F0.loc[~F0.doc_content.isna()].sample(SAMPLE_SIZE, random_state = 5001)

display_tables(F0, 0)

F0 table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_content,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
880014,US News,Australia Regulator Accused of Harming Competi...,By Sonali PaulMELBOURNE (Reuters) - Australia'...,09/10/2019,https://www.usnews.com/news/technology/article...
428184,US News,Six Tanzanian Opposition Leaders Charged With ...,DAR ES SALAAM (Reuters) - Tanzania has charged...,03/28/2018,https://www.usnews.com/news/world/articles/201...
556227,UPI Latest,New England Patriots WR Kenny Britt hampered b...,New England Patriots wide receiver Kenny Britt...,08/20/2018,https://www.upi.com/Sports_News/NFL/2018/08/20...


## 1 Machine Learning Corpus Format (F1)

For F1, I create a DataFrame without the metadata, just the content of each document and a document index.

### 1.1 F1 Code

In [3]:
F1 = F0.loc[:, 'doc_content'].to_frame()
F1.doc_content = F1.doc_content.astype('str')

### 1.2 F1 Results

In [4]:
display_tables(F1, 1, table = 'Document')

F1 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
433118,Retaliatory tariffs are a blow to exporters in...
662974,SEARS DOWN TO LAST 24 HOURS? (Top h...
934273,Maren Morris hit the CMA red carpet after anno...


## 2 Standard Text Analytic Data Model (F2)

For F2, I add a Library table with document metadata as well as a simplet Token table with OHCO index and a Vocabulary table mapping terms to term IDs.

### 2.1 F2 Code

#### 2.1.1 Library table

In [5]:
F2_LIBRARY = F0.loc[:, ['doc_source', 'doc_title', 'doc_date', 'doc_url']]

#### 2.1.2 Document table

My texts are mostly very short. I considered breaking them up into sentences, but some of them are only one sentence and breaking them up doesn't make much sense when you think about it like one chunk is supposed to be a unit of discourse or a "message," because sentences are often very related. In the end, I chose to leave my OHCO at the document level.

In [6]:
# show that some texts are very short
for i in F0.sample(3).index:
    print(F0.loc[i].doc_content, end = '\n\n')

An increasing number of South Korean women are choosing to freeze their eggs, a survey showed on Tuesday.

As dozens of Catholic dioceses around the country have released lists of priests who have been credibly accused of child sex abuse, the Charlotte diocese remains undecided about whether to join what its spokesman calls the "stampede.".

OTTAWA (Reuters) - Canada's Foreign Minister Chrystia Freeland said on Friday that she met her Chinese counterpart, Wang Yi, to discuss tensions...



In [7]:
F2_DOC = F1.copy()

#### 2.1.3 Token table

I considered splitting into paragraphs, but there were no newlines in the corpus, so tokens will be indexed by document ID, sentence number, and token number.

In [8]:
# split into paragraphs
# TOKEN.doc_content.str.split(r'\n+', expand = True) -- no effect

# TOKEN.query("doc_content == ''") -- none
# TOKEN.query("doc_content == ' '") -- no empty lines

# split into sentences
TOKEN = F1.doc_content.apply(lambda content: pd.Series(nltk.sent_tokenize(content)))\
        .stack().to_frame().rename(columns = {0: 'sentence'})
TOKEN.index.names = ['doc_id', 'sent_num']

# split into tokens
TOKEN = TOKEN.sentence.apply(lambda sent: pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(sent))))
TOKEN = TOKEN.stack().to_frame().rename(columns = {0 : 'pos_tuple'})
TOKEN.index.names = ['doc_id', 'sent_num', 'token_id']
TOKEN['token'] = TOKEN.pos_tuple.apply(lambda x: x[0])
TOKEN['pos'] = TOKEN.pos_tuple.apply(lambda x: x[1])
TOKEN['term'] = TOKEN.token.str.extract(r'([\w-]+)').squeeze().str.lower()
TOKEN['punctuation'] = TOKEN.term.isna() # flag tokens that are just punctuation
TOKEN.loc[TOKEN.term.isna(), 'term'] = TOKEN.loc[TOKEN.term.isna(), 'token'] # copy over tokens that are just punctuation

F2_TOKEN = TOKEN.token.to_frame().dropna()

#### 2.1.4 Vocabulary table

In [9]:
VOCAB = TOKEN.term.value_counts().to_frame().sort_index().reset_index()
VOCAB.index.name = 'term_id'

F2_VOCAB = VOCAB.copy()

### 2.2 F2 Results

In [10]:
f2_tables = [F2_LIBRARY, F2_DOC, F2_TOKEN, F2_VOCAB]
display_tables(f2_tables, 2, table = 'all')

F2 Library table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
469844,Politico Magazine,‘Americans are Being Held Hostage and Terroriz...,05/13/2018,https://www.politico.com/magazine/story/2018/0...
832510,US News,Ohio Supreme Court to Hear School 'Takeovers' ...,07/12/2019,https://www.usnews.com/news/best-states/ohio/a...
654573,US News,10 Things to Know About Maryland,12/18/2018,https://www.usnews.com/news/best-states/articl...


F2 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
499719,Preservationists are outraged that someone cov...
523068,Italian prosecutors on Friday investigated all...
977073,TEAR GAS FIRED AT PRO-IRAN MOB... (...


F2 Token table sample:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token
doc_id,sent_num,token_id,Unnamed: 3_level_1
775191,0,10,to
193412,0,6,hatred
612073,0,6,said


F2 Vocabulary table sample:


Unnamed: 0_level_0,term,count
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6225,conservationist,1
24571,stalin,2
11770,hamad,1


In [11]:
f2_db = MyDB('tables/F2')
f2_db.save_table(f2_tables, 'all')

## 3 NLP Annotated STADM (F3)

For F3, I add some annotations to my Token and Vocabulary tables.

### 3.1 F3 Code

#### 3.1.0 Library and Document tables

The Library and Document tables will not change from F2 onward.

#### 3.1.1 Token table

For F3, we add the following to our token table:
- term ID
- part of speech (POS)

In [12]:
term_to_id_map = F2_VOCAB.reset_index().set_index('term').to_dict()['term_id']

TOKEN['term_id'] = TOKEN.term.map(term_to_id_map)
TOKEN.term_id = TOKEN.term_id.astype('int')

F3_TOKEN = TOKEN.drop(columns = ['pos_tuple', 'punctuation'])

#### 3.1.2 Vocabulary table

For F3, we add the following to our vocabulary table:
- Flags:
    - numeric
    - punctuation
    - stopwords
    
- Annotations:
  - stems
  - max POS

In [13]:
# add flags
VOCAB['punctuation'] = VOCAB.term.map(TOKEN.reset_index().set_index('term').to_dict()['punctuation'])
VOCAB['numeric'] = VOCAB.term.str.match('\d+').astype('bool')
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns = ['term']).set_index('term')
sw['stopword'] = True
VOCAB['stopword'] = VOCAB.term.map(sw.stopword)
VOCAB.stopword = VOCAB.stopword.fillna(0).astype('bool')

# add annotations
VOCAB['stem'] = VOCAB.term.apply(nltk.stem.porter.PorterStemmer().stem)
max_pos = TOKEN[['pos', 'term']].groupby('term').value_counts().to_frame().reset_index().groupby('term').max().pos
VOCAB['pos_max'] = VOCAB.term.map(max_pos)

F3_VOCAB = VOCAB.copy()

### 3.3 F3 Resutls

In [14]:
f3_tables = [F2_LIBRARY, F2_DOC, F3_TOKEN, F3_VOCAB]
display_tables(f3_tables, 3, table = 'all')

F3 Library table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
294845,US News,Lawsuit: Louisiana Sheriff Rejected Applicant ...,10/24/2017,https://www.usnews.com/news/best-states/louisi...
263822,US News,Special Report: More Power and a Quiet Exit fo...,09/21/2017,https://www.usnews.com/news/top-news/articles/...
761093,US News,FCC Urges RI to Stop Diverting 911 Fees to Oth...,04/17/2019,https://www.usnews.com/news/best-states/rhode-...


F3 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
980808,Actor George Lopez appeared to accept an Irani...
723234,BEIJING (Reuters) - China on Monday accused de...
306588,The Kansas Democratic Party has agreed to pay ...


F3 Token table sample:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token,pos,term,term_id
doc_id,sent_num,token_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
228209,2,8,...and,NN,and,2067
572664,0,15,kidnapped,VBD,kidnapped,14402
635819,0,5,States,NNPS,states,24677


F3 Vocabulary table sample:


Unnamed: 0_level_0,term,count,punctuation,numeric,stopword,stem,pos_max
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
18911,part-time,1,False,False,False,part-tim,JJ
11347,governs,1,False,False,False,govern,VBZ
14326,kellyseoul,1,False,False,False,kellyseoul,NNP


In [15]:
f3_db = MyDB('tables/F3')
f3_db.save_table(f3_tables, 'all')

## 4 STADM with Vector Space Models (F4)

For F4, I add a vector representation of each term in the form of TFIDF.

### 4.1 F4 Code

#### 4.1.1 Token table

In [30]:
# document-term count matrix without stopwords and proper nouns
TOKEN['stopword'] = TOKEN.term_id.map(VOCAB.stopword)
bow_tokens = TOKEN.query("punctuation == False & stopword == False & pos != 'NNP' & pos != 'NNPS'")
BOW = bow_tokens.groupby(['doc_id','term_id']).term_id.value_counts().to_frame()
BOW.columns = ['count']
dt_matrix = BOW['count'].unstack().fillna(0).astype('int')
    
# summed tf
tf = ( dt_matrix.T / dt_matrix.T.sum() ).T
        
# standard idf
df = dt_matrix[dt_matrix > 0].count()
N = dt_matrix.shape[0]
idf = np.log10(N / df)

TFIDF = (tf * idf).T
TFIDF.columns = [f'tfidf[{column}]' for column in TFIDF.columns]

#### 4.1.2 Vocabulary table

In [31]:
VOCAB['tfidf_sum'] = TFIDF.sum(axis = 1)
F4_VOCAB = VOCAB.copy()

### 4.2 F4 Results

In [32]:
f4_tables = [F2_LIBRARY, F2_DOC, F3_TOKEN, F4_VOCAB]
display_tables([f4_tables, TFIDF], 4, table = ['all', 'TFIDF'])

F4 Library table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
916043,Drudge Report,'LYNCHING'...,10/22/2019,https://tracking.feedpress.it/link/20202/12915342
13205,Real Clear Politics,"Obamacare Hikes Leave Clinton, Democrats Exposed",10/26/2016,http://www.realclearpolitics.com/2016/10/26/ob...
927562,US News,BP Invests in City Transportation App Whim,11/06/2019,https://www.usnews.com/news/technology/article...


F4 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
959242,The Greenland ice sheet's losses have accelera...
900603,KUALA LUMPUR (Reuters) - Malaysian Prime Minis...
158718,A New York City man who pleaded guilty in a do...


F4 Token table sample:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token,pos,term,term_id
doc_id,sent_num,token_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
348687,0,12,execs,NN,execs,9451
126705,0,17,man,NN,man,15791
32908,6,3,my,PRP$,my,17146


F4 Vocabulary table sample:


Unnamed: 0_level_0,term,count,punctuation,numeric,stopword,stem,pos_max,tfidf_sum,max_topic_id
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
20454,protecting,10,False,False,False,protect,VBG,1.34476,13.0
17650,newssarah,1,False,False,False,newssarah,NNP,,
22978,seagal,1,False,False,False,seagal,NNP,,


F4 TFIDF table sample:


Unnamed: 0_level_0,tfidf[41],tfidf[233],tfidf[679],tfidf[747],tfidf[795],tfidf[843],tfidf[943],tfidf[974],tfidf[1195],tfidf[1247]
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10362,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21283,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
f4_db = MyDB('tables/F4')
f4_db.save_table(f4_tables, 'all')
TFIDF.to_csv('tables/TFIDF.csv')

## 5 STADM with Analytical Models (F5)

For F5, I add additional vector representations of terms.

### 5.1 F5 Code

#### 5.1.1 Principal components

In [56]:
# remove common words and proper nouns
VOCAB_reduced = VOCAB.query("pos_max != 'NNP' & pos_max != 'NNPS'")
VOCAB_reduced = VOCAB_reduced.tfidf_sum.sort_values().dropna().tail(VOCAB_reduced.shape[0]//2) # top 50%
reduced_term_ids = VOCAB_reduced.index.tolist()
dt_reduced = dt_matrix.loc[:, reduced_term_ids]

# get eigen values and vectors
eig_vals, eig_vecs = eigh(dt_reduced.cov())
eig = pd.DataFrame(eig_vecs, index = dt_reduced.columns, columns = dt_reduced.columns)
eig.insert(0, 'eigenvalue', eig_vals)
eig.insert(0, 'explained_variance', eig.eigenvalue / eig_vals.sum())
eig = eig.sort_values('explained_variance', ascending = False)
eig.index = ['PC' + str(i)  for i in range(len(eig.index))]
eig.columns.name = None

# get loadings
LOADINGS = eig.iloc[:10, :].T
LOADINGS.columns = [col + '_loading' for col in LOADINGS.columns]
LOADINGS.index.name = 'term_id'

# save PCS
PCS = eig.iloc[:10, :2]
PCS.index.name = 'principal_component'

#### 5.1.2 Topic models (LDA)

In [35]:
# get F1 table with only regular nouns
doc_strings = TOKEN[TOKEN.pos.str.match(r'^NNS?$')]\
                .groupby('doc_id').term.apply(lambda x: ' '.join(x))\
                .to_frame().rename(columns = {'term' : 'doc_string'})

tfv = CountVectorizer(max_features = None, stop_words = list(sw))
tf = tfv.fit_transform(doc_strings.doc_string)
lda_terms = tfv.get_feature_names_out()

lda = LDA(n_components = 15, # number of topics
          max_iter = 50,
          learning_offset = 50.,
          random_state = 5001)

# doc-topic table (distribution of topic preference over each document)
THETA = pd.DataFrame(lda.fit_transform(tf), index = doc_strings.index)
THETA.columns.name = 'topic_id'

# sanity check
if not ((THETA.sum(axis = 1) < 1.000001).all() and
        (THETA.sum(axis = 1) > 0.999999).all()):
    raise ValueError('Documents do not sum to 1 over all topics')

# topic-word table (distribution of words over each topic)
PHI = pd.DataFrame(lda.components_, columns = lda_terms).T
# some terms were vectorized differently with CountVectorizer, causing problematic NAs
PHI = PHI.loc[~PHI.index.map(term_to_id_map).isna(), :]
PHI.index.name = 'term'
PHI.columns.name = 'topic_id'

# define topics by top 3 terms
TOPICS = PHI.stack().to_frame().rename(columns = {0 : 'weight'})\
            .groupby('topic_id')\
            .apply(lambda x:
                       x.weight.sort_values(ascending = False)\
                       .head(3)\
                       .reset_index()\
                       .drop('topic_id', axis = 1)\
                       .term)
TOPICS['label'] = TOPICS.apply(lambda x: str(x.name) + ' ' + ' '.join(x), 1)
TOPICS.drop(columns = [0, 1, 2], inplace = True)
TOPICS['doc_weight_sum'] = THETA.sum()
TOPICS.index.name = 'topic_id'

# make PHI have the same index as VOCAB
PHI = PHI.reset_index().rename(columns = {'index' : 'term'})
PHI['term_id'] = PHI.term.map(term_to_id_map).astype('int')
PHI = PHI.set_index('term_id')
PHI.drop(columns = 'term', inplace = True)

max_topic_id = pd.Series(np.array(PHI).argmax(axis = 1), index = PHI.index)
VOCAB['max_topic_id'] = VOCAB.index.map(max_topic_id)
F5_VOCAB = VOCAB

#### 5.1.3 Word Embeddings (Word2Vec)

In [36]:
# remove proper nouns
corpus = TOKEN.loc[~TOKEN.pos.str.match('NNPS?'), 'term']\
                .groupby('doc_id')\
                .apply(lambda x: x.tolist())\
                .reset_index()\
                .term\
                .tolist()
vec_model = Word2Vec(corpus,
                     vector_size = 50,
                     window = 5,
                     min_count = 2,
                     workers = 4)

vec = {}
for l in corpus:
    for element in l:
        try:
            vec[element] = vec_model.wv[element]
        except KeyError:
            pass
VECTORS = pd.DataFrame(vec).T.reset_index().rename(columns = {'index' : 'term'})
VECTORS.index = VECTORS.term.map(term_to_id_map)
VECTORS.index.name = 'term_id'
VECTORS.drop(columns = 'term', inplace = True)
VECTORS.columns = [f'word_vector[{i}]' for i in VECTORS.columns]

vec_model.save('word2vec.model')

#### 5.1.4 Sentiment analysis

In [109]:
# add sentiments from salex_nrc.csv used in class
SENTIMENTS = pd.read_csv('salex_nrc.csv', index_col = 'term_str')
SENTIMENTS.columns = [col.replace('nrc_', '') for col in SENTIMENTS.columns]
SENTIMENTS['polarity'] = SENTIMENTS.positive - SENTIMENTS.negative

# fix index to match VOCAB table
SENTIMENTS = SENTIMENTS.reset_index().rename(columns = {'term_str' : 'term'})
SENTIMENTS['term_id'] = SENTIMENTS.term.map(term_to_id_map)
SENTIMENTS = SENTIMENTS.dropna()
SENTIMENTS.term_id = SENTIMENTS.term_id.astype('int')
SENTIMENTS = SENTIMENTS.set_index('term_id')
SENTIMENTS.drop(columns = 'term', inplace = True)

### 2.5 F5 Results

In [38]:
f5_tables = [[F2_LIBRARY, F2_DOC, F3_TOKEN, F5_VOCAB], PCS, LOADINGS, THETA, PHI, TOPICS, VECTORS, SENTIMENTS]
f5_table_names = ['all', 'Principal_Components', 'Loadings', 'Theta', 'Phi', 'Topics', 'Vectors', 'Sentiments']
display_tables(f5_tables, 5, table = f5_table_names)

F5 Library table sample:


Unnamed: 0_level_0,doc_source,doc_title,doc_date,doc_url
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
671873,US News,Tokuda to Take Helm of Maui's Nisei Veterans M...,01/07/2019,https://www.usnews.com/news/best-states/hawaii...
368927,US News,Official: Law Enforcement Officers Hurt Servin...,01/18/2018,https://www.usnews.com/news/best-states/pennsy...
852197,US News,AP Top U.S. News at 12:53 A.m. EDT,08/04/2019,https://www.usnews.com/news/us/articles/2019-0...


F5 Document table sample:


Unnamed: 0_level_0,doc_content
doc_id,Unnamed: 1_level_1
698157,(Reuters) - Virginia lawmakers were due to mee...
1013601,Thailand's government and civil servants were ...
551578,CNNExclusive: Pentagon spokeswoman under inves...


F5 Token table sample:


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,token,pos,term,term_id
doc_id,sent_num,token_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
112176,0,4,in,IN,in,13035
264573,0,1,House,NNP,house,12605
165983,1,20,is,VBZ,is,13694


F5 Vocabulary table sample:


Unnamed: 0_level_0,term,count,punctuation,numeric,stopword,stem,pos_max,tfidf_sum,max_topic_id
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
5773,cohn,3,False,False,False,cohn,NNP,,
9379,ex-wife,8,False,False,False,ex-wif,NN,2.005143,
7930,discovery,14,False,False,False,discoveri,NNP,2.290573,13.0


F5 Loadings table sample:


Unnamed: 0_level_0,PC0_loading,PC1_loading,PC2_loading,PC3_loading,PC4_loading,PC5_loading,PC6_loading,PC7_loading,PC8_loading,PC9_loading
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
852,0.040887,-0.006666,-0.056283,0.014502,0.001703,0.015287,-0.042758,-0.016136,-0.006303,-0.021897
8392,0.000486,-0.001571,0.004573,-0.005497,0.003662,0.000105,-0.001151,0.000966,-0.000144,0.001474
3354,-0.003162,8.1e-05,0.003151,-0.001884,-0.002077,0.001159,-0.000544,-0.000606,-0.001107,-0.001726


F5 Theta table sample:


topic_id,0,1,2,3,4,5,6,7,8,9
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
366653,0.011111,0.011111,0.011111,0.011111,0.011111,0.011111,0.844444,0.011111,0.011111,0.011111
701016,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.766666,0.016667,0.016667
886766,0.006061,0.006061,0.006061,0.277776,0.006061,0.006061,0.006061,0.006061,0.006061,0.643436


F5 Phi table sample:


topic_id,0,1,2,3,4,5,6,7,8,9
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
22975,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667
19062,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667
1446,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667,0.066667


F5 Topics table sample:


term,label,doc_weight_sum
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
7,7 tax state plan,603.526244
1,1 story link stories,727.476
12,12 fire oil earnings,518.561565


F5 Vectors table sample:


Unnamed: 0_level_0,word_vector[0],word_vector[1],word_vector[2],word_vector[3],word_vector[4],word_vector[5],word_vector[6],word_vector[7],word_vector[8],word_vector[9]
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
19515,-0.118554,0.000284,-0.021293,0.014367,-0.013035,-0.321132,0.465697,0.432389,-0.297004,-0.285431
5778,-0.003832,-0.010398,0.016112,-0.007914,0.013224,-0.019965,0.033323,0.067391,-0.03352,-0.034333
8739,-0.018079,-0.00584,-0.02156,-0.001365,-0.007031,-0.026461,0.076192,0.063439,-0.028263,-0.029759


F5 Sentiments table sample:


Unnamed: 0_level_0,anger,anticipation,disgust,fear,joy,sadness,surprise,trust,polarity
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
11437,0,0,1,0,0,1,0,0,0
28705,1,0,0,1,0,1,0,0,-1
18958,0,1,0,0,1,0,0,1,1


In [39]:
f5_db = MyDB('tables/F5')
f5_db.save_table(f5_tables, f5_table_names)

## Next up: Statistical and Visual Exploration will be addressed in [rn7ena-final-project-eta.ipynb](rn7ena-final-project-eta.ipynb)