## Overview
The goal of your final project is to apply what you have learned in this course to create a digital analytical edition of a corpus that will support exploration of the social, historical, or cultural contents of that corpus. These contents are broadly conceived—they may be about language use, social events, cultural categories, sentiments, identity, taste, etc., and these may be described synchronically or diachronically, i.e. as structures or as trends over time.

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- Convert the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- Annotate these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- Produce a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- Model the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- Explore your results using statistical and visual methods.
- Present conclusions about patterns observed in the corpus by means of these operations.


## Deliverables
To receive full credit for the assignment, you will produce a digital analytical edition of a corpus, which will include a written report and be hosted on a dedicated GitHub repository.

This edition should include the following deliverables.

### Data Files
A collection of source files hosted on your UVA Box account. If these are large for downloading, you should compress them as archive files (e.g., zip or tar.gz).

A collection of data files, each in CSV format, containing the F2 through F5 data you extracted from the corpus. These files should include, at a minimum, the following core tables:

- LIB.csv — Metadata for the source files.
- CORPUS.csv — This is a tokens table annotated with statistical and linguistic features, such as TFIDF. It should include and index that represents the OHCO of the documents in your corpus.
- VOCAB.csv — Annotated with statistical and linguistic features, such as DFIDF.
In addition, you should include the following data sets, either as features in the appropriate core table or as separate tables. Note that all tables should have an appropriate index and, where appropriate, an OCHO index.

#### Principal Components (PCA)

- Table of documents and components.
- Table of components and word counts (i.e., the “loadings”), either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.


#### Topic Models (LDA)

- Table of document and topic concentrations.
- Table of topics and term counts, either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.

#### Word Embeddings (word2vec)

- Terms and embeddings, either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.

#### Sentiment Analysis

- Sentiment and emotion values as features in VOCAB or as a separate table with a shared index with the VOCAB table.
- Sentiment polarity and emotions for each document.

### Code Files
The Jupyter notebooks used to perform all operations that produced the data in your tables.

Any Jupyter notebooks used to explore and visualize the data in preparation for your final report.

Any Python files (e.g., .py files) you wrote to support your work.

Any other assets — e.g., images, stylesheets, JavaScript libraries, etc. — required by your notebooks.

### Report Document
A Jupyter notebook called FINAL_REPORT.ipynb describing your work and interpreting its results along with links to all the files listed above. This report should be written using Markdown text cells and embedded graphics from your other notebooks to illustrate points. Do not reference images that are not listed in the notebook. You may use images to show images in the notebook if you don't want to include the code there. Include citations for any references made in the notebook.

This notebook should contain the following four sections:

1. Introduction. Describe the nature of your corpus and the question(s) you've asked of the data.

2. Source Data. Provide a description of all relativant source files and describe the following features for each source file:

- Provenance: Where did they come from? Describe the website or other source and provide relevant URLs.
- Location: Provide a link to the source files in UVA Box.
- Description: What is the general subject matter of the corpus? How many observations are there? What is the average document length?
Format: A description of both the file formats of the source files, e.g., plaintext, XML, CSV, etc., and the internal structure where applicable. For - example, if XML then specify document type (e.g., TEI or XHTML).
- Data Model. Describe the analytical tables you generated in the process of tokenization, annotation, and analysis of your corpus. You provide a list of tables with field names and their definition, along with URLs to each associated CSV file.

4. Exploration. Describe each of your explorations, such as PCA and topic models. For each, include the relevant parameters and hyperparemeters used to generate each model and visualization. For your visualizations, you should use at least three (but likely more) of the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps showing correlations
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots

5. Interpretation. Provide your interpretation of the results of exploration, and any conclusion if you are comfortable making them.

Regarding number of pages, a rule of thumb would be a six page exported PDF. The question of length is secondary to the requirement that you answer complete all the sections.



### Form Level Description
- F0 Source Format. The initial source format of a text, which varies by collection, e.g. XML (e.g. TEI and RSS), HTML, plain text (e.g. Gutenberg), JSON, and CSV.
- F1 Machine Learning Corpus Format (MLCF). Ideally a table of minimum discursive units indexed by document content hierarchy.
- F2 Standard Text Analytic Data Model (STADM). A normalized set of tables including DOC, TOKEN, and TERM tables. Produced by the tokenization of F1 data.
- F3 NLP Annotated STADM. STADM with annotations added to token and term records indicating stopwords, parts-of-speech, stems and lemmas, named entities, grammatical dependencies, sentiments, etc.
- F4 STADM with Vector Space models. Vector space representations of TOKEN data and resulting statistical data, such as term frequency and TFIDF.
- F5 STADM with analytical models. STADM with columns and tables added for outputs of fitting and transforming models with the data.
- F6 STADM converted into interactive visualization. STADM represented as a database-driven application with interactive visualization, .e.g. Jupyter notebooks and web applications.

In [30]:
import pandas as pd
import seaborn as sns
import nltk
nltk.download("stopwords")
import numpy as np
import re
from numpy.linalg import norm
from scipy.spatial.distance import pdist, squareform
from scipy.linalg import eigh
import plotly.express as px
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from gensim.models import word2vec

[nltk_data] Downloading package stopwords to /home/npm5ct/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
# company_num = BOOKS
# link_num = CHAPTERS
# text = PARAS

OHCO = ['company_num', 'link_num', 'sent_num', 'token_num']

### F0

#### : Source Format. The initial source format of a text, which varies by collection, e.g. XML (e.g. TEI and RSS), HTML, plain text (e.g. Gutenberg), JSON, and CSV.

In [32]:
df = pd.read_csv('data/RAW.tar.gz', compression='gzip', lineterminator='\n')
df

Unnamed: 0,company_num,Text,characters,words
0,1,Manufacturer ofMetal FastenersandGeneral Hardw...,1025,150
1,1,404-Page Not Found Please check the URL for pr...,269,39
2,1,Buckles CMC specializes in design and manufact...,917,145
3,1,Clasps CMC specializes in design and manufactu...,664,105
4,1,Loop-Rings CMC specializes in design and manuf...,808,123
...,...,...,...,...
2265,1219,PROTOTYPE TO PRODUCTION Magnesium Typical AZ...,88,13
2266,1219,Contact What Are You Waiting For? Protocast I...,130,21
2267,1222,HOME ABOUT PENTACAST SERVICES CONTACT More Pen...,352,47
2268,1222,HOME ABOUT PENTACAST SERVICES CONTACT More SER...,5105,769


In [33]:
df.to_csv('./data/filtered.csv', index=False)

### F1

#### Machine Learning Corpus Format (MLCF). Ideally a table of minimum discursive units indexed by document content hierarchy.

In [34]:
df = pd.read_csv('data/filtered.csv', lineterminator='\n')

In [35]:
df

Unnamed: 0,company_num,Text,characters,words
0,1,Manufacturer ofMetal FastenersandGeneral Hardw...,1025,150
1,1,404-Page Not Found Please check the URL for pr...,269,39
2,1,Buckles CMC specializes in design and manufact...,917,145
3,1,Clasps CMC specializes in design and manufactu...,664,105
4,1,Loop-Rings CMC specializes in design and manuf...,808,123
...,...,...,...,...
2265,1219,PROTOTYPE TO PRODUCTION Magnesium Typical AZ...,88,13
2266,1219,Contact What Are You Waiting For? Protocast I...,130,21
2267,1222,HOME ABOUT PENTACAST SERVICES CONTACT More Pen...,352,47
2268,1222,HOME ABOUT PENTACAST SERVICES CONTACT More SER...,5105,769


# Create LIB

In [36]:
LIB = df[['company_num', 'words']].groupby('company_num').agg(['sum', 'count'])['words'].reset_index()\
.rename(columns={'sum':'total_words', 'count':'total_links'})
LIB

Unnamed: 0,company_num,total_words,total_links
0,1,2796,26
1,5,368,9
2,10,845,11
3,12,1194,7
4,14,682,3
...,...,...,...
295,1211,968,7
296,1212,372,3
297,1216,1948,9
298,1219,2084,12


In [37]:
LIB.to_csv('./data/LIB.csv', index=False)

# Create CORPUS

In [38]:
# CORPUS indexed by company_id
df['link_num'] = df.groupby('company_num').cumcount()

CORPUS = df[["company_num", "link_num" ,"Text", "words"]]
CORPUS = CORPUS.rename(columns={'company_num': 'company_id'})
CORPUS = CORPUS.rename(columns={'Text': 'text'})
CORPUS = CORPUS.set_index(["company_id", "link_num"])
CORPUS

Unnamed: 0_level_0,Unnamed: 1_level_0,text,words
company_id,link_num,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,Manufacturer ofMetal FastenersandGeneral Hardw...,150
1,1,404-Page Not Found Please check the URL for pr...,39
1,2,Buckles CMC specializes in design and manufact...,145
1,3,Clasps CMC specializes in design and manufactu...,105
1,4,Loop-Rings CMC specializes in design and manuf...,123
...,...,...,...
1219,10,PROTOTYPE TO PRODUCTION Magnesium Typical AZ...,13
1219,11,Contact What Are You Waiting For? Protocast I...,21
1222,0,HOME ABOUT PENTACAST SERVICES CONTACT More Pen...,47
1222,1,HOME ABOUT PENTACAST SERVICES CONTACT More SER...,769


In [39]:
CORPUS.to_csv('./data/CORPUS.csv')

### F2
: Convert the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model

#### Standard Text Analytic Data Model (STADM). A normalized set of tables including DOC, TOKEN, and TERM tables. Produced by the tokenization of F1 data.

# Create SENTS

In [40]:
%%time
sent_pat = r'[.?!;:]+'
SENTS = CORPUS['text'].str.split(sent_pat, expand=True).stack().to_frame('sent_str')
SENTS.index.names = ["company_id", "link_num", "sent_num"]
SENTS

CPU times: user 90.6 ms, sys: 10.1 ms, total: 101 ms
Wall time: 97.1 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
company_id,link_num,sent_num,Unnamed: 3_level_1
1,0,0,Manufacturer ofMetal FastenersandGeneral Hardw...
1,0,1,"(CMC) specializes in the design, manufacture a..."
1,0,2,"Items are available in Zinc, Brass, and Steel..."
1,0,3,"Most items are made in the USA, are Berry Ame..."
1,0,4,Ask us aboutCustom Fabricated Partsmade to yo...
...,...,...,...
1222,2,6,© 2017 by PentaCast Inc
1222,2,7,Tel
1222,2,8,519
1222,2,9,245


In [41]:
SENTS.to_csv('./data/SENTS.csv')

# Create TOKENS

In [42]:
keep_whitespace = True

In [43]:
%%time
if keep_whitespace:
    TOKENS = SENTS.sent_str\
            .apply(lambda x: pd.Series(nltk.pos_tag(nltk.word_tokenize(x))))\
            .stack()\
            .to_frame('pos_tuple')
else:
    TOKENS = SENTS.sent_str\
            .apply(lambda x: pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x))))\
            .stack()\
            .to_frame('pos_tuple')



CPU times: user 28.5 s, sys: 1.14 s, total: 29.7 s
Wall time: 29.8 s


In [44]:
TOKENS.index.names = ["company_id", "link_num", "sent_num", "token_num"]
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple
company_id,link_num,sent_num,token_num,Unnamed: 4_level_1
1,0,0,0,"(Manufacturer, NNP)"
1,0,0,1,"(ofMetal, JJ)"
1,0,0,2,"(FastenersandGeneral, NNP)"
1,0,0,3,"(Hardware, NNP)"
1,0,0,4,"(Custom, NNP)"
...,...,...,...,...
1222,2,6,4,"(Inc, NNP)"
1222,2,7,0,"(Tel, NN)"
1222,2,8,0,"(519, CD)"
1222,2,9,0,"(245, CD)"


### F3 
: NLP Annotated STADM. STADM with annotations added to token and term records indicating stopwords, parts-of-speech, stems and lemmas, named entities, grammatical dependencies, sentiments, etc.

In [45]:
import pandas as pd

def clean_tokens(df):
    # Extract the token and filter out unwanted tokens
    df['token'] = df['pos_tuple'].apply(lambda x: x[0])
    df['pos'] = df['pos_tuple'].apply(lambda x: x[1])
    df = df[(df['token'].str.isalpha()) & (df['pos'].str.startswith('N')) & (~df['token'].str.isnumeric()) & (df['token'].str.len() <= 15)]

    # Reset the index and set the desired multi-level index
    df = df.reset_index().set_index(['company_id', 'link_num', 'sent_num', 'token_num'])
    
    df = df.drop(columns=['token'])
    
    return df

TOKENS = clean_tokens(TOKENS)
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos
company_id,link_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,0,"(Manufacturer, NNP)",NNP
1,0,0,3,"(Hardware, NNP)",NNP
1,0,0,4,"(Custom, NNP)",NNP
1,0,0,5,"(Metal, NNP)",NNP
1,0,0,6,"(Crafters, NNP)",NNP
...,...,...,...,...,...
1222,2,4,0,"(Success, NN)",NN
1222,2,5,1,"(message, NN)",NN
1222,2,6,3,"(PentaCast, NNP)",NNP
1222,2,6,4,"(Inc, NNP)",NNP


In [46]:
%%time
TOKENS['pos'] = TOKENS.pos_tuple.apply(lambda x: x[1])
TOKENS['token_str'] = TOKENS.pos_tuple.apply(lambda x: x[0])
TOKENS['term_str'] = TOKENS.token_str.str.lower()
TOKENS

CPU times: user 137 ms, sys: 1.08 ms, total: 138 ms
Wall time: 137 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str
company_id,link_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,0,0,"(Manufacturer, NNP)",NNP,Manufacturer,manufacturer
1,0,0,3,"(Hardware, NNP)",NNP,Hardware,hardware
1,0,0,4,"(Custom, NNP)",NNP,Custom,custom
1,0,0,5,"(Metal, NNP)",NNP,Metal,metal
1,0,0,6,"(Crafters, NNP)",NNP,Crafters,crafters
...,...,...,...,...,...,...,...
1222,2,4,0,"(Success, NN)",NN,Success,success
1222,2,5,1,"(message, NN)",NN,message,message
1222,2,6,3,"(PentaCast, NNP)",NNP,PentaCast,pentacast
1222,2,6,4,"(Inc, NNP)",NNP,Inc,inc


In [47]:
# SAVE TOKENS TABLE
TOKENS.to_csv("./data/TOKENS.csv")

## Create DOCS

In [48]:
BAG = ['company_id', 'link_num']

DOCS = TOKENS\
    .groupby(BAG).term_str\
    .apply(lambda x: ' '.join(x))\
    .to_frame()\
    .rename(columns={'term_str':'doc_str'})
DOCS

Unnamed: 0_level_0,Unnamed: 1_level_0,doc_str
company_id,link_num,Unnamed: 2_level_1
1,0,manufacturer hardware custom metal crafters wa...
1,1,found please url spelling capitalization troub...
1,2,buckles cmc design manufacturing standard cust...
1,3,clasps cmc design manufacturing standard custo...
1,4,cmc design manufacturing standard custom loop ...
...,...,...
1219,10,prototype to production magnesium typical prop...
1219,11,are protocast inc p e avecommerce city made in...
1222,0,home about pentacast services contact more pen...
1222,1,home about pentacast services contact more ser...


In [49]:
# SAVE DOCS TABLE
DOCS.to_csv("./data/DOCS.csv")

# Create VOCAB

In [50]:
%%time
VOCAB = TOKENS.term_str.value_counts().to_frame('n')
VOCAB.index.name = 'term_str'
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['i'] = -np.log2(VOCAB.p)
VOCAB['n_chars'] = VOCAB.index.str.len()
VOCAB['max_pos'] = TOKENS[['term_str','pos']].value_counts().unstack(fill_value=0).idxmax(1)
VOCAB

CPU times: user 129 ms, sys: 3.95 ms, total: 133 ms
Wall time: 133 ms


Unnamed: 0_level_0,n,p,i,n_chars,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
quality,1369,0.009497,6.718315,7,NN
castings,1228,0.008519,6.875126,8,NNS
products,1199,0.008318,6.909605,8,NNS
casting,1194,0.008283,6.915634,7,NNP
contact,944,0.006549,7.254578,7,NNP
...,...,...,...,...,...
beane,1,0.000007,17.137221,5,NNP
lucas,1,0.000007,17.137221,5,NNP
hernandez,1,0.000007,17.137221,9,NNP
enrique,1,0.000007,17.137221,7,NNP


### F4 
: STADM with Vector Space models. Vector space representations of TOKEN data and resulting statistical data, such as term frequency and TFIDF.

In [51]:
VOCAB['n_pos'] = TOKENS[['term_str','pos']].value_counts().unstack().count(1)
VOCAB['cat_pos'] = TOKENS[['term_str','pos']].value_counts().to_frame('n').reset_index()\
    .groupby('term_str').pos.apply(lambda x: set(x))

In [52]:
VOCAB

Unnamed: 0_level_0,n,p,i,n_chars,max_pos,n_pos,cat_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
quality,1369,0.009497,6.718315,7,NN,3,"{NNS, NNP, NN}"
castings,1228,0.008519,6.875126,8,NNS,3,"{NNS, NNP, NNPS}"
products,1199,0.008318,6.909605,8,NNS,4,"{NNS, NNP, NN, NNPS}"
casting,1194,0.008283,6.915634,7,NNP,2,"{NNP, NN}"
contact,944,0.006549,7.254578,7,NNP,2,"{NNP, NN}"
...,...,...,...,...,...,...,...
beane,1,0.000007,17.137221,5,NNP,1,{NNP}
lucas,1,0.000007,17.137221,5,NNP,1,{NNP}
hernandez,1,0.000007,17.137221,9,NNP,1,{NNP}
enrique,1,0.000007,17.137221,7,NNP,1,{NNP}


In [53]:
# SAVE VOCAB TABLE
VOCAB.to_csv("./data/VOCAB.csv")

# Create BOW

In [54]:
def create_bow(CORPUS, bag, item_type='term_str'):
    BOW = CORPUS.groupby(bag+[item_type])[item_type].count().to_frame('n')
    return BOW

BOW = create_bow(TOKENS, ['company_id'])
BOW

Unnamed: 0_level_0,Unnamed: 1_level_0,n
company_id,term_str,Unnamed: 2_level_1
1,access,4
1,accessories,1
1,accuracy,1
1,act,1
1,actions,1
...,...,...
1222,wall,1
1222,works,1
1222,x,3
1222,years,1


In [55]:
BOW.to_csv('./data/BOW.csv')

# Create TFIDF and DFIDF

In [56]:
def get_tfidf_dfidf(BOW, tf_method='max', df_method='standard', item_type='term_str'):            
    DTCM = BOW.n.unstack() # Create Doc-Term Count Matrix
    
    if tf_method == 'sum':
        TF = (DTCM.T / DTCM.T.sum()).T
    elif tf_method == 'max':
        TF = (DTCM.T / DTCM.T.max()).T
    elif tf_method == 'log':
        TF = (np.log2(DTCM.T + 1)).T
    elif tf_method == 'raw':
        TF = DTCM
    elif tf_method == 'bool':
        TF = DTCM.astype('bool').astype('int')
    else:
        raise ValueError(f"TF method {tf_method} not found.")

    DF = DTCM.count() # Assumes NULLs 
    N_docs = len(DTCM)
    
    if df_method == 'standard':
        IDF = np.log10(N_docs/DF) # This what the students were asked to use
    elif df_method == 'textbook':
        IDF = np.log10(N_docs/(DF + 1))
    elif df_method == 'sklearn':
        IDF = np.log10(N_docs/DF) + 1
    elif df_method == 'sklearn_smooth':
        IDF = np.log10((N_docs + 1)/(DF + 1)) + 1
    else:
        raise ValueError(f"DF method {df_method} not found.")
    
    TFIDF = TF * IDF
    
    DFIDF = DF * IDF
    
    TFIDF = TFIDF.fillna(0)

    return TFIDF, DFIDF

In [57]:
TFIDF, DFIDF = get_tfidf_dfidf(BOW)

In [58]:
TFIDF

term_str,a,aa,aaa,aac,aact,aaron,aashto,ab,abb,abbeville,...,ዃ,䭲,薑,齺,𝐂𝐨𝐦𝐩𝐚𝐜𝐭,𝐆𝐥𝐨𝐛𝐚𝐥,𝐍𝐚𝐭𝐢𝐨𝐧𝐬,𝐔𝐧𝐢𝐭𝐞𝐝,𝘾𝙧𝙤𝙣𝙞𝙩𝙚,𝙂𝙧𝙤𝙪𝙥
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.055800,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,0.000000,0.0,0.0,2.176091,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1211,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1212,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1216,0.007672,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1219,0.035074,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
VOCAB['dfidf'] = DFIDF
VOCAB['mean_tfidf'] = TFIDF.mean()

In [60]:
VOCAB.sort_values('mean_tfidf', ascending=False)

Unnamed: 0_level_0,n,p,i,n_chars,max_pos,n_pos,cat_pos,dfidf,mean_tfidf
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
die,563,0.003906,8.000230,3,NNP,3,"{NNS, NNP, NN}",44.436266,0.057658
castings,1228,0.008519,6.875126,8,NNS,3,"{NNS, NNP, NNPS}",44.741763,0.056828
casting,1194,0.008283,6.915634,7,NNP,2,"{NNP, NN}",45.154499,0.051122
foundry,717,0.004974,7.651392,7,NNP,3,"{NNS, NNP, NN}",47.789150,0.048962
inc,805,0.005584,7.484376,3,NNP,2,"{NNP, NN}",47.868939,0.044545
...,...,...,...,...,...,...,...,...,...
nortbrook,1,0.000007,17.137221,9,NNP,1,{NNP},2.477121,0.000067
beach,1,0.000007,17.137221,5,NNP,1,{NNP},2.477121,0.000067
crowne,1,0.000007,17.137221,6,NNP,1,{NNP},2.477121,0.000067
lawai,1,0.000007,17.137221,5,NNP,1,{NNP},2.477121,0.000067


In [61]:
# TOP 30 meaningful words based on mean_tfidf
VOCAB.sort_values('mean_tfidf', ascending=False).index[:30]

Index(['die', 'castings', 'casting', 'foundry', 'inc', 'aluminum', 'bronze',
       'brass', 'sand', 'com', 'website', 'cnc', 'alloys', 'zinc', 'rights',
       'copyright', 'machining', 'investment', 'steel', 'parts', 'precision',
       'cast', 'site', 'information', 'metal', 'alloy', 'us', 'copper',
       'email', 'cookies'],
      dtype='object', name='term_str')

In [62]:
VOCAB.to_csv('./data/VOCAB.csv')

# Create VIDX and MT

In [65]:
VIDX = VOCAB.sort_values('dfidf', ascending=False)\
    .head(1000).index

In [66]:
VIDX

Index(['today', 'machine', 'work', 'steel', 'variety', 'email', 'cast',
       'engineering', 'inc', 'range',
       ...
       'oems', 'forge', 'july', 'gate', 'jet', 'footprint', 'j', 'flange',
       'resume', 'contractors'],
      dtype='object', name='term_str', length=1000)

In [67]:
MT = TFIDF[VIDX].groupby('company_id').mean().fillna(0) # MUST FILLNA

In [68]:
MT

term_str,today,machine,work,steel,variety,email,cast,engineering,inc,range,...,oems,forge,july,gate,jet,footprint,j,flange,resume,contractors
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,0.000000,0.000000,0.168568,0.134454,0.192747,0.331080,0.000000,0.262604,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.043507
5,0.000000,0.000000,0.000000,0.061129,0.000000,0.000000,0.000000,0.021515,0.000000,0.039301,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
10,0.059958,0.000000,0.000000,0.000000,0.000000,0.000000,0.191007,0.020537,0.000000,0.056272,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
12,0.000000,0.000000,0.000000,0.000000,0.060504,0.134923,0.076403,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
14,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.024719,0.026577,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1211,0.011884,0.011884,0.011884,0.000000,0.000000,0.000000,0.000000,0.036634,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
1212,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.045182,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
1216,0.000000,0.000000,0.000000,0.005349,0.000000,0.090109,0.000000,0.000000,0.005158,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.017947,0.0,0.0,0.0,0.000000
1219,0.025125,0.113064,0.025125,0.036677,0.025354,0.000000,0.048025,0.000000,0.058952,0.011790,...,0.0,0.0,0.0,0.0,0.0,0.041021,0.0,0.0,0.0,0.000000


# Create L0, L1, L2

In [69]:
L0 = MT.astype('bool').astype('int') # Binary (Pseudo L)
L1 = MT.apply(lambda x: x / x.sum(), 1) # Manhattan (Probabilistic)
L2 = MT.apply(lambda x: x / norm(x), 1) # Euclidean

# Create PAIRS and CORR_MATRIX

In [70]:
PAIRS = 1 - MT.T.corr().stack().to_frame('correl')
PAIRS.index.names = ['doc_a','doc_b']
PAIRS = PAIRS.query("doc_a > doc_b") # Remove identities and reverse duplicates

general_method = 'weighted' # single, complete, average, weighted 
euclidean_method = 'ward' # ward, centroid, median
combos  = [
    (L2, 'euclidean', 'euclidean', euclidean_method),
    (MT,  'cosine', 'cosine', euclidean_method),
    (MT,  'cityblock', 'cityblock', general_method),
    (L0, 'jaccard', 'jaccard', general_method),
    (L1, 'jensenshannon', 'js', general_method),
]

for X, metric, label, _ in combos:
    PAIRS[label] = pdist(X, metric)

In [71]:
corr_type = 'kendall'
CORR_MATRIX = MT.T.corr(corr_type)

#LIB['kendall_sum'] = CORR_MATRIX.sum()

In [72]:
np.fill_diagonal(CORR_MATRIX.values, 0)

In [73]:
CORR_MATRIX

company_id,1,5,10,12,14,15,22,37,43,44,...,1188,1191,1203,1208,1209,1211,1212,1216,1219,1222
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,0.132190,0.154811,0.121842,0.081024,0.111586,0.155693,0.196721,0.087955,0.079855,...,0.224599,0.131308,0.099751,0.176767,0.175804,0.138649,0.066773,0.127660,0.087683,0.015685
5,0.132190,0.000000,0.159047,0.080203,0.150023,0.101994,0.101329,0.115747,0.154385,0.148781,...,0.137387,0.094860,0.183032,0.110708,0.161698,0.173293,0.017827,0.154018,0.154032,0.114125
10,0.154811,0.159047,0.000000,0.173576,0.159651,0.134446,0.140818,0.157772,0.158581,0.133862,...,0.125101,0.106830,0.196684,0.174179,0.238319,0.195241,0.129415,0.162665,0.162228,0.159862
12,0.121842,0.080203,0.173576,0.000000,0.155825,0.082150,0.094252,0.114499,0.175814,0.058866,...,0.157394,0.037605,0.082551,0.104782,0.222849,0.080639,0.097196,0.103341,0.064106,0.033138
14,0.081024,0.150023,0.159651,0.155825,0.000000,0.098556,0.166978,0.231023,0.115923,0.142436,...,0.016248,0.024190,0.147507,0.101357,0.173361,0.211132,0.130177,0.072651,0.221168,0.115353
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1211,0.138649,0.173293,0.195241,0.080639,0.211132,0.093203,0.191716,0.223455,0.123391,0.095404,...,0.038929,0.108147,0.248318,0.167760,0.168234,0.000000,0.044832,0.131283,0.182738,0.144750
1212,0.066773,0.017827,0.129415,0.097196,0.130177,0.092788,0.060987,0.025501,0.044571,0.075567,...,0.054363,0.009751,0.122212,0.071265,0.018342,0.044832,0.000000,0.054411,0.054750,0.078275
1216,0.127660,0.154018,0.162665,0.103341,0.072651,0.057524,0.142452,0.108100,0.120375,0.068407,...,0.119055,0.101428,0.163248,0.147754,0.265168,0.131283,0.054411,0.000000,0.117881,0.118311
1219,0.087683,0.154032,0.162228,0.064106,0.221168,0.182750,0.163457,0.173549,0.088934,0.190853,...,0.039948,0.073408,0.221648,0.138196,0.112833,0.182738,0.054750,0.117881,0.000000,0.183312


# Explore CORR_MATRIX

In [74]:
max_corr = CORR_MATRIX.max(axis=0)

In [75]:
max_corr_idx = CORR_MATRIX.idxmax(axis=0)

In [76]:
corr_pairs = pd.concat([max_corr_idx, max_corr], axis=1).rename(columns={0:'Max_id', 1:'Max_correlation'})
corr_pairs

Unnamed: 0_level_0,Max_id,Max_correlation
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,340,0.444475
5,655,0.230997
10,730,0.283224
12,1105,0.389843
14,37,0.231023
...,...,...
1211,777,0.261281
1212,919,0.172608
1216,783,0.277148
1219,536,0.299074


In [77]:
TFIDF_SMALL = TFIDF[VIDX]
TFIDF_SMALL

term_str,today,machine,work,steel,variety,email,cast,engineering,inc,range,...,oems,forge,july,gate,jet,footprint,j,flange,resume,contractors
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,0.000000,0.000000,0.168568,0.134454,0.192747,0.331080,0.000000,0.262604,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.043507
5,0.000000,0.000000,0.000000,0.061129,0.000000,0.000000,0.000000,0.021515,0.000000,0.039301,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
10,0.059958,0.000000,0.000000,0.000000,0.000000,0.000000,0.191007,0.020537,0.000000,0.056272,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
12,0.000000,0.000000,0.000000,0.000000,0.060504,0.134923,0.076403,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
14,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.024719,0.026577,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1211,0.011884,0.011884,0.011884,0.000000,0.000000,0.000000,0.000000,0.036634,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
1212,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.045182,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000
1216,0.000000,0.000000,0.000000,0.005349,0.000000,0.090109,0.000000,0.000000,0.005158,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.017947,0.0,0.0,0.0,0.000000
1219,0.025125,0.113064,0.025125,0.036677,0.025354,0.000000,0.048025,0.000000,0.058952,0.011790,...,0.0,0.0,0.0,0.0,0.0,0.041021,0.0,0.0,0.0,0.000000


In [78]:
# normalize doc vector lengths
TFIDF_L2 = (TFIDF_SMALL.T / norm(TFIDF_SMALL, 2, axis=1)).T

# center term vectors
TFIDF_L2 = TFIDF_L2 - TFIDF_L2.mean()

TFIDF_L2

term_str,today,machine,work,steel,variety,email,cast,engineering,inc,range,...,oems,forge,july,gate,jet,footprint,j,flange,resume,contractors
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.011544,-0.017719,-0.012737,0.071572,0.066224,0.088042,0.162926,-0.015846,0.113212,-0.011010,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,-0.001975,-0.002617,-0.008426,-0.002221,0.021561
5,-0.011544,-0.017719,-0.012737,0.086841,-0.008526,-0.019116,-0.021139,0.022513,-0.032783,0.059059,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,-0.001975,-0.002617,-0.008426,-0.002221,-0.002626
10,0.054641,-0.017719,-0.012737,-0.022144,-0.008526,-0.019116,0.189705,0.006824,-0.032783,0.051106,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,-0.001975,-0.002617,-0.008426,-0.002221,-0.002626
12,-0.011544,-0.017719,-0.012737,-0.022144,0.029331,0.065305,0.026666,-0.015846,-0.032783,-0.011010,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,-0.001975,-0.002617,-0.008426,-0.002221,-0.002626
14,-0.011544,-0.017719,-0.012737,-0.022144,-0.008526,-0.019116,0.000031,0.006916,-0.032783,-0.011010,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,-0.001975,-0.002617,-0.008426,-0.002221,-0.002626
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1211,0.003521,-0.002653,0.002329,-0.022144,-0.008526,-0.019116,-0.021139,0.030597,-0.032783,-0.011010,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,-0.001975,-0.002617,-0.008426,-0.002221,-0.002626
1212,-0.011544,-0.017719,-0.012737,-0.022144,-0.008526,-0.019116,-0.021139,0.012510,-0.032783,-0.011010,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,-0.001975,-0.002617,-0.008426,-0.002221,-0.002626
1216,-0.011544,-0.017719,-0.012737,-0.014488,-0.008526,0.109865,-0.021139,-0.015846,-0.025399,-0.011010,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,0.023714,-0.002617,-0.008426,-0.002221,-0.002626
1219,0.006242,0.062321,0.005050,0.003820,0.009423,-0.019116,0.012859,-0.015846,0.008950,-0.002664,...,-0.003239,-0.015122,-0.002513,-0.004676,-0.003063,0.027064,-0.002617,-0.008426,-0.002221,-0.002626


In [79]:
TFIDF_L2.to_csv('./data/TFIDF.csv')