## Overview
The goal of your final project is to apply what you have learned in this course to create a digital analytical edition of a corpus that will support exploration of the social, historical, or cultural contents of that corpus. These contents are broadly conceived—they may be about language use, social events, cultural categories, sentiments, identity, taste, etc., and these may be described synchronically or diachronically, i.e. as structures or as trends over time.

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- Convert the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- Annotate these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- Produce a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- Model the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- Explore your results using statistical and visual methods.
- Present conclusions about patterns observed in the corpus by means of these operations.


## Deliverables
To receive full credit for the assignment, you will produce a digital analytical edition of a corpus, which will include a written report and be hosted on a dedicated GitHub repository.

This edition should include the following deliverables.

### Data Files
A collection of source files hosted on your UVA Box account. If these are large for downloading, you should compress them as archive files (e.g., zip or tar.gz).

A collection of data files, each in CSV format, containing the F2 through F5 data you extracted from the corpus. These files should include, at a minimum, the following core tables:

- LIB.csv — Metadata for the source files.
- CORPUS.csv — This is a tokens table annotated with statistical and linguistic features, such as TFIDF. It should include and index that represents the OHCO of the documents in your corpus.
- VOCAB.csv — Annotated with statistical and linguistic features, such as DFIDF.
In addition, you should include the following data sets, either as features in the appropriate core table or as separate tables. Note that all tables should have an appropriate index and, where appropriate, an OCHO index.

#### Principal Components (PCA)

- Table of documents and components.
- Table of components and word counts (i.e., the “loadings”), either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.


#### Topic Models (LDA)

- Table of document and topic concentrations.
- Table of topics and term counts, either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.

#### Word Embeddings (word2vec)

- Terms and embeddings, either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.

#### Sentiment Analysis

- Sentiment and emotion values as features in VOCAB or as a separate table with a shared index with the VOCAB table.
- Sentiment polarity and emotions for each document.

### Code Files
The Jupyter notebooks used to perform all operations that produced the data in your tables.

Any Jupyter notebooks used to explore and visualize the data in preparation for your final report.

Any Python files (e.g., .py files) you wrote to support your work.

Any other assets — e.g., images, stylesheets, JavaScript libraries, etc. — required by your notebooks.

### Report Document
A Jupyter notebook called FINAL_REPORT.ipynb describing your work and interpreting its results along with links to all the files listed above. This report should be written using Markdown text cells and embedded graphics from your other notebooks to illustrate points. Do not reference images that are not listed in the notebook. You may use images to show images in the notebook if you don't want to include the code there. Include citations for any references made in the notebook.

This notebook should contain the following four sections:

1. Introduction. Describe the nature of your corpus and the question(s) you've asked of the data.

2. Source Data. Provide a description of all relativant source files and describe the following features for each source file:

- Provenance: Where did they come from? Describe the website or other source and provide relevant URLs.
- Location: Provide a link to the source files in UVA Box.
- Description: What is the general subject matter of the corpus? How many observations are there? What is the average document length?
Format: A description of both the file formats of the source files, e.g., plaintext, XML, CSV, etc., and the internal structure where applicable. For - example, if XML then specify document type (e.g., TEI or XHTML).
- Data Model. Describe the analytical tables you generated in the process of tokenization, annotation, and analysis of your corpus. You provide a list of tables with field names and their definition, along with URLs to each associated CSV file.

4. Exploration. Describe each of your explorations, such as PCA and topic models. For each, include the relevant parameters and hyperparemeters used to generate each model and visualization. For your visualizations, you should use at least three (but likely more) of the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps showing correlations
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots

5. Interpretation. Provide your interpretation of the results of exploration, and any conclusion if you are comfortable making them.

Regarding number of pages, a rule of thumb would be a six page exported PDF. The question of length is secondary to the requirement that you answer complete all the sections.



### Form Level Description
- F0 Source Format. The initial source format of a text, which varies by collection, e.g. XML (e.g. TEI and RSS), HTML, plain text (e.g. Gutenberg), JSON, and CSV.
- F1 Machine Learning Corpus Format (MLCF). Ideally a table of minimum discursive units indexed by document content hierarchy.
- F2 Standard Text Analytic Data Model (STADM). A normalized set of tables including DOC, TOKEN, and TERM tables. Produced by the tokenization of F1 data.
- F3 NLP Annotated STADM. STADM with annotations added to token and term records indicating stopwords, parts-of-speech, stems and lemmas, named entities, grammatical dependencies, sentiments, etc.
- F4 STADM with Vector Space models. Vector space representations of TOKEN data and resulting statistical data, such as term frequency and TFIDF.
- F5 STADM with analytical models. STADM with columns and tables added for outputs of fitting and transforming models with the data.
- F6 STADM converted into interactive visualization. STADM represented as a database-driven application with interactive visualization, .e.g. Jupyter notebooks and web applications.

In [178]:
import pandas as pd
import seaborn as sns
import nltk
import numpy as np
import re
from numpy.linalg import norm
from scipy.spatial.distance import pdist, squareform
from scipy.linalg import eigh
import plotly.express as px

In [2]:
# company_num = BOOKS
# link_num = CHAPTERS
# text = PARAS

OHCO = ['company_num', 'link_num', 'sent_num', 'token_num']

### F0

#### Source Format. The initial source format of a text, which varies by collection, e.g. XML (e.g. TEI and RSS), HTML, plain text (e.g. Gutenberg), JSON, and CSV.

In [8]:
df = pd.read_csv('./data/old_data/CORPUS.tar.gz', compression='gzip', lineterminator='\n')
df

Unnamed: 0,company_num,Text,characters
0,0,"Ahresty, with more than 60 years of experienc...",1709
1,0,"PRODUCTS Ahresty, with more than 60 years of e...",754
2,0,ENVIRONMENTAL,16
3,0,CONTACT Address Ahresty Wilmington Corporation...,439
4,1,Manufacturer ofMetal FastenersandGeneral Hardw...,1025
...,...,...,...
90628,1225,"Home•Careers Together, we build the future We...",2524
90629,1225,Privacy The protection of your personal data i...,12706
90630,1225,Signicast acquires European based CIREX 02.15....,5160
90631,1225,Email Protection You are unable to access this...,558


# Subset data

In [9]:
char_per_comp = df.groupby('company_num').sum('characters').sort_values('characters')
char_per_comp
filtered_comps = char_per_comp[(char_per_comp['characters']<20000) & (char_per_comp['characters']>1000) & ~(char_per_comp['characters'].isna())]
filtered_comps

Unnamed: 0_level_0,characters
company_num,Unnamed: 1_level_1
138,1033
858,1035
1093,1035
766,1051
276,1061
...,...
882,19599
356,19617
639,19806
615,19918


In [12]:
companies = filtered_comps.sample(150, random_state=12341).index
filtered_data = df[df['company_num'].isin(companies)]
len(filtered_data)

1105

In [14]:
filtered_data.groupby('company_num').first()

Unnamed: 0_level_0,Text,characters
company_num,Unnamed: 1_level_1,Unnamed: 2_level_1
3,"3D Prototyping With 3D printing technology, pr...",3279
10,"Congress Drives, established in 1915, is the l...",734
33,1 Single Source ProviderFor A Complete Solutio...,904
34,Engineering Excellence in Zinc Die Cast Manu...,1572
49,The Heavy Metal Company Limited has been casti...,10199
...,...,...
1191,Open The Door Products Soft Close Magnetic Cat...,987
1200,Canterbury Aluminium | Exceptional Windows & D...,1907
1201,About Pioneer Venture Our Products Why US? Pro...,59
1216,"Our Company For over 35 years, the Houston An...",1469


In [15]:
filtered_data.to_csv('./data/raw_text.csv', index=False)

### F1

#### Machine Learning Corpus Format (MLCF). Ideally a table of minimum discursive units indexed by document content hierarchy.

In [26]:
df = pd.read_csv('./data/raw_text.csv', lineterminator='\n')

In [27]:
df

Unnamed: 0,company_num,Text,characters
0,3,"3D Prototyping With 3D printing technology, pr...",3279
1,10,"Congress Drives, established in 1915, is the l...",734
2,10,Variable Pitch Pulleys Congress Drives Variabl...,284
3,10,"Custom Die Cast Components Today, we serve Nor...",336
4,10,Manufacturing Capabilities Congress Drives con...,429
...,...,...,...
1100,1216,Quality Control Quality Control Houston state...,1759
1101,1216,"Testimonials Testimonials ""The business ethic...",570
1102,1222,HOME ABOUT PENTACAST SERVICES CONTACT More Pen...,352
1103,1222,HOME ABOUT PENTACAST SERVICES CONTACT More SER...,5105


# Create LIB

In [39]:
LIB = df[['company_num', 'characters']].groupby('company_num').agg(['sum', 'count'])['characters'].reset_index()\
.rename(columns={'sum':'total_characters', 'count':'total_links'})
LIB.to_csv('./data/LIB.csv', index=False)

# Create DOCS

In [40]:
# Add link count column
df['link_num'] = df.groupby('company_num').cumcount()

DOCS = df[["company_num", "link_num" ,"Text", "characters"]]
DOCS = DOCS.rename(columns={'company_num': 'company_id'})
DOCS = DOCS.rename(columns={'Text': 'text'})
DOCS = DOCS.set_index(["company_id"])
DOCS

Unnamed: 0_level_0,link_num,text,characters
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,0,"3D Prototyping With 3D printing technology, pr...",3279
10,0,"Congress Drives, established in 1915, is the l...",734
10,1,Variable Pitch Pulleys Congress Drives Variabl...,284
10,2,"Custom Die Cast Components Today, we serve Nor...",336
10,3,Manufacturing Capabilities Congress Drives con...,429
...,...,...,...
1216,7,Quality Control Quality Control Houston state...,1759
1216,8,"Testimonials Testimonials ""The business ethic...",570
1222,0,HOME ABOUT PENTACAST SERVICES CONTACT More Pen...,352
1222,1,HOME ABOUT PENTACAST SERVICES CONTACT More SER...,5105


In [275]:
DOCS.to_csv('./data/DOCS.csv')

### F2
: Convert the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model

#### Standard Text Analytic Data Model (STADM). A normalized set of tables including DOC, TOKEN, and TERM tables. Produced by the tokenization of F1 data.

# Create SENTS

#### 2. SENTS

In [43]:
## SENTS
#%%time
sent_pat = r'[.?!;:]+'
SENTS = CHAPS['text'].str.split(sent_pat, expand=True).stack().to_frame('sent_str')
SENTS.index.names = ["company_id", "link_num", "sent_num"]
SENTS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
company_id,link_num,sent_num,Unnamed: 3_level_1
3,0,0,"3D Prototyping With 3D printing technology, pr..."
3,0,1,Utilizing rapid prototyping through 3D printi...
3,0,2,Tooling Equipped with onsite tool room facili...
3,0,3,Production Our unique and custom built Zinc d...
3,0,4,Our highly skilled production staff inspect a...
...,...,...,...
1222,2,6,© 2017 by PentaCast Inc
1222,2,7,Tel
1222,2,8,519
1222,2,9,245


In [274]:
SENTS.to_csv('./data/SENTS.csv')

# Create TOKENS/CORPUS

#### 3. TOKENS

In [46]:
## TOKENS
## TOKENIZING TABLE TAKES AROUND 5 MINS.
# YOU CAN JUST EASILY ALREADY SAVED TOKENS TABLE.
TOKENS = pd.read_csv('./data/TOKENS.tar.gz', compression='gzip', lineterminator='\n')
TOKENS

Unnamed: 0,company_id,link_num,sent_num,token_num,pos_tuple,pos,token_str,term_str
0,0,0,0,0,"('Ahresty', 'NNP')",NNP,Ahresty,ahresty
1,0,0,0,1,"(',', ',')",",",",",","
2,0,0,0,2,"('with', 'IN')",IN,with,with
3,0,0,0,3,"('more', 'JJR')",JJR,more,more
4,0,0,0,4,"('than', 'IN')",IN,than,than
...,...,...,...,...,...,...,...,...
3383121,199,2,5,1,"('Designed', 'VBN')",VBN,Designed,designed
3383122,199,2,5,2,"('byElegant', 'JJ')",JJ,byElegant,byelegant
3383123,199,2,5,3,"('Themes|', 'NNP')",NNP,Themes|,themes|
3383124,199,2,5,4,"('Powered', 'NNP')",NNP,Powered,powered


### F3 
: NLP Annotated STADM. STADM with annotations added to token and term records indicating stopwords, parts-of-speech, stems and lemmas, named entities, grammatical dependencies, sentiments, etc.

In [47]:
keep_whitespace = True

In [48]:
%%time
if keep_whitespace:
    TOKENS = SENTS.sent_str\
            .apply(lambda x: pd.Series(nltk.pos_tag(nltk.word_tokenize(x))))\
            .stack()\
            .to_frame('pos_tuple')
else:
    TOKENS = SENTS.sent_str\
            .apply(lambda x: pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x))))\
            .stack()\
            .to_frame('pos_tuple')



CPU times: user 12.8 s, sys: 965 ms, total: 13.7 s
Wall time: 14.1 s


In [49]:
TOKENS.index.names = ["company_id", "link_num", "sent_num", "token_num"]
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple
company_id,link_num,sent_num,token_num,Unnamed: 4_level_1
3,0,0,0,"(3D, CD)"
3,0,0,1,"(Prototyping, VBG)"
3,0,0,2,"(With, IN)"
3,0,0,3,"(3D, CD)"
3,0,0,4,"(printing, VBG)"
...,...,...,...,...
1222,2,6,4,"(Inc, NNP)"
1222,2,7,0,"(Tel, NN)"
1222,2,8,0,"(519, CD)"
1222,2,9,0,"(245, CD)"


In [50]:
%%time
TOKENS['pos'] = TOKENS.pos_tuple.apply(lambda x: x[1])
TOKENS['token_str'] = TOKENS.pos_tuple.apply(lambda x: x[0])
TOKENS['term_str'] = TOKENS.token_str.str.lower()
TOKENS

CPU times: user 117 ms, sys: 10.3 ms, total: 127 ms
Wall time: 127 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str
company_id,link_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,0,0,0,"(3D, CD)",CD,3D,3d
3,0,0,1,"(Prototyping, VBG)",VBG,Prototyping,prototyping
3,0,0,2,"(With, IN)",IN,With,with
3,0,0,3,"(3D, CD)",CD,3D,3d
3,0,0,4,"(printing, VBG)",VBG,printing,printing
...,...,...,...,...,...,...,...
1222,2,6,4,"(Inc, NNP)",NNP,Inc,inc
1222,2,7,0,"(Tel, NN)",NN,Tel,tel
1222,2,8,0,"(519, CD)",CD,519,519
1222,2,9,0,"(245, CD)",CD,245,245


In [51]:
# SAVE TOKENS TABLE
TOKENS.to_csv("./data/TOKENS.csv")

# Create VOCAB

In [99]:
%%time
VOCAB = TOKENS.term_str.value_counts().to_frame('n')
VOCAB.index.name = 'term_str'
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['i'] = -np.log2(VOCAB.p)
VOCAB['n_chars'] = VOCAB.index.str.len()
VOCAB['max_pos'] = TOKENS[['term_str','pos']].value_counts().unstack(fill_value=0).idxmax(1)
VOCAB

CPU times: user 132 ms, sys: 14.8 ms, total: 147 ms
Wall time: 151 ms


Unnamed: 0_level_0,n,p,i,n_chars,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
",",8918,0.046287,4.433260,1,","
and,5814,0.030176,5.050450,3,CC
the,5331,0.027669,5.175574,3,DT
to,4440,0.023045,5.439421,2,TO
of,3567,0.018514,5.755269,2,IN
...,...,...,...,...,...
flangebolt,1,0.000005,17.555765,10,NNP
ratingmollength,1,0.000005,17.555765,15,NN
class1525354669current,1,0.000005,17.555765,22,NNP
specificationsvoltage,1,0.000005,17.555765,21,NNP


In [100]:
# SAVE VOCAB TABLE
VOCAB.to_csv("./data/VOCAB.csv")

### F4 
: STADM with Vector Space models. Vector space representations of TOKEN data and resulting statistical data, such as term frequency and TFIDF.

In [101]:
VOCAB['n_pos'] = TOKENS[['term_str','pos']].value_counts().unstack().count(1)
VOCAB['cat_pos'] = TOKENS[['term_str','pos']].value_counts().to_frame('n').reset_index()\
    .groupby('term_str').pos.apply(lambda x: set(x))

In [102]:
VOCAB

Unnamed: 0_level_0,n,p,i,n_chars,max_pos,n_pos,cat_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
",",8918,0.046287,4.433260,1,",",1,"{,}"
and,5814,0.030176,5.050450,3,CC,3,"{VBP, CC, NNP}"
the,5331,0.027669,5.175574,3,DT,2,"{DT, NNP}"
to,4440,0.023045,5.439421,2,TO,2,"{TO, NNP}"
of,3567,0.018514,5.755269,2,IN,2,"{IN, NNP}"
...,...,...,...,...,...,...,...
flangebolt,1,0.000005,17.555765,10,NNP,1,{NNP}
ratingmollength,1,0.000005,17.555765,15,NN,1,{NN}
class1525354669current,1,0.000005,17.555765,22,NNP,1,{NNP}
specificationsvoltage,1,0.000005,17.555765,21,NNP,1,{NNP}


# Add Stopwords to VOCAB

In [103]:
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw.dummy = 1

VOCAB['stop'] = VOCAB.index.map(sw.dummy)
VOCAB['stop'] = VOCAB['stop'].fillna(0).astype('int')
VOCAB = VOCAB[VOCAB['stop'] == 0]

# Create POS_GROUP

In [104]:
tags_csv = [(line.split()[0], ' '.join(line.split()[1:])) 
            for line in open(f'data/upenn_tagset.txt', 'r').readlines()]

POS = pd.DataFrame(tags_csv)
POS.columns = ['pos_code','pos_def']
POS = POS.set_index('pos_code')
POS['n'] = TOKENS.pos.value_counts()
POS['n'] = POS['n'].fillna(0).astype('int')
POS['pos_group'] = POS.apply(lambda x: x.name[:2], 1)
POS['punc'] = POS.apply(lambda x: bool(re.match(r"^\W", x.name)), 1)

POS_GROUP = POS.groupby('pos_group').n.sum().to_frame('n')
POS_GROUP = POS_GROUP[POS_GROUP.n > 0]
POS_GROUP['pos_def'] = POS.groupby('pos_group').apply(lambda x: '; '.join(x['pos_def']))
POS_GROUP['p'] = POS_GROUP.n / POS_GROUP.n.sum()
POS_GROUP['i'] = np.log2(1/POS_GROUP.p)
POS_GROUP['h'] = POS_GROUP.p * POS_GROUP.i

POS_GROUP['n_terms'] = VOCAB.max_pos.value_counts() 
POS_GROUP['n_tokens'] = VOCAB.groupby('max_pos').n.sum()

In [105]:
POS_GROUP

Unnamed: 0_level_0,n,pos_def,p,i,h,n_terms,n_tokens
pos_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
$,108,dollar,0.000561,10.800413,0.006056,9.0,79.0
'',201,closing quotation mark,0.001044,9.904249,0.010336,1.0,193.0
(,1230,opening parenthesis,0.006386,7.290858,0.04656,2.0,1230.0
),1233,closing parenthesis,0.006402,7.287344,0.046651,2.0,1233.0
",",8918,comma,0.046302,4.432796,0.205245,1.0,8918.0
:,311,colon or ellipsis,0.001615,9.27453,0.014975,2.0,311.0
CC,7501,"conjunction, coordinating",0.038945,4.682433,0.182355,3.0,686.0
CD,6955,"numeral, cardinal",0.03611,4.791466,0.173019,1424.0,6874.0
DT,10618,determiner,0.055128,4.181076,0.230494,4.0,150.0
EX,83,existential there,0.000431,11.180261,0.004818,,


# Create BOW

In [106]:
def create_bow(CORPUS, bag, item_type='term_str'):
    
    BOW = CORPUS.groupby(bag+[item_type])[item_type].count().to_frame('n')
    return BOW

BOW = create_bow(TOKENS, ['company_id'])

In [107]:
BOW.to_csv('./data/BOW.csv')

# Create TFIDF and DFIDF

In [108]:
def get_tfidf_dfidf(BOW, tf_method='max', df_method='standard', item_type='term_str'):
    '''
    The purpose of this function is to calculate TFIDF and DFIDF for a given BOW representation of a CORPUS.
    
    INPUT:
        BOW           dataframe of a bag of words representation of a corpus
        tf_method     method for calculating term frequency, string
        df_method     method for calculating document frequency, string
        item_type     item type
        
    OUTPUT:
        TFIDF         dataframe of term frequency inverse document frequency for the corpus
        DFIDF         dataframe of document frequency inverse document frequency for the corpus
    '''
            
    DTCM = BOW.n.unstack() # Create Doc-Term Count Matrix
    
    if tf_method == 'sum':
        TF = (DTCM.T / DTCM.T.sum()).T
    elif tf_method == 'max':
        TF = (DTCM.T / DTCM.T.max()).T
    elif tf_method == 'log':
        TF = (np.log2(DTCM.T + 1)).T
    elif tf_method == 'raw':
        TF = DTCM
    elif tf_method == 'bool':
        TF = DTCM.astype('bool').astype('int')
    else:
        raise ValueError(f"TF method {tf_method} not found.")

    DF = DTCM.count() # Assumes NULLs 
    N_docs = len(DTCM)
    
    if df_method == 'standard':
        IDF = np.log10(N_docs/DF) # This what the students were asked to use
    elif df_method == 'textbook':
        IDF = np.log10(N_docs/(DF + 1))
    elif df_method == 'sklearn':
        IDF = np.log10(N_docs/DF) + 1
    elif df_method == 'sklearn_smooth':
        IDF = np.log10((N_docs + 1)/(DF + 1)) + 1
    else:
        raise ValueError(f"DF method {df_method} not found.")
    
    TFIDF = TF * IDF
    
    DFIDF = DF * IDF
    
    TFIDF = TFIDF.fillna(0)

    return TFIDF, DFIDF

In [109]:
TFIDF, DFIDF = get_tfidf_dfidf(BOW)

In [110]:
TFIDF

term_str,,֭g,,,,֩,8g,l,89sg-s4-c,rt,...,⠀⠀we,⬇|contact,ꇥw,5045,address,888,phone,e-mail,email,﻿
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1200,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1201,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1216,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [111]:
TFIDF.to_csv('./data/TFIDF.csv')

In [112]:
VOCAB['dfidf'] = DFIDF
VOCAB['mean_tfidf'] = TFIDF.mean()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  VOCAB['dfidf'] = DFIDF
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  VOCAB['mean_tfidf'] = TFIDF.mean()


In [116]:
VOCAB.sort_values('mean_tfidf', ascending=False)

Unnamed: 0_level_0,n,p,i,n_chars,max_pos,n_pos,cat_pos,stop,dfidf,mean_tfidf
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
casting,942,0.004889,7.676182,7,NNP,4,"{NN, JJ, VBG, NNP}",0,22.950631,0.035022
die,403,0.002092,8.901129,3,NNP,6,"{VB, JJ, NN, NNS, VBP, NNP}",0,23.613338,0.028115
castings,513,0.002663,8.552950,8,NNS,4,"{NNS, VBZ, NNPS, NNP}",0,23.735295,0.022809
foundry,309,0.001604,9.284302,7,NNP,4,"{NNS, NN, JJ, NNP}",0,23.436098,0.022693
inc,296,0.001536,9.346312,3,NNP,1,{NNP},0,23.894575,0.021041
...,...,...,...,...,...,...,...,...,...,...
gf,1,0.000005,17.555765,2,NNP,1,{NNP},0,2.176091,0.000061
magnetrol,1,0.000005,17.555765,9,NNP,1,{NNP},0,2.176091,0.000061
leser,1,0.000005,17.555765,5,NNP,1,{NNP},0,2.176091,0.000061
mod,1,0.000005,17.555765,3,NNP,1,{NNP},0,2.176091,0.000061


# Create VIDX and MT

In [220]:
pos_list = "NN NNS VB VBD VBG VBN VBP VBZ JJ JJR JJS RB RBR RBS".split()

VIDX = VOCAB.loc[VOCAB.max_pos.isin(pos_list)]\
    .sort_values('dfidf', ascending=False)\
    .head(1000).index

In [221]:
VIDX

Index(['small', 'world', 'technology', 'wide', 'industrial', 'control', 'need',
       'required', 'today', 'form',
       ...
       'internationally', 'durability', 'drawing', 'floor', 'keeps',
       'immediate', 'reasons', 'sections', 'return', '<'],
      dtype='object', name='term_str', length=1000)

In [222]:
MT = TFIDF[VIDX].groupby('company_id').mean().fillna(0) # MUST FILLNA

In [223]:
MT

term_str,small,world,technology,wide,industrial,control,need,required,today,form,...,internationally,durability,drawing,floor,keeps,immediate,reasons,sections,return,<
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,0.000000,0.031123,0.046685,0.000000,0.000000,0.000000,0.000000,0.000000,0.015846,0.000000,...,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000
10,0.000000,0.014056,0.014056,0.041410,0.000000,0.013803,0.013803,0.000000,0.042938,0.000000,...,0.0,0.039414,0.00000,0.000000,0.000000,0.000000,0.039414,0.0,0.0,0.000000
33,0.013617,0.000000,0.013617,0.000000,0.000000,0.013372,0.000000,0.013866,0.000000,0.000000,...,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000
34,0.005586,0.000000,0.039104,0.005486,0.000000,0.005486,0.010972,0.000000,0.017065,0.005688,...,0.0,0.015665,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000
49,0.000000,0.000000,0.000000,0.000000,0.011804,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1191,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.023482,0.000000,0.000000,0.000000,...,0.0,0.000000,0.00745,0.022351,0.007450,0.000000,0.000000,0.0,0.0,0.000000
1200,0.003252,0.000000,0.000000,0.009580,0.003193,0.006387,0.003193,0.003311,0.019867,0.003311,...,0.0,0.000000,0.00000,0.000000,0.009118,0.018237,0.009118,0.0,0.0,0.000000
1201,0.000000,0.000000,0.000000,0.030565,0.061129,0.122258,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.000000
1216,0.004841,0.009683,0.014524,0.000000,0.000000,0.038036,0.000000,0.024650,0.000000,0.009860,...,0.0,0.000000,0.00000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.013576


# Create L0, L1, L2

In [224]:
L0 = MT.astype('bool').astype('int') # Binary (Pseudo L)
L1 = MT.apply(lambda x: x / x.sum(), 1) # Manhattan (Probabilistic)
L2 = MT.apply(lambda x: x / norm(x), 1) # Euclidean

# Create PAIRS and CORR_MATRIX

In [225]:
PAIRS = 1 - MT.T.corr().stack().to_frame('correl')
PAIRS.index.names = ['doc_a','doc_b']
PAIRS = PAIRS.query("doc_a > doc_b") # Remove identities and reverse duplicates

general_method = 'weighted' # single, complete, average, weighted 
euclidean_method = 'ward' # ward, centroid, median
combos  = [
    (L2, 'euclidean', 'euclidean', euclidean_method),
    (MT,  'cosine', 'cosine', euclidean_method),
    (MT,  'cityblock', 'cityblock', general_method),
    (L0, 'jaccard', 'jaccard', general_method),
    (L1, 'jensenshannon', 'js', general_method),
]

for X, metric, label, _ in combos:
    PAIRS[label] = pdist(X, metric)

In [226]:
corr_type = 'kendall'
CORR_MATRIX = MT.T.corr(corr_type)

#LIB['kendall_sum'] = CORR_MATRIX.sum()

In [227]:
np.fill_diagonal(CORR_MATRIX.values, 0)

In [228]:
CORR_MATRIX

company_id,3,10,33,34,49,58,63,66,77,81,...,1129,1142,1172,1180,1181,1191,1200,1201,1216,1222
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,0.000000,0.159349,0.146537,0.131635,0.110409,0.070868,0.037816,0.071675,0.125767,0.068812,...,0.047679,0.015428,0.122031,0.101043,0.050848,0.054188,0.074448,0.085950,0.038597,0.138107
10,0.159349,0.000000,0.092076,0.167307,0.059006,0.067753,0.001826,0.141123,0.138894,0.108029,...,0.099429,0.077938,0.129839,0.107361,0.072021,0.078233,0.073718,0.064473,0.068487,0.079883
33,0.146537,0.092076,0.000000,0.175527,0.055557,0.088896,0.008007,0.058511,0.134023,0.126665,...,0.109010,0.099983,0.089656,0.112505,0.034404,0.021161,0.071129,0.073877,0.104039,0.105933
34,0.131635,0.167307,0.175527,0.000000,0.095546,0.139430,0.032919,0.149848,0.157738,0.123961,...,0.125274,0.147193,0.130622,0.038329,0.042311,0.054726,0.078546,0.072473,0.135309,0.124704
49,0.110409,0.059006,0.055557,0.095546,0.000000,0.068522,0.068892,0.083794,0.113773,0.054757,...,0.089703,0.066512,0.088512,0.069542,0.065609,-0.000573,0.078734,0.052266,0.037746,0.056171
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1191,0.054188,0.078233,0.021161,0.054726,-0.000573,0.010773,0.045261,0.131647,0.040264,0.014294,...,0.069907,0.133072,0.032156,0.136286,0.090617,0.000000,0.155138,-0.025887,0.087204,0.012463
1200,0.074448,0.073718,0.071129,0.078546,0.078734,0.023428,0.065143,0.068488,0.039785,0.035677,...,0.099923,0.060085,0.073755,0.110597,0.051931,0.155138,0.000000,0.064449,0.055833,0.044950
1201,0.085950,0.064473,0.073877,0.072473,0.052266,0.086231,0.012233,0.096992,0.034654,0.137539,...,0.073804,0.042450,0.066881,0.011425,0.012951,-0.025887,0.064449,0.000000,0.046180,0.063417
1216,0.038597,0.068487,0.104039,0.135309,0.037746,0.092793,0.037118,0.018584,0.105758,0.121284,...,0.096723,0.184894,0.115332,0.087673,0.139542,0.087204,0.055833,0.046180,0.000000,0.082297


# Explore CORR_MATRIX

In [229]:
max_corr = CORR_MATRIX.max(axis=0)

In [230]:
max_corr_idx = CORR_MATRIX.idxmax(axis=0)

In [231]:
pd.concat([max_corr_idx, max_corr], axis=1).rename(columns={0:'Max_id', 1:'Max_correlation'})

Unnamed: 0_level_0,Max_id,Max_correlation
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1
3,567,0.183606
10,730,0.217421
33,711,0.238519
34,730,0.214948
49,211,0.147221
...,...,...
1191,797,0.189507
1200,503,0.206058
1201,299,0.188862
1216,794,0.203915


# PCA

In [232]:
pos_list = "NN NNS VB VBD VBG VBN VBP VBZ JJ JJR JJS RB RBR RBS".split()

VIDX = VOCAB.loc[VOCAB.max_pos.isin(pos_list)]\
    .sort_values('dfidf', ascending=False)\
    .head(1000).index

In [233]:
# setup

norm_docs = True # This has the effect of exaggerating variance when False
center_term_vectors = True # This has the effect of demoting authorship when False

colors = "Spectral"

sns.set(style='ticks')

TFIDF_SMALL = TFIDF[VIDX]

In [234]:
# normalize doc vector lengths
TFIDF_L2 = (TFIDF_SMALL.T / norm(TFIDF_SMALL, 2, axis=1)).T

# center term vectors
TFIDF_L2 = TFIDF_L2 - TFIDF_L2.mean()

In [235]:
COV = TFIDF_L2.T.dot(TFIDF_L2) / (TFIDF_L2.shape[0] - 1)

In [236]:
eig_vals, eig_vecs = eigh(COV)

In [237]:
EIG_VEC = pd.DataFrame(eig_vecs, index=COV.index, columns=COV.index)
EIG_VAL = pd.DataFrame(eig_vals, index=COV.index, columns=['eig_val'])
EIG_VAL.index.name = 'term_str'

In [238]:
EIG_PAIRS = EIG_VAL.join(EIG_VEC.T)

In [239]:
COMPS = EIG_PAIRS.sort_values('eig_val', ascending=False).head(10).reset_index(drop=True)
COMPS.index.name = 'comp_id'
COMPS.index = ["PC{}".format(i) for i in COMPS.index.tolist()]
COMPS.index.name = 'pc_id'

In [240]:
TFIDF_L2

term_str,small,world,technology,wide,industrial,control,need,required,today,form,...,internationally,durability,drawing,floor,keeps,immediate,reasons,sections,return,<
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,-0.010922,0.059745,0.096675,-0.011998,-0.018063,-0.017254,-0.015579,-0.012899,0.022445,-0.012175,...,-0.003582,-0.003208,-0.005011,-0.003239,-0.001917,-0.004375,-0.003115,-0.002757,-0.004803,-0.005927
10,-0.010922,0.009028,0.008623,0.058567,-0.018063,0.006267,0.007943,-0.012899,0.057597,-0.012175,...,-0.003582,0.063957,-0.005011,-0.003239,-0.001917,-0.004375,0.064050,-0.002757,-0.004803,-0.005927
33,0.010282,-0.014924,0.005874,-0.011998,-0.018063,0.003568,-0.015579,0.008692,-0.015573,-0.012175,...,-0.003582,-0.003208,-0.005011,-0.003239,-0.001917,-0.004375,-0.003115,-0.002757,-0.004803,-0.005927
34,0.009268,-0.014924,0.126001,0.007830,-0.018063,0.002573,0.024076,-0.012899,0.046105,0.008384,...,-0.003582,0.053408,-0.005011,-0.003239,-0.001917,-0.004375,-0.003115,-0.002757,-0.004803,-0.005927
49,-0.010922,-0.014924,-0.015329,-0.011998,0.031573,-0.017254,-0.015579,-0.012899,-0.015573,-0.012175,...,-0.003582,-0.003208,-0.005011,-0.003239,-0.001917,-0.004375,-0.003115,-0.002757,-0.004803,-0.005927
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1191,-0.010922,-0.014924,-0.015329,-0.011998,-0.018063,-0.017254,0.060532,-0.012899,-0.015573,-0.012175,...,-0.003582,-0.003208,0.019137,0.069204,0.022230,-0.004375,-0.003115,-0.002757,-0.004803,-0.005927
1200,-0.000687,-0.014924,-0.015329,0.018155,-0.008012,0.002847,-0.005528,-0.002478,0.046958,-0.001754,...,-0.003582,-0.003208,-0.005011,-0.003239,0.026782,0.053024,0.025585,-0.002757,-0.004803,-0.005927
1201,-0.010922,-0.014924,-0.015329,0.077080,0.160092,0.339056,-0.015579,-0.012899,-0.015573,-0.012175,...,-0.003582,-0.003208,-0.005011,-0.003239,-0.001917,-0.004375,-0.003115,-0.002757,-0.004803,-0.005927
1216,0.001501,0.009921,0.021938,-0.011998,-0.018063,0.080341,-0.015579,0.050349,-0.015573,0.013124,...,-0.003582,-0.003208,-0.005011,-0.003239,-0.001917,-0.004375,-0.003115,-0.002757,-0.004803,0.028908


In [243]:
# get Document Component Matrix
DCM = TFIDF_L2.dot(COMPS[COV.index].T)

# add metadata for display purposes
# LIB_COLS = LIB.columns
# DCM = DCM.join(LIB[LIB_COLS], on='company_id')

# # define doc field to name each chapter
# DCM['doc'] = DCM.apply(lambda x: f"{x.title} {str(x.name[1]).zfill(2)}", 1)

In [248]:
DCM = DCM.reset_index()

In [249]:
LOADINGS = COMPS[COV.index].T
LOADINGS.index.name = 'term_str'

# Explore LOADINGS

In [250]:
LOADINGS

pc_id,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
small,-0.002965,-0.012143,-0.001677,-0.008403,-0.005370,0.001083,0.027700,-0.015795,0.025798,0.002957
world,0.001195,0.032745,0.023359,0.026443,-0.008521,-0.003662,-0.042093,-0.037384,-0.016262,-0.012480
technology,0.021972,-0.011125,0.097873,0.051153,0.039648,0.021411,0.006687,0.005313,0.000378,0.024521
wide,0.008420,-0.006042,-0.001656,0.033535,0.019811,-0.000644,-0.015614,-0.009406,0.007628,0.003319
industrial,-0.029376,0.021530,-0.048029,0.070021,0.028716,-0.014202,-0.052232,0.007404,-0.071543,0.003332
...,...,...,...,...,...,...,...,...,...,...
immediate,0.002133,0.005592,-0.001527,0.014171,0.007185,0.029599,0.028360,-0.018830,-0.015290,0.015170
reasons,-0.008055,-0.000038,0.016649,-0.009791,-0.004026,-0.010203,-0.005417,-0.002978,-0.014328,0.007028
sections,0.005768,-0.009968,-0.009432,-0.001370,-0.012068,-0.014429,0.003957,-0.008323,0.007816,0.008559
return,-0.007369,0.004792,-0.005812,-0.031139,-0.016567,-0.030938,0.006619,-0.019307,-0.011011,-0.001185


In [257]:
LOADINGS.sort_values('PC2', ascending=False)[:50]

pc_id,PC0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
data,-0.225801,-0.089065,0.165746,-0.101804,-0.08905,-0.030747,-0.046782,0.053998,0.040989,0.062745
privacy,-0.17931,-0.074547,0.160152,-0.071876,-0.087537,-0.055677,-0.072801,-0.005001,0.033966,0.053946
solutions,0.062885,-0.04844,0.137326,0.127228,-0.008407,-0.014075,-0.116605,0.033377,-0.121818,-0.037873
personal,-0.171446,-0.074413,0.132292,-0.084634,-0.072177,-0.036616,-0.028856,0.006895,0.02396,0.032448
precision,0.046271,-0.073172,0.114919,0.211776,-0.023735,0.092895,0.068993,0.097661,0.021189,0.109474
technology,0.021972,-0.011125,0.097873,0.051153,0.039648,0.021411,0.006687,0.005313,0.000378,0.024521
policy,-0.112156,-0.060872,0.093072,-0.002645,-0.059482,-0.027146,-0.059809,0.01827,-0.00396,0.057081
development,0.011806,-0.004452,0.091518,0.071711,0.072691,-0.004208,-0.029319,-0.007263,-0.048714,0.095831
metal,-0.007002,-0.040181,0.085383,-0.069957,-0.101833,-0.107757,0.075635,0.051153,-0.089514,-0.072964
tooling,0.065946,-0.065915,0.07787,0.053104,-0.003392,-0.023664,0.098377,0.048175,0.070355,-0.013047


# Visualize LOADINGS

In [254]:
def vis_loadings(a=0, b=1, hover_name='term_str'):
    X = LOADINGS.join(VOCAB)
    return px.scatter(X.reset_index(), f"PC{a}", f"PC{b}", 
                      text='term_str', size='i', color='max_pos',
                      marginal_x='box', height=800)

def vis_pcs(M, a, b, label='company_id', hover_name='company_id',symbol=None, size=None):
    fig = px.scatter(M, f"PC{a}", f"PC{b}", hover_name=hover_name, 
                     symbol=symbol, size=size,
                     marginal_x='box', height=800)
    fig.show()

In [258]:
vis_pcs(DCM, 0, 2)

# Read raw sentences from DOCS

In [279]:
DOCS[DOCS.index==565]['text'].values

array(['Gibbs Expect Excellence 50 yearsof leadership in die casting, machining, and assembly Experience Innovation Gibbs is a go-to-provider trusted by its long term Tier One and OEM customers.\xa0 As the speed of innovation escalates, Gibbs continues to out perform the competition with unique and innovative cast, machined, and assembled components. Extraordinary Capabilities We do what others can’t. Gibbs makes aggressive investments in our engineering, processes, and infrastructure to execute all of your projects on time, every time. From advanced engineering tools for lightweighting to multiple aluminum alloys or complex machining and assembly with advanced automation, Gibbs gets the job done. This unique combination of capabilities makes Gibbs a strategic partner. Engineering Tool + Die Die Casting Machining + Assembly Who We Serve News Careers  Photo Credits to Collin Floyd Privacy Policy ',
       'Capabilities 50 years of leadership in die casting, machining, and assembly makes