## Overview
The goal of your final project is to apply what you have learned in this course to create a digital analytical edition of a corpus that will support exploration of the social, historical, or cultural contents of that corpus. These contents are broadly conceived—they may be about language use, social events, cultural categories, sentiments, identity, taste, etc., and these may be described synchronically or diachronically, i.e. as structures or as trends over time.

Specifically, you will acquire a collection of long-form texts and perform the following operations:

- Convert the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model (F2).
- Annotate these tables with statistical and linguistic features using NLP libraries such as NLTK (F3).
- Produce a vector representation of the corpus to generate TFIDF values to add to the TOKEN (aka CORPUS) and VOCAB tables (F4).
- Model the annotated and vectorized model with tables and features derived from the application of unsupervised methods, including PCA, LDA, and word2vec (F5).
- Explore your results using statistical and visual methods.
- Present conclusions about patterns observed in the corpus by means of these operations.


## Deliverables
To receive full credit for the assignment, you will produce a digital analytical edition of a corpus, which will include a written report and be hosted on a dedicated GitHub repository.

This edition should include the following deliverables.

### Data Files
A collection of source files hosted on your UVA Box account. If these are large for downloading, you should compress them as archive files (e.g., zip or tar.gz).

A collection of data files, each in CSV format, containing the F2 through F5 data you extracted from the corpus. These files should include, at a minimum, the following core tables:

- LIB.csv — Metadata for the source files.
- CORPUS.csv — This is a tokens table annotated with statistical and linguistic features, such as TFIDF. It should include and index that represents the OHCO of the documents in your corpus.
- VOCAB.csv — Annotated with statistical and linguistic features, such as DFIDF.
In addition, you should include the following data sets, either as features in the appropriate core table or as separate tables. Note that all tables should have an appropriate index and, where appropriate, an OCHO index.

#### Principal Components (PCA)

- Table of documents and components.
- Table of components and word counts (i.e., the “loadings”), either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.


#### Topic Models (LDA)

- Table of document and topic concentrations.
- Table of topics and term counts, either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.

#### Word Embeddings (word2vec)

- Terms and embeddings, either added to the VOCAB table or as a separate table with a shared index with the VOCAB table.

#### Sentiment Analysis

- Sentiment and emotion values as features in VOCAB or as a separate table with a shared index with the VOCAB table.
- Sentiment polarity and emotions for each document.

### Code Files
The Jupyter notebooks used to perform all operations that produced the data in your tables.

Any Jupyter notebooks used to explore and visualize the data in preparation for your final report.

Any Python files (e.g., .py files) you wrote to support your work.

Any other assets — e.g., images, stylesheets, JavaScript libraries, etc. — required by your notebooks.

### Report Document
A Jupyter notebook called FINAL_REPORT.ipynb describing your work and interpreting its results along with links to all the files listed above. This report should be written using Markdown text cells and embedded graphics from your other notebooks to illustrate points. Do not reference images that are not listed in the notebook. You may use images to show images in the notebook if you don't want to include the code there. Include citations for any references made in the notebook.

This notebook should contain the following four sections:

1. Introduction. Describe the nature of your corpus and the question(s) you've asked of the data.

2. Source Data. Provide a description of all relativant source files and describe the following features for each source file:

- Provenance: Where did they come from? Describe the website or other source and provide relevant URLs.
- Location: Provide a link to the source files in UVA Box.
- Description: What is the general subject matter of the corpus? How many observations are there? What is the average document length?
Format: A description of both the file formats of the source files, e.g., plaintext, XML, CSV, etc., and the internal structure where applicable. For - example, if XML then specify document type (e.g., TEI or XHTML).
- Data Model. Describe the analytical tables you generated in the process of tokenization, annotation, and analysis of your corpus. You provide a list of tables with field names and their definition, along with URLs to each associated CSV file.

4. Exploration. Describe each of your explorations, such as PCA and topic models. For each, include the relevant parameters and hyperparemeters used to generate each model and visualization. For your visualizations, you should use at least three (but likely more) of the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps showing correlations
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots

5. Interpretation. Provide your interpretation of the results of exploration, and any conclusion if you are comfortable making them.

Regarding number of pages, a rule of thumb would be a six page exported PDF. The question of length is secondary to the requirement that you answer complete all the sections.



### Form Level Description
- F0 Source Format. The initial source format of a text, which varies by collection, e.g. XML (e.g. TEI and RSS), HTML, plain text (e.g. Gutenberg), JSON, and CSV.
- F1 Machine Learning Corpus Format (MLCF). Ideally a table of minimum discursive units indexed by document content hierarchy.
- F2 Standard Text Analytic Data Model (STADM). A normalized set of tables including DOC, TOKEN, and TERM tables. Produced by the tokenization of F1 data.
- F3 NLP Annotated STADM. STADM with annotations added to token and term records indicating stopwords, parts-of-speech, stems and lemmas, named entities, grammatical dependencies, sentiments, etc.
- F4 STADM with Vector Space models. Vector space representations of TOKEN data and resulting statistical data, such as term frequency and TFIDF.
- F5 STADM with analytical models. STADM with columns and tables added for outputs of fitting and transforming models with the data.
- F6 STADM converted into interactive visualization. STADM represented as a database-driven application with interactive visualization, .e.g. Jupyter notebooks and web applications.

In [None]:
import pandas as pd
import seaborn as sns
import nltk
import numpy as np

In [15]:
# company_num = BOOKS
# link_num = CHAPTERS
# text = PARAS

OHCO = ['company_num', 'link_num', 'sent_num', 'token_num']

### F0

#### Source Format. The initial source format of a text, which varies by collection, e.g. XML (e.g. TEI and RSS), HTML, plain text (e.g. Gutenberg), JSON, and CSV.

In [16]:
df = pd.read_csv('CORPUS.tar.gz', compression='gzip', lineterminator='\n')
df

Unnamed: 0,company_num,Text,characters
0,0,"Ahresty, with more than 60 years of experienc...",1709
1,0,"PRODUCTS Ahresty, with more than 60 years of e...",754
2,0,ENVIRONMENTAL,16
3,0,CONTACT Address Ahresty Wilmington Corporation...,439
4,1,Manufacturer ofMetal FastenersandGeneral Hardw...,1025
...,...,...,...
90628,1225,"Home•Careers Together, we build the future We...",2524
90629,1225,Privacy The protection of your personal data i...,12706
90630,1225,Signicast acquires European based CIREX 02.15....,5160
90631,1225,Email Protection You are unable to access this...,558


In [17]:
# Since this CORPUS is too big, I only included certian # of companies. 
# Otherwise, it crashes when running sentence seperator cell. ( > 300 companies)
# tokenization doesn't crash when we limit to 200 companies.
df = df[df["company_num"] < 200]
df.sample(5)

Unnamed: 0,company_num,Text,characters
7086,181,Shopping Cart Your shopping cart is empty 2020...,177
3475,71,The Disadvantages of having Porosity in Die Ca...,2692
3792,71,What is an Aluminum Die Casting Company? What ...,2447
1928,24,Sound Absorption Panels KOCH Finishing Systems...,305
4960,123,Your bag is currently empty. James Coppell Lee...,1563


### F1

#### Machine Learning Corpus Format (MLCF). Ideally a table of minimum discursive units indexed by document content hierarchy.

In [20]:
# Add link count column
df['link_num'] = df.groupby('company_num').cumcount()

DOCS = df[["company_num", "link_num" ,"Text", "characters"]]
DOCS = DOCS.rename(columns={'company_num': 'company_id'})
DOCS = DOCS.rename(columns={'Text': 'text'})
DOCS = DOCS.set_index(["company_id"])
DOCS

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['link_num'] = df.groupby('company_num').cumcount()


Unnamed: 0_level_0,link_num,text,characters
company_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,"Ahresty, with more than 60 years of experienc...",1709
0,1,"PRODUCTS Ahresty, with more than 60 years of e...",754
0,2,ENVIRONMENTAL,16
0,3,CONTACT Address Ahresty Wilmington Corporation...,439
1,0,Manufacturer ofMetal FastenersandGeneral Hardw...,1025
...,...,...,...
198,15,Phone:+55 41 3341 1900 Sitemap Coming Soon… He...,576
198,16,Phone:+55 41 3341 1900 Author:Daniel WHB Autom...,3545
199,0,Committed toQuality MaterialsQuality Workmansh...,332
199,1,Our Services Specializing in: Sandblasting San...,809


### F2
: Convert the collection from their source formats (F0) into a set of tables that conform to the Standard Text Analytic Data Model

#### Standard Text Analytic Data Model (STADM). A normalized set of tables including DOC, TOKEN, and TERM tables. Produced by the tokenization of F1 data.

#### 1. CHAPS

In [21]:
# CHATPERS
CHAPS = DOCS.reset_index().set_index(["company_id", "link_num"])
CHAPS

Unnamed: 0_level_0,Unnamed: 1_level_0,text,characters
company_id,link_num,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,"Ahresty, with more than 60 years of experienc...",1709
0,1,"PRODUCTS Ahresty, with more than 60 years of e...",754
0,2,ENVIRONMENTAL,16
0,3,CONTACT Address Ahresty Wilmington Corporation...,439
1,0,Manufacturer ofMetal FastenersandGeneral Hardw...,1025
...,...,...,...
198,15,Phone:+55 41 3341 1900 Sitemap Coming Soon… He...,576
198,16,Phone:+55 41 3341 1900 Author:Daniel WHB Autom...,3545
199,0,Committed toQuality MaterialsQuality Workmansh...,332
199,1,Our Services Specializing in: Sandblasting San...,809


#### 2. SENTS

In [22]:
## SENTS
%%time
sent_pat = r'[.?!;:]+'
SENTS = CHAPS['text'].str.split(sent_pat, expand=True).stack().to_frame('sent_str')
SENTS.index.names = ["company_id", "link_num", "sent_num"]
SENTS

CPU times: user 1.46 s, sys: 25 ms, total: 1.48 s
Wall time: 1.49 s


#### 3. TOKENS

In [23]:
## TOKENS
## TOKENIZING TABLE TAKES AROUND 5 MINS.
# YOU CAN JUST EASILY ALREADY SAVED TOKENS TABLE.
TOKENS = pd.read_csv('TOKENS.tar.gz', compression='gzip', lineterminator='\n')
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
company_id,link_num,sent_num,Unnamed: 3_level_1
0,0,0,"Ahresty, with more than 60 years of experienc..."
0,0,1,Industry leadingmanufacturing technology Ahre...
0,0,2,We utilize leak testing on all machine lines...
0,0,3,We currently have 28 fully automated High Pre...
0,0,4,Global leaderhere at home The Ahresty Wilming...
...,...,...,...
199,2,1,Hours
199,2,2,Monday-Friday 7
199,2,3,30 am - 4
199,2,4,00 pm 618-753-3188 Web Design by Novel Designs...


### F3 
: NLP Annotated STADM. STADM with annotations added to token and term records indicating stopwords, parts-of-speech, stems and lemmas, named entities, grammatical dependencies, sentiments, etc.

In [24]:
# keep_whitespace = True

In [25]:
# %%time
# if keep_whitespace:
#     TOKENS = SENTS.sent_str\
#             .apply(lambda x: pd.Series(nltk.pos_tag(nltk.word_tokenize(x))))\
#             .stack()\
#             .to_frame('pos_tuple')
# else:
#     TOKENS = SENTS.sent_str\
#             .apply(lambda x: pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x))))\
#             .stack()\
#             .to_frame('pos_tuple')



CPU times: user 4min 11s, sys: 11.4 s, total: 4min 22s
Wall time: 4min 23s


In [26]:
# TOKENS.index.names = ["company_id", "link_num", "sent_num", "token_num"]
# TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple
company_id,link_num,sent_num,token_num,Unnamed: 4_level_1
0,0,0,0,"(Ahresty, NNP)"
0,0,0,1,"(,, ,)"
0,0,0,2,"(with, IN)"
0,0,0,3,"(more, JJR)"
0,0,0,4,"(than, IN)"
...,...,...,...,...
199,2,5,1,"(Designed, VBN)"
199,2,5,2,"(byElegant, JJ)"
199,2,5,3,"(Themes|, NNP)"
199,2,5,4,"(Powered, NNP)"


In [27]:
# %%time
# TOKENS['pos'] = TOKENS.pos_tuple.apply(lambda x: x[1])
# TOKENS['token_str'] = TOKENS.pos_tuple.apply(lambda x: x[0])
# TOKENS['term_str'] = TOKENS.token_str.str.lower()
# TOKENS

CPU times: user 2.37 s, sys: 76.9 ms, total: 2.45 s
Wall time: 2.45 s


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str
company_id,link_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,0,0,"(Ahresty, NNP)",NNP,Ahresty,ahresty
0,0,0,1,"(,, ,)",",",",",","
0,0,0,2,"(with, IN)",IN,with,with
0,0,0,3,"(more, JJR)",JJR,more,more
0,0,0,4,"(than, IN)",IN,than,than
...,...,...,...,...,...,...,...
199,2,5,1,"(Designed, VBN)",VBN,Designed,designed
199,2,5,2,"(byElegant, JJ)",JJ,byElegant,byelegant
199,2,5,3,"(Themes|, NNP)",NNP,Themes|,themes|
199,2,5,4,"(Powered, NNP)",NNP,Powered,powered


In [28]:
# SAVE TOKENS TABLE
# TOKENS.to_csv("TOKENS.csv")

#### VOCAB

In [31]:
%%time
VOCAB = TOKENS.term_str.value_counts().to_frame('n')
VOCAB.index.name = 'term_str'
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['i'] = -np.log2(VOCAB.p)
VOCAB['n_chars'] = VOCAB.index.str.len()
VOCAB['max_pos'] = TOKENS[['term_str','pos']].value_counts().unstack(fill_value=0).idxmax(1)
VOCAB

Unnamed: 0_level_0,n,p,i,n_chars,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
",",140656,4.157575e-02,4.588114,1,","
the,108291,3.200915e-02,4.965372,3,DT
and,77276,2.284160e-02,5.452193,3,CC
to,68675,2.029927e-02,5.622428,2,TO
die,62309,1.841758e-02,5.762773,3,NNP
...,...,...,...,...,...
ephgrave,1,2.955846e-07,21.689925,8,NNP
bastian,1,2.955846e-07,21.689925,7,NNP
atentamente,1,2.955846e-07,21.689925,11,NNP
desarrolle,1,2.955846e-07,21.689925,10,NN


### F4 
: STADM with Vector Space models. Vector space representations of TOKEN data and resulting statistical data, such as term frequency and TFIDF.