<a href="https://colab.research.google.com/github/hlapin/DHTeaching/blob/master/Getting_Started_With_Text_Mining_Stylometry%2C_TF_IDF%2C_Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
This colab notebook was prepared for teaching purposes for a session on text mining in a course on digital tools in historical research.  
As a text source the examples use the text of the *Federalist Papers*.   
Much of what follows is derivative of resources available on the web, although I have revised material for display or pedagogical purposes. My dependence is especially visible with the tutorial on stylometry at _The Programing Historian_:
> https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python   

(The goal has been to provide "plug-and-play" demonstrations, leaving space open for discussion of methods and concepts.) 
For those of you new to Colab/Jupyter, there are text blocks, code blocks, and--after running code--output blocks. Subsections may be hidden by default, but `ctrl+[` should expand all sections and `ctrl+]` collapse them.    
You should not need to write any code (although you are welcome to experiment): just click the run (play arrow) button at the top left of each code block. You generally ***do*** need to run the code blocks in order.  

# Getting the text
We are going to be working with the Federalist Papers  
We need to:
1. Download a zip file from github (Programming Historian repository)
2. Unzip it and
3. Unpack the files in a local folder [local to Colab]


In [None]:
import requests, io, zipfile, os
os.chdir ('/content/') # changes working directory on local machine
r = requests.get('https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/introduction-to-stylometry-with-python/stylometry-federalist.zip?raw=true')
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall() 



There is now a local file on the virtual machine you are using called `data` with the files   
You can check that in the files tools in the left hand panel.   
However, let's set the working directory to that directory confirm it and list its contents programmatically.
 

In [None]:
os.chdir('/content/data/')
os.listdir()

## Text to Dataframe
Now we are going to read each of the individual chapters into a table (a pandas dataframe, `dfPapers`) where each row is a document and that has the column headings `file_name` and `text`.

For later use we are going to create a column `num` that has only the numerical part of the paper's name (`8` for `Federalist_8`) and we are going to extract an approximation of the document's title (the third line in each document; that will go into `title`).

In [None]:
import pandas as pd # pandas is a data structure library
import glob         # finds all pathnames matching a specific pattern


# create an dataframe with each document as a row
# we start by creating a "dictionary" to represent the columns
results = {"file_name":[],"num":[],"title":[],"text":[]}

for item in glob.glob('*[0-9].txt'):  # read only files with names ending 
                                      # with numerals.Why?
   
   # each `item` is a path that matches the pattern, e.g., 'Federalist_8.txt'

   # below, the function split() splits a string into substrings based on 
   # a specified separator. 
   # [In Python, the first item in a list is indexed as 0 rather than 1]
   
   short = item.split('.')[0]           # grab filename without '.txt'
   paperNum = int(short.split('_')[1])  # grab the numeral after '_'
   with open(item, "r") as file_open:
     txt = file_open.read()
     results["file_name"].append(short)
     results["num"].append(paperNum)
     results["title"].append(txt.split('\n')[2])
     results["text"].append(txt.replace('\n', ' '))

# pandas has built in abilities to convert dictionaries to dataframes
dfPapers = pd.DataFrame(results)
dfPapers = dfPapers.sort_values(["num"])
dfPapers = dfPapers.reset_index(drop=True)

#let's check that we got the documents into shape
dfPapers   

## A Bit of Cleanup
Let's remove punctuation and convert all upper case to lower case, and then print a sample of our data to if we got it right.  
> *Regular expressions* (often `regex`) refers to a set of operations on text that can be defined by patterns (a valid email address is an unbroken string, followed by '@' followed by a domain and one of a number of valid suffixes (.org, .edu, .ac.uk). For an example of how complex the regex may be to capture "all" emails see: http://emailregex.com/




In [None]:
import re #re is the module that does regular expression operations

# note that pandas allows us to operate on all the cells in a column
# of a dataframe by filtering by column label: dfPapers['text'] 

# regularize spacing: 
# replace one or more line breaks or spaces with single space
dfPapers['text'] = dfPapers['text'].map(lambda x: re.sub(r"\s+", ' ', x))

# remove punctuation, numerals, etc. This time replace with no space
dfPapers['text'] = dfPapers['text'].map(lambda x: re.sub(r"[\d\'\"\(\)\:;,\.!?‘’“”]", '', x))

# convert characters to lower case
dfPapers['text'] = dfPapers['text'].map(lambda x: x.lower())

# again, let's check that we got the documents into proper shape
dfPapers

# Some Exploratory Analysis
First we are going to do some exploratory text analysis by making a word cloud.  
How does the model select words to present?

## Frequency vs significance
In a word cloud, frequency determines the size of the word, but:
* Is it really modelling the ***most frequent*** words?
* What is the problem with words like may, will?

The python library that does the work for us has a default set of `stopwords`: very common words that it filters out.

In [None]:
from wordcloud import WordCloud

# code adapted from https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
# See also: https://towardsdatascience.com/generate-meaningful-word-clouds-in-python-5b85f5668eeb

# Join the different processed titles together into one long text.
long_string = ','.join(list(dfPapers['text'].values))

# Create a WordCloud object
# You can change the parameters below
wordcloud = WordCloud(background_color="white", 
                        max_words=100, 
                        contour_width=3, 
                        contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()


## Let's run the same function without any stopwords

In [None]:
stopwords = set() # define the list of stopwords as an empty set 

# Join the different processed papers together into one long text.
long_string = ','.join(list(dfPapers['text'].values))

# Create a WordCloud object
# You can change the parameters below
wordcloud = WordCloud(background_color="white", 
                        max_words=100, 
                        contour_width=3, 
                        stopwords = stopwords,
                        collocations=False, #only single tokens
                        contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(long_string)

# Visualize the word cloud
wordcloud.to_image()

# Stylometry
The approach we are taking to stylometry is based on most frequent ("function") words. We could refine this further, but the basic observation is that no two authors use very common words (in English) like "the" or "and" or "of" identically (or punctuate quite the same way). In priniple, we should be able to create a "fingerprint" for a given author based on the distinctive use of these stopwords. 
From there techniques can become quite complicated, but the basic idea is that is that if we take a number of features (in our case, most frequent words) and see how these are used by each author, we can use this to measure distance between sample texts or authors' corpora. Before we do that we want to standardize our measurements (so that a larger corpus does not outweigh a smaller one) and decide on how to weight the features. (If the word "the" is about 10% of all the words it is clearly potentially distinctive; but how much should it count against the next most frequent words?)

# Stylometry I Author Identification
> Adapted from "Programming Historian"   
https://doi.org/10.46430/phen0078   

This section uses the same texts as before (the Federalist Papers) to illustrate the use of **Burrows's Delta**. What this tries to do is to measure how each writer in a corpus uses "function words" (in English, the very common words like "the" and "is" and "to").       
`Delta` seeks to aggregate the observations about each of the features we are testing for (we have used the 30 most common words in the document set).  
For each of the features we compare the frequency of the word in each of the texts we are examining in comparison with that of the corpus as a whole, and standardize the measurements across the corpus (so that more prolific authors [Hamilton] or more common words do not outweigh all the other authors or features).   
We also hold out Federalist 64 as a test case, and calculate `Delta` between that essay, and the other authors. A smaller `Delta` between texts means that two texts are "closer" to each other. 

## First let's modify the dataframe we created to add attribution
Follows the canonical division plus test case as in PH

In [None]:
import nltk

%unload_ext google.colab.data_table
# A "canonical" division into authors plus one test case, as in PH
papers = {
    'Madison': [10, 14, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48],
    'Hamilton': [1, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 21, 22, 23, 24,
                 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 59, 60,
                 61, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
                 78, 79, 80, 81, 82, 83, 84, 85],
    'Jay': [2, 3, 4, 5],
    'Shared': [18, 19, 20],
    'Disputed': [49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 62, 63],
    'TestCase': [64]
}
k, v = list(papers.keys()), list(papers.values())
def return_attrib (num):
  """ checks for the document no in lists of values
      returns first letter of attribution
      to use as label. """
  for i in v:
    if num in i: 
      return k[v.index(i)][0]

# insert attribution to datatable in first position if it does not exist
# if not "attrib" in dfPapers.columns:
#   dfPapers.insert(loc=0, column='attrib',value='')


dfPapers["attrib"] = dfPapers["num"].apply(return_attrib)
dfPapers.head(5)    


## Feature Selection
We are choosing 30 as in PH example.

Create a composite feature set for all texts (except test case)

In [None]:
# filter dataframe to exclude testcase
# this is a variation of the operation we used above for the word cloud
# create a single string
corpus = ' '.join(dfPapers["text"][dfPapers["attrib"] != "T"].values)

# separate into a list of individual words ('tokens'). 
# These are our 'features'
corpus_tokens = corpus.split()

# create frequency list using built in nltk library
# this will give us the n (30) most common words and how often they appear
whole_corpus_freq_dist = list(nltk.FreqDist(corpus_tokens).most_common(30))

# # uncomment and run again to see the first 10
# print(whole_corpus_freq_dist[ :10 ])

# data structure to contain our statistical information
dfFeatures = pd.DataFrame( columns=["feats"])
dfFeatures["feats"] = [w for w, freq in whole_corpus_freq_dist]
dfFeatures["corpus"] = [freq for w, freq in whole_corpus_freq_dist]

# calculate frequency for each of the "authors"
# authors to test
authors = ("H","M","J","S","D","T")
for author in authors:
  author_corpus = ' '.join(dfPapers["text"][dfPapers["attrib"] == author].values)

  #separate into a list of values
  author_tokens = author_corpus.split()

  # create frequency list using built in nltk function
  author_length = len(author_tokens)

  # copy the features to a list
  # for each feature count the proportion of features to total author words
  # append to df

  features = dfFeatures.feats.to_list()
  author_features = [author_tokens.count(x)/author_length for x in features]
  # dfFeatures[author] = author_features
  dfFeatures[author] = author_features

dfFeatures.head()



## Means, Standard Deviation, z-scores

In [None]:
import math
# mean of the mean frequency of each feature
# exclude testcase from means
authors_no_T = ["H","M", "J", "S", "D"]

#calculate the means of columns
dfFeatures["means"] = dfFeatures[authors_no_T].mean(axis=1)

# dfFeatures

# calculate stdev of columns compared tos sample
# formula stdev = sum(sqrt((x[i] - x[sample])^2/(n - 1)))
# should be a more efficient way of doing this in Pandas but
# (a) I am a newbie
# (b) this makes the process explicit

n = len(authors_no_T)
stdev = list([0]*len(features))

for i in range(len(features)):
  squ_diff_fr_mean = 0
  sum_squ_diff = 0
  author_feature_values = dfFeatures.iloc[[i],[2,3,4,5,6]].values[0]
  feature_mean = dfFeatures.iloc[[i],[8]].values[0]
  
  for j in range(len(authors_no_T)):
    squ_diff_fr_mean = (author_feature_values[j] - feature_mean[0])**2
    sum_squ_diff = sum_squ_diff + squ_diff_fr_mean
    stdev[i] = math.sqrt(sum_squ_diff/(n - 1))

dfFeatures["stdev"] = stdev

# z-scores
# formula z[j] = (Observed[j] - mean[j])/stdev[j]

# dataframe to hold z scores
z_cols = list(authors)
z_cols.extend(["means", "stdev"])
#z_cols
#calcuate z-scores
dfZ = dfFeatures[z_cols].copy()
for author in authors:
  dfZ[author] = (dfZ[author] - dfZ["means"])/dfZ["stdev"]
dfZ.head(7)


Sample values in PH for T (Federalist 64) are:
```
Test case z-score for feature the is -0.7692828380408238
Test case z-score for feature of is -1.8167784558461264
Test case z-score for feature to is 1.032705844508835
Test case z-score for feature and is 1.0268752924746058
Test case z-score for feature in is 0.6085448501260903
Test case z-score for feature a is -0.9341289591084886
Test case z-score for feature be is 1.0279650702511498
```
Our values are close to those calculated there.

## Calculate Burrows's Delta  
> (from PH)  
Finally, calculate a delta score comparing the anonymous paper with each candidate’s subcorpus. To do this, take the average of the ***absolute values of the differences between the z-scores for each feature between the anonymous paper and the candidate’s subcorpus.*** (Read that twice!) This gives equal weight to each feature, no matter how often the words occur in the texts; otherwise, the top 3 or 4 features would overwhelm everything else.


In [None]:
# formula Delta = sum(abs())

# new data frame for delta
dfDelta = dfZ.copy()
col_keys = list(authors_no_T)
col_vals = list(authors_no_T)
for idx, v in enumerate(col_vals):
  col_vals[idx] = "T_to_" + col_vals[idx]
col_labels = dict(zip(col_keys, col_vals))
dfDelta = dfDelta.rename(index=str, columns=col_labels)

for v in col_vals:
  dfDelta[v] = abs(dfDelta[v] - dfDelta["T"])

# add a row for sums and and delta and calculate

# 
dfDelta.loc["sum"] = dfDelta.sum(axis=0)
dfDelta.loc["delta"] = dfDelta.loc["sum"]/len(features)

# clean up stupid error
all_col_labels = list(dfDelta.columns)
for col in all_col_labels:
  if col not in col_vals:
    dfDelta.loc["sum",col] = ""
    dfDelta.loc["delta", col] = ""

dfDelta.tail(2)

In [None]:
# report out

for col in col_vals:
  print("Delta for " + col + " is: ",dfDelta.loc["delta",col])


This was the conclusion/outpt in PH
```
Delta score for candidate Hamilton is 1.768470453004334
Delta score for candidate Madison is 1.6089724119682816
**Delta score for candidate Jay is 1.5345768956569326**
Delta score for candidate Disputed is 1.5371768107570636
Delta score for candidate Shared is 1.846113566619675
```

This was Laramée's concluding paragraph:

> As expected, Delta identifies John Jay as Federalist 64’s most likely author. It is interesting to note that, according to Delta, Federalist 64 is more similar to the disputed papers than to those known to have been written by Hamilton or by Madison; why that might be, however, is a question for another day.

Has our work confirmed this?

# Stylometry II: Visualizing Similarity and Difference
Here what we are doing, is essentially repeating what we did when applying Burrow's Delta, with two important changes: (1) we are applying it to each publication separately; and (2) we are comparing every work to every other work to gauge "closeness" or "distance."   
As with topic models or word vector embeddings, we can think of each sample as a vector and use a distance measurement to position them in multidimensional space.
Here there's a problem: how do we examine or visualize these relationships, and how do we find the important ones? (With 30 features, we need to allow up to 29 dimensions to describe the variation. Humans have difficulty imagining more than three.)   
We are going to demo two techniques for visualizing multi-dimensional data: **Principal Component Analysis** (PCA) and **t-distributed stochastic neighbor embedding**.

   
## PCA
**Principal Component Analysis** attempts to calculate the directions of the dimensions and their weights. When we are looking at a two dimensional plot of principal components, we are looking at two specific dimensions calculated by the model. The analysis also can tell us how much of the variation among our test cases is explained by the specific components.   
In our case, using the same number of features as before, the first two components account for a little under a quarter of the variation among the documents, but it looks like the first component (the horizontal or x axis) is where a good deal of the author differentiation is happening.

## Prepare the data as for Burrows' Delta
Here we go over the steps that we followed for Burrow's delta to get a frequency table for each individual publication in the Federalist Papers.    
We then reformat the data in a transposed table for PCA. (Without transposing, PCA would calculate how much features differ from one another over the 85D space of the authors. A very different question.)

In [None]:
# Repeat earlier steps, but now asigning each paper not author to a column 
# copy features and corpus data

# repeating here allows us to experiment with feature values
# see the explanations above

corpus_tokens = corpus.split()

# create frequency list using built in nltk function
whole_corpus_freq_dist = list(nltk.FreqDist(corpus_tokens).most_common(30))

# # uncomment and run again to see the check the first 10
# print(whole_corpus_freq_dist[ :10 ])

# data structure to contain our statistical information
dfFeatures_pca = pd.DataFrame( columns=["feats","corpus"])
dfFeatures_pca["feats"] = [w for w, freq in whole_corpus_freq_dist]
dfFeatures_pca["corpus"] = [freq for w, freq in whole_corpus_freq_dist]

# the last time we did this we bundled the docs together by attribution 
# This time, we create frequency table by publication rather than author

for p in range(1,len(dfPapers)+1):  # iterate over all the files
  paper_corpus = ' '.join(dfPapers["text"][dfPapers["num"] == p].values)

  #separate into a list of tokens (features)
  paper_tokens = paper_corpus.split()
  
  paper_length = len(paper_tokens)

  # copy the features to a list
  # for each feature count the proportion of features to total author words
  # append to df

  # create frequency list using built in nltk function
  features = dfFeatures_pca.feats.to_list()
  
  # # raw numbers
  # paper_features = [paper_tokens.count(x) for x in features]

  # proportions
  paper_features = [paper_tokens.count(x)/paper_length for x in features]
  dfFeatures_pca[p] = paper_features


# transpose
dfFeats_transp = dfFeatures_pca.transpose()
dfFeats_transp.drop(["feats", "corpus"],inplace=True)

for f in features:
  dfFeats_transp.rename(columns={features.index(f):f},inplace=True)
	

# # uncomment to show check data
dfFeats_transp.head()



##Calcuate PCA 
We are using tools built into the scikit-learn library

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA



pca = PCA(n_components=3)
# # # to scale (mean = 0 total stdev = 1)
# # # balances the weights of more q less frequent words
# scaled = StandardScaler().fit_transform(dfFeats_transp)
# principalComponents = pca.fit_transform(scaled)

# to apply scaling uncomment above and comment next line
principalComponents = pca.fit_transform(dfFeats_transp)
principalDf = pd.DataFrame(data = principalComponents,
                           columns = ['principal component 1', 
                           'principal component 2', 
                           'principal component 3'])

# add attrib labels
principalDf["attrib"] = dfPapers["attrib"]

principalDf.head()

## Plot a 2-Dimensional Grid for two Components.
Plot a 2D scatter chart for the two first principal components.  
Federalist 64 is marked in red on the plot. Does this confirm our earlier obsevation using Burrows's Delta?  
This was Laramée's concluding paragraph:
>As expected, Delta identifies John Jay as Federalist 64’s most likely author. It is interesting to note that, according to Delta, Federalist 64 is more similar to the disputed papers than to those known to have been written by Hamilton or by Madison; why that might be, however, is a question for another day.   

Is this still true?  
What happens when we increase the number of features?
Decrease?


In [None]:
import matplotlib.pyplot as plt
## plotting code from:
## https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

# print out the amount explained by the first two components
print(pca.explained_variance_ratio_)


fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ["H","M","J","S","D","T"]
colors = ['purple', 'green', 'black','yellow', 'blue', 'red']
for target, color in zip(targets,colors):
  indicesToKeep = principalDf['attrib'] == target
  ax.scatter(principalDf.loc[indicesToKeep, 'principal component 1']
               , principalDf.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()


## t-SNE
**t-distributed stochastic neighbor embedding** is another widely used method for visualizing high-dimensional data in 2- to 3-dimensional space. 
It assigns a probability that each datapoint is similar to every other datapoint, and then attempts to map those relationships in a lower dimensional space. 

A frequent description is that PCA attempts to preserve the overall structure of the data as a whole, while t-SNE preserves the relationship with neighbors.

## Calculating t-SNE
Although the math of the two approaches is different, in scikitlearn, the procedure is more or less the same. 

Our data is the same `dFeats_transp` we created for PCA: a table that presents each of the words (features) as columns and each of the documents as rows.

* set the parameters for the t-SNE model we are using

* fit and transform our data to the model to generate an array of points

* plot those points in 2-dimensional space


In [None]:
# https://github.com/olekscode/Examples-PCA-tSNE/blob/master/Python/Visualizing%20Iris%20Dataset%20using%20PCA%20and%20t-SNE.ipynb

# We set the parameters for t-SNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, 
            n_iter=1000, 
            random_state=0, 
            perplexity=15, 
            learning_rate='auto',
            init='random')
points = tsne.fit_transform(dfFeats_transp[features])

In [None]:
tsneDF = pd.DataFrame(data = points,
                           columns = ['eigenvector 1', 
                           'eigenvector 2'])
tsneDF["attrib"] = dfPapers["attrib"]
tsneDF

In [None]:
fig_tsne = plt.figure(figsize = (8,8))
ax = fig_tsne.add_subplot(1,1,1) 
ax.set_xlabel('Eigenvector 1', fontsize = 15)
ax.set_ylabel('Eigenvector 2', fontsize = 15)
ax.set_title('2 component t-SNE', fontsize = 20)
targets = ["H","M","J","S","D","T"]
colors = ['purple', 'green', 'black','yellow', 'blue', 'red']
for target, color in zip(targets,colors):
  indicesToKeep = tsneDF['attrib'] == target
  ax.scatter(tsneDF.loc[indicesToKeep, 'eigenvector 1'], 
             tsneDF.loc[indicesToKeep, 'eigenvector 2'],
             c = color, 
             s = 50)
ax.legend(targets)
ax.grid()

# Semantic extraction: TF-IDF and LDA Topic Modelling

# TF-IDF
In our WordCloud experiment, we have already seen that term-frequency (how often a word appears) may not be the most helpful if we are trying to extract meaning. In that experiment we actively excluded stopwords, really frequent words, in order to get at words that are more indicative of what a text is about. **text frequency-inverse document freqency** (TF-IDF) is a method for weighting the "important" words in a given document. 

The intuition behind TF-IDF is pretty straightforward. In the sample of five titles from the Federalist Papers below, the word 'the' is very common in general and it appears in every title at least once. This means that it is not likely to be a good indicator of the contents of the individual document. However each title also has at least one word (e.g., `'foreign,' 'territory,' 'considered,' 'department,' 'senate'`). Most of these (maybe not 'considered', or 'method' in Federalist 49) point to subject matter. 

|index|file\_name|title|
|---|---|---|
|2|federalist\_3|The Same Subject Continued \(Concerning Dangers From Foreign Force and|
|13|federalist\_14|Objections to the Proposed Constitution From Extent of Territory|
|41|federalist\_42|The Powers Conferred by the Constitution Further Considered|
|48|federalist\_49|Method of Guarding Against the Encroachments of Any One Department of|
|63|federalist\_64|The Powers of the Senate|

In the title to Federalist 2 the **term frequency** of 'the' and 'foreign' are the same: 1 out of 10 or 0.1

**document frequency** refers to the proportion of documents in which the word appears. For 'the' that is 5/5 or 1. For 'foreign' in the title of Federalist 2 the document frequency is 1/5 or 0.2

**In principle**, for any given term in  any given document we get the TF-IDF score by dividing  the term frequency by the document frequency (multiplying it by the inverse of the document frequency). For 'territory' in the title of Federalist 2, TF-IDF is  0, since the term frequency of 'territory' in that text is 0:

>TF-IDF = 0/0.2 = 0

The word  'the' has a TF-IDF of:

>TF-IDF = 0.1/1 = 0.1

For 'foreign':
>TF-IDF = 0.1/0.2 = 0.5


By taking into account the document frequency of 'the' and 'foreign' we count 'foreign' as five times as "important" for this text as 'the.'

Note that **in practice** TF-IDF uses a formula based on the natural logarithm of the document frequency to calculate the IDF.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# the n most frequent features (words) across the corpus
# experiment with different values and see the results in the next cell

n = 1000

# set up the constraints for the structure that will
# represent our text as mathematical vectors
tfidf = TfidfVectorizer(
    min_df = 5,             # ONLY select words appearing in more than 5 docs
    max_df = 0.95,          # DO NOT words appearing in >= 95% of docs
    max_features = n,       # the n most frequent words across the corpus
    stop_words = 'english'  # remove very common english words
)

# calculate the tf/idf values for each word in all the documents
# sklearn allows us to do both steps in one command

# `fit` calculates the df and idf values
# and the parameters to create a standardized array
tfidf.fit(dfPapers.text)

# transform creates a (sparse) array of values for each feature for each doc
text = tfidf.transform(dfPapers.text)


In [None]:
# run this cell to get the first 100 feature names (words) 
# based on different values of n
tfidf.get_feature_names_out()[:100]

In [None]:
dfTFIDF = pd.DataFrame(text.todense())
dfTFIDF.columns = tfidf.get_feature_names_out()
dfTFIDF

In [None]:
n = 10 # number of top keywords to return
dfPapers['tfidf_keywords'] = dfTFIDF.apply(lambda x: 
                                      ', '.join(x.nlargest(n).index.tolist()), 
                                      axis=1)
dfPapers[['num','title','tfidf_keywords']]

# LDA topic modeling
***To do***: rewrite exercise using sklearn.

In [None]:
import sklearn.feature_extraction.text as text

# min_df: ignore words occurring in fewer than `n` documents 
#         (if decimal, proportion of docs)
# max_df: ignore words occurring in more than `m` documents
#         (if decimal, proporition of docs)
# stop_words: ignore very common words ("the", "and", "or", "to", ...)
vec = text.CountVectorizer(min_df=5, 
                           max_df=0.5, 
                           stop_words='english')
dtm = vec.fit_transform(dfPapers['text'])

print(f'Shape of document-term matrix: {dtm.shape}. '
      f'Number of tokens {dtm.sum()}')


In [None]:
# parameter estimation: using sklearn's default estimation method in the 
# class LatentDirichletAllocation: variational inference

# THIS CAN TAKE A WHILE

import sklearn.decomposition as decomposition

# define model
model = decomposition.LatentDirichletAllocation(
    n_components=10, learning_method='online', random_state=1)

# does this iterate over and fit each row of dtm?
# mixing weights for each document
document_topic_distributions = model.fit_transform(dtm)

In [None]:
#get array of "columns" (unique words)
vocabulary = vec.get_feature_names_out()

In [None]:
vocabulary

In [None]:
# create recognizable topic names
# add one to topic names to align with pyLDAvis (below)

topic_names = [f'Topic {str(k + 1)}' for k in range(10)]

# save topic word distributions and document topic distributions in separate dfs
topic_word_distributions = pd.DataFrame(
    model.components_, columns=vocabulary, index=topic_names)
document_topic_distributions = pd.DataFrame(
    document_topic_distributions, columns=topic_names, index=dfPapers.index)

document_topic_distributions.loc[9]

In [None]:
# sort to find the topics with most weight in specific dissent
topic_dist_test = document_topic_distributions.loc[11]
topic_dist_test.sort_values(ascending=False)[:5]

In [None]:
# topic_word_distributions

In [None]:
# create df with topic names as index and joined top n words as content
# function to get the top n words sorted
def word_dist_for_topic(topic,n):
    """
    returns the top n words for specified topic
    """
    twd = topic_word_distributions.loc[topic].sort_values(ascending=False).head(n)
    return twd
    
n = 10 # number of words to return

# word_dist_for_topic = word_dist_for_topic(topic,n)
# topic_top_words_joined = ", ".join(word_dist_for_topic.index)
# print(topic, ': ',topic_top_words_joined)

topics_dict = dict()
for t in topic_names:
    topic_words = word_dist_for_topic(t,n)
    topic_top_words_joined = ", ".join(topic_words.index)
    topics_dict[t]=topic_top_words_joined

topics_df = pd.DataFrame(topics_dict.items(),columns=['topic','topic words'])  
topics_df.set_index('topic')
# pd.set_option('display.max_colwidth', None)
# # reset_option('display.max_colwidth') #remember to turn on and off
topics_df

In [None]:
# topics (and topic words) most closely associated with document 
df = pd.DataFrame(document_topic_distributions.idxmax(axis=1))
df['lda_keywords'] = df[0].apply(lambda x : 
                           .loc[topics_df['topic'] == x, 'topic words'].item())

df

In [None]:
dfPapers.join(df)[['num','title','tfidf_keywords',0,'lda_keywords']]

In [None]:
# documents most strongly associated with each topic
d = document_topic_distributions.T.eq(document_topic_distributions.T.max(axis=1), axis=0)
d.dot(d.columns)

# Vector Space