<a href="https://colab.research.google.com/github/jtallison/LDLFest-workshop/blob/master/workbook-1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
import pandas as pd

%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 300

# Who’s Afraid of ChatGPT?

## A Gentle Introduction to Text Analytics

John Laudun  
Department of English  
University of Louisiana at Lafayette  
johnlaudun@gmail.com  
https://johnlaudun.net/  
@johnlaudun@hcommons.social  

All of today's materials, and this notebook, are available at: http://github.com/johnlaudun/workshop

## Workshop Agenda

1. Understanding what AI/ML are and how they are involved in text analytics
2. Doing some "analytics" with a few texts
3. Exploring the possibilities

## Focus on code as a form of inquiry 
## (*versus categorical thinking*)

![Books and Research](images/books.png)

## Artificial Intelligence, Machine Learning, & You

- Machine learning used to be, and sometimes still is (in math departments) called **statistical learning**.
- Most machine learning is a mix of statistical operations with **calculus** and **linear algebra**.

### What that looks like:

In [None]:
# our "data"
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
y = np.array([13, 14, 17, 12, 23, 24, 25, 25, 24, 28, 32, 33])

# scatterplot
plt.plot(x, y, 'o')

In [None]:
m, b = np.polyfit(x, y, 1)
plt.plot(x, y, 'o')
plt.plot(x, m*x+b, color='red')

In [None]:
plt.plot(x, y, '-o')

In [None]:
plt.plot(x, m*x+b, color='red')
plt.plot(x, y, '-o')
plt.plot(6.5, 17, 'g*')

In [None]:
poly = np.polyfit(x, y, deg=5)
plt.plot(x, y, 'o')
plt.plot(np.polyval(poly, x))

In [None]:
print(f""" \n
y = {m:.2f} x  + {b:.2f}\n\n\n
versus \n\n\n
{np.poly1d(poly)}\n """)

## The Math Involved

- **Linear Algebra** to handle lots of "dimensions"
- **Statistics** and **calculus** to determine possible relationships between the dimensions.

## Dimensions?!

```
Text 1 = "Mary had a little lamb whose fleece was white as snow."
Text 2 = "And everywhere that Mary went, the lamb was sure to go."

t1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
t2 = [12, 13, 14, 1, 15, 16, 5, 8, 17, 18, 19]
```

But there are 12 lines: the resulting matrix for this poem would be ...

In [None]:
# First we load our file into a string
lamb = open('texts/marys-lamb.txt', 'r').read()

# Then we turn that string into a list of words
lamb_words = re.sub("[^a-zA-Z']"," ", lamb).lower().split()

# Now let's determine the unique words & count them
unique_words = set(lamb_words)
print(f"There are {len(unique_words)} unique words in this text.\n")
print(f"Those words are: {unique_words}.\n")
print(f"The matrix for this text would be 12 x 96.")

Imagine a table 12 rows deep by 96 columns wide -- most scientific disciplines make each example, or observation, a row, and then they make each feature a column:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

with open('texts/marys-lamb.txt', 'r') as the_file:
    lines = the_file.readlines()

vectors = vectorizer.fit_transform(lines)
td = pd.DataFrame(vectors.todense())
td.columns = vectorizer.get_feature_names_out()
td.head()

This is where linear algebra comes into play: we can pull the **term-document matrix** as this is called into two smaller matrices (factor it) in a way that, potentially, reveals key word clusters. (This is one form of *topic modeling*.)

In [None]:
tdm = td.T
tdm['total_count'] = tdm.sum(axis=1)
tdm = tdm.sort_values(by ='total_count', ascending=False)[:20]
tdm['total_count'].plot.bar()
# tdm.head()

## But are all the words worth counting?