# <center>Critical AI</center>
<center>ENGL 54.41</center>
<center>Dartmouth College</center>
<center>Fall 2024</center>
<pre>Created: 08/23/2019; Revised: 09/18/2024</pre>

In [None]:
# This is Jupyter code cell. This line is a comment; 
# comments are not executed by the interpreter.
 
# Here we are assigning a variable 'course_title' the value of 'Critical AI',
# this will automatically make 'course_title' a String.
course_title = 'Critical AI'

In [None]:
# To see the type assigned, we can use the type() function:
type(course_title)

In [None]:
# To see or display the value of a String (especially a short one), we can use the print() function:
print(course_title)

## Exploring Datatypes: Strings 
 
You can query the Python interpreter for documentation on what can be done with a 
particular datatype with the help() function. Just like you can learn what can be 
done with the print() function by executing help(print),you can ask for some basic 
documentation on the datatype with help(str) which gives us help for the class Str or String. 
Try that below by creating a new cell and executing that function.

In [None]:
# Strings enable us to store and manipulate data, especially text-based data.
# They are sequences of characters and they are perfect for the manipulation of text.
# We can easily access individual characters by indexing the string by character position:

print(course_title[0])
print(course_title[1])
print(course_title[2])
print(course_title[3])
print(course_title[4])
print(course_title[5])
print(course_title[6])
print(course_title[7])

In [None]:
# But that was a silly way of obtaining those characters:
print(course_title[:8])

# Here we've just exposed something potentially confusing: Strings-as-sequences begin with 
# the index of 0, so the first character is the zero-eth element but some 
# methods are going to ask for a specific number of characters--so we want
# the first eight characters.

## Try It: Print the second word of the course title
In the cell below, print just the second word of the course title

## More String Methods
There is a *lot* more that you can do with strings. Review the help(str) documentation for a full list and examples. We'll just quickly look at few examples now.

In [None]:
# There are so many methods that we can perform on strings. 

# Here's a longer string:
sentence = """To care for poetry in this way does not make one a poet, but it 
does make one feel blessedly rich, and quite indifferent to many things which
are usually looked upon as desirable possessions"""

# We can count the occurance of items.
sentence.count("one")

In [None]:
# We can change the case of strings.
sentence.upper()

## Try It: Make it all lowercase
In the cell below, display this variable in lowercase

In [None]:
# We can remove those newline characters represented as '\n'
sentence.replace('\n',' ')

In [None]:
# Notice that these strings are immutable.

In [None]:
sentence[0] = "Z"

In [None]:
# We can, however, reassign a value to string replacing its present content
sentence = sentence.replace('\n',' ')
print(sentence)

### Exploring Datatypes: Lists 


In [None]:
# Lists store values. They can combine different datatypes. They can even contain other lists.
# Lists can be created easily enough and are trivially modified.

# Here is a simple list of numbers
example_list = [1,2,3,4,5]

# You can also do that with range()
example_list = list(range(1,6))

example_list

In [None]:
# But what about using strings in lists?
example_list = ["one","two","three","four","five","six"]

# display the first item in our list
example_list[0]

## Try It: Add items to a list
Add another word to the list (example_list) with the append method:

In [None]:
# Display the result:
example_list

In [None]:
# Iteration is simple, but list comprehension--that's cool!
[w.upper() for w in example_list]

In [None]:
# This did not actually modify the list. You could overwrite the 
# entire list with variable assignment or modify item by item.

for i, w in enumerate([w.upper() for w in example_list]):
    example_list[i] = w

# The enumerate() function gives us both a list and the indices for that list--it's really handy.
example_list

## Part II: Vectors, Matrices, and Tensors

In [None]:
# Python supports a very large number of libraries that can be loaded
# or imported as needed. We only import the libraries that we need in 
# order to reduce the memory requirements and (possibly) prevent 
# collisions in the namespace used by various functions.

import numpy as np
import torch

In [None]:
# Pytorch (torch) is a very popular library for building neural networks and 
# for deep learning. https://pytorch.org/
# 
# It is the industry standard for working with the kinds of AI & ML technologies
# that we will be studying this year. 
# 
# It introduces a new datatype called a tensor. Tensors are similar to the
# arrays and matrices used by NumPy (numpy) but are designed to run on faster
# processing devices called GPU (graphics processing units) that we will be
# hopefully using later this term. They also can keep a record, some history,
# of the transformations that created them. 

![Pytorch Tensor Types](../img/pytorch-tensor-types.png)

Pytorch Tensor Types from Eli Stevens et al. *Deep Learning with Pytorch* (Manning, 2020)

## Vectors and Vectorization

The sociologist Adrian Mackenzie writes in [*Machine Learners: Archaeology of a Data Practice*](https://mitpress.mit.edu/author/adrian-mackenzie-8915/) (MIT Press, 2017) of the function of vectorization as a remapping of space:

"Machine learning locates data practice in an expanding epistemic space. The expansion
derives, I will suggest, from a specific operational diagram that maps data into a vector
space. It vectorizes data according to axes, coordinates, and scales. Machine learners, in
turn, inhabit a vectorized space, and their operations vectorize data...Often data are represented as a homogenous set of numbers or a continuous flowing stream. We need, however, to archaeologically examine some of the transformations that allow different shapes and densities of data, whether in the form of numbers,
words, or images, to become machine learnable. Data in their local complexes space
out in many different density shapes, depending on how the changes, signals, propensities,
and norms have been generated or configured."

In [None]:
# A vector is typically thought of as a single dimension list of values, not unlike a list.
#
# This variable is list of floating point numbers. What's a floating point number? It's a numerical value 
# with greater precision than a integer (i.e., the int 4 vs. the float 4.3). 
vec = [4.3, 3.0, 1.1, 0.1]

In [None]:
# If we try perform an operation on the list (on every element of the list) we will 
# most likely not get the result we want:
vec * 3

In [None]:
# Converting this list to an array (and treating as a vector) enables us to apply a 
# transformation to the entire vector at once:
vec = np.array(vec)
vec * 3

In [None]:
# We can do other fun stuff with this vector. For example, here are some basic 
# summary statistics:
vec.mean()

In [None]:
vec.min()

In [None]:
vec.max()

In [None]:
# Now display all these at once:
vec_mean = vec.mean()
vec_min = vec.min()
vec_max = vec.max()
print(f'mean: {vec_mean}, min: {vec_min}, max: {vec_max}')

In [None]:
# We are going to now use Pytorch tensors rather than numpy arrays. This is because
# this datatype is especially good for the sort of work we are going to do in
# Critical AI.

vec1 = torch.tensor([4.3,3.0,1.1,0.1])
vec2 = torch.tensor([6.3,2.8,5.1,1.5])

In [None]:
# Let's say that these are representations of two kinds of flowers (because they are).
# The values, let's call them features, are measures of the length and width of two 
# types of flower appendages (sepal and petal). 
#
# How might we answer the question of how similar are these two flowers?
#
# One way might be to find the difference across all four feature dimensions. We can
# take the absolute value of that difference to get a sense of how similar these two
# samples are to each other:
torch.abs(vec1 - vec2)

In [None]:
# We can also combine these 1D vectors into a 2D (tensor) matrix:
matrix = torch.vstack([vec1,vec2])
matrix

In [None]:
# Tell us about this matrix--what is it shape? How many rows and columns do we have?
matrix.shape

In [None]:
# display mean values across all four feature dimensions of the 
matrix.mean(axis=0)

In [None]:
# display standard deviation values across all four feature dimensions of the 
matrix.std(axis=0)

## A Better Way: The Distance Matrix

Reconceptualizing our data as features in a standardize space (via vectorization) allows us to measure distances between points, where each point is a multidimensional value.

In [None]:
# Euclidean distance is a measurement of a straight line between two points
# https://en.wikipedia.org/wiki/Euclidean_distance
from sklearn.metrics import euclidean_distances

# Cosine similarity is a measurement of the angle between two vectors
# https://en.wikipedia.org/wiki/Cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# This will look a bit funny, but here we are measuring the distance
# on a straight line between 4 and 10. We are seeing these distances
# as pairs. The first row displays the distance betwen 4 and 4
# and then 4 and 10. The second begins with the later and then the former.
#
# This is a know as the distance matrix. As we add values, we can compare 
# distances among all the rows and columns. The top and bottom triangle 
# separated by the  diagonal measuring the distance between each item to 
# itself, thus all zeros
#
euclidean_distances([[4],[10]])

In [None]:
# Here is the distance matrix for our separate vectors:
euclidean_distances([vec1,vec2])

In [None]:
# Now processing the matrix composed of those two stacked vectors:
euclidean_distances(matrix)

In [None]:
# Observe the differences with cosine similarity:
cosine_similarity([vec1,vec2])

In [None]:
cosine_similarity(matrix)

## Adding Real Data

We are going to look at and use a very well known dataset in machine learning from Ronald A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” *Annals of Eugenics* 7, no. 2 (1936): 179–88. This dataset contains measurements of 150 Iris flowers, fifty samples each from three different kinds (classes, in the language of machine learning) using four measurements:
   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm

What are these measurements or features? Why these?

![Museum of Natural History](../img/parts-of-a-flower_full_amnh.png)

"Parts of a Flower," American Museum of Natural History
https://www.amnh.org/learn-teach/curriculum-collections/biodiversity-counts/plant-identification/plant-morphology/parts-of-a-flower

The three classes represented in the Fisher Iris dataset are: 
1. Iris Setosa
2. Iris Versicolour
3. Iris Virginica
      

In [None]:
# We have a local copy of the dataset and we'll read with a Python
# package called Pandas, assigining short names to each of the columns
# that represent the four features.

import pandas as pd
df = pd.read_csv('../data/iris.data',names=["sl","sw","pl","pw","class"])

In [None]:
df

In [None]:
# drop the class name for now and make into a matrix
matrix = torch.tensor(df.iloc[:, : 4].to_numpy())

In [None]:
# mean values?
matrix.mean(axis=0)

In [None]:
# standard deviation?
matrix.std(axis=0)

## The dataset curators tell us a little more about the dataset here:

8. Missing Attribute Values: None
<pre>
Summary Statistics:
                 Min  Max   Mean    SD   Class Correlation
   sepal length: 4.3  7.9   5.84  0.83    0.7826   
    sepal width: 2.0  4.4   3.05  0.43   -0.4194
   petal length: 1.0  6.9   3.76  1.76    0.9490  (high!)
    petal width: 0.1  2.5   1.20  0.76    0.9565  (high!)
</pre>

9. Class Distribution: 33.3% for each of 3 classes.

In [None]:
# Now lets look at the distance matrix. How does this look?
euclidean_distances(matrix)

In [None]:
# But there are better ways to view this!

# assign dist variable to our distance matrix
dist = euclidean_distances(matrix)

# import what we need to visualize
import matplotlib.pyplot as plt
%matplotlib inline

# show it!
plt.imshow(dist)
plt.show()

In [None]:
# We'll create a cosine DISimilarity plot by subtracting from 1
# (making similarity items closer to 0 rather than 1)

dist = 1 - cosine_similarity(matrix)

# show it!
plt.imshow(dist)
plt.show()