# Lab 01: Setup computing environment and introduction

Welcome to ANLY 580!

We'll use *Jupyter Notebooks* to write and run our Python code. In Python, a *package* refers to a collection of Python files (sometimes called modules) that contain functions, classes, and other objects that can be installed and used. Jupyter Notebooks are convenient, web-based cells that can run Python code within a specified environment. An *environment* is a specification or collection of packages and their associated versions. Using environments makes it easy to manage versions of software so that you can maintain multiple separate environments at once and allow others to exactly reproduce your Python ecosystem.

The Notebook page is not static; it's interactive and allows you to execute code in your browser. Although in practice we typically organize projects in Git repositories containing Python modules and packages, Jupyter Notebooks are well-suited for certain programming tasks, pedagogical purposes, and integrating executable code with other media (images, LaTeX, and more). For example, cells can be executed out-of-order and plots can be displayed in-line. Crucially, with Google Colaboratory, you can also request an environment that leverages GPUs, which are critical to modern deep learning.

Often, the packages of interest in AI software development include some subset of the following:
- *matplotlib* and *plotly*: for plotting data
- *numpy*: for efficient manipulation of array
- *pandas*: for working with tabular data
- *pytorch* and *tensorflow*: two popular deep learning frameworks
- *sklearn*: for statisical learning (e.g., regression, classification, and others)
- *spacy*: for natural language processing (NLP) experimentation
- *transformers*: for cutting-edge deep learning-based NLP models

In this class, we'll be managing our compute packages and environments using the *Anaconda distribution*, which contains some of the most common packages for data science software development.

Follow the instructions in `computing-setup.md` to create an environment for this class.

Let's get started!

**Problem 1.** Run this block of code to ensure your dependencies are installed.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow
import torch

In [2]:
#%matplotlib inline

**Problem 2.** For this problem, use the text of Moby Dick (`moby_dick.txt`) throughout.

(A) Write a function `count_words(infile)` that takes a file as input and returns the total number of words in the file.

(B) Modify your `counts_words` function to accept a second argument `unique` that, if true, returns the number of unique words in the file. If `unique` is false, its behavior should be unchanged from Part (A).

(C) Within a function `get_ranks_and_frequencies(infile)`, use the `Counter` data structure from the `collections` module to compute word ranks and frequencies for every word in the text. The function should output a list that stores tuples, where the first element of each tuple is the ranking of the word (counting one-up, so the word that is most frequent receives rank=1, the word that is second most frequent receives rank=2, and so on) and the second element is the frequency value.

(D) With descending order of frequency, plot the logarithm of the frequency versus the logarithm of the rank for each word. What rough shape is the graph? 

In [3]:
# Part (A)
import re

def count_words(infile):
    f = open(infile, "r")
    content = f.read()
    word_list = re.findall(r'[a-zA-Z]+',content)
    return len(word_list)

print(count_words('moby_dick.txt'))

220481


In [4]:
# Part (B)

def count_words(infile, unique=True):
    if unique:
        f = open(infile, "r")
        content = f.read()
        word_list = re.findall(r'[a-zA-Z]+',content)
        unique_list = set(word_list)
        return len(unique_list)
    else:
        f = open(infile, "r")
        content = f.read()
        word_list = re.findall(r'[a-zA-Z]+',content)
        return len(word_list)

print(count_words('moby_dick.txt',True))

19319


In [5]:
# Part (C)

from collections import Counter


def get_ranks_and_frequencies(infile):
    f = open(infile, "r")
    content = f.read()
    word_list = re.findall(r'[a-zA-Z]+',content)
    stat = Counter(word_list)
    freq = [v for k, v in sorted(stat.items(), key=lambda item: item[1], reverse=True)]
    rank_and_freq = [(i+1, freq[i]) for i in range(len(freq))]
    return rank_and_freq
print(get_ranks_and_frequencies('moby_dick.txt'))

[(1, 13784), (2, 6597), (3, 6065), (4, 4605), (5, 4595), (6, 3933), (7, 2992), (8, 2459), (9, 2227), (10, 2125), (11, 1746), (12, 1716), (13, 1666), (14, 1661), (15, 1635), (16, 1627), (17, 1470), (18, 1434), (19, 1311), (20, 1243), (21, 1161), (22, 1115), (23, 1110), (24, 1067), (25, 1058), (26, 1042), (27, 1019), (28, 925), (29, 906), (30, 901), (31, 883), (32, 767), (33, 763), (34, 734), (35, 716), (36, 706), (37, 681), (38, 648), (39, 642), (40, 628), (41, 625), (42, 625), (43, 612), (44, 598), (45, 587), (46, 586), (47, 580), (48, 573), (49, 564), (50, 554), (51, 538), (52, 529), (53, 521), (54, 508), (55, 507), (56, 507), (57, 506), (58, 501), (59, 488), (60, 471), (61, 460), (62, 443), (63, 436), (64, 433), (65, 432), (66, 428), (67, 423), (68, 422), (69, 415), (70, 405), (71, 387), (72, 384), (73, 375), (74, 369), (75, 364), (76, 364), (77, 340), (78, 335), (79, 335), (80, 335), (81, 330), (82, 329), (83, 328), (84, 320), (85, 319), (86, 312), (87, 311), (88, 311), (89, 308), (

In [6]:
# Part (D)
ranks_and_frequencies = get_ranks_and_frequencies('moby_dick.txt')

# Your code goes here

In [7]:
rank = [k[0] for k in ranks_and_frequencies]
freq = [k[1] for k in ranks_and_frequencies]

In [8]:
log_rank = [np.log(k[0]) for k in ranks_and_frequencies]
log_freq = [np.log(k[1]) for k in ranks_and_frequencies]
#plt.plot(log_rank, log_freq)
#plt.title("Logarithm of Frequency vs. Logarithm of Rank")
#plt.xlabel("log rank")
#plt.ylabel("log freq")
#plt.show()

In [None]:
plt.plot(np.array([0,1]),np.array([0,1]))

In [None]:
plt.plot(log_rank[0:50],log_freq[0:50])

**Problem 3.** In this problem, we'll work with *regular expressions*, which are specifications of matching string patterns. We'll work using the introductory content in Chapter 1 of *Moby Dick*. This is short enough for you to look at the text before and after specific operations are performed. Follow through the basic examples to see how you can match against specified patterns and process the text from there in useful ways. You should review the `re` module from the Python documentation; here, we will survey just a few examples.

(A) After experimenting with `re.sub`, use `re.compile` and `re.search` to write a function that checks whether its input is a valid Georgetown NetID. For this problem, let's assume that a valid ID consists of 2-4 lowercase letters followed by no more than 4 digits. For instance, "abc123" is a valid NetID but "ab12345" and "x36" are not.

In [None]:
import re

text = "CHAPTER 1. Loomings. Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."

# \d is a special symbol that matches digits [0-9]
# the + symbol means to match one or more of the preceding element
# this substitution matches digits, replaces them with the empty string
# i.e. omits them, and is applied over the text variable
re.sub(r"\d+", "", text)


# You can specify your own character classes, too
# You may want to remove certain punctuation symbols like periods, commas, and semicolons
re.sub(r"[\;\.\,']", " ", text)


def is_valid_netid(s):
    # Your code goes here
    pass

**Problem 4.** Let's take a brief look at the `NumPy` library. The basic object in the library is the multidimensional array, where each dimension is called an *axis*. All of its elements are of the same type and usually hold numeric data. NumPy arrays are typed as `ndarray`.

You can create an array in various ways. For instance, the `array` function can take a Python list and convert it into an `ndarray`:

In [None]:
a = np.array([1, 2, 3, 4]) #1d array
b = np.array([[1.0, 2, 3, 4], [5, 6, 7, 8]]) #2d array
c = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) #2d array

print("shape of a", a.shape) # an array with one axis of length 4
print("shape of b", b.shape) # an array with two axes; the first of length 2, the second of length 4
print("shape of c", c.shape) # an array with two axes; the first of length 3, the second of length 4

print("a.dtype", a.dtype) # integer
print("b.dtype", b.dtype) # float

Indexing operations behave as expected:

In [None]:
print("b[0, 0]", b[0, 0]) # fetch the value at the 0th position of the 1st axis and the 0th position of the 2nd axis
print("b[-1, -1]", b[-1, -1]) # fetch the value at the last position of the 1st axis and the last position of the 2nd axis

b[1,1] = 10 # updates array value; set the value at index [1,1] to 1
print("b", b)

print("b[:, 1]", b[:, 1]) # fetch all the values along the 1st position in the 2nd axis (i.e., the second column of the matrix)

print("b[-1]", b[-1]) # equivalent to b[-1, :] and b[-1, ...] i.e. the last row

Here is a brief survey of some typical functionalities:

In [None]:
a = np.arange(10) # an array of 10 integers, 0-9 (analogous to Python's range)
print("a", a)
print("a reshaped", a.reshape(2, 5)) # reshape array
print("a", a) # original array
print("a dim", a.ndim) # 2 axes

print("zeros", np.zeros((2, 3))) # an array of all 0s
print("ones", np.ones((2, 3))) # an array of all 1s

np.random.randn((3,4)) # random module provides functions to create ndarrays with random values

print("linspace", np.linspace(0, 1, 100)) # an array with 100 evenly-spaced numbers, from 0 to 1

Here are some additional examples. Take time to understand them, and feel free to test each one.

In [None]:
A = np.array([[1, 1], [0, 1]])
B = np.array([[2, 0], [3,4]])

print("A", A)
print("Hadamard product", A*B) # Hadamard/element-wise product

print("matrix product", A@B) # matrix product

print("matrix product", A.dot(B)) # matrix product

print("sum", A.sum()) # sum over all elements
print("min", A.min()) # min over all elements
print("max", A.max()) # max over all elements


print("apply operation across the row direction", A.sum(axis=0)) # sum of each column, i.e. operation applied across the direction along the rows

print("apply the operation across the column direction", A.sum(axis=1)) # sum of each row, i.e. operation applied across the direction along the columns

print("concat", np.concatenate((A, B), axis=0)) # concatenate along rows
print("concat shape", np.concatenate((A, B), axis=0).shape)

q1 = np.empty((3, 4)) # creates a (3,4) array from values in memory
q2 = np.empty((3, 4))
print("stack", np.stack((q1, q2))) # stack two arrays of the same shape to create a (2, 3, 4) array

B = np.arange(3)
print("exponential applied to each element", np.exp(B)) # exponential function applied to each

**Problem 5.** Explain in words what the following code does.

(A) 

```
C = np.arange(24).reshape(2,3,4)
C.sum(axis=0)
```

(B) 

```
C = np.arange(24).reshape(2,3,4)
C.sum(axis=1)
```

(C)

```
C = np.arange(24).reshape(2,3,4)
C.sum(axis=2)
```

In [None]:
# Your explanation goes here

**Problem 6.** Complete the function `linear_transformation()` by implementing a matrix multiplication between the data matrix, `X` of size `M x N`, and the tensor, `W` of size `N x N` (an identity matrix).

In [None]:
def linear_transformation(X, W):
    """
    Parameters
    ----------
    X: np.array (M x N)
        Data matrix
    W: np.array(N x N)
        Linear transformation
    Returns
    -------
    X_prime: np.array (M x N)
    """
    # Your code goes here
    pass

In [None]:
W = np.identity(10)
X = np.random.random((300, 10))

In [None]:
X_tx = linear_transformation(X, W)
assert (X_tx == X).all()