# 2023 COMP 4446 / 5046 Assignment 1

Assingment 1 is an **individual** assessment. Please note the University's [Academic dishonesty and plagiarism policy](https://www.sydney.edu.au/students/academic-dishonesty.html).

Submission Deadline: Friday, March 17th, 2023, 11:59pm

Submit via Canvas:
- Your notebook
- Run all cells before saving the notebook, so we can see your output

In this assignment, we will explore ways to predict the length of a Wikipedia article based on the first 100 tokens in the article. Such a model could be used to explore whether there are systematic biases in the types of articles that get more detail.

If you are working in another language, please make sure to clearly indicate which part of your code is running which section of the assignment and produce output that provides all necessary information. Submit your code, example outputs, and instructions for executing it.

Note: This assignment contains topics that are not covered at the time of release. Each section has information about which lectures and/or labs covered the relevant material. We are releasing it now so you can (1) start working on some parts early, and (2) know what will be in the assignment when you attend the relevant labs and lectures.

# **TODO: Copy and Name this File**
Make a copy of this notebook in your own Google Drive (File -> Save a Copy in Drive) and change the filename, replacing `YOUR-UNIKEY`. For example, for a person with unikey `mcol1997`, the filename should be:

`COMP-4446-5046_Assignment1_mcol1997.ipynb`

# Readme
*If there is something you want to tell the marker about your submission, please mention it here.* 

# Data Download [DO NOT MODIFY THIS]

We have already constructed a dataset for you using a recent dump of data from Wikipedia. Both the training and test datasets are provided in the form of csv files (training_data.csv, test_data.csv) and can be downloaded from Google Drive using the code below. Each row of the data contains:

- The length of the article
- The title of the article
- The first 100 tokens of the article

In case you are curious, we constructed this dataset as follows:
1. Downloaded [a recent dump](https://dumps.wikimedia.org/) of English wikipedia.
2. Ran [WikiExtractor](https://github.com/attardi/wikiextractor) to get the contents of the pages.
3. Filtered out very short pages.
4. Ran [SpaCy](https://spacy.io/) with the `en_core_web_lg` model to tokenise the pages (Note, SpaCy's development is led by an alumnus of USyd!).
5. Counted the tokens and saved the relevant data in the format described above.

This code will download the data. **DO NOT MODIFY IT**

In [1]:
## DO NOT MODIFY THIS CODE
# Code to download files into Colaboratory

# Install the PyDrive library
!pip install -U -q PyDrive

# Import libraries for accessing Google Drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Function to read the file, save it on the machine this colab is running on, and then read it in
import csv
def read_file(file_id, filename):
  downloaded = drive.CreateFile({'id':file_id})
  downloaded.GetContentFile(filename)
  with open(filename) as src:
    reader = csv.reader(src)
    data = [r for r in reader]
  return data

# Calls to get the data
# If you need to access the data directly (e.g., you are running experiments on a local machine), use these links:
# - Training, https://drive.google.com/file/d/1-UGFS8D-qglAX-czU38KaM4jQVCoNe0W/view?usp=share_link
# - Dev, https://drive.google.com/file/d/1RWMEf0mdJMTkWc7dvN0ioks8bjujqZaN/view?usp=share_link
# - Test, https://drive.google.com/file/d/1YVPNzdIFSMmVPeLBP-gf5DOIed3oRFyB/view?usp=share_link
training_data = read_file('1-UGFS8D-qglAX-czU38KaM4jQVCoNe0W', "/content/training_data.csv")
dev_data = read_file('1RWMEf0mdJMTkWc7dvN0ioks8bjujqZaN', "/content/dev_data.csv")
test_data = read_file('1YVPNzdIFSMmVPeLBP-gf5DOIed3oRFyB', "/content/test_data.csv")

print("------------------------------------")
print("Size of training data: {0}".format(len(training_data)))
print("Size of development data: {0}".format(len(dev_data)))
print("Size of test data: {0}".format(len(test_data)))
print("------------------------------------")

print("------------------------------------")
print("Sample Data")
print("LABEL: {0} / SENTENCE: {1}".format(training_data[0][0], training_data[0][1:]))
print("------------------------------------")

# Preview of the data in the csv file, which has three columns: 
# (1) length of article, (2) title of the article, (3) first 100 words in the article
for v in training_data[:10]:
  print("{}\n{}\n{}\n".format(v[0], v[1], v[2][:100] + "..."))

# Store the data in lists and mofidy the length value to be in [0, 1]
training_lengths = [min(1.0, int(r[0])/10000) for r in training_data]
training_text = [r[2] for r in training_data]

dev_lengths = [min(1.0, int(r[0])/10000) for r in dev_data]
dev_text = [r[2] for r in dev_data]

test_lengths = [min(1.0, int(r[0])/10000) for r in test_data]
test_text = [r[2] for r in test_data]

------------------------------------
Size of training data: 9859
Size of development data: 994
Size of test data: 991
------------------------------------
------------------------------------
Sample Data
LABEL: 6453 / SENTENCE: ['Anarchism', 'Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy , typically including , though not necessarily limited to , governments , nation states , and capitalism . Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations . As a historically left - wing movement , usually placed on the farthest left of the political spectrum , it is usually described alongside communalism and libertarian Marxism as the libertarian wing ( libertarian socialism )']
------------------------------------
6453
Anarchism
Anarchism is a political philosophy and movement that is ske

# 1 - Predicting article length from initial content

This section relates to content from **the week 1 lecture and the week 2 lab**.

In this section, you will implement training and evaluation of a linear model (as seen in the week 2 lab) to predict the length of a wikipedia article from its first 100 words. You will represent the text using a Bag of Words model (as seen in the week 1 lecture).

## 1.1 Word Mapping [2pt]

In the code block below, write code to go through the training data and for any word that occurs at least 10 times:
- Assign it a unique ID (consecutive, starting at 0)
- Place it in a dictionary that maps from the word to the ID

In [2]:
# importing packages used in the assignment
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import re
import numpy
nltk.download('stopwords')
from nltk.corpus import stopwords as sw
import string
import torch
print(torch.__version__)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


1.13.1+cu116


In [3]:
# this function takes an input of a list of strings
# returns text_tokenized: an array of words for each element in input
# returns tokenized_set: an dictionary(word,occurence of word in string element) for each element in input
def tokenization(data):
  tokenized_set = []
  text_tokenized = []
  for x in data:
    x_tokenized = [text.lower() for text in word_tokenize(x)]
    text_tokenized.append(x_tokenized)
    tokenized_set.append(dict(Counter(x_tokenized)))
  return text_tokenized,tokenized_set

In [4]:
# (dataset)_text_tokenized : each element of dataset stored as list of tokens
# (dataset)_text_tokenized_set : each element of data stored as (word,occurences of word in element)
training_text_tokenized, training_text_tokenized_set = tokenization(training_text)
dev_text_tokenized, dev_text_tokenized_set = tokenization(dev_text)
test_text_tokenized, test_text_tokenized_set = tokenization(test_text)

####################################### Code for counting occurences of words in the training data set #######################################

# training_text_tokenized is flattened so that it be used as a long list of word that will be treated as a vocabulary in this assignment 
training_text_tokenized_flat = [item for sublist in training_text_tokenized  for item in sublist]
# Counter used to count the occurence of every unique token in above created vocabulary
CounterDict = dict(Counter(training_text_tokenized_flat))


####################################### Code for removing words occuring less than 10 times in the dataset #######################################

# list comprehension to fiter and remove words that have occured less than 10 times in the training set
CounterMap = {key:value for key, value in CounterDict.items() if value>10}


####################################### Code for place words in a dictionary that maps from the word to its ID #######################################

# list comprehension for assigning a index starting from 0 for each word in the vocabulary
Vocabulary = {key:list(CounterMap.keys()).index(key) for key in CounterMap.keys()}
Vocabulary

{'anarchism': 0,
 'is': 1,
 'a': 2,
 'political': 3,
 'philosophy': 4,
 'and': 5,
 'movement': 6,
 'that': 7,
 'of': 8,
 'all': 9,
 'for': 10,
 'authority': 11,
 'seeks': 12,
 'to': 13,
 'the': 14,
 'institutions': 15,
 'it': 16,
 'claims': 17,
 'maintain': 18,
 'hierarchy': 19,
 ',': 20,
 'typically': 21,
 'including': 22,
 'though': 23,
 'not': 24,
 'necessarily': 25,
 'limited': 26,
 'governments': 27,
 'nation': 28,
 'states': 29,
 'capitalism': 30,
 '.': 31,
 'advocates': 32,
 'replacement': 33,
 'state': 34,
 'with': 35,
 'societies': 36,
 'or': 37,
 'other': 38,
 'forms': 39,
 'free': 40,
 'associations': 41,
 'as': 42,
 'historically': 43,
 'left': 44,
 '-': 45,
 'wing': 46,
 'usually': 47,
 'placed': 48,
 'on': 49,
 'spectrum': 50,
 'described': 51,
 'alongside': 52,
 'libertarian': 53,
 '(': 54,
 ')': 55,
 ';': 56,
 'measure': 57,
 'reflection': 58,
 'solar': 59,
 'radiation': 60,
 'out': 61,
 'total': 62,
 'measured': 63,
 'scale': 64,
 'from': 65,
 '0': 66,
 'corresponding'

## 1.2 Data to Bag-of-Words Tensors [2pt]

In the code block below, write code to prepare the data in PyTorch tensors.

The text should be converted to a bag of words (ie., a vector the length of the vocabulary in the mapping in the previous step, with counts of the words in the text).

In [5]:
# input for this function : list of ( list of tokens )
# this function converts each element of the input to a bag of words ( by first creating an array of occurence of each vocabulary word and then coveting this array to a tensor)
def createTensorforDocs(data):
  vector_for_docs = []
  words = Vocabulary.keys()
  for x in data:
    vector = [x[word] if word in x.keys() else 0 for word in words ]
    vector_for_docs.append(vector)
  return torch.from_numpy(numpy.asarray(vector_for_docs))

In [6]:

####################################### Code for text being converted to bag of words #######################################

# populating tensors used in training and testing model
x_training_data_tensor = createTensorforDocs(training_text_tokenized_set)
x_dev_data_tensor = createTensorforDocs(dev_text_tokenized_set)
x_test_data_tensor = createTensorforDocs(test_text_tokenized_set)


####################################### Code for changing the shape of Y vectors from shape(size_of_dataset) to shape([size_of_dataset],1) #######################################

# populating tensors used in training and testing model
y_training_data_tensor = torch.from_numpy(numpy.asarray([[x] for x in training_lengths]))
y_dev_data_tensor = torch.from_numpy(numpy.asarray([[x] for x in dev_lengths]))
y_test_data_tensor = torch.from_numpy(numpy.asarray([[x] for x in test_lengths]))

In [7]:
print('x_training_data_tensor',x_training_data_tensor)
print('x_dev_data_tensor',x_dev_data_tensor)
print('x_test_data_tensor',x_test_data_tensor)

x_training_data_tensor tensor([[2, 3, 2,  ..., 0, 0, 0],
        [0, 3, 4,  ..., 0, 0, 0],
        [0, 3, 5,  ..., 0, 0, 0],
        ...,
        [0, 0, 1,  ..., 0, 0, 0],
        [0, 2, 4,  ..., 0, 0, 0],
        [0, 3, 2,  ..., 0, 0, 0]])
x_dev_data_tensor tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 2,  ..., 0, 0, 0],
        [0, 0, 2,  ..., 0, 0, 0],
        ...,
        [0, 2, 6,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 1, 1,  ..., 0, 0, 0]])
x_test_data_tensor tensor([[0, 2, 0,  ..., 0, 0, 0],
        [0, 0, 1,  ..., 0, 0, 0],
        [0, 0, 1,  ..., 0, 0, 0],
        ...,
        [0, 2, 3,  ..., 0, 0, 0],
        [0, 3, 1,  ..., 0, 0, 0],
        [0, 1, 3,  ..., 0, 0, 0]])


In [8]:
print('y_training_data_tensor)',y_training_data_tensor)
print('y_dev_data_tensor',y_dev_data_tensor)
print('y_test_data_tensor',y_test_data_tensor)

y_training_data_tensor) tensor([[0.6453],
        [0.3528],
        [0.1265],
        ...,
        [0.5215],
        [0.0191],
        [0.1101]], dtype=torch.float64)
y_dev_data_tensor tensor([[0.0552],
        [0.0999],
        [0.1331],
        [0.0461],
        [0.5329],
        [0.3803],
        [0.7429],
        [0.3737],
        [0.1130],
        [0.4459],
        [0.8221],
        [0.5769],
        [0.0800],
        [0.0039],
        [0.8192],
        [0.2340],
        [0.3241],
        [0.4813],
        [0.1308],
        [0.1061],
        [0.0964],
        [0.4800],
        [0.3005],
        [0.4446],
        [0.6819],
        [0.0624],
        [0.0023],
        [0.6516],
        [0.1084],
        [0.0338],
        [0.1668],
        [1.0000],
        [0.0378],
        [0.7162],
        [1.0000],
        [0.4729],
        [0.1461],
        [0.0339],
        [0.1397],
        [0.0467],
        [0.0139],
        [0.1381],
        [0.0618],
        [0.6992],
        [0.0825],
     

## 1.3 Model Creation [2pt]

Construct a linear model with an SGD optimiser (we recommend a learning rate of `1e-4`) and mean squared error as the loss.

In [9]:
import torch.nn as nn
# model and optimiser for prediction of article lenght 
linearRegression =  nn.Linear(len(Vocabulary),1)
optimizer = torch.optim.SGD(linearRegression.parameters(), lr=1e-4)

In [10]:
# function used for calculating cost
def mse(x1, x2):
  diff = x1 - x2
  return torch.sum(diff*diff)/diff.numel()

## 1.4 Training [2pt]

Write a loop to train your model for 100 epochs, printing performance on the dev set every 10 epochs.

In [11]:
# this function trains specified model using specified optimizer for the number of epochs specified
def training(model,optimizerForModel,no_of_epochs):
  display_interval = no_of_epochs/10

  for epoch in range(no_of_epochs):
    predictions = model(x_data)
    loss = mse(predictions, y_data)
    loss.backward()
    optimizerForModel.step() 
    optimizerForModel.zero_grad() 
    if epoch % display_interval == 0 :
      # calculate the loss of the current model
      predictions = model(x_dev_data)
      loss = mse(predictions, y_dev_data)          
      print("Epoch:", '%04d' % (epoch), "dev loss=", "{:.8f}".format(loss))

  print("=========================================================")
  training_loss = mse(model(x_data), y_data)   
  print("Optimised:", "training loss=", "{:.9f}".format(training_loss.data))
  training_loss = mse(model(x_dev_data), y_dev_data)   
  print("Optimised:", "dev loss=", "{:.9f}".format(training_loss.data))
  print("=========================================================")

In [12]:
# populating variables used in training the model
x_data = x_training_data_tensor.float()
y_data = y_training_data_tensor.float()
x_dev_data = x_dev_data_tensor.float()
y_dev_data = y_dev_data_tensor.float()
x_test_data = x_test_data_tensor.float()
y_test_data = y_test_data_tensor.float()

In [13]:
####################################### Code for training model for 100 epochs, printing performance on the dev set every 10 epochs #######################################
# model will run for 100 epochs
%time training(linearRegression,optimizer,100)

Epoch: 0000 dev loss= 0.16157503
Epoch: 0010 dev loss= 0.12982070
Epoch: 0020 dev loss= 0.11151215
Epoch: 0030 dev loss= 0.10091813
Epoch: 0040 dev loss= 0.09475143
Epoch: 0050 dev loss= 0.09112658
Epoch: 0060 dev loss= 0.08896209
Epoch: 0070 dev loss= 0.08763759
Epoch: 0080 dev loss= 0.08679716
Epoch: 0090 dev loss= 0.08623658
Optimised: training loss= 0.087500580
Optimised: dev loss= 0.085873276
CPU times: user 4.82 s, sys: 28.3 ms, total: 4.85 s
Wall time: 4.97 s


## 1.1 Measure Accuracy [2pt]

In the code block below, write code to evaluate your model on the test set.

In [14]:
# this function evaluates specified model on test set
def testing(model):
  print("=========================================================")
  training_loss = mse(model(x_data), y_data)   
  print("Optimised:", "training loss=", "{:.9f}".format(training_loss.data))
  training_loss = mse(model(x_dev_data), y_dev_data)   
  print("Optimised:", "dev loss=", "{:.9f}".format(training_loss.data))
  print("=========================================================")

  # Calculating testing loss
  testing_loss = mse(model(x_test_data), y_test_data) 
  print("Testing loss=", "{:.9f}".format(testing_loss.data))
  print("Absolute mean square loss difference:", "{:.9f}".format(abs(training_loss.data - testing_loss.data)))

In [15]:
####################################### code to evaluate model on the test set. #######################################

%time testing(linearRegression)

Optimised: training loss= 0.087500580
Optimised: dev loss= 0.085873276
Testing loss= 0.079499044
Absolute mean square loss difference: 0.006374232
CPU times: user 39.1 ms, sys: 854 µs, total: 40 ms
Wall time: 33.3 ms


## 1.2 Analyse the Model [2pt]

In the code block below, write code to identify the 50 words with the highest weights and the 50 words with the lowest weights.

In [16]:
# Code to associate vocabulary element with respective weights 
weights_without_gradient = linearRegression.weight.detach()
weights = weights_without_gradient.numpy()[0]
dtype = [('word',numpy.object_),('weight',float)]
values = []
for i in range(0,len(Vocabulary)):
  values.append((list(Vocabulary.keys())[i],weights[i]))
important_words_with_weights_sorted = numpy.sort(numpy.array(values,dtype=dtype),order='weight')[::-1]

In [17]:
####################################### 50 words with the highest weights identified #######################################

#words with highest weights
important_words_with_weights_sorted[:50]

array([('of', 0.01702806), ('the', 0.01616932), ('in', 0.01288139),
       ('at', 0.01243216), ('–', 0.01215785), ('all', 0.01212496),
       ('creating', 0.01209114), ('confederate', 0.01208129),
       ('fossils', 0.01208048), ('capacity', 0.012078  ),
       ('centers', 0.01207101), ('thriller', 0.01206873),
       ('confederation', 0.01206743), ('theatrical', 0.01205799),
       ('attempts', 0.01205662), ('mya', 0.01205641),
       ('ottoman', 0.01205078), ('straits', 0.01204906),
       ('show', 0.0120482 ), ('long', 0.01204792), ('firm', 0.01204728),
       ('measure', 0.01204004), ('count', 0.01203752),
       ('davis', 0.01203716), ('improving', 0.01202694),
       ('cathedral', 0.01201396), ('houses', 0.01199976),
       ('economies', 0.01199922), ('america', 0.01199831),
       ('basque', 0.01199717), ('taking', 0.01198024),
       ('training', 0.01198005), ('dry', 0.01197861),
       ('heart', 0.01197722), ('jerusalem', 0.01197718),
       ('marshall', 0.01197248), ('capita'

In [18]:
####################################### 50 words with the lowest weights identified #######################################

#words with lowest weights
important_words_with_weights_sorted[-50:][::-1]

array([('organised', -0.01206932), ('francesco', -0.01206626),
       ('reserves', -0.01205936), ('2010', -0.01205783),
       ('ritual', -0.01205778), ('1927', -0.01205746),
       ('juice', -0.01205644), ('biography', -0.01205084),
       ('1932', -0.01203747), ('emperors', -0.01203547),
       ('operate', -0.01203311), ('laser', -0.01202521),
       ('architecture', -0.01202202), ('becoming', -0.01202058),
       ('projects', -0.01201517), ('wider', -0.01201157),
       ('titled', -0.01200502), ('passing', -0.01199951),
       ('holidays', -0.01199939), ('centres', -0.01199607),
       ('advocates', -0.0119927 ), ('headquartered', -0.01198541),
       ('caribbean', -0.01197631), ('missionary', -0.01197587),
       ('theatre', -0.01197538), ('things', -0.01197077),
       ('exploration', -0.01197059), ('question', -0.01197045),
       ('harbours', -0.01196087), ('center', -0.01195694),
       ('matches', -0.01195681), ('spaceflight', -0.01194841),
       ('volume', -0.01194496), ('d.

# 2 - Compare Data Storage Methods

This section relates to content from **the week 1 lecture and the week 2 lab**.

Implement a variant of the model with a sparse vector for your input bag of words (See https://pytorch.org/docs/stable/sparse.html for how to switch a vector to be sparse). Use the default sparse vector type (COO).

In [19]:
# populating data for training model
x_data = x_data.to_sparse()
y_data = y_data.to_sparse()
x_dev_data = x_dev_data.to_sparse()
y_dev_data = y_dev_data.to_sparse()
x_test_data = x_test_data.to_sparse()
y_test_data = y_test_data.to_sparse()

In [20]:
print('x_data_tensor',x_data)
print('x_dev_data',x_dev_data)
print('x_test_data',x_test_data)

x_data_tensor tensor(indices=tensor([[   0,    0,    0,  ..., 9858, 9858, 9858],
                       [   0,    1,    2,  ..., 5188, 5702, 6707]]),
       values=tensor([2., 3., 2.,  ..., 1., 1., 1.]),
       size=(9859, 6853), nnz=527943, layout=torch.sparse_coo)
x_dev_data tensor(indices=tensor([[   0,    0,    0,  ...,  993,  993,  993],
                       [   5,    7,    8,  ..., 4240, 5386, 6623]]),
       values=tensor([4., 1., 5.,  ..., 2., 1., 3.]),
       size=(994, 6853), nnz=52662, layout=torch.sparse_coo)
x_test_data tensor(indices=tensor([[   0,    0,    0,  ...,  990,  990,  990],
                       [   1,    5,    8,  ..., 4802, 4803, 6057]]),
       values=tensor([ 2.,  1., 10.,  ...,  1.,  1.,  1.]),
       size=(991, 6853), nnz=52583, layout=torch.sparse_coo)


In [21]:
print('y_data',y_data)
print('y_dev_data',y_dev_data)
print('y_test_data',y_test_data)

y_data tensor(indices=tensor([[   0,    1,    2,  ..., 9856, 9857, 9858],
                       [   0,    0,    0,  ...,    0,    0,    0]]),
       values=tensor([0.6453, 0.3528, 0.1265,  ..., 0.5215, 0.0191, 0.1101]),
       size=(9859, 1), nnz=9859, layout=torch.sparse_coo)
y_dev_data tensor(indices=tensor([[  0,   1,   2,  ..., 991, 992, 993],
                       [  0,   0,   0,  ...,   0,   0,   0]]),
       values=tensor([0.0552, 0.0999, 0.1331, 0.0461, 0.5329, 0.3803, 0.7429,
                      0.3737, 0.1130, 0.4459, 0.8221, 0.5769, 0.0800, 0.0039,
                      0.8192, 0.2340, 0.3241, 0.4813, 0.1308, 0.1061, 0.0964,
                      0.4800, 0.3005, 0.4446, 0.6819, 0.0624, 0.0023, 0.6516,
                      0.1084, 0.0338, 0.1668, 1.0000, 0.0378, 0.7162, 1.0000,
                      0.4729, 0.1461, 0.0339, 0.1397, 0.0467, 0.0139, 0.1381,
                      0.0618, 0.6992, 0.0825, 0.1998, 0.3397, 0.3242, 0.8384,
                      0.6603, 0.1526, 0.

In [22]:
# creation of model that will be trained using sparse vectors
linearRegressionForSparce =  nn.Linear(len(Vocabulary),1)
optimizerForSparce = torch.optim.SGD(linearRegressionForSparce.parameters(), lr=1e-4)

## 2.1 Training and Test Speed [2pt]
Compare the time it takes to train and test the new model with the time it takes to train and test the old model.

You can time the execution of a line of code using `%time`.
See [this guide](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.07-Timing-and-Profiling.ipynb#scrollTo=z1gyaC_PNZUB) for more on timing.

In [23]:
# recording time to train a model with saprce vectors as input
%time training(linearRegressionForSparce,optimizerForSparce,100)

Epoch: 0000 dev loss= 0.30211979
Epoch: 0010 dev loss= 0.21018898
Epoch: 0020 dev loss= 0.15715803
Epoch: 0030 dev loss= 0.12649871
Epoch: 0040 dev loss= 0.10871670
Epoch: 0050 dev loss= 0.09835556
Epoch: 0060 dev loss= 0.09227756
Epoch: 0070 dev loss= 0.08867680
Epoch: 0080 dev loss= 0.08651288
Epoch: 0090 dev loss= 0.08518549
Optimised: training loss= 0.084280558
Optimised: dev loss= 0.084415913
CPU times: user 3.02 s, sys: 9.89 ms, total: 3.03 s
Wall time: 3.04 s


In [24]:
# recording time to test a model with sparce vectors as input
%time testing(linearRegressionForSparce)

Optimised: training loss= 0.084280558
Optimised: dev loss= 0.084415913
Testing loss= 0.075273253
Absolute mean square loss difference: 0.009142660
CPU times: user 22.1 ms, sys: 1.96 ms, total: 24.1 ms
Wall time: 24.4 ms


In [25]:
# populating variables used in training the model the model in section 1
x_data = x_training_data_tensor.float()
y_data = y_training_data_tensor.float()
x_dev_data = x_dev_data_tensor.float()
y_dev_data = y_dev_data_tensor.float()
x_test_data = x_test_data_tensor.float()
y_test_data = y_test_data_tensor.float()

In [26]:
# reinitializing the model is used in section 1
linearRegression =  nn.Linear(len(Vocabulary),1)
optimizer = torch.optim.SGD(linearRegression.parameters(), lr=1e-4)

In [27]:
# recording time to train a model in section 1
%time training(linearRegression,optimizer,100)

Epoch: 0000 dev loss= 0.26912254
Epoch: 0010 dev loss= 0.19174317
Epoch: 0020 dev loss= 0.14729714
Epoch: 0030 dev loss= 0.12173622
Epoch: 0040 dev loss= 0.10700449
Epoch: 0050 dev loss= 0.09848249
Epoch: 0060 dev loss= 0.09352167
Epoch: 0070 dev loss= 0.09060355
Epoch: 0080 dev loss= 0.08885767
Epoch: 0090 dev loss= 0.08778510
Optimised: training loss= 0.087428689
Optimised: dev loss= 0.087156363
CPU times: user 4.72 s, sys: 21 ms, total: 4.75 s
Wall time: 4.74 s


In [28]:
# recording time to test a model in section 1
%time testing(linearRegression)

Optimised: training loss= 0.087428689
Optimised: dev loss= 0.087156363
Testing loss= 0.078067541
Absolute mean square loss difference: 0.009088822
CPU times: user 36.8 ms, sys: 1.97 ms, total: 38.7 ms
Wall time: 33.5 ms


On comparing the training time and testing time recorded for model created section 1 (model using dense vectors) and the training and testing time recorded for model created in section 2 (model using sparce vectors) we see that the model using sparse vector takes lesser time to train, same goes for time taken to test as well.

In this case training the model using sparse vectors takes 1.72s less time to train and 14.6ms less time to test. this is because sparse vectors have more efficient storage and faster access techniques

# 3 - Switch to Word Embeddings

This section relates to content from **the week 2 lecture and the week 3 lab**.

In this section, you will implement a model based on word2vec.

1. Use word2vec to learn embeddings for the words in your data.
2. Represent each input document as the average of the word vectors for the words it contains.
3. Train a linear regression model.

In [29]:
import gensim
print(gensim.__version__)

3.6.0


In [30]:
####################################### Code for learning embeddings from words in training dataset #######################################

# model to learn word embeddings from training data
from gensim.models import Word2Vec
word_embeddings_model = Word2Vec(sentences=training_text_tokenized, size=100, window=5, min_count=10, workers=2, sg=0)

In [31]:
# this function uses specified word embedding model to generate average vector for each element of specified dataset
def getAvgVectors(data,word_embedding_model):
  # list of arrays that is treated as a list of vectors 
  list_of_word_vectors = []
  for text in data:
    # generating a vector for each word in the text
    vector_for_texts = [word_embedding_model.wv[word] for word in text if word_embedding_model.wv.__contains__(word) ]
    # reducing the list of word vectors to a mean vector for the dataset element
    mean_vector = [sum(i)/len(text) for i in zip(*vector_for_texts)]
    list_of_word_vectors.append(mean_vector)
  # retruning the list of vectors as tensor
  return torch.from_numpy(numpy.asarray(list_of_word_vectors))
  #return list_of_word_vectors

In [32]:
####################################### Code for representing each input document as the average of the word vectors for the words it contains #######################################

# generating input data that will be used in training model
x_training_word_embedding_tensor = getAvgVectors(training_text_tokenized,word_embeddings_model)
x_dev_word_embedding_tensor = getAvgVectors(dev_text_tokenized,word_embeddings_model)
x_test_word_embedding_tensor = getAvgVectors(test_text_tokenized,word_embeddings_model)

In [33]:
# populating variables used in training model
x_data = x_training_word_embedding_tensor.float()
y_data = y_training_data_tensor.float()
x_dev_data = x_dev_word_embedding_tensor.float()
y_dev_data = y_dev_data_tensor.float()
x_test_data = x_test_word_embedding_tensor.float()
y_test_data = y_test_data_tensor.float()

In [34]:
print('x_data',x_data)
print('x_dev_data',x_dev_data)
print('x_test_data',x_test_data)

x_data tensor([[-0.5877, -0.0584, -0.0620,  ...,  0.0216, -0.3667, -0.2914],
        [-0.5651, -0.1119, -0.1609,  ...,  0.1444, -0.2572, -0.1452],
        [-0.5016, -0.1134, -0.2075,  ...,  0.2529, -0.4345, -0.3639],
        ...,
        [-0.5559, -0.1558, -0.1314,  ...,  0.3902, -0.5177, -0.2062],
        [-0.5544, -0.0438, -0.2238,  ...,  0.0254, -0.3075, -0.3114],
        [-0.4939,  0.0416, -0.1102,  ...,  0.0725, -0.3594, -0.1712]])
x_dev_data tensor([[-0.6061, -0.3074, -0.2971,  ...,  0.3092, -0.4251, -0.3105],
        [-0.5857, -0.0894, -0.1847,  ...,  0.1451, -0.1954, -0.2376],
        [-0.5944, -0.1709, -0.2255,  ...,  0.3798, -0.5226, -0.3171],
        ...,
        [-0.5528, -0.2319, -0.1300,  ...,  0.1162, -0.5047, -0.3705],
        [-0.5757, -0.1193, -0.4067,  ...,  0.3365, -0.3388, -0.2778],
        [-0.5173, -0.1462, -0.2788,  ...,  0.3027, -0.3760, -0.3550]])
x_test_data tensor([[-0.6377, -0.0882,  0.0557,  ...,  0.4014, -0.5563, -0.3310],
        [-0.6294, -0.2011, -0.24

In [35]:
print('y_data',y_data)
print('y_dev_data',y_dev_data_tensor)
print('y_test_data',y_test_data)

y_data tensor([[0.6453],
        [0.3528],
        [0.1265],
        ...,
        [0.5215],
        [0.0191],
        [0.1101]])
y_dev_data tensor([[0.0552],
        [0.0999],
        [0.1331],
        [0.0461],
        [0.5329],
        [0.3803],
        [0.7429],
        [0.3737],
        [0.1130],
        [0.4459],
        [0.8221],
        [0.5769],
        [0.0800],
        [0.0039],
        [0.8192],
        [0.2340],
        [0.3241],
        [0.4813],
        [0.1308],
        [0.1061],
        [0.0964],
        [0.4800],
        [0.3005],
        [0.4446],
        [0.6819],
        [0.0624],
        [0.0023],
        [0.6516],
        [0.1084],
        [0.0338],
        [0.1668],
        [1.0000],
        [0.0378],
        [0.7162],
        [1.0000],
        [0.4729],
        [0.1461],
        [0.0339],
        [0.1397],
        [0.0467],
        [0.0139],
        [0.1381],
        [0.0618],
        [0.6992],
        [0.0825],
        [0.1998],
        [0.3397],
        [0.324

In [36]:
# model specification that will trained using word2vec vectors
linearRegressionForWordEmbedding =  nn.Linear(100,1)
optimizerForWordEmbedding = torch.optim.SGD(linearRegressionForWordEmbedding.parameters(), lr=1e-4)

In [37]:
####################################### Code for training linear model #######################################
# train model with 100 epochs
%time training(linearRegressionForWordEmbedding,optimizerForWordEmbedding,100)

Epoch: 0000 dev loss= 0.16285279
Epoch: 0010 dev loss= 0.16140001
Epoch: 0020 dev loss= 0.15997556
Epoch: 0030 dev loss= 0.15857889
Epoch: 0040 dev loss= 0.15720941
Epoch: 0050 dev loss= 0.15586668
Epoch: 0060 dev loss= 0.15455012
Epoch: 0070 dev loss= 0.15325919
Epoch: 0080 dev loss= 0.15199341
Epoch: 0090 dev loss= 0.15075228
Optimised: training loss= 0.144883811
Optimised: dev loss= 0.149655968
CPU times: user 112 ms, sys: 19 ms, total: 131 ms
Wall time: 126 ms


## 3.1 Accuracy [1pt]

Calculate the accuracy of your model.

In [38]:
%time testing(linearRegressionForWordEmbedding)

Optimised: training loss= 0.144883811
Optimised: dev loss= 0.149655968
Testing loss= 0.128177151
Absolute mean square loss difference: 0.021478817
CPU times: user 1.98 ms, sys: 0 ns, total: 1.98 ms
Wall time: 1.99 ms


## 3.2 Speed [1pt]

Calcualte how long it takes your model to be evaluated.

In [39]:
# recording time taken to evaluate model
%time mse(linearRegressionForWordEmbedding(x_test_data), y_test_data)

CPU times: user 300 µs, sys: 1.01 ms, total: 1.31 ms
Wall time: 795 µs


tensor(0.1282, grad_fn=<DivBackward0>)

Total time taken is 1.31 milliseconds

# 4 - Open-Ended Improvement

This section relates to content from **the week 1, 2, and 3 lectures and the week 1, 2, and 3 labs**.

This section is an open-ended opportunity to find ways to make your model more accurate and/or faster (e.g., use WordNet to generalise words, try different word features, other optimisers, etc).

We encourage you to try several ideas to provide scope for comparisons.

If none of your ideas work you can still get full marks for this section. You just need to justify the ideas and discuss why they may not have improved performance.


## 4.1 Ideas and Motivation [1pt]

In **this** box, describe your ideas and why you think they will improve accuracy and/or speed.

*   Approch 1 :  removing stopwords and punctuation + increasing epochs to 1000 


> Aim of removing stopwords is to remove tokens that have no meaning and no importance from the vocabulary. this would amount in the model being more meaningful hence more accurate. As model uses vectors of length 100 the time taken to train is around 100ms this indicates that a even better result can be acheived by training more and still gain results within reasonable time.


*   Approch 2 :  Using the Fast Text instead of Word2Vec + increasing the learning rate


> Aim to better the model by learning embeddings at an n-gram level instead of at the word level to increase accuracy + Aim to achieve better accuracy by increasing the gradiaent displacement in each learning cycle











## 4.2 Implementation [2pt]

Implement your ideas

### Approch 1

In [40]:
# creating list of tokens that need to be removed from the corpus
stopwordsAndPunctuation = []
stopwordsAndPunctuation.extend(string.punctuation)
stopwordsAndPunctuation.extend(sw.words())

In [41]:
# function to remove stopwords and punctuation from the provided list of tokens
def removeStopwordsAndPunctuation(words): 
  return list(filter(lambda x:not x in stopwordsAndPunctuation, words))

In [42]:
# code for removing stopwords and punctuation from all datasets
training_text_tokenized_new = [removeStopwordsAndPunctuation(tokens) for tokens in training_text_tokenized]
dev_text_tokenized_new = [removeStopwordsAndPunctuation(tokens) for tokens in dev_text_tokenized]
test_text_tokenized_new = [removeStopwordsAndPunctuation(tokens) for tokens in test_text_tokenized]

In [43]:
# specification of model being trained on dataset with stopwords and punctuation removed
linearRegressionForAp1 =  nn.Linear(100,1)
optimizerForAp1 = torch.optim.SGD(linearRegressionForAp1.parameters(), lr=1e-4)

In [44]:
# word2vec model learning word embedding from dataset with stopwords and punctuation removed
word_embeddings_model_ap1 = Word2Vec(sentences=training_text_tokenized_new, size=100, window=5, min_count=10, workers=2, sg=0)

In [45]:
# generating a mean word2vec vector for documents in each dataset
x_training_word_embeddings_tensor_ap1 = getAvgVectors(training_text_tokenized_new,word_embeddings_model_ap1)
x_dev_word_embeddings_tensor_ap1 = getAvgVectors(dev_text_tokenized_new,word_embeddings_model_ap1)
x_test_word_embeddings_tensor_ap1 = getAvgVectors(test_text_tokenized_new,word_embeddings_model_ap1)

In [46]:
# populating data that will be used in training and testing of model
x_data = x_training_word_embeddings_tensor_ap1.float()
y_data = y_training_data_tensor.float()
x_dev_data = x_dev_word_embeddings_tensor_ap1.float()
y_dev_data = y_dev_data_tensor.float()
x_test_data = x_test_word_embeddings_tensor_ap1.float()
y_test_data = y_test_data_tensor.float()

In [47]:
print('x_data',x_data)
print('x_dev_data',x_dev_data)
print('x_test_data',x_test_data)

x_data tensor([[-0.1702,  0.1000,  0.0085,  ...,  0.0720, -0.1330,  0.1746],
        [-0.2561,  0.1278, -0.0843,  ...,  0.1324, -0.0095,  0.1355],
        [-0.3705,  0.1042, -0.2515,  ...,  0.3621,  0.0111,  0.1001],
        ...,
        [-0.3415, -0.0035,  0.0352,  ...,  0.1716, -0.2742,  0.2558],
        [-0.2405,  0.1128, -0.1940,  ...,  0.2752, -0.0928,  0.1537],
        [-0.1679,  0.1122,  0.0311,  ...,  0.0565, -0.0539,  0.0962]])
x_dev_data tensor([[-0.0047,  0.1255,  0.1215,  ..., -0.1362,  0.2402,  0.1910],
        [-0.0896,  0.1010, -0.0394,  ...,  0.0113,  0.0367,  0.0370],
        [-0.0291,  0.1147,  0.0927,  ..., -0.0941,  0.1721,  0.1373],
        ...,
        [-0.4617,  0.0448, -0.1124,  ...,  0.4191,  0.1006,  0.0458],
        [ 0.0059,  0.1472,  0.1590,  ..., -0.1388,  0.2910,  0.1228],
        [-0.0785,  0.0955,  0.1452,  ..., -0.0463,  0.2217,  0.1459]])
x_test_data tensor([[-0.2756,  0.0254,  0.1546,  ...,  0.1448, -0.1768,  0.4000],
        [-0.3451,  0.0397, -0.10

In [48]:
print('y_data',y_data)
print('y_dev_data',y_dev_data)
print('y_test_data',y_test_data)

y_data tensor([[0.6453],
        [0.3528],
        [0.1265],
        ...,
        [0.5215],
        [0.0191],
        [0.1101]])
y_dev_data tensor([[0.0552],
        [0.0999],
        [0.1331],
        [0.0461],
        [0.5329],
        [0.3803],
        [0.7429],
        [0.3737],
        [0.1130],
        [0.4459],
        [0.8221],
        [0.5769],
        [0.0800],
        [0.0039],
        [0.8192],
        [0.2340],
        [0.3241],
        [0.4813],
        [0.1308],
        [0.1061],
        [0.0964],
        [0.4800],
        [0.3005],
        [0.4446],
        [0.6819],
        [0.0624],
        [0.0023],
        [0.6516],
        [0.1084],
        [0.0338],
        [0.1668],
        [1.0000],
        [0.0378],
        [0.7162],
        [1.0000],
        [0.4729],
        [0.1461],
        [0.0339],
        [0.1397],
        [0.0467],
        [0.0139],
        [0.1381],
        [0.0618],
        [0.6992],
        [0.0825],
        [0.1998],
        [0.3397],
        [0.324

In [49]:
%time training(linearRegressionForAp1,optimizerForAp1,1000)

Epoch: 0000 dev loss= 0.16549142
Epoch: 0100 dev loss= 0.15135138
Epoch: 0200 dev loss= 0.13972367
Epoch: 0300 dev loss= 0.13015531
Epoch: 0400 dev loss= 0.12227534
Epoch: 0500 dev loss= 0.11578003
Epoch: 0600 dev loss= 0.11042059
Epoch: 0700 dev loss= 0.10599336
Epoch: 0800 dev loss= 0.10233136
Epoch: 0900 dev loss= 0.09929790
Optimised: training loss= 0.093802556
Optimised: dev loss= 0.096803851
CPU times: user 1.05 s, sys: 8.72 ms, total: 1.06 s
Wall time: 1.13 s


In [50]:
%time testing(linearRegressionForAp1)

Optimised: training loss= 0.093802556
Optimised: dev loss= 0.096803851
Testing loss= 0.082667157
Absolute mean square loss difference: 0.014136694
CPU times: user 5.28 ms, sys: 3 µs, total: 5.28 ms
Wall time: 9.44 ms


### Approch 2

In [51]:
# fast text vector to learn n-gram embbeding from dataset(stopwords and punctuation removed)
from gensim.models import FastText
word_embeddings_model_ft = FastText(sentences=training_text_tokenized_new, size=100, window=5, min_count=10, workers=2, sg=0)

In [52]:
# specification of model being trained using fast text vectors
linearRegressionForAp2 =  nn.Linear(100,1)
optimizerForAp2 = torch.optim.SGD(linearRegressionForAp2.parameters(), lr=1e-3)

In [53]:
# generating a mean fast text vector for documents in each dataset
x_training_fasttext_tensor_ap2 = getAvgVectors(training_text_tokenized_new,word_embeddings_model_ft)
x_dev_fasttext_tensor_ap2 = getAvgVectors(dev_text_tokenized_new,word_embeddings_model_ft)
x_test_fasttext_tensor_ap2 = getAvgVectors(test_text_tokenized_new,word_embeddings_model_ft)

In [54]:
# populating data that will be used in training and testing of model
x_data = x_training_fasttext_tensor_ap2.float()
y_data = y_training_data_tensor.float()
x_dev_data = x_dev_fasttext_tensor_ap2.float()
y_dev_data = y_dev_data_tensor.float()
x_test_data = x_test_fasttext_tensor_ap2.float()
y_test_data = y_test_data_tensor.float()

In [55]:
print('x_data',x_data)
print('x_dev_data',x_dev_data)
print('x_test_data',x_test_data)

x_data tensor([[-1.6640e-01,  2.9206e-01, -4.2247e-01,  ...,  3.4840e-01,
          1.5143e-01,  4.3527e-01],
        [ 4.6573e-01,  1.9362e-01, -3.8290e-01,  ...,  3.8650e-01,
         -1.4027e-01,  5.4531e-01],
        [ 8.2157e-01,  6.2583e-02, -3.3494e-01,  ...,  4.3443e-01,
          5.2901e-01,  8.3186e-01],
        ...,
        [-2.6867e-01,  6.0831e-01, -4.3429e-01,  ...,  4.0811e-01,
          6.1082e-01,  6.6706e-01],
        [ 5.3086e-01,  1.7527e-01, -4.0932e-01,  ...,  5.0237e-01,
          1.8469e-01,  6.7536e-01],
        [ 2.1045e-01,  2.2887e-01, -4.7201e-01,  ...,  2.8749e-01,
          3.0694e-04,  1.6014e-01]])
x_dev_data tensor([[-0.2387,  0.6382, -0.2250,  ...,  0.0288,  0.6768,  0.2630],
        [ 0.2172,  0.2514, -0.3168,  ...,  0.1638,  0.0610,  0.1788],
        [-0.1816,  0.5649, -0.2809,  ...,  0.0792,  0.5635,  0.2617],
        ...,
        [ 1.5462,  0.3827, -0.4995,  ...,  0.4265,  1.0051,  0.8364],
        [-0.0868,  0.5643, -0.3162,  ...,  0.0463,  0.674

In [56]:
print('y_data',y_data)
print('y_dev_data',y_dev_data)
print('y_test_data',y_test_data)

y_data tensor([[0.6453],
        [0.3528],
        [0.1265],
        ...,
        [0.5215],
        [0.0191],
        [0.1101]])
y_dev_data tensor([[0.0552],
        [0.0999],
        [0.1331],
        [0.0461],
        [0.5329],
        [0.3803],
        [0.7429],
        [0.3737],
        [0.1130],
        [0.4459],
        [0.8221],
        [0.5769],
        [0.0800],
        [0.0039],
        [0.8192],
        [0.2340],
        [0.3241],
        [0.4813],
        [0.1308],
        [0.1061],
        [0.0964],
        [0.4800],
        [0.3005],
        [0.4446],
        [0.6819],
        [0.0624],
        [0.0023],
        [0.6516],
        [0.1084],
        [0.0338],
        [0.1668],
        [1.0000],
        [0.0378],
        [0.7162],
        [1.0000],
        [0.4729],
        [0.1461],
        [0.0339],
        [0.1397],
        [0.0467],
        [0.0139],
        [0.1381],
        [0.0618],
        [0.6992],
        [0.0825],
        [0.1998],
        [0.3397],
        [0.324

In [57]:
%time training(linearRegressionForAp2,optimizerForAp2,1000)

Epoch: 0000 dev loss= 0.14912276
Epoch: 0100 dev loss= 0.10009476
Epoch: 0200 dev loss= 0.08687051
Epoch: 0300 dev loss= 0.08225392
Epoch: 0400 dev loss= 0.08056862
Epoch: 0500 dev loss= 0.07993797
Epoch: 0600 dev loss= 0.07970515
Epoch: 0700 dev loss= 0.07962994
Epoch: 0800 dev loss= 0.07961952
Epoch: 0900 dev loss= 0.07963561
Optimised: training loss= 0.078452975
Optimised: dev loss= 0.079661146
CPU times: user 824 ms, sys: 11 ms, total: 835 ms
Wall time: 838 ms


In [58]:
%time testing(linearRegressionForAp2)

Optimised: training loss= 0.078452975
Optimised: dev loss= 0.079661146
Testing loss= 0.069427423
Absolute mean square loss difference: 0.010233723
CPU times: user 7.44 ms, sys: 1 ms, total: 8.44 ms
Wall time: 8.24 ms


## 4.3 Evaluation [1pt]

Evaluate the speed and accuracy of the model with your ideas

In **this** text box, briefly describe the results. Did your improvement work? Why / Why not?

Approach 1 : 

Removing stopwords from the vocabulary lead to and inreasing number of epochs lead to a significant decrease in tesing loss 0.12817 -> 0.08266

Training takes more time than before 131ms -> 1.06ms but this increase is very reasonable

The improvement seen is because removal of stopwords improved the quality of training data in turn providing a better model which produced a better result because it trained more which let the model keep adjusting its parameters to gain better results


Approch 2 : 

Changing to Fasttext model and increasing the learning rate led to a slight improvement in testing cost 0.08266 -> 0.069427

There is a slight drop in the training time 1.06s -> 835ms

This improvement has been done in addition to Approch 1.The slight improvement we see is because fast text uses n-gram to learn embeddings, a larger learning rate was required for this model to provide improved results because the training loss for the model was initially very large. Both contributed in the improvement of the model.
