*-- Author: Jose Camacho Collados (Cardiff University, Lecturer) --*

This Jupyter notebook is intended to be used as a refresher of Python. It also includes some basic functions which will then be used during the module.

Let's start with some basic text (string) preprocessing





## TEXT PREPROCESSING WITH NLTK


---

First, we import the libraries that we need using "import".

**Note:** All these libraries need to be downloaded beforehand if not using Google Colab. Check their official websites for details on how to install them.

In [0]:
import numpy as np
import nltk

We can also download any required dependancies. For example for nltk we will need "punkt" for tokenization and "wordnet" for lemmatization (you can alternatively download all dependancies with "nltk.download('all')"). You only need to do this once.







In [0]:
nltk.download('punkt')
nltk.download('wordnet')
#nltk.download('all')

Text is represented as strings. String can be concatenated using "+". For instance, if we are given three sentences we can join them forming a paragraph. Recall that in Python 3 we use the function "print()" for printing in console.



In [0]:
sentence1="Machine learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. "
sentence2="It is seen as a subset of artificial intelligence. "
sentence3="Machine learning algorithms build a mathematical model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to perform the task. "
paragraph=sentence1+sentence2+sentence3
print (paragraph)

Let's now preprocess this paragraph. First we need to *tokenize* the text, which means create a list where each element is a word (or a *token*). To do this, we can use the *nltk* library, which is very convinient to deal with text data. 

In [0]:
list_tokens=nltk.tokenize.word_tokenize(paragraph)
print (list_tokens)

We can also split the text by sentences, like we had it originally.



In [0]:
sentence_split=nltk.tokenize.sent_tokenize(paragraph)
print (sentence_split)

And now we can tokenize each of the sentence separately and keep it in a new list.

In [0]:
list_sentence_tokens=[]
for sentence in sentence_split:
  list_sentence_tokens.append(nltk.tokenize.word_tokenize(sentence))
for sentence_tokens in list_sentence_tokens:
  print (sentence_tokens)

Now that our whole text is tokenized and split into sentences, we can check for example how many sentences contain the word "learning"

In [0]:
count_word=0
for sentence_tokens in list_sentence_tokens:
  if "learning" in sentence_tokens:
    count_word+=1
print ("Number of sentences containing 'learning': "+str(count_word))

Sometimes we may want to further preprocess the text. For example, get for each word its *lemma* form, which is its canonical or dictionary form (e.g. "models" -> "model") or convert each lemma to lowercase (e.g. "Machine" -> "machine").





In [0]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [0]:
list_sentence_lemmas_lower=[]
for sentence_tokens in list_sentence_tokens:
  list_lemmas=[]
  for token in sentence_tokens:
    list_lemmas.append(lemmatizer.lemmatize(token).lower())
  list_sentence_lemmas_lower.append(list_lemmas)
print (list_sentence_lemmas_lower)

**Excercise (optional):**
Repeat the same text preprocessing procedure but with the Python library [spacy](https://spacy.io/). Spacy is an advanced library for text preprocessing and natural language processing. It tends to provide a better accuracy than NLTK on standard text preprocessing techniques (e.g. lemmatization).

Sometimes it is hard to keep track of all the operations we make. For this, we make use of "functions". In our case, it could be interesting to have a function that given a text (string) as input, it gives us a list of lemmas as output.

In [0]:
def get_list_tokens(string):
  sentence_split=nltk.tokenize.sent_tokenize(string)
  list_tokens=[]
  for sentence in sentence_split:
    list_tokens_sentence=nltk.tokenize.word_tokenize(sentence)
    for token in list_tokens_sentence:
      list_tokens.append(lemmatizer.lemmatize(token).lower())
  return list_tokens

Let's check how this function works with our running example.

In [0]:
print (get_list_tokens(paragraph))

Note that with this function you don't keep the information about the individual sentences.

**Excercise (optional):** Change slightly this function so that you can keep sentences separated, as in the previous examples.




## VECTOR MANIPULATION WITH NUMPY

---


For machine learning we always need *vectors* as input, which can be viewed as an array of numbers. For this the *numpy* library is essential, as it provides many useful tools to manipulate vectors. Let's initialize and make some basic operation (e.g. sum) with the vectors.



In [0]:
a=np.zeros(3)
b=np.arange(3)
print (a)
print (b)
print (a+b)

Now we are going to create a vector based on the vocabulary of the three sentences we used earlier. Each word corresponds to a position in the vector, where the value is defined by its frequency (i.e. number of occurrences). 

---





In [0]:
dict_freq_tokens={}
for sentence in list_sentence_lemmas_lower:
  for token in sentence:
    if token in dict_freq_tokens: dict_freq_tokens[token]+=1
    else: dict_freq_tokens[token]=1
vector_paragraph=np.zeros(len(dict_freq_tokens))
list_tokens=list(dict_freq_tokens.keys())
print (list_tokens)
for i in range(len(list_tokens)):
  vector_paragraph[i]=dict_freq_tokens[list_tokens[i]]
print (vector_paragraph)

We can also compute a vector for each of the sentences, using the same vocabulary ("list_tokens").

In [0]:
dict_freq_tokens={}
count_sent=0
for sentence in list_sentence_lemmas_lower:
  count_sent+=1
  for token in sentence:
    if token in dict_freq_tokens: dict_freq_tokens[token]+=1
    else: dict_freq_tokens[token]=1
  for i in range(len(list_tokens)):
    token_vocab=list_tokens[i]
    if token_vocab in dict_freq_tokens: vector_paragraph[i]=dict_freq_tokens[token_vocab]
    else: vector_paragraph[i]=0
  print ("Sentence "+str(count_sent)+": "+str(vector_paragraph))
  dict_freq_tokens.clear()

More generally, we may be interested in having a function that, given a pre-defined vocabulary, it computes a frequency vector for any given text.

**Exercise 1:** Create a function that takes a vocabulary (as a list of words/lemmas) and a text (string) and outputs its corresponding vector. Test the function with the following inputs: 

1.   vocabulary=['cat', 'dog', 'machine', 'field']
2.   string="Machine learning is a field where we study how machines learn."

Hint: the correct output for this example should be the vector (0,0,2,1)





In [0]:
def get_vector_text(list_vocab,string):
  #To complete here
  

Sometimes we may be interested in how close or similar certain vectors are. For this we can use either the euclidean distance or cosine similarity.



In [0]:
#Cosine similarity
def cos_sim(a,b):
  return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

#Euclidean distance
def euc_dist(a,b):
  return np.linalg.norm(a-b)


In [0]:
a=np.array([1, 2, 3])
b=np.array([10, 21, 32])

print ("Cosine similarity: "+str(cos_sim(a,b)))
print ("Euclidean distance: "+str(euc_dist(a,b)))


*Reminder:* The cosine similarity takes values between -1 to 1, and only measures the angle between the vectors (the size of the vector is irrelevant).

**Exercise 2:** With the function defined in Exercise 1 and with the vocabulary ('cat','dog','machine','field'), compute the cosine similarity of the following string pairs:

1.   "Machine learning is a field where we study how machines learn." and "The machine is not working."
2.   "My favorite animals are dogs and cats" and "The machine is not working."
3.   "My favorite animals are dogs and cats" and "What can we do with the cat and the dog? The cat is always fighting with the dog."

In [0]:
#To complete here
