<a href="https://colab.research.google.com/github/joemcl81/google_colab/blob/main/collabquiz_20224885.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Joseph Mc Laughlin, 20224885

Collaborative Quiz of The Week

Q1: 

Using an example, explain the difference between Type I and Type II Errors in the context of Regular expressions.

Statistical hypothesis testing implies that no test is ever 100% certain: that’s because we rely on probabilities to experiment.

When online marketers and scientists run hypothesis tests, both seek out statistically relevant results. This means that the results of their test have to be true within a range of probabilities (typically 95%). A statistically significant result cannot prove that a research hypothesis is correct (as this implies 100% certainty). Because a p-value is based on probabilities, there is always a chance of making an incorrect conclusion regarding accepting or rejecting the null hypothesis (H0).

Even though hypothesis tests are meant to be reliable, there are two types of errors that can occur.

**Type 1**

Type 1 errors – often assimilated with false positives – happen in hypothesis testing when the null hypothesis is true but rejected. The null hypothesis is a general statement or default position that there is no relationship between two measured phenomena.

Simply put, type 1 errors are “false positives” – they happen when the tester validates a statistically significant difference even though there isn’t one.

**Type 2**


If type 1 errors are commonly referred to as “false positives”, type 2 errors are referred to as “false negatives”. Type 2 errors happen when you inaccurately assume that no winner has been declared between a control version and a variation although there actually is a winner. In more statistically accurate terms, type 2 errors happen when the null hypothesis is false and you subsequently fail to reject it.

If the probability of making a type 1 error is determined by “α”, the probability of a type 2 error is “β”. Beta depends on the power of the test (i.e the probability of not committing a type 2 error, which is equal to 1-β)

**Type 1 and Type 2 Errors in Regex**

The code below demonstrates how Type 1 and Type 2 errors can occur when using Regex. The aim is to use regex to match all instances of the word "the" in a text. 


In [1]:
#Demonstration
import re
text = "The other one there, the blithe one."

x = re.findall("the", text) #Misses capitalized examples
y = re.findall("[tT]he", text) #Incorrectly returns other, there and blithe
z = re.findall("[^A-Za-z][tT]he[^A-Za-z]", text) #Returns instances of 'the' that don't have a alphabetic element before or after instances
print(x)
print(y)
print(z)

['the', 'the', 'the', 'the']
['The', 'the', 'the', 'the', 'the']
[' the ']


This code demonstrates both Type 1 and Type 2 errors in Regex as well as providing a solution to these errors. y represents the Type 1 or False Positive errors, where Regex finds matches for instances of "the" which we are not looking for, other, there and blite. x represents Type 2 or False Negative errors, where Regex misses matches, The due to the capital letters. z represents a solution to this, by only matching instances of 'the' that either begin or end with a non-alphabetic character. 

These types of errors must always be dealt with when developing NLP solutions. There are two agtagonistic approaches to reduing the error rates, minimising false positives by increasing accuracy/precision and minimising false negatives by increasing coverage/recall

Q2:

Count the number of Types and Tokens in the following sentence: “NLP is the art of analyzing and understanding human languages by machines.”

In [2]:
#Solution
import keras
from keras.preprocessing.text import Tokenizer

num_words = 50
text = ["NLP is the art of analyzing and understanding human languages by machines."]
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(text)
word_index = tokenizer.word_index
 
print("Word index:\n", word_index)
print('There are %s tokens in the text.' % len(word_index.keys()))
print('There are %s types in the text.' % len(word_index))



Word index:
 {'nlp': 1, 'is': 2, 'the': 3, 'art': 4, 'of': 5, 'analyzing': 6, 'and': 7, 'understanding': 8, 'human': 9, 'languages': 10, 'by': 11, 'machines': 12}
There are 12 tokens in the text.
There are 12 types in the text.


Q3:

Explain the difference between lemmatization and stemming and provide an example for each.

**What is Stemming?**

Stemming is a kind of normalization for words. Normalization is a technique where a set of words in a sentence are converted into a sequence to shorten its lookup. The words which have the same meaning but have some variation according to the context or sentence are normalized.

In another word, there is one root word, but there are many variations of the same words. For example, the root word is "wait" and it's variations are "waits, waiting and waited[link text](https://)". In the same way, with the help of Stemming, we can find the root word of any variations.

In [3]:
#Stemming
from nltk.stem import PorterStemmer
ps =PorterStemmer()
e_words= ["wait", "waiting", "waited", "waits"]
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(e_words)
for w in e_words:
    rootWord=ps.stem(w)
    print("Stemming for {} is {}".format(w,rootWord))

Stemming for wait is wait
Stemming for waiting is wait
Stemming for waited is wait
Stemming for waits is wait


**Discussion of Output:**

Stemming is a data-preprocessing module. The English language has many variations of a single word, i.e. wait. These variations create ambiguity in machine learning training and prediction. To create a successful model, it's vital to filter such words and convert to the same type of sequenced data using stemming. Also, this is an important technique to get row data from a set of sentence and removal of redundant data also known as normalization.

**What is Lemmatization?** 

Lemmatization is the algorithmic process of finding the lemma of a word depending on their meaning. Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as the lemma. The NLTK Lemmatization method is based on WorldNet's built-in morph function. Text preprocessing includes both stemming as well as lemmatization. 

In [4]:
#Lemming
import nltk
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import wordnet
from nltk.stem import 	WordNetLemmatizer
wl = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
	print("Lemma for {} is {}".format(w, wl.lemmatize(w))) 

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


**Discussion of output:**

If you use stemming for studies and studying, output is same (studi) but lemmatizer provides different lemma for both tokens study for studies and studying for studying. So when we need to make feature set to train machine, it would be great if lemmatization is preferred.

**Stemming vs Lemmatizer**

Many people find the two terms confusing. Some treat these as same, but there is a difference between these both. Lemmatization is preferred over the former. The stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the beginning or end of the word. Lemmatizer minimizes text ambiguity. Example words like bicycle or bicycles are converted to base word bicycle. Basically, it will convert all words having the same meaning but different representation to their base form. It reduces the word density in the given text and helps in preparing the accurate features for training machine. Cleaner the data, the more intelligent and accurate your machine learning model, will be. Lemmatizerwill also saves memory as well as computational cost.