# <font color='SEAGREEN'>Day 5</font>
# <font color='MEDIUMSEAGREEN'>Improving the Accuracy</font>

Two learning algorithms have been implemented so far.

Now, let's look if we can improve the accuracy by having more nicer data!

## Data
Going back to the data, from our previous research we know that we can use stemming and lemmatizing to improve the collection of our words.

### Stemming and Lemmatization

Remember that the goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

A difference between stemming and lemmatization is that stemming looks at the current word only, while lemmatization also takes the context into consideration. Either way, this pre-processing step could be somewhat tedious. Luckily, the powerful `nltk` provides tools for both.

#### Stemming using the Porter stemmer
*Porter's algorithm*, developed in the 1980s, is one of the most commonly used stemmers.

In [None]:
import nltk
from nltk.stem.porter import *

# Get the Porter stemmer
stemmer = PorterStemmer()

# Let's stemming on plurals
plurals = ['apples', 'batteries', 'generators', 'medicines', 'tests', 'feet']
print('plurals:')
for plural in plurals:
    print('{:s} --> {:s}'.format(plural, stemmer.stem(plural)))
print()
    
# and variations of verbs
verbs = ['studies', 'thinks', 'goes', 'played', 'bought', 'went', 'ran', 'drew', ]
print('verbs:')
for verb in verbs:
    print('{:s} --> {:s}'.format(verb, stemmer.stem(verb)))

You can add more words to `plurals` and see what the stemming results look like.  
You may find that the results may look a bit mechanical. This is because the Porter's algorithm is essentially a sequential application of a set of rules. To get better looking results, let's try out a lemmatizer.

In [None]:
# Uncomment and run the following line when you this cell for the first time:
# nltk.download('wordnet')

from nltk.stem.wordnet import WordNetLemmatizer

# Get the lemmatizer
lmtzr = WordNetLemmatizer()

# Lemmatize the plurals
print('plurals:')
for plural in plurals:
    print('{:s} --> {:s}'.format(plural, lmtzr.lemmatize(plural)))
print()

# Lemmatize the verbs
print('verbs:')
for verb in verbs:
    print('{:s} --> {:s}'.format(verb, lmtzr.lemmatize(verb)))

Not yet perfect, but much better, especially for the plurals. Whoray! :)

Let's check whether stemming or lemmatization would improve our classification accuracy or not.

Load the data

In [None]:
# Your code

Find the tf values, use either lemmatization or stemming.

In [None]:
# Your code

Find the idf values

In [None]:
# Your code

Find tf-idf values

In [None]:
# Your code

Export your result table to ``X2.csv``

In [None]:
# Your code

Implement the Naive Bayes Classifier on the new data and report the accuracy.

In [None]:
# Your code

**Advanced Exercise:** Synonyms

It would be really cool to be able to check if two words are synonyms when comparing them. NLTK's WordNet (a corpus) allows us to find synsets (synonym sets) which we can use to do just that. Let's try to write a function to check whether two words are synonyms. 

Get help from [HERE](https://www.geeksforgeeks.org/get-synonymsantonyms-nltk-wordnet-python/).

In [None]:
from nltk.corpus import wordnet 

def synonym_check(word1, word2):
    # TODO: get all synsets from word 1
    
        # TODO: get lemmas for this synset
        
            # TODO: compare the name for this lemma to word 2
            
                
                # TODO: return True if the same 
    
    # TODO: otherwise, return false

print(synonym_check("intelligence", "understanding"))
print(synonym_check("machine", "automation"))
print(synonym_check("robot", "human"))

You should get False False False. 

Hmmm.... we can see that this method isn't as robust as we might like. Another way to check synonyms is to compare similarity indices, and then set a threshold for calling two words synonyms. 

In [None]:
import nltk
def synonym_check_2(word1, word2):
    # the maximum similarity found so far
    max_wup = 0
    # gets all possible synsets for word1
    w1 = wordnet.synsets(word1)
    # gets all possible synsets for word2
    w2 = wordnet.synsets(word2) # n denotes noun
    # TODO: for synset in w1
    
        # TODO: for synset in w2
        
            # TODO: get wup_similarity between the two
            
            
            # TODO: if wup_sumilarity is greater than the previous maximum, update it

    threshold = # TODO: set threshold
    if max_wup >= threshold:
        return max_wup, True
    else:
        return max_wup, False

In [None]:
print(synonym_check_2("intelligence", "understanding"))
print(synonym_check_2("machine", "automation"))
print(synonym_check_2("robot", "human"))

**Advanced Exercise:**

Try to improve the data with using the synonym_check_2 function. Report the accuracy with Naive Bayes Classifier.

In [None]:
# Your code

**Advanced Exercise (Optional):**

Some English words occur together more frequently. For example - Sky High, best performance, heavy rain. So, in a text document we may need to identify such pair of words which will help in sentiment analysis. First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs.

Find the bi-grams on your data.

In [None]:
# Your code

In [None]:
print("Now, you can call me a scientist! Drop the MIC!!!")