## Instructions

This notebook consists of 5 parts.  The point-value of each part is indicated in the section header.

Some executable cells have code, while other executable cells have a comment 'Challenge Cell' and a point value.

The executable cells with code are not worth any points, but must be executed to successfully complete and execute the challenge cells.

In [None]:
# load the necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Part 1: Pre-Midterm (Modules 01-06)
Total Part Value: 10 Points

## Challenge: NLP Pipeline (8 Points)

The twenty newsgroups dataset will be used for this lab. It is already loaded in the sklearn library.

In [None]:
# load the necessary data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
count_vect = CountVectorizer()
print("twenty_train ready")

In [None]:
# Fit the Data (2.0 Points)
# Fit the twenty_train.data to CountVectorizer and store it in a variable X_train_counts, and print the shape
X_train_counts = count_vect.fit_transform(twenty_train.data)
print(X_train_counts.shape)

Now, let us obtain the tfidf features from the text.

In [None]:
# Calculate TF-IDF (2.0 Points)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

# Fit the data to TfidfTransformer and store it in a variable X_train_tfidf, and print the shape
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(X_train_tfidf.shape)

Now, we will use the Naive Bayes classifier to classify text. It is also available in the sklearn library.

In [None]:
# Create Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
clf

The Pipeline function is available in the Sklearn library. It takes the different steps of the pipeline one by one:

- ('vect', CountVectorizer()),
- ('tfidf', TfidfTransformer()),
- ('clf', MultinomialNB())

In [None]:
# Create Pipeline for MultinomialNB
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
text_clf

In [None]:
# Fit Pipeline
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
text_clf

In [None]:
# Evaluate Bayes Predictions
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

You are given the following chronological steps of the pipeline:

- ('vect', CountVectorizer())
- ('tfidf', TfidfTransformer())
- ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42))

Now, your task is to design the <b>pipeline</b>

In [None]:
# Create the Pipeline for SVM using SGD (2.0 Points)
from sklearn.linear_model import SGDClassifier
# Similar to 'Create Pipeline for MultinomialNB', design the pipeline using SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42)
# and store it in a variable text_clf_svm
#text_clf_svm = SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42)

text_clf_svm = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(
        loss='hinge',
        penalty='l2',
        alpha=1e-3,
        max_iter=5,
        random_state=42
    ))
])

In [None]:
# Evaluate SVM Predictions (1.0 Points)
text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)
# Similar to 'Bayes Predictions' (above),  predict the labels for twenty_test data
# Specifically, use the function text_clf_svm.predict, the test data can be referenced by twenty_test.data

# prediction accuracy
text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)

predicted_svm = text_clf_svm.predict(twenty_test.data)

np.mean(predicted_svm == twenty_test.target)

##Challenge: Define a sequential model

In [None]:
# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import plot_model
from keras import Sequential

In [None]:
# The model you need to make (2 Points)
# fill in the parameters so that you can create the model above
model = Sequential() # initialize sequential model
model.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'tanh', input_dim = 15)) # Dense layer with 10 Neurons
model.add(Dense(units = 32, kernel_initializer = 'uniform', activation = 'tanh')) # Dense layer with 32 Neurons
model.add(Dense(units = 32, kernel_initializer = 'uniform', activation = 'tanh')) # Dense layer with 32 Neurons
model.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid')) # Dense output layer with 1 neuron, sigmoid activation
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)


# Part 2: Recurrent Neural Network (Modules 08)
Total Part Value: 10 Points

## Challenge: Bag of Words


Generate the vectors for the following sentences using Bag of Words approach:
*   Here we go again
*   Go and play baseball
*   Baseball and tennis are popular here

In [None]:
# create the data and helper functions

docs = [
    'Here we go again',
    'Go and play baseball',
    'Baseball and tennis are popular here'
]
tokenized_docs = [sentence.split() for sentence in docs]
vocab = [word.lower() for sentence in tokenized_docs for word in sentence]
def vectorize(sentence):
  vectorized = [0] * len(vocab)

  i = 0
  for token in sentence.split():
    vectorized[vocab.index(token.lower())] += 1
    i += 1
  return vectorized

In [None]:
# Vectorize the sentences (2.0 Points)
vectorized_list = []
for sentence in docs:
  vectorized = vectorize(sentence)
  vectorized_list.append(vectorized)

vectorized_list

## Designing Neural Networks


We have given you the image of a model and its layers. The code is partially made for you. Your task is to fill in the blanks in the code block to create an exact same model.

In [None]:
from keras.layers import SimpleRNN, Dense, LSTM
from keras.layers import Bidirectional

In [None]:
# Complete the Neural Network Model (4.0 Points)
# The model you need to make
model = Sequential() # initialize sequential model
model.add(LSTM(126, input_shape=(70,1), return_sequences=True)) # LSTM layer with 126 neurons
model.add(LSTM(63, return_sequences=True)) # LSTM layer with 63 neurons
model.add(LSTM(63)) # LSTM layer with 63 neurons
model.add(Dense(26,activation='relu')) # Dense layer with 26 neurons
model.add(Dense(18,activation='relu')) # Dense layer with 18 neurons, relu activation
model.add(Dense(1,activation='relu')) # Dense output layer with 1 neuron, relu activation
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)


Bidirectional RNN
- Design a sequential model that takes an input vector of shape (10,20)
- Add a bidirectional LSTM layer of 25 neurons
- Add another bidirectional LSTM layer of 15 neurons
- Add a dense layer of 10 neurons

In [None]:
# Implement the Neural Network Model (4.0 Points)
model = Sequential()
model.add(Bidirectional(LSTM(25, return_sequences=True), input_shape=(10,20)))
model.add(Bidirectional(LSTM(15)))
model.add(Dense(10))
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

# Part 3: Model Deployment (Module 09)
Total Part Value: 5 Points

##Simulating Streamlit
In Module09, we used Streamlit to select a classification algorithm.
Your task is to use the code snippets below to implement the same functionality in this Jupyter Notebook, but using input provided by the 'input()' method instead of the Streamlit User Interface elements.

```
  # Code Snippet #1
  trainData = fetch_20newsgroups(subset='train', shuffle=True)
  print("SVM selected")
  classificationPipeline = Pipeline([('bow', CountVectorizer()), ('vector', TfidfTransformer()), ('classifier', SGDClassifier(loss='hinge', penalty='l1', alpha=0.0005, l1_ratio=0.17))])
  #https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
  classificationPipeline = classificationPipeline.fit(trainData.data, trainData.target)
  test_set = fetch_20newsgroups(subset='test', shuffle=True)
  dataPrediction = classificationPipeline.predict(test_set.data)
  print("SVM:")    
  print(np.mean(dataPrediction == test_set.target))
```



```
  # Code Snippet #2
  trainData = fetch_20newsgroups(subset='train', shuffle=True)
  print("Naive Bayes selected")
  classificationPipeline = Pipeline([('bow', CountVectorizer()), ('vector', TfidfTransformer()), ('classifier', MultinomialNB())])
  classificationPipeline = classificationPipeline.fit(trainData.data, trainData.target)
  test_set = fetch_20newsgroups(subset='test', shuffle=True)
  dataPrediction = classificationPipeline.predict(test_set.data)
  print("Accuracy of Naive Bayes:")
  print(np.mean(dataPrediction == test_set.target))
```



In [None]:
# retrieve the user's desired classification method
cs = ["Naive Bayes","SVM"]
print("Pick a classification method:")
for option in cs:
  print(f"\t {option}")

classification_space = input()

In [None]:
# Implement the correct Code Snippet based on the user's choice (5.0 Points)
if classification_space == "Naive Bayes":
  trainData = fetch_20newsgroups(subset='train', shuffle=True)
  print("Naive Bayes selected")
  classificationPipeline = Pipeline([('bow', CountVectorizer()), ('vector', TfidfTransformer()), ('classifier', MultinomialNB())])
  classificationPipeline = classificationPipeline.fit(trainData.data, trainData.target)
  test_set = fetch_20newsgroups(subset='test', shuffle=True)
  dataPrediction = classificationPipeline.predict(test_set.data)
  print("Accuracy of Naive Bayes:")
  print(np.mean(dataPrediction == test_set.target))

if classification_space == "SVM":
  trainData = fetch_20newsgroups(subset='train', shuffle=True)
  print("SVM selected")
  classificationPipeline = Pipeline([('bow', CountVectorizer()), ('vector', TfidfTransformer()), ('classifier', SGDClassifier(loss='hinge', penalty='l1', alpha=0.0005, l1_ratio=0.17))])
  #https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
  classificationPipeline = classificationPipeline.fit(trainData.data, trainData.target)
  test_set = fetch_20newsgroups(subset='test', shuffle=True)
  dataPrediction = classificationPipeline.predict(test_set.data)
  print("SVM:")
  print(np.mean(dataPrediction == test_set.target))


# Part 4: Chatbots (Module 10)
Total Part Value: 5 Points

##Customize a Chatbot

In [None]:
# create a chatbot
import nltk
import numpy as np
import random
import string # to process standard python strings
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

f=!wget https://ethaneldridge.github.io/cisd410/M16-FinalExam/chatbots.txt
f=open('chatbots.txt','r',errors = 'ignore')

raw=f.read()
raw=raw.lower() # converts to lowercase
nltk.download('punkt') # first-time use only
nltk.download('wordnet') # first-time use only
nltk.download('punkt_tab') # Download punkt_tab resource
sent_tokens = nltk.sent_tokenize(raw) # converts to list of sentences
word_tokens = nltk.word_tokenize(raw) # converts to list of words

lemmer = nltk.stem.WordNetLemmatizer()
# WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

# This removes the punctuation from sentences
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

def response(user_request):
    robo_response=''
    sent_tokens.append(user_request)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, token_pattern=None, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

In [None]:
# run the chatbot
flag=True
print("Hi. I will answer your queries about Chatbots. To exit, type Bye!")
while(flag==True):
    user_request = input()
    user_request=user_request.lower()
    if(user_request!='bye'):
        if(user_request=='thanks' or user_request=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_request)!=None):
                print("ROBO: "+greeting(user_request))
            else:
                print("ROBO: ",end="")
                print(response(user_request))
                sent_tokens.remove(user_request)
    else:
        flag=False
        print("ROBO: Bye! ")

We have successfully built our first chatbot. Your challenge is to now change this chatbot. For our example, we used the Wikipedia page for chatbots as our corpus. Now use the information from this page: https://www.chatcompose.com/what-are-chatbots.html as the chatbot corpus and retrain your chatbot.

In [None]:
# Customize the chatbot (5.0 Points)
f=!wget https://www.chatcompose.com/what-are-chatbots.html
f=open('chatbots.txt','r',errors = 'ignore')

raw=f.read()
raw=raw.lower() # converts to lowercase
nltk.download('punkt') # first-time use only
nltk.download('wordnet') # first-time use only
nltk.download('punkt_tab') # Download punkt_tab resource
sent_tokens = nltk.sent_tokenize(raw) # converts to list of sentences
word_tokens = nltk.word_tokenize(raw) # converts to list of words

lemmer = nltk.stem.WordNetLemmatizer()
# WordNet is a semantically-oriented dictionary of English included in NLTK.
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

# This removes the punctuation from sentences
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)
GREETING_RESPONSES = ["Howdy, what's up?", "Hey, how may I be of assistance?", "Hi, how's it going?", "Hello, how can I help you?", "I am glad to be chatting!"]
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

def response(user_request):
    robo_response=''
    sent_tokens.append(user_request)
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, token_pattern=None, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf<0):
        robo_response=robo_response+"I'm not sure I understand. Can you elaborate or rephrase that please."
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

flag=True
print("Hi. I can answer your questions about Chatbots. To exit, type Bye!")
while(flag==True):
    user_request = input()
    user_request=user_request.lower()
    if(user_request!='bye'):
        if(user_request=='thanks' or user_request=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_request)!=None):
                print("ROBO: "+greeting(user_request))
            else:
                print("ROBO: ",end="")
                print(response(user_request))
                sent_tokens.remove(user_request)
    else:
        flag=False
        print("ROBO: Bye! ")

#Part 5: Translation (Module 11)
Total Part Value: 10 Points

## Use Argos Translate to perform Language Translation

In [None]:
# install the library
!pip install argostranslate


In [None]:
# import the libraries and create the helper functions
import argostranslate.package
import argostranslate.translate
argostranslate.package.update_package_index()
available_packages = argostranslate.package.get_available_packages()

for package in available_packages:
  if package.from_code == "en":
    print(f"Package {package} package.from_code {package.from_code}, package.to_code {package.to_code}, package.code {package.code}")

def translate(text, target_language):
  from_code = "en"
  package_to_install = next(
    filter(
      lambda x: x.from_code == from_code and x.to_code == target_language, available_packages
    )
  )
  argostranslate.package.install_from_path(package_to_install.download())
  return argostranslate.translate.translate(text, from_code, target_language)

In [None]:
# Translate the sentence 'A robot may not injure a human being or, through inaction, allow a human being to come to harm.' into Spanish (3.0 Points)
sentence = 'A robot may not injure a human being or, through inaction, allow a human being to come to harm.'
translate(sentence, 'es')

In [None]:
# Translate a sentence of your choice into a language of your choice (2.0 Points)
personal_sentence = "Hello, my name is Oliver and I am a high school student who is super interested in AI and Computer Science. I am so happy to meet you."
translate(personal_sentence, 'az')

##Using MT5

In [None]:
# process the article_text
import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))

article_text = """Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company said.  The policy includes the termination of accounts of anti-vaccine influencers.  Tech giants have been criticised for not doing more to counter false health information on their sites.  In July, US President Joe Biden said social media platforms were largely responsible for people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue.  YouTube, which is owned by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation about Covid vaccines.  In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B.  "We're expanding our medical misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and effective by local health authorities and the WHO," the post said, referring to the World Health Organization."""

model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer(
    [WHITESPACE_HANDLER(article_text)],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=84,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(summary)


In [None]:
# Question01: What NLP Task is performed by the MT5 LLM? (1 Point)
"The LLM is summarizing and condensing information"

In [None]:
# Modify the MT5 LLM code so that it will instead operate on this text:(4 Points)
# (from Andrew Ng https://aifund.ai/insights-written-statement-of-andrew-ng-before-the-u-s-senate-ai-insight-forum/):
article_text = """
AI technology is used in applications in healthcare, underwriting, self-driving, social media, and other sectors. With some applications, there are risks of significant harm. We want:

Medical devices to be safe
Underwriting software to be fair, and not discriminate based on protected characteristics
Self-driving cars to be safe
Social media to be governed in a way that respects freedom of speech but also does not subject us to foreign actors’ disinformation campaigns
When we think about specific AI applications, we can figure out what outcomes we do want (such as improved healthcare) and do not want (such as medical products that make false claims) and regulate accordingly.

A fundamental distinction in decisions about how to regulate AI is between applications vs. technology.

Nikola Tesla’s invention of the AC (alternating current) electric motor was a technology. When this technology is incorporated into either a blender or an electric car, the blender or car is an application. Electric motors are useful for so many things it is hard to effectively regulate them separately from thinking about concrete use cases. But when we look at blenders and electric cars, we can systematically identify benefits and risks and work to enable the benefits while limiting risks.

Whereas motors help us with physical work, AI helps us with intellectual work.

In the case of AI, engineers and scientists will typically write software and have it learn from a lot of data. This AI system may live in a company’s datacenter and be subject to testing and experimentation, but it is not yet made available to any end-user. This is AI technology — essentially a piece of math that can be used for many different applications. When engineers then use this technology to build a piece of software for a particular purpose, such as medical diagnosis, it then becomes an application.
"""
# process the article_text
import re
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

WHITESPACE_HANDLER = lambda k: re.sub('\s+', ' ', re.sub('\n+', ' ', k.strip()))

model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

input_ids = tokenizer(
    [WHITESPACE_HANDLER(article_text)],
    return_tensors="pt",
    padding="max_length",
    truncation=True,
    max_length=512
)["input_ids"]

output_ids = model.generate(
    input_ids=input_ids,
    max_length=84,
    no_repeat_ngram_size=2,
    num_beams=4
)[0]

summary = tokenizer.decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(summary)