<a href="https://colab.research.google.com/github/kvinne-anc/Keras-and-Tensor/blob/main/Major_NN_Architecture_S.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.
Instructions

- **Download** this notebook as you would any other ipynb file 
- **Upload** to Google Colab or work locally (if you have that set-up)
- **Delete** `raise NotImplementedError()`

- **Write** your code in the `# YOUR CODE HERE` space


- **Execute** the Test cells that contain assert statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)

- **Save** your notebook when you are finished
- **Download** as a ipynb file (if working in Colab)
- **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)



# Major Neural Network Architectures Challenge
## *Data Science Unit 4 Sprint 3 Challenge*

In this sprint challenge, you'll explore some of the cutting edge of Deep Learning. This week we studied several famous neural network architectures: 
recurrent neural networks (RNNs), long short-term memory (LSTMs), convolutional neural networks (CNNs), and Autoencoders. In this sprint challenge, you will revisit these models. Remember, we are testing your knowledge of these architectures not your ability to fit a model with high accuracy. 

__*Caution:*__  these approaches can be pretty heavy computationally. All problems were designed so that you should be able to achieve results within at most 5-10 minutes of runtime locally, on AWS SageMaker, on Colab or on a comparable environment. If something is running longer, double check your approach!

__*GridSearch:*__ CodeGrade will likely break if it is asked to run a gridsearch for a deep learning model (CodeGrade instances run on a single processor). So while you may choose to run a gridsearch locally to find the optimum hyper-parameter values for your model, please delete (or comment out) the gridsearch code and simply instantiate a model with the optimum parameter values to get the performance that you want out of your model prior to submission. 


## Challenge Objectives
*You should be able to:*
* <a href="#p1">Part 1</a>: Train a LSTM classification model
* <a href="#p2">Part 2</a>: Utilize a pre-trained CNN for object detection
* <a href="#p3">Part 3</a>: Describe a use case for an autoencoder
* <a href="#p4">Part 4</a>: Describe yourself as a Data Science and elucidate your vision of AI

____

# (CodeGrade) Before you submit your notebook you must first

1) Restart your notebook's Kernel

2) Run all cells sequentially, from top to bottom, so that cell numbers are sequential numbers (i.e. 1,2,3,4,5...)
- Easiest way to do this is to click on the **Cell** tab at the top of your notebook and select **Run All** from the drop down menu. 

3) If you have gridsearch code, now is when you either delete it or comment out that code so CodeGrade doesn't run it and crash. 

4) Read the directions in **Part 2** of this notebook for specific instructions on how to prep that section for CodeGrade.

____

<a id="p1"></a>
## Part 1 - LSTMs

Use a LSTM to fit a multi-class classification model on Reuters news articles to distinguish topics of articles. The data is already encoded properly for use in a LSTM model. 

Your Tasks: 
- Use Keras to fit a predictive model, classifying news articles into topics. 
- Name your model as `model`
- Use a `single hidden layer`
- Use `sparse_categorical_crossentropy` as your loss function
- Use `accuracy` as your metric
- Report your overall score and accuracy
- Due to resource concerns on CodeGrade, `set your model's epochs=1`

For reference, the LSTM code we used in class will be useful. 

__*Note:*__  Focus on getting a running model, not on maxing accuracy with extreme data size or epoch numbers. Only revisit and push accuracy if you get everything else done! 

In [None]:
# Import data (don't alter the code in this cell)
from tensorflow.keras.datasets import reuters

# Suppress some warnings from deprecated reuters.load_data
import warnings
warnings.filterwarnings('ignore')

# Load data
(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=None,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=723812,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

# Due to limited computational resources on CodeGrade, take the following subsample 
train_size = 1000
X_train = X_train[:train_size]
y_train = y_train[:train_size]

In [None]:
# Demo of encoding
word_index = reuters.get_word_index(path="reuters_word_index.json")

print(f"Iran is encoded as {word_index['iran']} in the data")
print(f"London is encoded as {word_index['london']} in the data")
print("Words are encoded as numbers in our dataset.")

Iran is encoded as 779 in the data
London is encoded as 544 in the data
Words are encoded as numbers in our dataset.


In [None]:
len(X_train[1])

125

In [None]:
len(X_test[1])

66

In [None]:
# Imports (don't alter this code)
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM

# DO NOT CHANGE THESE VALUES 
# Keras docs say that the + 1 is needed: https://keras.io/api/layers/core_layers/embedding/
MAX_FEATURES = len(word_index.values()) + 1

# maxlen is the length of each sequence (i.e. document length)
MAXLEN = 200

In [None]:
X_train.shape

(1000,)

In [None]:
y_train.shape

(1000,)

In [None]:
X_test.shape

(2246,)

In [None]:
y_test.shape

(2246,)

In [None]:
import tensorflow as tf
import random
import sys
import os
import pandas as pd
import numpy as np
from __future__ import print_function
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, LSTM, Dropout, Activation, Embedding, Bidirectional

In [None]:

# Pre-process your data by creating sequences
def print_text_from_seq(x):
    # print('=================================================')
    word_to_id = word_index
   # word_to_id = {k:(v+index_from) for k,v in word_to_id.items()}
    word_to_id["<PAD>"] = 0
    word_to_id["<START>"] = 1
    word_to_id["<UNK>"] = 2
    word_to_id["<UNUSED>"] = 3

    id_to_word = {value:key for key,value in word_to_id.items()}
  
    print(f'Length = {len(x)}')
    print(' '.join(id_to_word[id] for id in x ))
    print('=================================================')

# Save your transformed data to the same variable name:
# example: X_train = some_transformation(X_train)

In [None]:
for i in range(0, 6):
    print(X_train[i])

[1, 248, 409, 166, 265, 1537, 1662, 8, 24, 4, 1222, 2771, 7, 227, 236, 40, 85, 944, 10, 531, 176, 8, 4, 176, 1613, 24, 1662, 297, 5157, 6, 10, 103, 5, 231, 215, 8, 7, 2889, 6, 10, 1202, 69, 4, 1222, 329, 2771, 24, 944, 23, 944, 1662, 40, 2509, 1592, 907, 69, 4, 113, 997, 762, 2539, 7, 227, 236, 17, 12]
[1, 4665, 1183, 413, 381, 7, 1134, 1664, 62, 729, 7, 4, 121, 273, 93, 109, 28, 2115, 72, 11, 428, 4, 387, 989, 558, 3956, 8, 7, 25, 1213, 427, 1969, 223, 4, 213, 5, 387, 580, 8, 1145, 413, 62, 410, 451, 18, 428, 7, 4, 121, 6, 3106, 19, 11, 428, 9, 1283, 317, 65, 413, 138, 59, 12, 11, 428, 6, 6118, 63, 11, 4, 3956, 8, 3640, 1183, 413, 202, 251, 18, 428, 6, 546, 19, 11, 428, 9, 317, 65, 413, 7, 4, 1721, 427, 409, 7145, 138, 19, 19, 11, 428, 6, 3843, 70, 11, 4, 135, 5, 137, 317, 1833, 542, 9, 7145, 413, 138, 72, 47, 11, 428, 6, 19, 5106, 19, 16, 8, 17, 12]
[1, 56, 14065, 65, 9, 249, 149, 8, 4, 347, 5, 25, 65, 9, 249, 282, 333, 27, 258, 20, 6, 644, 59, 11, 15, 22, 653, 32, 11, 15, 257, 28, 2

In [None]:
print('Pad Sequences (samples x time)')
x_train = sequence.pad_sequences(X_train, maxlen=MAXLEN)
x_test = sequence.pad_sequences(X_test, maxlen=MAXLEN)
print('x_train shape: ', x_train.shape)
print('x_test shape: ', x_test.shape)

Pad Sequences (samples x time)
x_train shape:  (1000, 200)
x_test shape:  (2246, 200)


In [None]:
# Visible tests
assert x_train.shape[1] == MAXLEN, "Your train input sequences are the wrong length. Did you use the sequence import?"
assert x_test.shape[1] == MAXLEN, "Your test input sequences are the wrong length. Did you use the sequence import?"

### Create your model

Make sure to follow these instructions (also listed above):
- Name your model as `model`
- Use a `single hidden layer`
- Use `sparse_categorical_crossentropy` as your loss function
- Use `accuracy` as your metric

**Additional considerations**

The number of nodes in your output layer should be equal to the number of **unique** values in the sequences you are training and testing on. For this text, that value is equal to 46.

- Set the number of nodes in your output layer equal to 46

In [None]:
model = Sequential()
model.add(Embedding(MAX_FEATURES, 46))
model.add(Dropout(0.5))
model.add(Bidirectional(LSTM(46)))
model.add(Dense(46, activation='softmax'))

model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, None, 46)          1425080   
_________________________________________________________________
dropout_6 (Dropout)          (None, None, 46)          0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 92)                34224     
_________________________________________________________________
dense_5 (Dense)              (None, 46)                4278      
Total params: 1,463,582
Trainable params: 1,463,582
Non-trainable params: 0
_________________________________________________________________


In [None]:
opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=opt,
    metrics=['accuracy'],
)

In [None]:
# Build and complie your model here
#from tensorflow.keras.layers import Dropout

#model = Sequential()
#model.add(Embedding(MAX_FEATURES, 46))  
#model.add(Dropout(0.1))
#model.compile(loss='sparse_categorical_crossentropy',
 #             optimizer='adam', 
  #            metrics=['accuracy'])
#model.summary()


In [None]:
# Visible Test
assert model.get_config()["layers"][1]["class_name"] == "Embedding", "Layer 1 should be an Embedding layer."

In [None]:
# Hidden Test

### Fit your model

Now, fit the model that you built and compiled in the previous cells. Remember to set your `epochs=1`! 

In [None]:
# Fit your model here
# REMEMBER to set epochs=1
output = model.fit(x_train, 
                   y_train, 
                   batch_size=46, 
                   epochs=1, 
                   validation_data=(x_test, y_test))




In [None]:
# Visible Test 
n_epochs = len(model.history.history["loss"])
assert n_epochs == 1, "Verify that you set epochs to 1."

## Sequence Data Question
#### *Describe the `pad_sequences` method used on the training dataset. What does it do? Why do you need it?*

pad_sequences method deals with the variable length issue. 
It ensures that all the sequenceses are the same length by adding place holders at the beginning or the end depending on 'pre' or 'post' argument. 

## RNNs versus LSTMs
#### *What are the primary motivations behind using Long-ShortTerm Memory Cell unit over traditional Recurrent Neural Networks?*

LSTM gives us more controlability over the flow and mix of inputs per trained weights 

## RNN / LSTM Use Cases
#### *Name and Describe 3 Use Cases of LSTMs or RNNs and why they are suited to that use case*

RNN/LSTM Use Cases: 
Speech recognition (Siri/Cortana),
Text autofill,
Translation,
Stock Prices, 
DNA sequencing

RNN and LSTM are very useful in processing textual and speech data because each word in a sequence is dependent upon the previous word. They use the context of each word... ie. to predict one word, the word before it is already providing a great deal of information as to what words are likely to follow. LSTM's are generally preferrable to RNNs. 

<a id="p2"></a>
## Part 2- CNNs

### Find the Frog

Time to play "find the frog!" Use Keras and [ResNet50v2](https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet_v2) (pre-trained) to detect which of the images with the `frog_images` subdirectory has a frog in it.

<img align="left" src="https://d3i6fh83elv35t.cloudfront.net/newshour/app/uploads/2017/03/GettyImages-654745934-1024x687.jpg" width=400>

The skimage function below will help you read in all the frog images into memory at once. You should use the preprocessing functions that come with ResnetV2, and you should also resize the images using scikit-image.

### Reading in the images

The code in the following cell will download the images to your notebook (either in your local Jupyter notebook or in Google colab).

### Run ResNet50v2

Your goal is to validly run ResNet50v2 on the input images - don't worry about tuning or improving the model. You can print out or view the predictions in any way you see fit. In order to receive credit, you need to have made predictions at some point in the following cells.

*Hint* - ResNet 50v2 doesn't just return "frog". The three labels it has for frogs are: `bullfrog, tree frog, tailed frog`

**Autograded tasks**

* Instantiate your ResNet 50v2 and save to a variable named `resnet_model`

**Other tasks**
* Re-size your images
* Use `resnet_model` to predict if each image contains a frog
* Decode your predictions
* Hint: the lesson on CNNs will have some helpful code

**Stretch goals***
* Check for other things such as fish
* Print out the image with its predicted label
* Wrap everything nicely in well documented functions

## Important note!

To increase the chances that your notebook will run in CodeGrade, when you **submit** your notebook:

* comment out the code where you load the images
* comment out the code where you make the predictions
* comment out any plots or image displays you create

**MAKE SURE YOUR NOTEBOOK RUNS COMPLETELY BEFORE YOU SUBMIT!**

In [None]:
# Prep to import images (don't alter the code in this cell)
import urllib.request

# Text file of image URLs
text_file = "https://raw.githubusercontent.com/LambdaSchool/data-science-canvas-images/main/unit_4/sprint_challenge_files/frog_image_url.txt"
data = urllib.request.urlopen(text_file)

# Create list of image URLs
url_list = [] 
for line in data:
    url_list.append(line.decode('utf-8'))

In [None]:
# Import images (don't alter the code in this cell)

from skimage.io import imread
from skimage.transform import resize 

# instantiate list to hold images
image_list = []

### UNCOMMENT THE FOLLOWING CODE TO LOAD YOUR IMAGES

#loop through URLs and load each image
for url in url_list:
    image_list.append(imread(url))

## UNCOMMENT THE FOLLOWING CODE TO VIEW AN EXAMPLE IMAGE SIZE
#What is an "image"?
print(type(image_list[0]), end="\n\n")

print("Each of the Images is a Different Size")
print(image_list[0].shape)
print(image_list[1].shape)

<class 'numpy.ndarray'>

Each of the Images is a Different Size
(2137, 1710, 3)
(3810, 2856, 3)


In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.keras.applications.resnet_v2 import ResNet50V2 # <-- pre-trained model 
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet_v2 import preprocess_input, decode_predictions

In [None]:
resnet_model = tf.keras.applications.ResNet50V2(
    include_top=True, weights='imagenet', input_tensor=None,
    input_shape=None, pooling=None, classes=1000,
    classifier_activation='softmax'
)

In [None]:
# Code from the CNN lecture might come in handy here! 
#Instantiate your ResNet 50v2 and save to a variable named resnet_model

def process_img_path(img_path):
  return image.load_img(img_path, target_size=(224, 224))

def img_recognition_pretrain(img):
  x = image.img_to_array(img)
  x = np.expand_dims(x, axis=0)
  x = preprocess_input(x)
  model = ResNet50(weights='imagenet')
  features = model.predict(x)
  results = decode_predictions(features, top=4)[0]
  print(results)
  for entry in results:
    if entry[1] == 'bullfrog':
      return entry[2]
  return 0.0


In [None]:
#Genuinely no idea whatsoever and I'm out of time

In [None]:
# Define the batch size:
batch_size=32

# Define the train and validation generators: 
train = train_datagen.flow_from_directory(
    image_list,
    target_size=(224, 224),
    classes=['bullfrog','tree_frog','tailed_frog'],
    class_mode='categorical',
    batch_size=batch_size
    )

val = valid_datagen.flow_from_directory(
    directory=image_list,
    target_size=(224, 224),
    classes=['bullfrog','tree_frog','tailed_frog'],
    class_mode='categorical',
    batch_size=batch_size
    )

TypeError: ignored

In [None]:
#Re-size your images
#Use resnet_model to predict if each image contains a frog

In [None]:
# Visible test
assert resnet_model.get_config()["name"] == "resnet50v2", "Did you instantiate the resnet model?"

<a id="p3"></a>
## Part 3 - Autoencoders

**Describe a use case for an autoencoder given that an autoencoder tries to predict its own input.**

Reverse image search, 
recommendation systems - content based filtering, 

Autoencoders, in my simplified interpretation, fill in blanks of the inputs, rather than the other way around... ie. search suggestions, mispelled words, alternative wording, related material or concepts. 

<a id="p4"></a>
## Part 4 - More...

**Answer the following questions, with a target audience of a fellow Data Scientist:**

- What do you consider your strongest area as a Data Scientist?
- What area of Data Science would you most like to learn more about, and why?
- Where do you think Data Science will be in 5 years?

A few sentences per answer is fine - only elaborate if time allows.

- Strongest: I am good at doing the research to figure out what data to join in and I am particularly good at working with geospatial data, which I really enjoy and taught myself while working on the waterpump challenge.
- I would like to learn more about real world use cases. I wnat to know how to use this in the field of biotech, genetics, biopharma, really anything more interesting than these largely boring datasets we look at. I feel like this stuff could be really interesting and I know it's useful but I really can't connect the dots between the technical parts and the potential applications when the esubject material that I am applying it to is so mundane that I can't understand why anyone would bother (see lecture on creating classes and objects for polo shirt colors/used car inventory) 
- Hopefully in 5 years I will have had the time to re-learn everything from Unit 3 and 4 and found a way to take the interesting elements and really do something interesting with them that furthers our understanding of biologicalprocesses/ecosystems/evolution/molecular structure/DNA/longevity/ and hopefully I will have managed to escape the hell that would be applying this to consumer insights/entertainment/tv/music/HR (my personall hell).
- I think I peaked in Unit 2 and the structural shift threw me off so I very much intend to go learn a ton of this again. 

## Congratulations! 

Thank you for your hard work, and [congratulations](https://giphy.com/embed/26xivLqkv86uJzqWk)!!! You've learned a lot, and you should proudly call yourself a Data Scientist.
