<a href="https://colab.research.google.com/github/pragmatizt/DS-Unit-4-Sprint-3-Deep-Learning/blob/master/ira_Unit_4_Sprint_3_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Major Neural Network Architectures Challenge
## *Data Science Unit 4 Sprint 3 Challenge*

In this sprint challenge, you'll explore some of the cutting edge of Data Science. This week we studied several famous neural network architectures: 
recurrent neural networks (RNNs), long short-term memory (LSTMs), convolutional neural networks (CNNs), and Autoencoders. In this sprint challenge, you will revisit these models. Remember, we are testing your knowledge of these architectures not your ability to fit a model with high accuracy. 

__*Caution:*__  these approaches can be pretty heavy computationally. All problems were designed so that you should be able to achieve results within at most 5-10 minutes of runtime on SageMaker, Colab or a comparable environment. If something is running longer, doublecheck your approach!

## Challenge Objectives
*You should be able to:*
* <a href="#p1">Part 1</a>: Train a LSTM classification model
* <a href="#p2">Part 2</a>: Utilize a pre-trained CNN for objective detection
* <a href="#p3">Part 3</a>: Describe the components of an autoencoder
* <a href="#p4">Part 4</a>: Describe yourself as a Data Science and elucidate your vision of AI

<a id="p1"></a>
## Part 1 - RNNs

Use an RNN/LSTM to fit a multi-class classification model on reuters news articles to distinguish topics of articles. The data is already encoded properly for use in an RNN model. 

Your Tasks: 
- Use Keras to fit a predictive model, classifying news articles into topics. 
- Report your overall score and accuracy

For reference, the [Keras IMDB sentiment classification example](https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py) will be useful, as well the RNN code we used in class.

__*Note:*__  Focus on getting a running model, not on maxing accuracy with extreme data size or epoch numbers. Only revisit and push accuracy if you get everything else done!

In [1]:
from tensorflow.keras.datasets import reuters

(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=None,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=723812,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

In [2]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((8982,), (8982,), (2246,), (2246,))

In [3]:
# Demo of encoding

word_index = reuters.get_word_index(path="reuters_word_index.json")

print(f"Iran is encoded as {word_index['iran']} in the data")
print(f"London is encoded as {word_index['london']} in the data")
print("Words are encoded as numbers in our dataset.")

Iran is encoded as 779 in the data
London is encoded as 544 in the data
Words are encoded as numbers in our dataset.


In [4]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM

batch_size = 46
max_features = len(word_index.values())
maxlen = 200

print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)


print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.1, recurrent_dropout=0.1))
model.add(Dense(1, activation='sigmoid'))



8982 train sequences
2246 test sequences
Pad sequences (samples x time)
X_train shape: (8982, 200)
X_test shape: (2246, 200)
Build model...
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [5]:
# You should only run this cell once your model has been properly configured

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='nadam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train, y_train,
          batch_size=batch_size,
          epochs=1,
          validation_data=(X_test, y_test))

score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 8982 samples, validate on 2246 samples
Test score: nan
Test accuracy: 0.03873553


In [6]:
# Reference for the Sequence Data Question below:
X_train.shape, X_test.shape

((8982, 200), (2246, 200))

## Sequence Data Question
#### *Describe the `pad_sequences` method used on the training dataset. What does it do? Why do you need it?*

**Answer**: Pad sequences transforms a list into a 2D numpy array.  As we can see in the first time we split the data, it was 1 dimensional.

To transform it into a 2D array, we used the the pad_sequences() method. Also, the maxlen indicates the maximum length of each sequence.

So as we transform these into 2D arrays, we ensure that they are the same shape by using the maxlen parameter.

**References** (for my future self reviewing this sprint challenge): 
- [Keras Documentation](https://keras.io/preprocessing/sequence/)
- [Stack Overflow](https://stackoverflow.com/questions/42943291/what-does-keras-io-preprocessing-sequence-pad-sequences-do)

## RNNs versus LSTMs
#### *What are the primary motivations behind using Long-ShortTerm Memory Cell unit over traditional Recurrent Neural Networks?*

**Answer**: Simply put, LSTM's can remember information for long periods of time.  

In non-technical terms it can bring up "context" from the past to present & future information.  

LSTM's have the ability to add or remove information to the cell state by structures called **gates**.

*"Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation."* - Blog post referenced below.

![alt text](https://i.stack.imgur.com/Iv3nU.png)

**Reference Links**:
- [StackOverflow](https://i.stack.imgur.com/Iv3nU.png)
- [Blog Post on LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [Link](https://arxiv.org/ftp/arxiv/papers/1604/1604.04573.pdf): Saw on my google searches; looked like an interesting article on multi-label image classification

## RNN / LSTM Use Cases
#### *Name and Describe 3 Use Cases of LSTMs or RNNs and why they are suited to that use case*

**Answer**: 
- Unsegmented, connected handwriting recognition 
- Speech recognition
- Anomaly detection in network traffic

They're best suited for the cases mentioned above because LSTM excels in classifying, processing, and making predictions on *time series* data.

In each of the cases above, there could be an unspecified period of time between events.  

The fact that LSTM "remembers" makes it an excellent tool for these sort of problems.  

**Reference Link**:
- [Wikipedia](https://en.wikipedia.org/wiki/Long_short-term_memory)


<a id="p2"></a>
## Part 2- CNNs

### Find the Frog

Time to play "find the frog!" Use Keras and ResNet50 (pre-trained) to detect which of the following images contain frogs:

<img align="left" src="https://d3i6fh83elv35t.cloudfront.net/newshour/app/uploads/2017/03/GettyImages-654745934-1024x687.jpg" width=400>


In [7]:
!pip install google_images_download



In [8]:
from google_images_download import google_images_download

response = google_images_download.googleimagesdownload()
arguments = {"keywords": "lilly frog pond", "limit": 5, "print_urls": True}
absolute_image_paths = response.download(arguments)

# One error below.  Looks like the fifth image is returning a 404 error.


Item no.: 1 --> Item name = lilly frog pond
Evaluating...
Starting Download...
Image URL: http://www.slrobertson.com/images/usa/georgia/atlanta/atl-botanical-gardens/frog-lily-pond-2-b.jpg
Completed Image ====> 1.frog-lily-pond-2-b.jpg
Image URL: https://cdn.pixabay.com/photo/2017/07/14/17/44/frog-2504507_960_720.jpg
Completed Image ====> 2.frog-2504507_960_720.jpg
Image URL: https://storage.needpix.com/rsynced_images/bull-frog-2526024_1280.jpg
Completed Image ====> 3.bull-frog-2526024_1280.jpg
Image URL: https://i.pinimg.com/originals/9a/49/08/9a49083d4d7458a194a451eea757a444.jpg
Completed Image ====> 4.9a49083d4d7458a194a451eea757a444.jpg
Image URL: https://www.maxpixel.net/static/photo/1x/Frog-Pond-Lily-Pad-Water-Nature-Animal-4336943.jpg
URLError on an image...trying next one... Error: HTTP Error 404: Page not found
Image URL: https://image.shutterstock.com/image-photo/green-frogs-pond-lilly-pads-260nw-50197960.jpg
Completed Image ====> 5.green-frogs-pond-lilly-pads-260nw-50197960

At time of writing at least a few do, but since the Internet changes - it is possible your 5 won't. You can easily verify yourself, and (once you have working code) increase the number of images you pull to be more sure of getting a frog. Your goal is to validly run ResNet50 on the input images - don't worry about tuning or improving the model.

*Hint* - ResNet 50 doesn't just return "frog". The three labels it has for frogs are: `bullfrog, tree frog, tailed frog`

*Stretch goals* 
- Check for fish or other labels
- Create a matplotlib visualizations of the images and your prediction as the visualization label

In [9]:
import numpy as np

from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions

def process_img_path(img_path):
  return image.load_img(img_path, target_size=(224, 224))

def img_contains_frog(img):
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    model = ResNet50(weights='imagenet')
    features = model.predict(x)
    frog_results = decode_predictions(features, top=3)[0]
    print(frog_results)
    frog_results.append
    for entry in frog_results:
        if 'frog'in entry[1]:
            return True
    # Else:
    return False 

def img_contains_fish(img):
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    model = ResNet50(weights='imagenet')
    features = model.predict(x)
    fish_results = decode_predictions(features, top=3)[0]
    fish_results.append(decode_predictions(features, top=10)[0])
    print(fish_results)
    fish_results.append
    for entry in fish_results:
        if 'fish'in entry[1]:
            return True
    #Else:
    return False


# Frogs
for x in absolute_image_paths[0]['lilly frog pond']:
    x = process_img_path(x)
    print(img_contains_frog(x))
    
# Fish
for x in absolute_image_paths[0]['lilly frog pond']:
    x = process_img_path(x)
    print(img_contains_fish(x))

[('n03991062', 'pot', 0.7243723), ('n01641577', 'bullfrog', 0.045519695), ('n01667778', 'terrapin', 0.04263503)]
True
[('n01641577', 'bullfrog', 0.35860497), ('n01644900', 'tailed_frog', 0.30636418), ('n01737021', 'water_snake', 0.15603636)]
True
[('n01644373', 'tree_frog', 0.7534928), ('n01641577', 'bullfrog', 0.13603099), ('n01644900', 'tailed_frog', 0.10173123)]
True
[('n04476259', 'tray', 0.6419137), ('n03485794', 'handkerchief', 0.18807994), ('n01644373', 'tree_frog', 0.01435911)]
True
[('n03991062', 'pot', 0.3353414), ('n04522168', 'vase', 0.13951766), ('n02280649', 'cabbage_butterfly', 0.08299894)]
False
[('n03991062', 'pot', 0.7243723), ('n01641577', 'bullfrog', 0.045519695), ('n01667778', 'terrapin', 0.04263503), [('n03991062', 'pot', 0.7243723), ('n01641577', 'bullfrog', 0.045519695), ('n01667778', 'terrapin', 0.04263503), ('n02877765', 'bottlecap', 0.0326819), ('n01737021', 'water_snake', 0.020111015), ('n04409515', 'tennis_ball', 0.018943807), ('n07753113', 'fig', 0.0169714

#### Stretch Goal: Displaying Predictions

In [0]:
## Couldn't get code to work.  Good "code challenge" for myself once Winter Break starts.

<a id="p3"></a>
## Part 3 - Autoencoders

Describe a use case for an autoencoder given that an autoencoder tries to predict its own input. 

*Answer:* Given that it tries to predict its own input, one novel way of using autoencoders is image denoising.  Oftentimes images contain noise in the data -- autoencoders can get rid of that noise!

- [Medium](https://medium.com/datadriveninvestor/deep-learning-autoencoders-db265359943e), a decent blog post overview on autoencoders.
- [Kaggle](https://www.kaggle.com/shivamb/how-autoencoders-work-intro-and-usecases), I love this Kaggle post on AutoEncoders.  Putting this on my list for winter reading.

<a id="p4"></a>
## Part 4 - More...

Answer the following questions, with a target audience of a fellow Data Scientist:

- **What do you consider your strongest area, as a Data Scientist?**

**Answer**: I would actually consider my non-technical experience as something that will help me in the long term.  

I have a background in sales, operations, and entrepreneurship.  So I'm comfortable with *storytelling* (selling a product or service), I understand business (which will help me share the data with different stakeholders: whether C-suite, finance, marketing, or customers, etc.), and I have an undergrad background in Economics (a general understanding of data and visualizations which can help me tell the story).

- **What area of Data Science would you most like to learn more about, and why?**

**Answer**: You know, I was more inspired by the data anlytics and visualizations part of Data Science, but since starting Unit 4, and learning about all the cool things that we can do with images, text, ... anything(!), I kind of want to spend some time looking into this deeper.

But I would be happy if my starting job in this field is as a Data Analyst or as a Business Intelligence analyst (plays on the strengths I mentioned above)

- **Where do you think Data Science will be in 5 years?**

**Answer**: Able to process more data (5G, stronger hardware), maybe one or two groundbreaking algorithms, more ubiquitous, more unintimidating to the general population.

Fully integrated with industries like energy, agriculture, finance, tech (of course), practically every industry will see the value in Data Science.

- **What are the threats posed by AI to our society?**

**Answer**: The social and economic changes are the biggest and most obvious ones.  There will be a massive job displacement for people all over the world.  Like every technological revolution in the past: agricultural, industrial, digital.  

- **How do you think we can counteract those threats? **

**Answer**: We need to have our brightest minds look into how to best "catch" these massive amounts of people that will see their jobs become obsolete due to A.I.

To have support structures that will allow them to reskill.  Whether that's financial support during the time that they're reskilling, as well as the educational support that allows them to get the best education while reskilling.

(Lambda is doing a great job as a solution to this, which is already happening)

- **Do you think achieving General Artifical Intelligence is ever possible?**

**Answer**: Yes, I do.  As hardware, algorithms, and internet speeds (meaning improved pipelines) improves, I think it's only a matter of time.

A few sentences per answer is fine - only elaborate if time allows.

## Congratulations! 

Thank you for your hard work, and congratulations! You've learned a lot, and you should proudly call yourself a Data Scientist.


In [11]:
from IPython.display import HTML

HTML("""<iframe src="https://giphy.com/embed/26xivLqkv86uJzqWk" width="480" height="270" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/mumm-champagne-saber-26xivLqkv86uJzqWk">via GIPHY</a></p>""")

In [12]:
print("Woohoo!  We did it.  Survived four units of Lambda School.  Onto labs!")

Woohoo!  We did it.  Survived four units of Lambda School.  Onto labs!
