# Topic 42 - Tuning Neural Networks + Deep NLP (with Google Colab!)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1F8BYCuuUI3Jcbnp70W3Uj-kSvG58uQEe?usp=sharing)


- 06/09/21
- onl01-dtsc-ft-022221

## Colab Notebook Key

- 📚: Info sections
- 🕹: Activity sections
    - 🎛: hyperparameters to tune
    - 🏋️: fitting models
    - 🤔: New Things to Potentially Try 
- Use the Table of Contents view on the left sidebar to find the relevant sections (button looks like a bulleted list)

## 📚 Google Colab Overview

**Google Colab Quick - Notes**
 1. **Open the sidebar!**
    - Use `Table of Contents` to Jump between the 3 major sections.
    - Mount your google drive via the `Files `
    - Note: **to make a section appear in the Table of Contents, create a NEW text cell for the *header only*.** This will also let you collapse all of the cells in the section to reduce clutter.

2. **Google Colab already has most common python packages.**
    - You can pip install anything by prepending an exclamation point
    - You can use the `IPython.display` function `clear_output` to programmatically clean up the displays from pip installation.
    ``` python
    !pip install fsds_100719
    !pip install fake_useragent
    !pip install lxml
    from IPython.display import clear_output
    clear_output()

    %codnda install ....
    ```
    


3. **Using GPUs/TPUs**
    - `Runtime > Change Runtime Type > Hardware Acceleration`

4. **Run-Before and Run-After**
    - Go to `Runtime` and select `Run before` to run all cells up to the currently active cell
    - Go to `Runtime` and select `Run after` to run all cells that follow the currently active cell 

5. **Cloud Files with Colab**
    - **Open .csv's stored in a github repo directly with Pandas**:
        - Go to the repo on GitHub, click on the csv file, then click on `Download` or `Raw` which will then change to show you the raw text. Copy and paste the link in your address bar (should start with www.rawgithubusercontent).
        - In your notebook, do `df=pd.read_csv(url)` to load in the data.
    - **Google Drive: Open sidebar > Files> click Mount Drive**
        - or use this function (also available from file sidebar): 
        ```python 
        ## Mount Google Drive
        from google.colab import drive
        drive.mount('/gdrive',force_remount=True)
        ```
        - Then access files by file path like usual.
        
    - Dropbox Files: (like images or csv)
        - Copy and paste the share link.
        - Change the end of the link from `dl=0`to `dl=1`
    - Note: for some data types (like.sqlite) the only option is to store them in google drive and then mount google drive using the Files tab of the sidebar.





6. **Keyboard Shortcuts**
    - A lot of the keyboard shortcuts for Colab are different.
        - Auto-Complete is Control+Space
        - Control+Shift+Space for docstrings.
    - Most keyboard shortcuts can be changed
        - go to `Tools`>`Keyboard Shortcuts`

    - Some of the keyboard shortcuts are the same BUT you first have to type `Command/Cntrl + M` and THEN the keyboard shortcut. 
        - e.g. `Cmd/Cntrl+M  Y` will change a cell to a code cell
        -  `Cmd/Cntrl+M  M` will change a code cell to a Markdown cell.


7. **GitHub Integration**
    
    3. Open a notebook from github or save to github using `File > Upload Notebook` and `File> Save a copy in github`, respectively
        - Notebooks saved to Github can optionally have a "Open in Colab" link inserted at the top of the notebook. 
    - You can open notebooks contained in GitHub repositories use the File menu.
        - `File`>`Open`> `GitHub tab`
    - You can save a copy of notebooks in GitHub repositories 
        - `File`> `Save a copy in GitHub`
    - When you are done working for the day/want to back up the current state of your notebook:
        - `File` > `Save and Pin Revision`
        - This will save a revision that you could revert back to later (like a commit)
    - **See the following example notebook from Google:**
        - https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb 

    

8. **You cannot easily clone an entire repository.**
    - It is possible, but you have to do it from WITHIN a Colab notebook. 
    - **See the following resources for additional info on using Colab + GitHub**:
        - https://towardsdatascience.com/google-drive-google-colab-github-dont-just-read-do-it-5554d5824228 

                    
9. **Load in images stored in a GitHub Repo for Markdown cells:**
    - Go to Repo on GitHub.com, click on image file name.
    - On the next page for the file, there should be a `Download` button. Click this. 
    - A new tab should open up with the raw image and the url should now read `raw.githubusercontent.com`. 
    - Copy this url, it can be used with Markdown cells using img tags. 

10. **Consider paying for Colab Pro if you need faster processing and more RAM**

___

# Original Topic 42 Notebook Continued

## Learning Objectives

- Learn about Word Embeddings.
    - Discuss word Embeddings and their advantages
    - Training Word2Vec models
    - ~~Using pretrained word embeddings~~ [Another time]
    
- Learn about Sequence Models and Recurrent Neural Networks
    - LSTMs with word embeddings. 
    
    
- Activity: Predicting Stack Overflow post quality. 

- Learn about Tuning Neural Networks
    - Discuss the different options available for tuning neural networks

    - Discuss some rules of thumb for tuning Neural Networks

    - Learn how to use GridSearchCV with Keras neural neural networks.

    - ~~Learn how to create your own custom scorer for sklearn (and why you'd want to)~~

## Questions

- What is an exploding Gradient?
- Do we have the ability to set max weights?


# Appendix Topic: Deep Natural Language Processing

## Review: NLP & Word Vectorization

> - As a reminder, machine learning models needed text to be converted to numbers ("vectorization") before training the model. 
    - We used frequency counts or tf-idf values to produce numeric values for each word. 
    - We trained the models to look for the presence/absence of words to classify texts.

# 📚 Word Embeddings

- Word embeddings are vectorized words representing their **semantic meaning**.
- They are created with an arbitrary length (typically 100 points).


- Convert words into a vector space
    - Each word gets its own unique vector. 
    - Vectors capture how similar various words are.
   

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-word-embeddings-online-ds-ft-100719/master/images/vectors.png">

>- Once we have word embeddings, we can actually identify related words based on meaning. 

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-word-embeddings-online-ds-ft-100719/master/images/embeddings.png">

## Resources

- [How Embeddings are Created](https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/natural-language-processing/word2vec)
- [Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning](https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8)
- Kaggle Tutorial:  https://www.kaggle.com/learn/embeddings
- Google Embedding Crash Course: https://developers.google.com/machine-learning/crash-course/embeddings

## Word2Vec

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-using-word2vec-online-ds-ft-100719/master/images/training_data.png">

### Skip-Gram Model

- Train the MLP to find the best weights (context) to map word-to-word
- But since words close to another usually contain context, we're _really_ teaching it context in those weights
- Gut check: similar contexted words can be exchanged
    + EX: "A fluffy **dog** is a great pet" <--> "A fluffy **cat** is a great pet"

- By training a text-generation model, we wind up with a lookup table where each word has its own vector 

- Resource: 
    - [skip-gram vs CBOW methods](https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314)

#### How to create an emebding:
- Resources:
    - [How Embeddings are Created](https://calvinfeng.gitbook.io/machine-learning-notebook/supervised-learning/natural-language-processing/word2vec)
    - [Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning](https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8)


- To create a word embedding, we train a shallow neural network for a fake task. 

    - The fake task is to use one-hot-encoded text data to then predict the probability of seeing every other word in the corpus within the same context as the one-hot-encoded word.
    

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-using-word2vec-online-ds-ft-100719/master/images/new_skip_gram_net_arch.png">


<img src="https://raw.githubusercontent.com/learn-co-students/dsc-using-word2vec-online-ds-ft-100719/master/images/new_word2vec_weight_matrix_lookup_table.png">

## GloVe - Global Vectors for Word Representation

### Transfer Learning

- Usually embeddings are hundreds of dimensions
- Just use the word embeddings already learned from before!
    + Unless very specific terminology, context will likely carry within language
- Comparable to CNN transfer learning

# 🕹 **Activity: Creating our own word embeddings**

### Data: Stack Overflow Questions

- Stack Overflow Answers: https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate


- Kaggle Description:
    - We collected 60,000 Stack Overflow questions from 2016-2020 and classified them into three categories:

        - HQ: High-quality posts with a total of 30+ score and without a single edit.
        - LQ_EDIT: Low-quality posts with a negative score, and multiple community edits. However, they still remain open after those changes.
        - LQ_CLOSE: Low-quality posts that were closed by the community without a single edit.

In [None]:
!pip install -U fsds
from fsds.imports import *

In [None]:
from tensorflow.random import set_seed
set_seed(321)

import tensorflow as tf
print(tf.__version__)

import numpy as np 
np.random.seed(321)

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os,sys

plt.rcParams['figure.figsize'] = [8,4]

In [None]:
## Get data from github
url ='https://github.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/blob/master/Phase_4/topic_42_tuning_neural_networks/data/stack_overflow.csv.gz?raw=true'
df = pd.read_csv(url,compression='gzip')

In [None]:
df['Y'].value_counts()

### Dealing with HTML Tags

- First, should we remove them?
- If yes, use beautiful soup?

In [None]:
## Getting text example for dealing with html tags
test_body = df.loc[4,'Body']
test_body

In [None]:
from bs4 import BeautifulSoup
test_soup = BeautifulSoup(test_body)
test_soup.text

In [None]:

df['soups'] = df['Body'].map(lambda x:BeautifulSoup(x).text )
df

In [None]:
## join together title and body. 
df['text'] = df['Title']+'; '+df['soups']
df

## Creating Word Embeddings with `Word2Vec`

### Resources:

- Two Part Word2Vec Tutorial  (linked from Canvas)
    - [Part 1: The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
    - [Part 2: Negative Sampling](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)
    
- White Paper on word2vec (downloads file):
    - https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf


- `sentences`: dataset to train on
- `size`: how big of a word vector do we want
- `window`: how many words around the target word to train with
- `min_count`: how many times the word shows up in corpus; we don't want words that are rarely used
- `workers`: number of threads (individual task "workers")

```python
from gensim.models import Word2Vec

model = Word2Vec(data, size=100, window=5, min_count=1, workers=4)

model.train(data, total_examples=model.corpus_count)

```

<!-- 
#### Word2Vec params

```python
## For initializing model
sentences=None,
    size=100,
    alpha=0.025,
    window=5,
    min_count=5,
    max_vocab_size=None,
    sample=0.001,
    seed=1,
    workers=3,
    min_alpha=0.0001,
    sg=0,
    hs=0,
    negative=5,
    cbow_mean=1,
    hashfxn=<built-in function hash>,
    iter=5,
    null_word=0,
    trim_rule=None,
    sorted_vocab=1,
    batch_words=10000,
    compute_loss=False,
    callbacks=(),
    
    
## For training 
    sentences,
    total_examples=None,
    total_words=None,
    epochs=None,
    start_alpha=None,
    end_alpha=None,
    word_count=0,
    queue_factor=2,
    report_delay=1.0,
    compute_loss=False,
    callbacks=(),
``` -->

In [None]:
## NLP imports
from nltk import word_tokenize, TweetTokenizer, regexp_tokenize
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec
import gensim
print(gensim.__version__)

from sklearn.model_selection import train_test_split

In [None]:
## use gensim's simple preprocess
df['cleaned-text'] = df['text'].map(lambda x: simple_preprocess(x, deacc=True))
df

### 🎛 Setting the Embedding Size, Training Word2Vec

In [None]:
## set the embedding size (normally I'd do 100, but doing 50 for time)
EMBEDDING_SIZE = 50
# EMBEDDING_SIZE = 100
## intiitalize the w2v odel
w2v_model = Word2Vec(df['cleaned-text'], size=EMBEDDING_SIZE, window=5,
                     min_count=3, workers=4, seed=321)

w2v_model.corpus_count

In [None]:
## Train w2v model
w2v_model.train(df['cleaned-text'],total_words=w2v_model.corpus_total_words,
                epochs=w2v_model.epochs)

In [None]:
## w2v saves word vectors as .wv
wv = w2v_model.wv
wv.index2word

In [None]:
type(wv)

In [None]:
## wv's vocab contains all words it learned


In [None]:
## wv can be used as a dictionary to extract word vectors
wv['python']

In [None]:
## Saving the keyed vectors as their own var
wv.most_similar('python', topn=20)

In [None]:
## Can get words that are similiar or dissimilar to specific word
# +python, -error, top 20
wv.most_similar(positive=['python'], 
                negative=['error'],topn=20)

In [None]:
##  can also do math on vectors 
# creating "frustrating error" from frustrating and error
frustrating_error = wv['frustrating'] + wv['error']
frustrating_error

In [None]:
## Can also get most_similar to word vector
# get the most simular words to our calculated frustrating_error

wv.most_similar([frustrating_error])

In [None]:
# del w2v_model, df

___

## Using Embeddings in Classification

### Embedding Layers
You should make note of a couple caveats that come with using embedding layers in your neural network -- namely:

* The embedding layer must always be the first layer of the network, meaning that it should immediately follow the `Input()` layer 
* All words in the text should be integer-encoded, with each unique word encoded as it's own unique integer  
* The size of the embedding layer must always be greater than the total vocabulary size of the dataset! The first parameter denotes the vocabulary size, while the second denotes the size of the actual word vectors
* The size of the sequences passed in as data must be set when creating the layer (all data will be converted to padded sequences of the same size during the preprocessing step) 


[Keras Documentation for Embedding Layers](https://keras.io/layers/embeddings/).

# 📚 Sequence Models - Recurrent Neural Networks

- One of the main disadvantages of machine learning NLP is that the models are assesing the presence or absence of words. 
- They are not analyzing the words in the context of the sentence.

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-understanding-recurrent-neural-networks-online-ds-ft-100719/master/images/unrolled.gif">

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-sequence-model-use-cases-online-ds-ft-100719/master/images/rnn.gif">

## LSTMs & GRUs

- GRU (Gated Recurrent Units (GRUs)
    - Reset Gate
    - Update Gate
    
- LSTM (Long Short Term Memory Cells)
   - Input Gate
   - Forget Gate
   - Output Gate

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-sequence-model-use-cases-online-ds-ft-100719/master/images/RNN-unrolled.png">

Each word will have a vector of contexts: the embeddings!

# 🕹 **Activity Part 2: Text Classification with Embeddings & Sequences**

### Defining The Target

In [None]:
## What is the distribution of classes in our target?
df['Y'].value_counts()

In [None]:
## Remapping target
target_map = {"LQ_CLOSE":0, 
              'LQ_EDIT': 1,
              "HQ":2}

In [None]:
## map targets
df['target'] = df['Y'].replace(target_map)
df['target'].value_counts()

#### *About that multi-classification*...
- After a couple hours or fighting to improve the metrics for the 3-class task, I decided to create a Hot-Dog/ Not-Hot-Dog classifier. 

<img src="https://github.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/blob/master/Phase_4/topic_42_tuning_neural_networks/images/hot_dog_not_hot_dog.png?raw=1" width=30%>

In [None]:
## Making our hot-dog/not-dog target
target_map_binary = {"LQ_CLOSE":0, 
                      'LQ_EDIT': 0,
                      "HQ":1}
df['target_binary'] = df['Y'].replace(target_map_binary)

df['target_binary'].value_counts()

In [None]:
from tensorflow.keras import layers,optimizers,callbacks, models
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing import text,sequence

# from keras.preprocessing import text,sequence
from sklearn.model_selection import train_test_split
from nltk import word_tokenize
from gensim.utils import simple_preprocess
from sklearn import metrics

### 🎛 Defining X,y + train test split

In [None]:
## Make X and y_t
X = df['cleaned-text'].copy()

# y_t = to_categorical(df['target'])
y_t = to_categorical(df['target_binary'])

y_t

In [None]:
X_train, X_test, y_train, y_test =train_test_split(X,y_t,test_size=0.3,
                                                   random_state=123) 
print(X_train.shape,y_test.shape)
X_test

### Tokenizing with Keras

In [None]:
## Keras has its own Tokenizer
tokenizer = text.Tokenizer(num_words=50000)
tokenizer.fit_on_texts(X_train)

In [None]:
## tokenizer has assigned integer lookup value for each word
tokenizer.word_index

In [None]:
# lookup the words using their integer
tokenizer.index_word

In [None]:
## Use tokenizer to convert texts to sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

## whatn does 1 sequence look like?
print(X_train_seq[0])

In [None]:
## can lookup the words via the tokenizer's index_word 
print(' '.join([tokenizer.index_word[w] for w in X_train_seq[0]]))

In [None]:
## We need to get all sequences as same length
# what is the len of each sequence?
seq_lens = [len(x) for x in X_test_seq]
seq_lens[:5]

In [None]:
## what is the longest sequence?
max(seq_lens)

In [None]:
## visualize distribution of lengthsw
sns.histplot(seq_lens
        )

In [None]:
## What would be an approx cutoff for outliers?
sns.boxplot(seq_lens)

### 🎛 Setting Max Sequence Length

In [None]:
## Defining a max sequence length
MAX_SEQUENCE_LENGTH = 100
MAX_SEQUENCE_LENGTH

In [None]:
## Plot our cutoff
## visualize distribution of lengthsw
ax = sns.histplot(seq_lens)
ax.axvline(MAX_SEQUENCE_LENGTH)

In [None]:
## pad X_train_seq and X_test_seq
X_train_pad = sequence.pad_sequences(X_train_seq,MAX_SEQUENCE_LENGTH)
X_test_pad = sequence.pad_sequences(X_test_seq, MAX_SEQUENCE_LENGTH)

In [None]:
seq_lens = [len(x) for x in X_test_pad]
seq_lens[:10]

#### Making Our Neural Networks

In [None]:
## Set the max words equal to tokenizer's word index
MAX_WORDS = len(tokenizer.word_index)
MAX_WORDS

In [None]:
## Save num classes for final layer
n_classes = y_train.shape[1]
n_classes

## 🏋️ Fitting Our First Model

In [None]:
EMBEDDING_SIZE

In [None]:
def make_model():
    """Make a neural network with a new emebdding layer, 
    an LSTM layer with 25 unit, and a final Dense layer appropriate for the task"""
    model = models.Sequential()
    model.add(layers.Embedding(MAX_WORDS+1, EMBEDDING_SIZE))
    model.add(layers.LSTM(25, return_sequences=False))
    model.add(layers.Dense(n_classes, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam',
                 metrics=['accuracy', tf.keras.metrics.Recall(name='recall')])
    display(model.summary())
    return model

In [None]:
%%time
## make model and fit 
model= make_model()
history = model.fit(X_train_pad, y_train, batch_size=256, epochs=3,validation_split=0.2,
         workers=-1)

### Model Evaluation Functions

>- Below I broke down the larger evaluation function introduced in study group last week. 
    - I've created several helper functions to simplify the code for the evaluation function.
    - Additionally, we can now use those smaller functions when we don't need a full model evaluation

In [None]:
### BREAKING OUR BIG FUNCTION UP INTO HELPER FUNCTIONS
def plot_history(history,model,figsize=(8,4)):
    """Takes a keras history and model and plots 
    all metrics in separate plots for each metric"""
#     print(header,'\t[i] MODEL HISTORY',header,sep='\n')

    ## Make a dataframe out of history
    res_df = pd.DataFrame(history.history)#.plot()

    ## Plot Losses
    plot_kws = dict(marker='o',ls=':',lw=2,figsize=figsize)

    ## Plot all metrics
    metrics_list = model.metrics_names

    for metric in metrics_list:
        ax = res_df[[col for col in res_df.columns if metric in col]].plot(**plot_kws)
        ax.set(xlabel='Epoch',ylabel=metric,title=metric)
        ax.grid()
        ax.xaxis.set_major_locator(mpl.ticker.MaxNLocator(integer=True))
        plt.show()


## testing function
plot_history(history, model)

In [None]:
def evaluate_scores(model,X_train,y_train,label='Training',verbose=0):
    """Evaluates a keras model and prints the scores using the provided label."""
    train_scores  = model.evaluate(X_train,y_train,verbose=verbose)#score()
    for i,metric in enumerate(model.metrics_names):
        print(f"\t{label} {metric}: {train_scores[i]:.3f}")

evaluate_scores(model, X_train_pad,y_train,verbose=1)
evaluate_scores(model, X_test_pad,y_test,verbose=1,label='Test')

In [None]:
  def classification_report_cm(model, X_train,y_train,label='TRAINING DATA',
                            cm_figsize=(6,6),normalize='true',cmap='Greens'):
    """Gets predictions from a Keras neural network and get 
    classification report and confusion matrix."""
    ## Print report header, get preds, get class report, and conf matrix
    header =  '==='*24
    print(header,f"\t[i] CLASSIFICATION REPORT - {label}",header,sep='\n')
    print()
    
    ## Get predictions
    y_hat_train = model.predict(X_train)
    
    ## convert to 1D targets
    y_train_class =y_train.argmax(axis=1)
    y_hat_train_class = y_hat_train.argmax(axis=1)
    
    
    ## Get classification report 
    print(metrics.classification_report(y_train_class,y_hat_train_class))
    print()
    
    
    ## Plot the confusion Matrix
    cm = metrics.confusion_matrix(y_train_class, y_hat_train_class,
                                  normalize=normalize)
    
    fig,ax = plt.subplots(figsize=cm_figsize)
    sns.heatmap(cm, cmap=cmap, annot=True,square=True,ax=ax)
    ax.set(ylabel='True Class',xlabel='Predicted Class',
           title='Confusion Matrix - Training Data')    
    plt.show()

    
classification_report_cm(model,X_test_pad, y_test, label='TEST DATA')   

In [None]:
def evaluate_network(model, X_test, y_test, history=None, 
                        X_train = None, y_train = None,
                        history_figsize = (8,4), cm_figsize=(8,8),
                        cmap='Greens', normalize='true'):
    """Gets predictions and evaluates a classification model using
    sklearn.

    Args:
        model (classifier): a fit keras classification model.
        X_test (tensor/array): X data
        y_test (tensor/array): y data
        history (History object): model history from .fit
        X_train (tensor/array): If provided, compare model.score 
                                for train and test. Defaults to None.
        y_train (Series or Array, optional): If provided, compare model.score 
                                for train and test. Defaults to None.
                                
        history_figsize (tuple): figsize for each metric's history plot.
        cm_figsize (tuple): figsize for confusion matrix plot
      
        cmap (str, optional): Colormap for confusion matrix. Defaults to 'Greens'.
        normalize (str, optional): normalize argument for plot_confusion_matrix. 
                                    Defaults to 'true'.  
    """
    
    header =  '==='*24
    
    ## First, Plot History, if provided.
    if history is not None:
        print(header,'\t[i] MODEL HISTORY',header,sep='\n')
        plot_history(history,model,figsize=history_figsize)
        
        
    ## Evaluate Network for loss/acc scores
    print(header,"\t[i] EVALUATING MODEL",header,sep='\n')
    print()
    if X_train is not None:
        try:
            evaluate_scores(model,X_train,y_train,label='Training')
            print()

        except Exception as e:
            print("Error evaluating for accuracy for training data:")
            print(e)
        

    ## Evaluate test data
    evaluate_scores(model,X_test,y_test,label='Test')
    print("\n")

    
    ## Report for training data
    if X_train is not None:
        classification_report_cm(model, X_train, y_train, cmap=cmap,
                                 normalize=normalize,
                                 label='TRAINING DATA',cm_figsize=cm_figsize)       
        print('\n'*2)
    ## Report for test data
    classification_report_cm(model,X_test,y_test, cmap=cmap,
                             normalize=normalize,
                             label='TEST DATA',cm_figsize=cm_figsize)

    

In [None]:
## make,fit model and evlaute
model = make_model()
history = model.fit(X_train_pad, y_train, epochs=3,
                    batch_size=128, validation_split=0.2,
                   workers=-1)
evaluate_network(model,X_test_pad,y_test,history,
                X_train = X_train_pad,y_train=y_train)

>- **Q: Whats one thing we haven't addressed, as part of classification-modeling workflow?**
    - Dealing with class imbalance/

### Compute Class Weights

In [None]:
## check class balance
y_tr_classes = pd.Series(y_train.argmax(axis=1))
y_tr_classes.value_counts(1)

> Neural networks accept class weights, but cannot calculate them like sklearn models that accept `class_weight="balanced"`.
    - We can use sklearn's function to calculate the class weights to use in our neural netowork

In [None]:
from sklearn.utils.class_weight import compute_class_weight

## Get the array of weights for each unique class
weights= compute_class_weight(
           'balanced',
            np.unique(y_tr_classes),
            y_tr_classes)
weights

In [None]:
## Turn the weights into a dict with the class name as the key
weights_dict = dict(zip( np.unique(y_tr_classes),weights))
weights_dict

In [None]:
## make,fit model and evlaute
model = make_model()
history = model.fit(X_train_pad, y_train, epochs=3,
                    batch_size=128, validation_split=0.2,
                   workers=-1,class_weight=weights_dict)
evaluate_network(model,X_test_pad,y_test,history,
                X_train = X_train_pad,y_train=y_train)

## 🏋️ Using our Previously Trained Word2Vec Embeddings in an Embedding Layer
- https://ppasumarthi-69210.medium.com/word-embeddings-in-keras-be6bb3092831

In [None]:
## Saving the total number of words as vocab size
vocab_size = len(tokenizer.index_word)

## Doubel check current embedding size and vocab size
vocab_size, EMBEDDING_SIZE

In [None]:
### make a metrix of embedding weights
embedding_matrix = np.zeros((vocab_size+1, EMBEDDING_SIZE))

## for each item in the word index
for word, i in tokenizer.word_index.items():

    ## if word in w2vec model, fill in the embedding matrix
     if word in wv:
        embedding_vector = wv[word]
        embedding_matrix[i] = embedding_vector
embedding_matrix

In [None]:
## Make a keras embedding layer with emebdding_matrix
embedding_layer = layers.Embedding(vocab_size+1,EMBEDDING_SIZE,
                                  weights=[embedding_matrix],
                                  input_length=MAX_SEQUENCE_LENGTH,
                                  trainable=False)

In [None]:
## update our make_model_w2v func wth embedding layer
def make_w2v_model():
    """Make a neural network with a new emebdding layer, 
    an LSTM layer with 25 unit, and a final Dense layer appropriate for the task"""
    model = models.Sequential()
    embedding_layer = layers.Embedding(vocab_size+1,EMBEDDING_SIZE,
                                  weights=[embedding_matrix],
                                  input_length=MAX_SEQUENCE_LENGTH,
                                  trainable=False)
    model.add(embedding_layer)
#     model.add(layers.Embedding(MAX_WORDS+1, EMBEDDING_SIZE))
    model.add(layers.LSTM(25, return_sequences=False))
    model.add(layers.Dense(n_classes, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam',
                 metrics=['accuracy', tf.keras.metrics.Recall(name='recall')])
    display(model.summary())
    return model

## make,fit model and evlaute
model = make_w2v_model()
history = model.fit(X_train_pad, y_train, epochs=3,
                    batch_size=128, validation_split=0.2,
                   workers=-1,class_weight=weights_dict)
evaluate_network(model,X_test_pad,y_test,history,
                X_train = X_train_pad,y_train=y_train)

# 📚 Overview  - Neural Network Tuning 

## Helpful Resources

- [Medium: Simple Guide to Hyperparameter Tuning in Neural Networks](https://towardsdatascience.com/simple-guide-to-hyperparameter-tuning-in-neural-networks-3fe03dad8594)
- [Medium: A guide to an efficient way to build neural network architectures- Part I:](https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-network-architectures-part-i-hyper-parameter-8129009f131b)
- [Medium: Optimizers for Neural Networks](https://medium.com/@sdoshi579/optimizers-for-training-neural-network-59450d71caf6)

## Rules of Thumb for Training Neural Networks

- **Always use a train-test-validation split.**
    - **Train-test-val splits:**
        - Training set: for training the algorithm
        - Validation set: used during training
        - Testing set: after choosing the final model, use the test set for an unbiased estimate of performance.
    - **Set sizes:**
        - With big data, your val and test sets don't necessarily need to be 20-30% of all the data. 
        - You can choose test and hold-out sets that are of size 1-5%. 
            - eg. 96% train, 2% hold-out, 2% test set.
            
            
- Consider using a `np.random.seed` for reproducibility/comparing models


- **Use cross validation of some sort to compare Networks**


- Normalize/Standardize features
    
    
- **Add EarlyStopping and ModelCheckpoint [callbacks](https://keras.io/api/callbacks/)**

    

#### Dealing with Bias/Variance

- Balancing Bias/Variance:
    - High Bias models are **underfit**
    - High Variance models are **overfit**



- **Rules of thumb re: bias/variance trade-off:**

| High Bias? (training performance) | high variance? (validation performance)  |
|---------------|-------------|
| Use a bigger network|    More data     |
| Train longer | Regularization   |
| Look for other existing NN architextures |Look for other existing NN architextures |


## Rules of Thumb - Hyperparameters to Tune 


- This section is partially based on the blog post below. 
- However, I ordered the steps with my recommended order of importance/what-to-tune-first
- [Blog Post](https://towardsdatascience.com/a-guide-to-an-efficient-way-to-build-neural-network-architectures-part-i-hyper-parameter-8129009f131b)

### Hyperparameters 

- Note: outline below is meant for Dense layers but will also generally be true for other layer types.

1. Number of layers (depends on the size of training data)


2. Number of neurons(depends on the size of training data)


3. Activation functions
    - Popular choices:
        - relu / leaky-relu
        - sigmoid / tanh (for shallow networks)
        
        
4. Optimizer:
    - Popular choices:
        - SGD (works well for shallow but gets stuck in local minima/saddle-points - if so use RMSProp)
        - RMSProp
        - Adam (general favorite)
        
        
5. Learning Rate
    - Try in powers of 10 (0.001,0.01,.1,1.0)
    - Which optimizer changes which l.r. is best (but try the others too).
        - SGD: 0.1
        - Adam: 0.001/0.01
    - Can use the `decay` parameter to reduce learning (though it is better to use adaptive optimizer than to adjust this)/.

7. Batch Size
    - Finding the "right" size is important
        - Too small = weights update too quickly and convergence is difficult
        - Too large = weights update too slowly (plus PC RAM issues)
    - Try batch sizes that are powers of 2 (for memory optimization)
    - Larger is better than smaller.
    
    
8. Number of Epochs:
    - Important parameter to tune
    - Use EarlyStopping callback to prevent overfitting
    

9. Adding Dropout
    - Usually use dropout rate of 0.2 to 0.5
    
    

10. Adding regularization (L1,L2)
    - Use when the model continues to over-fit even after adding Dropout
    
    
6. Initialization
    - Not as important as defaults (glorot-uniform) work well, but:
        - Use He-normal/uniform initialization when using ReLu
        - Use Glorot-normal/uniform when using Sigmoid
    - Avoid using all zeros or any constant for all neurons



### Easy-to-Add options to fight overfitting

#### Dropout 
<img src="https://raw.githubusercontent.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/master/Phase_4/topic_40-41_neural_networks/CL%20Repos/ds-neural_network_architecture-video/img/drop_out.png">

#### Early Stopping

- Monitor performance for decrease or plateau in performance, terminate process when given criteria is reached.

- **In Keras:**
    - Can be applied using the [callbacks function](https://keras.io/callbacks/)
```python    
from keras.callbacks import EarlyStopping
EarlyStopping(monitor='val_err', patience=5)
```
    - 'Monitor' denotes quanitity to check
    - 'val_err' denotes validation error
    - 'pateience' denotes # of epochs without improvement before stopping.
        - Be careful, as sometimes models _will_ continue to improve after a stagnant period


### Hyperparameter Details

#### Kernel Initialization
- Kernel Initializers
```python
# define the grid search parameters
init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero', 
             'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']```


#### Loss Functions
- MSE (regression)
- categorical cross-entropy (classification with 2D labels )
    - sparse categorical cross entropy (classification with 1D labels)
- binary cross-entropy (classification)
    - 2 categories
- **can also uses custom scoring functions**

## Using Regularization

### L1 & L2 Regularlization
- These methods of regularizaiton do so by penalizing coefficients(regression) or weights(neural networks),
    - L1 & L2 exist in regression models as well. There, L1='Lasso Regressions' , L2='Ridge regression'
    
<!--     

- **L1 & L2 regularization add a term to the cost function.**

$$Cost function = Loss (say, binary cross entropy) + Regularization term$$

$$ J (w^{[1]},b^{[1]},...,w^{[L]},b^{[L]}) = \dfrac{1}{m} \sum^m_{i=1}\mathcal{L}(\hat y^{(i)}, y^{(i)})+ \dfrac{\lambda}{2m}\sum^L_{l=1}||w^{[l]}||^2$$
- where $\lambda$ is the regularization parameter. 

- **The difference between  L1 vs L2 is that L1 is just the sum of the weights whereas L2 is the sum of the _square_of the weights.**  
 -->

<br><br>
- **L1 Regularization:**
    $$ Cost function = Loss + \frac{\lambda}{2m} * \sum ||w||$$
    - Uses the absolute value of weights and may reduce the weights down to 0. 
    
        
- **L2 Regularization:**:
    $$ Cost function = Loss + \frac{\lambda}{2m} * \sum ||w||^2$$
    - Also known as weight decay, as it forces weights to decay towards zero, but never exactly 0.. 

#### L1/L2 Regularization

- **CHOOSING L1 OR L2:**
    - L1 is very useful when trying to compress a model. (since weights can decreae to 0)
    - L2 is generally preferred otherwise.
    
    
- **USING L1/L2 IN KERAS:**
    - Add a kernel_regulaizer to a  layer.
```python 
from keras import regularizers
model.add(Dense(64, input_dim=64, kernel_regularizer=regularizers.l2(0.01))
```
    - here 0.01 = $\lambda$

# 🕹 Activity Part 2: Tuning Our Neural Network

## Adding Callbacks

### Keras Callbacks



- [Official Callback documentation](https://keras.io/callbacks/)
- CallBacks You'll Definitely Want to Use
 - `tensorflow.keras.callbacks.EarlyStopping`[ALWAYS!]
  - `tensorflow.keras.callbacks.ModelCheckpoint` [Always, if on Colab]

- Callbacks worth further exploration
 - `tensorflow.keras.callbacks.callbacks.LearningRateScheduler`
     - May be outdated in tf 2.x

In [None]:
## import callbacks
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, CSVLogger

In [None]:
## Make folder for models
model_folder = './models/'
os.makedirs(model_folder,exist_ok=True)
os.listdir(model_folder)

In [None]:
## make checkpoints - early stopping and modelcheckpoint
early_stop = EarlyStopping(monitor='val_accuracy',patience=2,verbose=1,
                          restore_best_weights=False)

checkpoint = ModelCheckpoint(model_folder,verbose=0,save_best_only=True)

In [None]:
### You can use {} to insert values in your checkpoint names
# filepath=folder+"weights-improvement-{epoch:02d}-{val_loss:.2f}.hdf5"
# checkpoint = ModelCheckpoint(filepath,verbose=1,save_best_only=True,
#                             save_weights_only=True)

In [None]:
## paste in our prior model function and fitting/eval code, but add callbacks

## update our make_model_w2v func wth embedding layer
def make_w2v_model():
    """Make a neural network with a new emebdding layer, 
    an LSTM layer with 25 unit, and a final Dense layer appropriate for the task"""
    model = models.Sequential()
    embedding_layer = layers.Embedding(vocab_size+1,EMBEDDING_SIZE,
                                  weights=[embedding_matrix],
                                  input_length=MAX_SEQUENCE_LENGTH,
                                  trainable=False)
    model.add(embedding_layer)
    model.add(layers.LSTM(25, return_sequences=False))
    model.add(layers.Dense(n_classes, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam',
                 metrics=['accuracy', tf.keras.metrics.Recall(name='recall')])
    display(model.summary())
    return model

## make,fit model and evlaute
model = make_w2v_model()
history = model.fit(X_train_pad, y_train, epochs=5,
                    batch_size=128, validation_split=0.2,
                   workers=-1,class_weight=weights_dict,callbacks=[early_stop])
evaluate_network(model,X_test_pad,y_test,history,
                X_train = X_train_pad,y_train=y_train)

# 📚 Gridsearching with Keras & Scikit-Learn

## HyperParameter Tuning with GridSearchCV & Keras

Original Source: https://chrisalbon.com/deep_learning/keras/tuning_neural_network_hyperparameters/
<br><br>

- To use `GridSearchCV` or other similar functions in scikit-learn with a Keras neural network, we need to wrap our keras model in `keras.wrappers.scikit_learn`'s `KerasClassifier` and `KerasRegressor`.

1. To do this, we need to write a build function(`build_fn`) that creates our model such as `create_model`.
    - This function must accept whatever parameters you wish to tune. 
    - It also must have a default argument for each parameter.
    - This function must Return the model (and only the model)
    
<!-- 
```python

## Define the build function
def create_model(n_units=(50,25,7), activation='relu',final_activation='softmax',
                optimizer='adam'):
    
    ## Pro tip:save the local variables now so you can print out the parameters used to create the model.
    params_used = locals()
    print('Parameters for model:\n',params_used)
    
   
    from keras.models import Sequential
    from keras import layers
    
    model=Sequential()
    model.add(layers.Dense(n_units[0], activation=activation, input_shape=(2000,)))
    model.add(layers.Dense(n_units[1], activation=activation))
    model.add(layers.Dense(n_units[2], activation=final_activation))
    model.compile(optimizer=optimizer, loss='categorical_crossentropy',metrics=['accuracy'])
    
    display(model.summary())
    return model 
```     -->

2. We then create out model using the Keras wrapper:

```python
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
neural_network =  KerasClassifier(build_fn=create_model,verbose=1)
```

3. Now, set up the hyperparameter space for grid search. (Remember, your `create_model` function must accept the parameter you want to tune)

```python
params_to_test = {'n_units':[(50,25,7),(100,50,7)],
                  'optimizer':['adam','rmsprop','adadelta'],
                  'activation':['linear','relu','tanh'],
                  'final_activation':['softmax']}
```

4. Now instantiate your GridSearch function

```python
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(estimator=neural_network,param_grid=params_to_test)
grid_result = grid.fit(X_train, y_train)
best_params = grid_result.best_params_
```
5. And thats it!

### 🎛 Defining Build Function and Params Grid

In [None]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
# make a new make_tune_model function to tune the # of units
# and if embeddings are trainable
def make_tune_model(n_units_lstm=25, trainable=False):
    """Make a neural network with a new emebdding layer, 
    an LSTM layer with 25 unit, and a final Dense layer appropriate for the task"""
    model = models.Sequential()
    embedding_layer = layers.Embedding(vocab_size+1,EMBEDDING_SIZE,
                                  weights=[embedding_matrix],
                                  input_length=MAX_SEQUENCE_LENGTH,
                                  trainable=trainable)
    model.add(embedding_layer)
    model.add(layers.LSTM(n_units_lstm, return_sequences=False))
    model.add(layers.Dense(n_classes, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam',
                 metrics=['accuracy', tf.keras.metrics.Recall(name='recall')])
    display(model.summary())
    return model


In [None]:
# ## make a model with sklearn wrapper
# wrapped_model = KerasClassifier(make_tune_model)
# wrapped_model.fit(X_train_pad,y_train, 
#                   epochs=3, validation_split=0.2)

In [None]:
## make anew early stopping with shorter patience, do not restore best weights
early_stop = EarlyStopping(monitor='val_accuracy',patience=0,verbose=1,
                          restore_best_weights=False)

In [None]:
## Set up params grid for 25,509 units and trianable true/false

## Set up params grid for 25,509 units and trianable true/false
params = {'n_units_lstm':[25,50],
         'trainable':[True,False]}


## Make and fit grid, check best params
grid = GridSearchCV(KerasClassifier(make_tune_model), 
                    params,cv=2,n_jobs=-1,verbose=1)

grid.fit(X_train_pad,y_train, epochs=3, 
         callbacks=[early_stop],
         validation_split=0.2)

grid.best_params_

In [None]:
## whats the best score?
best_ann = grid.best_estimator_
history = best_ann.fit(X_train_pad,y_train, epochs=3, 
         callbacks=[early_stop],
         validation_split=0.2)

In [None]:
best_ann.model

In [None]:
evaluate_network(best_ann.model,X_test_pad,y_test,history=history,
                 X_train=X_train_pad,y_train = y_train)

In [None]:
raise Exception('The following cells are not guaranteed to run!')

# 🗄 APPENDIX

## 🤔 Tensorboard Callback
- https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks

> Add tensorboard callback

In [None]:
%load_ext tensorboard
logdir = './logs/'
os.makedirs(logdir,exist_ok=True)
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)


In [None]:
## Test using function to train and evaluate
model = make_model()
history = model.fit(X_train_pad, y_train, epochs=3,
                    batch_size=64, validation_split=0.2,
                   workers=-1,callbacks=[tensorboard_callback])
evaluate_network(model,X_test_pad,y_test,history,
                X_train = X_train_pad,y_train=y_train)

In [None]:
# %tensorboard --logdir logs

## 🤔 Using Pre-Trained Vectors

### On Colab

>- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/) - official website with documentation. 
    - There are several different pretrained vectors available. We will use the default/starter set of vectors, but you could select a different link to download an alternative option
        - Recommended link: http://nlp.stanford.edu/data/glove.6B.zip

>- [Tutorial on using Glove with Colab](https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/pretrained_word_embeddings.ipynb#scrollTo=b_H-URXmROE6)

In [None]:
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
!wget {glove_url}
!unzip -q glove.6B.zip

In [None]:
## Delete the zip file
!rm glove.6B.zip

In [None]:
# path_to_glove_file = os.path.join(
#     os.path.expanduser("~"), ".keras/datasets/glove.6B.100d.txt"
# )
path_to_glove_file = "glove.6B.100d.txt"
embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

### Local Code

In [None]:
# import os
# folder = '/Users/jamesirving/Datasets/'#glove.twitter.27B/'
# # print(os.listdir(folder))
# glove_file = folder+'glove.6B/glove.6B.50d.txt'#'glove.twitter.27B.50d.txt'
# glove_twitter_file = folder+'glove.twitter.27B/glove.twitter.27B.50d.txt'
# print(glove_file)
# print(glove_twitter_file)

#### Keeping only the vectors needed

In [None]:
# ## This line of code for getting all words bugs me
# total_vocabulary = set(word for tweet in data_lower for word in tweet)
# len(total_vocabulary)

In [None]:
# glove = {}
# with open(glove_file,'rb') as f:#'glove.6B.50d.txt', 'rb') as f:
#     for line in f:
#         parts = line.split()
#         word = parts[0].decode('utf-8')
#         if word in total_vocabulary:
#             vector = np.array(parts[1:], dtype=np.float32)
#             glove[word] = vector

In [None]:
# glove['republican']

### Converting Glove to Word2Vec format

- Getting glove into w2vec format:
    - https://radimrehurek.com/gensim/scripts/glove2word2vec.html

In [None]:
glove_folder = folder+'glove.twitter.27B'
os.listdir(glove_folder)

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath(glove_twitter_file)
tmp_file = get_tmpfile(glove_folder+'glove_to_w2vec.txt')
_ = glove2word2vec(glove_file, tmp_file)
model_glove = KeyedVectors.load_word2vec_format(tmp_file)

In [None]:
model_glove.wv

## LSTM vs GRU

In [None]:
## GRU Model
from keras import models, layers, optimizers, regularizers
modelG = models.Sequential()

## Get and add embedding_layer
# embedding_layer = ji.make_keras_embedding_layer(wv, X_train)
modelG.add(Embedding(MAX_WORDS, EMBEDDING_SIZE))

# modelG.add(layers.SpatialDropout1D(0.5))
# modelG.add(layers.Bidirectional(layers.GRU(units=100, dropout=0.5, recurrent_dropout=0.2,return_sequences=True)))
modelG.add(layers.Bidirectional(layers.GRU(units=100, dropout=0.5, recurrent_dropout=0.2)))
modelG.add(layers.Dense(2, activation='softmax'))

modelG.compile(loss='categorical_crossentropy',optimizer="adam",metrics=['acc'])#,'val_acc'])#, callbacks=callbacks)
modelG.summary()

In [None]:

history = modelG.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

y_hat_test = modelG.predict_classes(X_test)
kg.evaluate_model(y_test,y_hat_test,history)

## Using Embeddings in Classification Models - sci-kit learn

- Embeddings can be used in Artificial Neural Networks as an input Embedding Layer
- Embeddings can be used in sci-kit learn models by taking the mean vector of a text/document and using the mean vector as the input into the model. 

### Creating Mean Embeddings

In [None]:
## This line of code for getting all words bugs me
total_vocabulary = set(word for tweet in data_lower for word in tweet)
len(total_vocabulary)

In [None]:
glove = {}
with open(glove_file,'rb') as f:#'glove.6B.50d.txt', 'rb') as f:
    for line in f:
        parts = line.split()
        word = parts[0].decode('utf-8')
        if word in total_vocabulary:
            vector = np.array(parts[1:], dtype=np.float32)
            glove[word] = vector

In [None]:
from sklearn.model_selection import train_test_split
from nltk import word_tokenize

y = pd.get_dummies(df['is_trump'],drop_first=True).values
X = df['text'].str.lower().map(word_tokenize)

X_idx = list(range(len(X)))
train_idx,test_idx = train_test_split(X_idx,random_state=123)

X[train_idx]

In [None]:
def train_test_split_idx(X, y, train_idx,test_idx):
    # try count vectorized first
    X_train = X[train_idx].copy()
    y_train = y[train_idx].copy()
    X_test = X[train_idx].copy()
    y_test = y[train_idx].copy()
    return X_train, X_test,y_train, y_test

X_train, X_test,y_train, y_test = train_test_split_idx(X,y,train_idx,test_idx)

In [None]:
# df['combined_text'] = df['headline'] + ' ' + df['short_description']
# data = df['combined_text'].map(word_tokenize).values

In [None]:
class W2vVectorizer(object):
    
    def __init__(self, w2v):
        # Takes in a dictionary of words and vectors as input
        self.w2v = w2v
        if len(w2v) == 0:
            self.dimensions = 0
        else:
            self.dimensions = len(w2v[next(iter(glove))])
    
    # Note: Even though it doesn't do anything, it's required that this object implement a fit method or else
    # it can't be used in a scikit-learn pipeline  
    def fit(self, X, y):
        return self
            
    def transform(self, X):
        return np.array([
            np.mean([self.w2v[w] for w in words if w in self.w2v]
                   or [np.zeros(self.dimensions)], axis=0) for words in X])