1. You can work with `preprocessed_data.csv` for the assignment. You can get the data from - <a href='https://drive.google.com/drive/u/0/folders/1CJnItndeSSJu7aragQoXWZS9-0apN6pp'>data folder</a>.
2. Load the data in your notebook.
3. After step 2 you have to train 3 types of models as discussed below. 
4. For all the model use <a href='https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics'>'auc'</a> as a metric. Check <a  href='https://stackoverflow.com/a/46844409'>this</a> and <a  href='https://www.kaggle.com/c/santander-customer-transaction-prediction/discussion/80807'>this</a> for using auc as a metric.
5. You are free to choose any number of layers/hiddden units but you have to use same type of architectures shown below.
6. You can use any one of the optimizers and choice of Learning rate and momentum.
7. For all the model's use <a href='https://www.youtube.com/watch?v=2U6Jl7oqRkM'>TensorBoard</a> and plot the Metric value and Loss with epoch. While submitting, take a screenshot of plots and include those images in a separate pad and write your observations about them.
8. Make sure that you are using GPU to train the given models.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from google.colab import drive
drive.mount(mountpoint='/content/drive')

Mounted at /content/drive


In [3]:
from IPython.display import display

In [4]:
from matplotlib import pyplot as plt
from matplotlib import style
style.use(style='seaborn-deep')

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_extraction.text import (
    TfidfVectorizer,
    CountVectorizer
)
from sklearn import metrics

In [6]:
from scipy.sparse import hstack

In [7]:
from tqdm import tqdm

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [9]:
from tensorflow.keras.layers import (
    Flatten, 
    Dense, 
    Dropout, 
    concatenate, 
    Embedding, 
    Conv1D, 
    Input, 
    LSTM, 
    BatchNormalization, 
    LeakyReLU
)

In [10]:
from tensorflow.keras.callbacks import (
    EarlyStopping, 
    TensorBoard, 
    ModelCheckpoint, 
    Callback
)

In [11]:
from tensorflow.keras.models import (
    Sequential,
    Model
)

In [12]:
import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
import pickle
import os
import datetime

In [13]:
base_path = '/content/drive/MyDrive/Applied-AI/Assignment-26/'

In [14]:
data_file = os.path.join(base_path, 'preprocessed_data.csv')
print(data_file)

/content/drive/MyDrive/Applied-AI/Assignment-26/preprocessed_data.csv


In [15]:
data_df = pd.read_csv(filepath_or_buffer=data_file)
display(data_df.head())

Unnamed: 0,school_state,teacher_prefix,project_grade_category,teacher_number_of_previously_posted_projects,project_is_approved,clean_categories,clean_subcategories,essay,price
0,ca,mrs,grades_prek_2,53,1,math_science,appliedsciences health_lifescience,i fortunate enough use fairy tale stem kits cl...,725.05
1,ut,ms,grades_3_5,4,1,specialneeds,specialneeds,imagine 8 9 years old you third grade classroo...,213.03
2,ca,mrs,grades_prek_2,10,1,literacy_language,literacy,having class 24 students comes diverse learner...,329.0
3,ga,mrs,grades_prek_2,2,1,appliedlearning,earlydevelopment,i recently read article giving students choice...,481.04
4,wa,mrs,grades_3_5,2,1,literacy_language,literacy,my students crave challenge eat obstacles brea...,17.74


In [16]:
X = data_df.drop(columns=['project_is_approved'])
y = data_df['project_is_approved'].values

In [17]:
numcols = X.select_dtypes('number').columns
catcols = X.select_dtypes('object').columns
numcols = list(set(numcols))
catcols = list(set(catcols))
catcols.remove('essay')
print(numcols)
print(catcols)

['price', 'teacher_number_of_previously_posted_projects']
['school_state', 'clean_subcategories', 'project_grade_category', 'clean_categories', 'teacher_prefix']


In [18]:
print(X.shape, y.shape)

(109248, 8) (109248,)


In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train, test_size=0.2, stratify=y_train, random_state=0)

In [20]:
print(X_train.shape, y_train.shape)
print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)

(69918, 8) (69918,)
(17480, 8) (17480,)
(21850, 8) (21850,)


In [21]:
num_classes = len(data_df['project_is_approved'].value_counts())
print(num_classes)

2


In [22]:
y_train_ohe = to_categorical(y=y_train, num_classes=num_classes)
y_cv_ohe = to_categorical(y=y_cv, num_classes=num_classes)
y_test_ohe = to_categorical(y=y_test, num_classes=num_classes)

<font color='red'> Model-1 </font>

Build and Train deep neural network as shown below.

<img src='https://i.imgur.com/w395Yk9.png'>

ref: https://i.imgur.com/w395Yk9.png

- __Input_seq_total_text_data__
    - You have to give total text data columns. After this use the Embedding layer to get word vectors. Use given predefined glove word vectors, don't train any word vectors. After this use LSTM and get the LSTM output and flatten that output. 
- __Input_school_state__
    - Give `school_state` column as input to embedding layer and train the Keras embedding layer. 
- __Project_grade_category__
    - Give `project_grade_category` column as input to embedding layer and train the Keras embedding layer.
- __Input_clean_categories__
    - Give `input_clean_categories` column as input to embedding layer and train the Keras embedding layer.
- __Input_clean_subcategories__
    - Give `input_clean_subcategories` column as input to embedding layer and train the Keras embedding layer.
- __Input_clean_subcategories__
     - Give `input_teacher_prefix` column as input to embedding layer and train the Keras embedding layer.
- __Input_remaining_teacher_number_of_previously_posted_projects._resource_summary_contains_numerical_digits._price._quantity__
    - Concatenate remaining columns and add a dense layer after that. 



Below is an example of embedding layer for a categorical columns. In below code all are dummy values, we gave only for referance. 

1. Go through this blog, if you have any doubt on using predefined Embedding values in Embedding layer - https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
2. Please go through this link https://keras.io/getting-started/functional-api-guide/ and check the 'Multi-input and multi-output models' then you will get to know how to give multiple inputs. 

1.1 Text vectorization

In [23]:
X_train_essay = X_train['essay']
X_cv_essay = X_cv['essay']
X_test_essay = X_test['essay']

In [24]:
token = Tokenizer()
token.fit_on_texts(texts=X_train_essay)

In [25]:
vocab_size = len(token.word_index) + 1
print(vocab_size)

47528


In [26]:
X_train_essay_seq = token.texts_to_sequences(texts=X_train_essay)
X_cv_essay_seq = token.texts_to_sequences(texts=X_cv_essay)
X_test_essay_seq = token.texts_to_sequences(texts=X_test_essay)

In [27]:
max_len = 500

In [28]:
X_train_essay_padded = pad_sequences(sequences=X_train_essay_seq, maxlen=max_len, padding='post')
X_cv_essay_padded = pad_sequences(sequences=X_cv_essay_seq, maxlen=max_len, padding='post')
X_test_essay_padded = pad_sequences(sequences=X_test_essay_seq, maxlen=max_len, padding='post')

In [29]:
glove_vectors_file = os.path.join(base_path, 'glove_vectors')
print(glove_vectors_file)

/content/drive/MyDrive/Applied-AI/Assignment-26/glove_vectors


In [30]:
with open(file=glove_vectors_file, mode='rb') as gvf:
    gmodel = pickle.load(gvf)

In [31]:
vocab_words = list(token.word_index.keys())

In [32]:
embedding_matrix = np.zeros(shape=(vocab_size, 300))
for (i, word) in enumerate(vocab_words):
    e_vector = gmodel.get(word, None)
    if e_vector is not None:
        embedding_matrix[i] = e_vector

In [33]:
print(embedding_matrix.shape)

(47528, 300)


1.2 Categorical feature vectorization

Reference: https://stackoverflow.com/a/65538195/7579443

In [34]:
def data_column_ordinal_encoder(X_train, X_cv, X_test, col):
    """
    This function encodes the categorical column.
    """
    X_train_col = X_train[col].values.reshape(-1, 1)
    X_cv_col = X_cv[col].values.reshape(-1, 1)
    X_test_col = X_test[col].values.reshape(-1, 1)

    ord_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
    ord_encoder.fit(X=X_train_col)

    X_train_col_vect = ord_encoder.transform(X=X_train_col)
    X_cv_col_vect = ord_encoder.transform(X=X_cv_col)
    X_test_col_vect = ord_encoder.transform(X=X_test_col)
    
    return X_train_col_vect, X_cv_col_vect, X_test_col_vect

school_state

In [35]:
(X_train_school_state_vect, 
 X_cv_school_state_vect, 
 X_test_school_state_vect) = data_column_ordinal_encoder(X_train=X_train,
                                                         X_cv=X_cv,
                                                         X_test=X_test,
                                                         col='school_state')

clean_categories

In [36]:
(X_train_clean_categories_vect,
 X_cv_clean_categories_vect,
 X_test_clean_categories_vect) = data_column_ordinal_encoder(X_train=X_train,
                                                                   X_cv=X_cv,
                                                                   X_test=X_test,
                                                                   col='clean_categories')

clean_subcategories

In [37]:
(X_train_clean_subcategories_vect,
 X_cv_clean_subcategories_vect,
 X_test_clean_subcategories_vect) = data_column_ordinal_encoder(X_train=X_train,
                                                                X_cv=X_cv,
                                                                X_test=X_test,
                                                                col='clean_subcategories')

project_grade_category

In [38]:
(X_train_project_grade_category_vect,
 X_cv_project_grade_category_vect,
 X_test_project_grade_category_vect) = data_column_ordinal_encoder(X_train=X_train,
                                                                   X_cv=X_cv,
                                                                   X_test=X_test,
                                                                   col='project_grade_category')

teacher_prefix

In [39]:
(X_train_teacher_prefix_vect,
 X_cv_teacher_prefix_vect,
 X_test_teacher_prefix_vect) = data_column_ordinal_encoder(X_train=X_train,
                                                           X_cv=X_cv,
                                                           X_test=X_test,
                                                           col='teacher_prefix')

1.3 Numerical feature vectorization

In [40]:
def stack_values(df, col_name_1, col_name_2):
    """
    Stacks the standardized columns.
    """
    col_list_1 = ((df[col_name_1] - df[col_name_1].mean()) / df[col_name_1].std()).to_list()
    col_list_2 = ((df[col_name_2] - df[col_name_2].mean()) / df[col_name_2].std()).to_list()
    return np.matrix(list(zip(col_list_1, col_list_2)))

In [41]:
X_train_numerical_vect = stack_values(df=X_train, col_name_1='teacher_number_of_previously_posted_projects', col_name_2='price')
X_cv_numerical_vect = stack_values(df=X_cv, col_name_1='teacher_number_of_previously_posted_projects', col_name_2='price')
X_test_numerical_vect = stack_values(df=X_test, col_name_1='teacher_number_of_previously_posted_projects', col_name_2='price')

In [42]:
print(X_train_essay_padded.shape)
print(X_train_school_state_vect.shape)
print(X_train_teacher_prefix_vect.shape)
print(X_train_project_grade_category_vect.shape)
print(X_train_clean_categories_vect.shape)
print(X_train_clean_subcategories_vect.shape)
print(X_train_numerical_vect.shape)

(69918, 500)
(69918, 1)
(69918, 1)
(69918, 1)
(69918, 1)
(69918, 1)
(69918, 2)


In [43]:
print(X_cv_essay_padded.shape)
print(X_cv_school_state_vect.shape)
print(X_cv_teacher_prefix_vect.shape)
print(X_cv_project_grade_category_vect.shape)
print(X_cv_clean_categories_vect.shape)
print(X_cv_clean_subcategories_vect.shape)
print(X_cv_numerical_vect.shape)

(17480, 500)
(17480, 1)
(17480, 1)
(17480, 1)
(17480, 1)
(17480, 1)
(17480, 2)


In [44]:
print(X_test_essay_padded.shape)
print(X_test_school_state_vect.shape)
print(X_test_teacher_prefix_vect.shape)
print(X_test_project_grade_category_vect.shape)
print(X_test_clean_categories_vect.shape)
print(X_test_clean_subcategories_vect.shape)
print(X_test_numerical_vect.shape)

(21850, 500)
(21850, 1)
(21850, 1)
(21850, 1)
(21850, 1)
(21850, 1)
(21850, 2)


Consolidating the data

In [45]:
X_train_all = [
    X_train_essay_padded, 
    X_train_school_state_vect, 
    X_train_teacher_prefix_vect, 
    X_train_project_grade_category_vect, 
    X_train_clean_categories_vect, 
    X_train_clean_subcategories_vect, 
    X_train_numerical_vect
]

X_cv_all = [
    X_cv_essay_padded, 
    X_cv_school_state_vect, 
    X_cv_teacher_prefix_vect, 
    X_cv_project_grade_category_vect, 
    X_cv_clean_categories_vect, 
    X_cv_clean_subcategories_vect, 
    X_cv_numerical_vect
]

X_test_all = [
    X_test_essay_padded, 
    X_test_school_state_vect, 
    X_test_teacher_prefix_vect, 
    X_test_project_grade_category_vect, 
    X_test_clean_categories_vect, 
    X_test_clean_subcategories_vect, 
    X_test_numerical_vect
]

In [46]:
catsize_dict = {col : X_train[col].nunique() for col in catcols}
print(catsize_dict)

{'school_state': 51, 'clean_subcategories': 387, 'project_grade_category': 4, 'clean_categories': 51, 'teacher_prefix': 5}


1.4 Defining the model

In [47]:
%load_ext tensorboard

# Clear any logs from previous runs
!rm -rf ./logs/

In [48]:
tf.keras.backend.clear_session()

In [49]:
# essay
input_text_seq = Input(shape=(max_len, ), name='Input_Text_Sequences')
embedding_text_seq = Embedding(input_dim=embedding_matrix.shape[0],
                               output_dim=embedding_matrix.shape[1],
                               weights=[embedding_matrix],
                               input_length=max_len,
                               trainable=False,
                               name='Embedding_Text_Sequences')(input_text_seq)
lstm_text_seq = LSTM(units=100, return_sequences=True, name='LSTM')(embedding_text_seq)
flatten_text_seq = Flatten(name='Flatten_Text_Sequences')(lstm_text_seq)

# school_state
input_school_state = Input(shape=(1, ), name='Input_School_State')
embedding_school_state = Embedding(input_dim=catsize_dict['school_state'],
                                   output_dim=10,
                                   input_length=1,
                                   name='Embedding_School_State')(input_school_state)
flatten_school_state = Flatten(name='Flatten_School_State')(embedding_school_state)

# project_grade_category
input_project_grade_category = Input(shape=(1, ), name='Input_Project_Grade_Category')
embedding_project_grade_category = Embedding(input_dim=catsize_dict['project_grade_category'],
                                             output_dim=10,
                                             input_length=1,
                                             name='Embedding_Project_Grade_Category')(input_project_grade_category)
flatten_project_grade_category = Flatten(name='Flatten_Project_Grade_Category')(embedding_project_grade_category)

# clean_categories
input_clean_categories = Input(shape=(1, ), name='Input_Clean_Categories')
embedding_clean_categories = Embedding(input_dim=catsize_dict['clean_categories'],
                                       output_dim=10,
                                       input_length=1,
                                       name='Embedding_Clean_Categories')(input_clean_categories)
flatten_clean_categories = Flatten(name='Flatten_Clean_Categories')(embedding_clean_categories)

# clean_subcategories
input_clean_subcategories = Input(shape=(1, ), name='Input_Clean_SubCategories')
embedding_clean_subcategories = Embedding(input_dim=catsize_dict['clean_subcategories'],
                                          output_dim=10,
                                          input_length=1,
                                          name='Embedding_Clean_SubCategories')(input_clean_subcategories)
flatten_clean_subcategories = Flatten(name='Flatten_Clean_SubCategories')(embedding_clean_subcategories)

# teacher_prefix
input_teacher_prefix = Input(shape=(1, ), name='Input_Teacher_Prefix')
embedding_teacher_prefix = Embedding(input_dim=catsize_dict['teacher_prefix'],
                                     output_dim=10,
                                     input_length=1,
                                     name='Embedding_Teacher_Prefix')(input_teacher_prefix)
flatten_teacher_prefix = Flatten(name='Flatten_Teacher_Prefix')(embedding_teacher_prefix)

# numericals
input_numericals = Input(shape=(2, ), name='Input_Numericals')
dense_numericals = Dense(units=16,
                         kernel_initializer=tf.keras.initializers.GlorotNormal(),
                         kernel_regularizer=tf.keras.regularizers.L2(0.0001))(input_numericals)
dense_numericals = LeakyReLU(name='Dense_Numericals')(dense_numericals)

# concatenate
inputs = [flatten_text_seq, flatten_school_state, flatten_project_grade_category, 
          flatten_clean_categories, flatten_clean_subcategories, 
          flatten_teacher_prefix, dense_numericals]
concatenate_layer = concatenate(inputs=inputs, name='Concatenation')

# dense
dense_1 = Dense(units=32,
                kernel_initializer=tf.keras.initializers.GlorotNormal(),
                kernel_regularizer=tf.keras.regularizers.L2(0.001),
                name='Dense_1_1')(concatenate_layer)
dense_1 = LeakyReLU(name='Dense_1_2')(dense_1)
drop_1 = Dropout(rate=0.3, name='Dropout_1')(dense_1)

# dense
dense_2 = Dense(units=32,
                kernel_initializer=tf.keras.initializers.GlorotNormal(),
                kernel_regularizer=tf.keras.regularizers.L2(0.001),
                name='Dense_2_1')(drop_1)
dense_2 = LeakyReLU(name='Dense_2_2')(dense_2)
drop_2 = Dropout(rate=0.2, name='Dropout_2')(dense_2)

# dense
dense_3 = Dense(units=16,
                kernel_initializer=tf.keras.initializers.GlorotNormal(),
                kernel_regularizer=tf.keras.regularizers.L2(0.001),
                name='Dense_3_1')(drop_2)
dense_3 = LeakyReLU(name='Dense_3_2')(dense_3)

output_layer = Dense(units=2, activation='softmax', name='Output')(dense_3)

In [50]:
inputs_layers = [input_text_seq, input_school_state, input_teacher_prefix, input_project_grade_category, 
                 input_clean_categories, input_clean_subcategories, input_numericals]

In [51]:
model_1 = Model(inputs=inputs_layers, outputs=output_layer)

Reference: https://www.tensorflow.org/api_docs/python/tf/numpy_function

In [52]:
def tf_auc(y_true, y_pred):
    """
    AUC function for tf.
    """
    func = lambda y_true, y_pred : metrics.roc_auc_score(y_true=y_true, y_score=y_pred)
    score = tf.numpy_function(func=func, inp=[y_true, y_pred], Tout=tf.double, name='custom_auc')
    return score

In [53]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
model_1.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy', tf_auc])

In [54]:
model_1.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 Input_Text_Sequences (InputLay  [(None, 500)]       0           []                               
 er)                                                                                              
                                                                                                  
 Embedding_Text_Sequences (Embe  (None, 500, 300)    14258400    ['Input_Text_Sequences[0][0]']   
 dding)                                                                                           
                                                                                                  
 Input_School_State (InputLayer  [(None, 1)]         0           []                               
 )                                                                                            

In [55]:
filepath_1 = "model_save/best_model_1.h5"
model_save_callback_1 = ModelCheckpoint(filepath=filepath_1, monitor='val_tf_auc', verbose=1, save_best_only=True, mode='max')

early_stop_callback_1 = EarlyStopping(monitor='val_loss', min_delta=0.01, patience=5, verbose=1)

log_dir_1 = os.path.join('logs', 'fits', datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback_1 = TensorBoard(log_dir=log_dir_1, histogram_freq=1)

callbacks_1 = [model_save_callback_1, early_stop_callback_1, tensorboard_callback_1]

1.5 Fitting your model

In [56]:
model_1.fit(x=X_train_all,
            y=y_train_ohe,
            validation_data=(X_cv_all, y_cv_ohe),
            epochs=50,
            batch_size=400,
            callbacks=callbacks_1)

Epoch 1/50
Epoch 1: val_tf_auc improved from -inf to 0.62818, saving model to model_save/best_model_1.h5
Epoch 2/50
Epoch 2: val_tf_auc improved from 0.62818 to 0.63751, saving model to model_save/best_model_1.h5
Epoch 3/50
Epoch 3: val_tf_auc improved from 0.63751 to 0.66111, saving model to model_save/best_model_1.h5
Epoch 4/50
Epoch 4: val_tf_auc improved from 0.66111 to 0.69374, saving model to model_save/best_model_1.h5
Epoch 5/50
Epoch 5: val_tf_auc improved from 0.69374 to 0.71091, saving model to model_save/best_model_1.h5
Epoch 6/50
Epoch 6: val_tf_auc improved from 0.71091 to 0.71849, saving model to model_save/best_model_1.h5
Epoch 7/50
Epoch 7: val_tf_auc improved from 0.71849 to 0.71911, saving model to model_save/best_model_1.h5
Epoch 8/50
Epoch 8: val_tf_auc improved from 0.71911 to 0.72509, saving model to model_save/best_model_1.h5
Epoch 9/50
Epoch 9: val_tf_auc improved from 0.72509 to 0.73015, saving model to model_save/best_model_1.h5
Epoch 10/50
Epoch 10: val_tf_au

<keras.callbacks.History at 0x7f8a3777e4d0>

In [61]:
%tensorboard --logdir logs/fits

TensorBoard images

![](https://user-images.githubusercontent.com/63338657/198864076-d2f8e8d7-7093-40bd-947f-26b2bad0af83.png)

![](https://user-images.githubusercontent.com/63338657/198864113-9e7bdde4-2393-499a-bb0a-4a40f49fd2aa.png)

![](https://user-images.githubusercontent.com/63338657/198864135-f98129c8-ccbc-46ef-8ef7-0022751a46fe.png)

In [58]:
model_1.load_weights(filepath=filepath_1)

In [59]:
test_loss_1, test_accuracy_1, test_auc_1 = model_1.evaluate(x=X_test_all, y=y_test_ohe, batch_size=60)



In [60]:
print("Test Loss: {}.".format(test_loss_1))
print("Test Accuracy: {}.".format(test_accuracy_1 * 100))
print("Test AUC: {}.".format(test_auc_1 * 100))

Test Loss: 0.4474007487297058.
Test Accuracy: 85.08467078208923.
Test AUC: 73.10256958007812.


End of the file.