<a href="https://colab.research.google.com/github/pavanramadass/machine-learning-projects/blob/main/Assignment4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Task 1

https://www.investing.com/currencies/eur-usd-historical-data

Downloaded dataset from investing.com

In [None]:
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, LSTM
from sklearn.metrics import mean_absolute_error
from google.colab import files
import io

# Data Preprocessing
uploaded = files.upload()
dataset = pd.read_csv(io.BytesIO(uploaded['eur_usd.csv']))

filtered_data = dataset.filter(['Price'])
data = filtered_data.values 

mmscaler = MinMaxScaler(feature_range=(0, 1))
transformed_data = mmscaler.fit_transform(data)

index = filtered_data.columns.get_loc("Price")

train_data_len = math.ceil(transformed_data.shape[0] * 0.7)
train_data = transformed_data[0:train_data_len, :]
test_data = transformed_data[train_data_len - 5:, :]
  
# Splitting dataset into train and test sets manually since the dataset is time dependent 
x_train, y_train = [], []
for i in range(5, train_data_len):
  x_train.append(data[i - 5:i, :])
  y_train.append(data[i, index])
x_train = np.array(x_train)
y_train = np.array(y_train)

x_test, y_test = [], []
test_data_len = test_data.shape[0]
for i in range(5, test_data_len):
  x_test.append(data[i - 5:i, :])
  y_test.append(data[i, index])
x_test = np.array(x_test)
y_test = np.array(y_test)

# Recurrent Neural Network Model
model_rnn = Sequential()

model_rnn.add(SimpleRNN(5, return_sequences=True, input_shape=(x_train.shape[1], 1))) 
model_rnn.add(SimpleRNN(5, return_sequences=False))
model_rnn.add(Dense(25, activation='relu'))
model_rnn.add(Dense(1))

# Compile the model
model_rnn.compile(optimizer='adam', loss='mean_squared_error')
model_rnn.fit(x_train, y_train, batch_size=16, epochs=25)

# Testing
preds_rnn = model_rnn.predict(x_test)

mae = mean_absolute_error(preds_rnn, y_test)
print('MAE: ' + str(round(mae, 5)))

# LSTM Model
model_lstm = Sequential()

model_lstm.add(LSTM(5, return_sequences=True, input_shape=(x_train.shape[1], 1))) 
model_lstm.add(LSTM(5, return_sequences=False))
model_lstm.add(Dense(25, activation='relu'))
model_lstm.add(Dense(1))

# Compile the model
model_lstm.compile(optimizer='adam', loss='mean_squared_error')
model_lstm.fit(x_train, y_train, batch_size=16, epochs=25)

# Testing
preds_lstm = model_lstm.predict(x_test)

mae = mean_absolute_error(preds_lstm, y_test)
print('MAE: ' + str(round(mae, 5)))

I chose to use the EUR/USD forex dataset to solve the problem of predicting future currency price per monthly timeframe. Unlike stocks, forex is more cyclical, therefore it is worth more in knowing what the future of currency market will be compared to the stock market. 

As for my neural network framework, I chose to use Keras due to it being well known by me. 

For my RNN, I used keras's SimpleRNN. The structure of my RNN I have two layers of the simpleRNN. Afterwards, I have one dense layer with the relu activation function, and the output dense layer.
Lastly, the metric I used is mean absolute error. This metric just shows what the average error is for the model. So, a low mean average score means the model is better. 

After implemeting LSTM, the major difference I noticed was the mean average score. The mean average score for the LSTM was lower, thus proving that LSTM is a better model than RNN. The mean average score for the LSTM is 0.03081 and the mean average score for the RNN is 0.03328.

The reason why LSTM had a better mean average score is because with LSTMs we are using more control. This in turn controls the flow and mixing of inputs to give us more control. So, by giving us more control, we are able to have less error in our predictions. 

Task 2

In [None]:
!pip install --upgrade gensim

NOTE:
I was pretty unsure where to find a good dataset for nlp, so after much research I found gensim. Gensim has their own implementation of 
Word2Vec (I assume it is similar to the one you showed us in class), and they have many datasets that users can use for training the Word2Vec
neural network. Overall I found gensim to be a useful and resourceful nlp/topic modelling toolkit or framework. 

So, I decided to use Gensim as my dataset as well as word2vec neural network. 
I implemented the cosine similarity as its own function, and for the dissimilarity I just did one minus the cosine similarity, because
in from my understanding if two words are 80% similar, then they are 20% dissimilar. 

In [None]:
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec
import math 

corpus = api.load('text8')

model = Word2Vec(corpus) 

def cos_similarity_func(vec_word1, vec_word2):
  numerator = 0
  denominator = 0
  part_1 = 0
  part_2 = 0
  for num1, num2 in zip(vec_word1, vec_word2):
    numerator += (num1 * num2)

  for num1, num2 in zip(vec_word1, vec_word2):
    part_1 += num1 ** 2
    part_2 += num2 ** 2

  denominator = math.sqrt(part_1) * math.sqrt(part_2)

  cos_similarity = numerator / denominator 

  return cos_similarity 


word1 = input("Enter first word: ")
word2 = input("Enter second word: ")

vec_word1 = model.wv[word1]
vec_word2 = model.wv[word2]

similarity = cos_similarity_func(vec_word1, vec_word2)
dissimilarity = 1 - similarity

print("Similarity")
print(similarity)

print("Dissimilarity")
print(dissimilarity)