# Main Architecture

For now we have `Tweets` and `Price History` as our primary data source.

![Main Architecture](main_architecture.jpg)

### Filter

The `Tweets` are first `Pre-Processed` and then `Classified` using a simple `Neural Network` which later will be converted into a `Convolutional Neural Network`. The `Tweets` are filtered in `realtime` and stored in the database.

Steps in Pre-Process:
1. remove tweets with `#Tags > 3`
2. lowercase
3. remove stopwords
4. remove urls
5. remove any symbols

Steps in Classification:
1. vectorize with `Bag of Words`
2. vectorize with `Term Frequency - Inverse Document Frequency (TFIDF)`
3. filter with `Neural Network`

### Cluster

The `vectorized Tweets` are clustered per day. For now `Spectral Clustering` (`Hierarchical Clustering` will also be tested). The clustered dataset is then fed into a `Neural Network` to produce `Good Tweets Cluster`, `Bad Tweets Cluster` or both as applicable. 

Steps in `Pre-Process`:
1. remove tweets with `#Tags > 3`
2. lowercase
3. remove stopwords
4. remove urls
5. remove any symbols
6. sort in ascending order
7. get tweets for a day

Steps in Clustering:
1. vectorize with `Bag of Words`
2. vectorize with `Term Frequency - Inverse Document Frequency (TFIDF)`
3. <s>reduce vector dimention with `Latent Semantic Analysis (LSA)` via `Singular Vector Decomposition (SVD)`</s>. Not Required as `Neural Network` is capable of identifying text patterns and thus more data is better.
4. normalize vector
5. cluster with `Spectral Clustering`
6. filter cluster with `Neural Network`

### Sentiment Analysis

`Good/Bad Clustered Tweets` are fed into a `Neural Network`. Cluster tweets of both `Good` and `Bad`. Count tweets per cluster and pick top tweets of either category. Sum counts for each group and normalize the count values. Return the `absolute difference` of the two groups.

##### This can be further taken into predicting the `Delta Price` for the next day.
`Delta Price` is calculate with a moving average of previous 10 days.
Feed the vectors into a `Neural Network` which should produce an output equivalent to `Delta Price`.

Steps in `Pre-Process`:
1. remove tweets with `#Tags > 3`
2. lowercase
3. remove stopwords
4. remove urls
5. remove any symbols

Steps in Clustering:
1. vectorize with `Bag of Words`
2. vectorize with `Term Frequency - Inverse Document Frequency (TFIDF)`
3. <s>reduce vector dimention with `Latent Semantic Analysis (LSA)` via `Singular Vector Decomposition (SVD)`</s>. Not Required as `Neural Network` is capable of identifying text patterns and thus more data is better.
4. cluster with `Spectral Clustering`
6. count items in cluster and normalize the value
7. return difference of the two groups.

May extend the functionality

8. regression to predict `Delta Price`.

### Trend Prediction

Price ranges of various CryptoCurrencies including `High`,`Low`,`Open`,`Close` along with the `Sentiment` value and `Delta Price` obtained by training the tweets are fed into the model to keep predicting prices in the near future until there is a considerable change rise/dip in the price plot and hence depicting a change in trend.

Steps in Data Preprocessing:
1. Normalize dataset
2. Prepare output labels
3. Split into Train and Test

Steps in defining the model:
1. Set the `input size` [No. of Features]
2. Set the `output size` [No. of Features predicted to feed next]
3. Set the `num_size` [Length of sequence fed into the model, which is 1 in our case as we feed only one day's data]
4. Set the `lstm_size` [The number of units in the LSTM cell]
5. Fetch processed Data
6. Send data to model for training

Steps after model training:
1. Train the model
2. Plot graph between predicted and actual values
3. Take the test_pred and feed it into the model to predict the prices for the next day
4. Keep repeating step 4 until the predicted value falls out of the `Delta Price`
5. Return the final prediction
6. Save the model