# Predictive Analytics Assignment Group 17
##### Group Members:
| Name | SNR |
| :---- | :--- |
|Nadya Hagen | 2049115 |
| Kenéz Kovács | 2040678 |
| Aryo Bimo Nugroho | 2039696 |
| Sheva Sonia Rahmani | 2075109 |

## The following is a model that aims to predict the direction of the movement of the price of a dogecoin based on Elon Musk's tweets ###

In [2]:
# Loading libraries, getting resources:
import pandas as pd
import nltk
import random
import csv
import random
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.probability import FreqDist

[nltk_data] Downloading package stopwords to /Users/nadya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data initialization
We expect the user of the model to know their dataset well, the code itself does not have data preparative features. In our case the raw data was processed using mostly pandas and excel features, we felt like the code needn't be  included here as it is not remotely reproducible. We have however included our pandas manipulations in another file

In [3]:
data = pd.read_csv('data_fluctuations.csv', delimiter = ',') # Here we import the dataset

### Creating the functions that classify the data
We created a total of 4 functions which help us to classify the data.

In [4]:
# We create the function that makes a list of lists of words of the tweets 

def into_words(tweetcolumn):
    """This function creates a list of lists of the words in the tweets."""
    wordlist = []
    for tweet_sentence in tweetcolumn:
            tweet_sentence_string_lower = str(tweet_sentence).lower()
            tweet_words = tweet_sentence_string_lower.split()
            wordlist.append(tweet_words)
    return(wordlist)

# We create the function that removes stopwords from the list, thus creating a stopword-free list of words

def filterstopwords(wordlist):
    """This function removes stopwords from the list of words in the tweets to create a 
    stopword-free list of words."""
    filtered_wordlist = []
    stop_words = list(stopwords.words('english'))
    for lists in wordlist:
        for words in lists:
            if  words.lower() not in stop_words:
                
                filtered_wordlist.append(words.lower())
    return(filtered_wordlist)

# We create the function that makes a list of the top 3000 words used by Elon Musk in his tweets

def top3kwordizer(filtered_wordlist):
    """This function creates a list of the top 3000 words which Elon Musk uses in his tweets."""
    all_words = nltk.FreqDist(filtered_wordlist)
    all_features = list(all_words)[:3000]
    return all_features

# We create the function that makes the list of lists that contain dictionaries with the features

def list_of_dicts(tweet_sentence, all_features):
    """This function creates a list of lists which contain dictionaries with the features which will
    be used for the model."""
    tweet_words = set(tweet_sentence)
    features = {}
    for word in all_features:
        if word in tweet_words:
            features['contains({})'.format(word)] = True
        else:
            features['contains({})'.format(word)] = False
    return features

## Dataset to classified set of data
Here we use the functions which we have defined above to go from our original dataset to a classified set of data.

In [5]:
# Running the functions one by one to classify the data
wordized_list = into_words(data['tweet']) # we have the list of lists of words
filteredwords = filterstopwords(wordized_list) # we have the list of all words filtered free of stopwords
top3kwords = top3kwordizer(filteredwords) # we have the top 3000 words from the list above
dictslist = list() # we get the list of list of dictionaries containing the features and if theyre in the tweet or not
for tweet in wordized_list: 
        dictslist.append(list_of_dicts(tweet, top3kwords))
        
# Here we use pandas' dataframe feature in order to build our classified dataset

workingdataframe = pd.DataFrame()
workingdataframe['features'] = (pd.Series(list(dictslist)))
workingdataframe['direction'] = data['direction']

# Creating the list of lists that the model will train on

finaldata = workingdataframe.values.tolist()

### Training the model
All that is left to do is to train the model. We first create a training data set to train our model upon and a test data set to test the model's accuracy on. Because our data is descending in its date, we have decided to create the training set based on the data at the bottom of the file and vice versa for the test set. We made this decision as Elon Musk's tweets had more traction on Dogecoin when he first started tweeting about it.

In [6]:
# Training the model
train_set = finaldata[500:] # Here we select the bottom 3479 tweets/rows in the dataset
test_set = finaldata[:500] # Here we select the top 500 tweets/rows in the dataset
classifier = nltk.NaiveBayesClassifier.train(train_set)
acc = nltk.classify.accuracy(classifier, test_set) # Here we define the accuracy measure of the model
print("The accuracy of the model is : ", acc)
classifier.show_most_informative_features()

The accuracy of the model is :  0.538
Most Informative Features
     contains(@tashaark) = True                0 : 1      =     11.1 : 1.0
         contains(tests) = True                0 : 1      =      8.1 : 1.0
            contains(🔥🔥) = True                1 : 0      =      7.8 : 1.0
contains(@teslaratiteam) = True                1 : 0      =      7.0 : 1.0
      contains(however,) = True                1 : 0      =      7.0 : 1.0
contains(@justpaulinelol) = True                0 : 1      =      6.9 : 1.0
           contains(pcr) = True                0 : 1      =      6.3 : 1.0
         contains(turns) = True                0 : 1      =      6.3 : 1.0
     contains(@aarons5_) = True                1 : 0      =      6.3 : 1.0
contains(@caspar_stanley) = True                0 : 1      =      5.7 : 1.0
