# Bitcoin Prediction based on Sentiment Analysis


## Introduction
For our project, we have chosen to use twitter data to perform a sentiment analysis on users opinions about crypto currencies inorder to create a predictive model that relies on the sentiment to predict how different crpyto currencies may behave.


## Description

### Data
First, we selected the datasets we would use for the analysis as well as to train and test our models later on.
We have chosen the following datasets for our analysis:
- Covid-19 Twitter chatter dataset
  > The data can be obtained from the following repository directly from the publisher however a copy is included as part of our project which is within the stiupulated terms of use by the publisher of the data as well as Twitter:https://github.com/thepanacealab/covid19_twitter
- Apple Twitter dataset
  > This dataset can be obtained from: https://socialgrep.com/datasets/five-years-of-aapl-on-reddit
- Ucc(The Unhealthy Comments Corpus)
  > This dataset can be obtained from: https://github.com/conversationai/unhealthy-conversations

### Research Question 
We chose to investigate how the price of Bitcoin may be affected by twitter sentiments about the currency based on a sentiment analysis model trained on the UCC corpus and a final prediction model based on the sentiment model.


To begin, we must first install some modules required to access the data, as per the publisher (Banda et al., 2021):
- Twarc
- Tweepy (v. 3.8.0)
- Argparse (v 3.2)
- Xtract (v 0.1 a3)
- Wget (v 3.2)


In [1]:
from IPython.display import clear_output
!pip install twarc 
!pip install tweepy 
!pip install argparse 
!pip install xtract 
!pip install ipywidgets
!pip install wget
clear_output()


## Setup
Before we run the analysis, we need to filter and process our data to extract only the English tweets.
To acces the data, I use the instructions provided by Banda et al from the website: https://github.com/thepanacealab/covid19_twitter/blob/master/COVID_19_dataset_Tutorial.ipynb

In [2]:
import gzip
import shutil
import os
import csv
import wget
import linecache
from shutil import copyfile
import ipywidgets as widgets
import numpy as np
import pandas as pd

## Filtering dataset by Language

In [3]:
#Unzips the dataset and gets the TSV dataset
with gzip.open('clean-dataset.tsv.gz', 'rb') as f_in:
    with open('clean-dataset.tsv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)



#Gets all possible languages from the dataset
df = pd.read_csv('clean-dataset.tsv',sep="\t")
lang_list = df.lang.unique()
lang_list= sorted(np.append(lang_list,'all'))
lang_picker = widgets.Dropdown(options=lang_list, value="en")
lang_picker

Dropdown(index=14, options=('all', 'am', 'ar', 'bg', 'bn', 'bo', 'ca', 'ckb', 'cs', 'cy', 'da', 'de', 'dv', 'e…

In [4]:
#Creates a new clean dataset with the english tweets and prints out the first 5 twits from the filtered dataset.
filtered_language = lang_picker.value

#If no language specified, it will get all records from the dataset
if filtered_language == "":
    copyfile('clean-dataset.tsv', 'clean-dataset-filtered.tsv')

#If language specified, it will create another tsv file with the filtered records
else:
    filtered_tw = list()
    current_line = 1
    with open("clean-dataset.tsv") as tsvfile:
        tsvreader = csv.reader(tsvfile, delimiter="\t")
        if current_line == 1:
            filtered_tw.append(linecache.getline("clean-dataset.tsv", current_line))

            for line in tsvreader:
                if line[3] == filtered_language:
                      filtered_tw.append(linecache.getline("clean-dataset.tsv", current_line))
                current_line += 1

print('\033[1mShowing first 5 tweets from the filtered dataset\033[0m')
print(filtered_tw[1:(6 if len(filtered_tw) > 6 else len(filtered_tw))])

with open('clean-dataset-filtered.tsv', 'w') as f_output:
    for item in filtered_tw:
        f_output.write(item)

[1mShowing first 5 tweets from the filtered dataset[0m
['1351757442653294592\t2021-01-20\t05:04:23\ten\tNULL\n', '1351757444033069056\t2021-01-20\t05:04:23\ten\tNULL\n', '1351757446860083202\t2021-01-20\t05:04:24\ten\tNULL\n', '1351757447619375106\t2021-01-20\t05:04:24\ten\tNULL\n', '1351757448219140105\t2021-01-20\t05:04:24\ten\tNULL\n']


## Authentication 
Accessing the Twitter APIs requires a set of credentials that you must pass with each request. These credentials can come in different forms depending on the type of authentication that is required by the specific endpoint that you are using. More information: https://developer.twitter.com/en/docs/apps/overview

The credentials can be obtained from the developer portal (https://developer.twitter.com/en/portal/dashboard).
**Ensure that you have an elevated account as the basic essential credentials will not allow you to access the data.**


In [5]:
import json
import tweepy
from tweepy import OAuthHandler

# Authenticate
CONSUMER_KEY = "jqDKBsklkmmcEWpurFa6NoeHu" #@param {type:"string"}
CONSUMER_SECRET_KEY = "cvMgSe1FhVuzJrwukiFG0UgTugvfOdRjdZKMTOWCjOQY4mqkLD" #@param {type:"string"}
ACCESS_TOKEN_KEY = "1069349234-lyOIiqDVKaCkYNBaoeS0cxoNcn8VZjezWflZ8Xn" #@param {type:"string"}
ACCESS_TOKEN_SECRET_KEY = "N9ClbBxTiUlcTY0vB0j5way3vzjWrLoaS0BQ7K46NBhWO" #@param {type:"string"}

#Creates a JSON Files with the API credentials
with open('api_keys.json', 'w') as outfile:
    json.dump({
    "consumer_key":CONSUMER_KEY,
    "consumer_secret":CONSUMER_SECRET_KEY,
    "access_token":ACCESS_TOKEN_KEY,
    "access_token_secret": ACCESS_TOKEN_SECRET_KEY
     }, outfile)

## Hydrating the Tweets

Before parsing the dataset, a hydration process is required.It is done by using the following social media mining tool: https://github.com/thepanacealab/SMMT

To perform this action, a python file from that repository is required (get_metadata.py).
Once obtained, this utility will take a file which meets the following requirements:

- A csv file which either contains one tweet id per line or contains at least one column of tweet ids
- A text file which contains one tweet id per line
- A tsv file which either contains one tweet id per line or contains at least one column of tweet ids
- For this case, the filtered dataset generated before (clean-dataset-filtered.tsv), which is in TSV format will be used for the hydration process

The arguments for this utily (get_metadata.py) are the following:
- i :input file name
- 0 :output file name
- k :key file name #json file containing your Api keys.

In [None]:
!python get_metadata.py -i clean-dataset-filtered.tsv -o hydrated_tweets -k api_keys.json

From the code above, the output will generate four files:

- A hydrated_tweets.json file which contains the full json object for each of the hydrated tweets
- A hydrated_tweets.CSV file which contains partial fields extracted from the tweets.
- A hydrated_tweets.zip file which contains a zipped version of the tweets_full.json file.
- A hydrated_tweets_short.json which contains a shortened version of the hydrated tweets.

## Parsing Tweets
Now to Parse the Tweets we need the following files from the data processing tools:
- https://raw.githubusercontent.com/thepanacealab/SMMT/master/data_preprocessing/parse_json_lite.py
- https://raw.githubusercontent.com/thepanacealab/SMMT/master/data_preprocessing/fields.py

In [None]:
from IPython.display import clear_output

!pip install emot --upgrade
!pip install emoji --upgrade

clear_output()


In [None]:
with open("hydrated_tweets_short.json", "r") as myfile:
    list_tweets = list(myfile)
    
file = open("sample_data.json", "w")
for i in list_tweets:
    file.write(i)
file.close() #This close() is important

### The following code uses the utility above to parse the data and preprocess it.

parse_json_lite.py: The first argument is the json file. The second argument is optional. If the second argument is given, it will preprocess the json file. The preprocessing includes removal of URLs, twitter specific Urls, Emojis, Emoticons. 

In [None]:
!python parse_json_lite.py sample_data.json p

clear_output()

## Preprocessing 
Now that the data is Hydrated and organised in the four files, we can begin the preprocessing of the data.

> For our project, we perform a sentiment analysis on tweets related to crypto currencies and use this analysis to predict how the currencies will varry depending on the sentiment. 

> Since we are only interested in tweets that are related to Bitcoin, we will specify a filter then filter out the tweets that do not contain the words in the filter.

>After that, we perform a sentiment analysis using pre trained models to see whether we can accurately predict what the sentiment of the tweets are.

>The models used will be trained on the UCC(The Unhealthy Comments Corpus) Coprus that was mentioned before , which contains over 40,000 online comments which have been tagged with sentiment values. 

## Apple-Twitter Data

In [None]:
ucc_train = pd.read_csv("UCC/train.csv")
aapl = pd.read_csv("data/Apple-Twitter-Sent.csv", encoding="ISO-8859-1")

In [None]:
aapl = aapl.drop(columns="_golden _unit_state _last_judgment_at date id query sentiment_gold".split())
aapl

In [None]:
import nltk
import re
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer, TweetTokenizer

tweets = list(zip(aapl["text"], aapl["sentiment"]))

ttk = TweetTokenizer()

tokens = [(ttk.tokenize(tweet), sentiment) for tweet, sentiment in tweets]

filtered = []
for tweet in tokens:
    toks = []
    for tok in tweet[0]:
        if tok.isalpha():
            toks.append(tok)
        if "#" in tok or "@" in tok:
            toks.append(tok)
    filtered.append((toks, tweet[1]))

tagged = [(nltk.pos_tag(tweet), sentiment) for tweet, sentiment in filtered]

tagged[2]