# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [1]:
# Initial imports
import os
import pandas as pd
from datetime import datetime, timedelta
from dotenv import load_dotenv
import nltk as nltk
import alpaca_trade_api as tradeapi
from newsapi.newsapi_client import NewsApiClient
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mrose/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
# Read your api key environment variable
load_dotenv("example.env")

# Set Alpaca API key and secret
alpaca_key = os.getenv("ALPACA_KEY")
alpaca_secret = os.getenv("ALPACA_SECRET")

api = tradeapi.REST(alpaca_key, alpaca_secret, api_version='v2')

In [3]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=os.environ["NEWS_KEY"])

In [4]:
# Fetch the Bitcoin news articles
btc_news_en = newsapi.get_everything(
    q="bitcoin AND Bitcoin",
    language="en",
    page_size = 100,
    sort_by = 'relevancy'
)

# Show the total number of news
btc_news_en["totalResults"]

7485

In [5]:
# Fetch the Ethereum news articles
eth_news_en = newsapi.get_everything(
    q="ethereum AND Ethereum",
    language="en",
    page_size = 100,
    sort_by = 'relevancy'
)

# Show the total number of news
eth_news_en["totalResults"]

3551

In [11]:
# Create the Bitcoin sentiment scores DataFrame
btc_sent = []

for article in btc_news_en['articles']:
    try:
        text = article['content']
        date = article['publishedAt'][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment['compound']
        pos = sentiment['pos']
        neu = sentiment['neu']
        neg = sentiment['neg']
        
        btc_sent.append({
            'text':text,
            'date':date,
            'compound':compound,
            'positive':pos,
            'negative':neg,
            'neutral':neu})
        
    except AttributeError:
        pass

btc_sent_df = pd.DataFrame(btc_sent)
btc_sent_df

Unnamed: 0,text,date,compound,positive,negative,neutral
0,"Even in cyberspace, the Department of Justice ...",2022-02-17,0.7351,0.147,0.000,0.853
1,"When Russia invaded Ukraine, Niki Proshin was ...",2022-03-17,0.0000,0.000,0.000,1.000
2,"""Bitcoin was seen by many of its libertarian-l...",2022-03-12,-0.7713,0.000,0.169,0.831
3,Feb 22 (Reuters) - Bitcoin miners are feeling ...,2022-02-22,-0.1779,0.046,0.067,0.887
4,March 1 (Reuters) - Bitcoin has leapt since Ru...,2022-03-01,0.0000,0.000,0.000,1.000
...,...,...,...,...,...,...
95,DENVER (KDVR) — Gov. Jared Polis says Colorado...,2022-02-27,0.3818,0.071,0.000,0.929
96,March 15 (Reuters) - Blockchain technology fir...,2022-03-15,0.0000,0.000,0.000,1.000
97,As Canadian protests against vaccine mandates ...,2022-03-15,-0.5719,0.035,0.134,0.831
98,Bitcoin and other cryptocurrencies are payment...,2022-03-13,0.6124,0.152,0.000,0.848


In [12]:
# Create the Ethereum sentiment scores DataFrame
eth_sent = []

for article in eth_news_en['articles']:
    try:
        text = article['content']
        date = article['publishedAt'][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment['compound']
        pos = sentiment['pos']
        neu = sentiment['neu']
        neg = sentiment['neg']
        
        eth_sent.append({
            'text':text,
            'date':date,
            'compound':compound,
            'positive':pos,
            'negative':neg,
            'neutral':neu})
        
    except AttributeError:
        pass

eth_sent_df = pd.DataFrame(eth_sent)
eth_sent_df

Unnamed: 0,text,date,compound,positive,negative,neutral
0,"Technical analysis isnt a perfect tool, but it...",2022-02-17,-0.2498,0.000,0.059,0.941
1,"In February, shit hit the fan in the usual way...",2022-03-01,-0.3182,0.059,0.093,0.848
2,Coinbase reported that the share of trading vo...,2022-02-25,0.6705,0.188,0.000,0.812
3,Illustration by James Bareham / The Verge\r\n\...,2022-02-26,-0.4588,0.000,0.083,0.917
4,"It seems that in 2022, you cant escape from th...",2022-03-03,-0.1326,0.000,0.044,0.956
...,...,...,...,...,...,...
95,Some investors have a lot of money in cryptocu...,2022-02-24,0.0000,0.000,0.000,1.000
96,Many brands are starting to see a recovery des...,2022-02-19,0.3903,0.087,0.037,0.877
97,"In Bitcoins proof of work, that investment is ...",2022-03-04,0.4019,0.135,0.070,0.795
98,Relatively unheralded cryptocurrency Waves( WA...,2022-03-18,0.0000,0.000,0.000,1.000


In [13]:
# Describe the Bitcoin Sentiment
btc_sent_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,100.0,100.0,100.0,100.0
mean,0.05654,0.07141,0.0514,0.87717
std,0.441233,0.070129,0.063819,0.087267
min,-0.8957,0.0,0.0,0.627
25%,-0.28445,0.0,0.0,0.83
50%,0.0,0.065,0.0165,0.88
75%,0.4068,0.10425,0.084,0.9385
max,0.91,0.301,0.265,1.0


In [14]:
# Describe the Ethereum Sentiment
eth_sent_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,100.0,100.0,100.0,100.0
mean,0.139883,0.07699,0.04231,0.88068
std,0.427214,0.069638,0.061696,0.086568
min,-0.9136,0.0,0.0,0.627
25%,-0.0129,0.0,0.0,0.83525
50%,0.1531,0.0695,0.0,0.887
75%,0.5052,0.1235,0.06625,0.9415
max,0.8625,0.29,0.312,1.0


### Questions:

Q: Which coin had the highest mean positive score?

A: Ethereum had the highest mean positive score.

Q: Which coin had the highest compound score?

A: Bitcoin had the highest compound score.

Q. Which coin had the highest positive score?

A: Bitcoin had the highest positive score.

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [15]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [18]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /Users/mrose/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text
    import string
    regular_punct = list(string.punctuation)
    def remove_punctuation(text,punct_list):
        for punc in punct_list:
            if punc in text:
                text = text.replace(punc, ' ')
        return text.strip()
   
    # Create a tokenized list of the words
    
    
    # Lemmatize words into root words

   
    # Convert the words to lowercase
    
    
    # Remove the stop words
    
    
    #return tokens

In [22]:
# Create a new tokens column for Bitcoin
btc_sent_df['Tokens'] = btc_tokens

NameError: name 'btc_tokens' is not defined

In [23]:
# Create a new tokens column for Ethereum
eth_sent_df['Tokens'] = eth_tokens

NameError: name 'eth_tokens' is not defined

---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [24]:
from collections import Counter
from nltk import ngrams

In [27]:
# Generate the Bitcoin N-grams where N=2
def process_text(doc):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', doc)
    words = word_tokenize(re_clean)
    lem = [lemmatizer.lemmatize(word) for word in words]
    output = [word.lower() for word in lem if word.lower() not in sw]
    return output

processed = process_text(doc)

btc_bigram = Counter(ngrams(processed, n=2))
print(dict(btc_bigram))

NameError: name 'doc' is not defined

In [17]:
# Generate the Ethereum N-grams where N=2
# YOUR CODE HERE!

In [18]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [19]:
# Use token_count to get the top 10 words for Bitcoin
# YOUR CODE HERE!

In [20]:
# Use token_count to get the top 10 words for Ethereum
# YOUR CODE HERE!

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [28]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

ModuleNotFoundError: No module named 'wordcloud'

In [29]:
# Generate the Bitcoin word cloud
btc_wc = WordCloud().generate(input_text)
plt.imshow(wc)

NameError: name 'WordCloud' is not defined

In [30]:
# Generate the Ethereum word cloud
eth_wc = WordCloud().generate(input_text)
plt.imshow(wc)

NameError: name 'WordCloud' is not defined

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [1]:
import spacy
from spacy import displacy

In [2]:
# Download the language model for SpaCy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 4.3 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [27]:
# Concatenate all of the Bitcoin text together
# YOUR CODE HERE!

In [4]:
# Run the NER processor on all of the text
btc_doc = nlp(article)

# Add a title to the document
# YOUR CODE HERE!

NameError: name 'article' is not defined

In [5]:
# Render the visualization
displacy.render(btc_doc, style='ent')

NameError: name 'btc_doc' is not defined

In [6]:
# List all Entities
for ent in btc_doc.ents:
    print(ent.text, ent.label_)

NameError: name 'btc_doc' is not defined

---

### Ethereum NER

In [31]:
# Concatenate all of the Ethereum text together
# YOUR CODE HERE!

In [7]:
# Run the NER processor on all of the text
eth_doc = nlp(article)

# Add a title to the document
# YOUR CODE HERE!

NameError: name 'article' is not defined

In [8]:
# Render the visualization
displacy.render(eth_doc, style='ent')

NameError: name 'eth_doc' is not defined

In [9]:
# List all Entities
for ent in eth_doc.ents:
    print(ent.text, ent.label_)

NameError: name 'eth_doc' is not defined

---