# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [1]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
from newsapi import NewsApiClient
load_dotenv("jfk.env")

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\johnf\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
# Read your api key environment variable
api_key = os.getenv("NEWS_API_KEY")

In [3]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=api_key)

In [4]:
# Fetch the Bitcoin news articles
btc_headlines = newsapi.get_everything(
    q='Bitcoin',
    language='en',
    page_size=100,
    sort_by='relevancy'
)

btc_headlines['totalResults']

10381

In [5]:
# Fetch the Ethereum news articles
eth_headlines = newsapi.get_everything(
    q='Ethereum',
    language='en',
    page_size=100,
    sort_by='relevancy'
)

eth_headlines['totalResults']

3999

In [6]:
# Create the Bitcoin sentiment scores DataFrame
btc_sentiments = []

In [7]:
for article in btc_headlines['articles']:
    try:
        text = article['content']
        date = article['publishedAt'][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment['compound']
        pos = sentiment['pos']
        neu = sentiment['neu']
        neg = sentiment['neg']
        
        btc_sentiments.append({
        'text': text,
        'date': date,
        'compound': compound,
        'positive' : pos,
        'neutral' : neu,
        'negative': neg
        }
        )
    except AttributeError:
        pass
btc_df = pd.DataFrame(btc_sentiments)

In [8]:
cols = ["date", "text", "compound", "positive", "negative", "neutral"]
btc_df = btc_df[cols]

btc_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral
0,2021-04-27,Tesla’s relationship with bitcoin is not a dal...,0.0,0.0,0.0,1.0
1,2021-04-20,Cryptocurrency continues to gain mainstream ac...,0.7506,0.171,0.0,0.829
2,2021-04-23,Cryptocurrency prices continued to tumble Frid...,0.0,0.0,0.0,1.0
3,2021-04-13,The crypto industry as a whole has seen a mome...,0.6124,0.135,0.0,0.865
4,2021-04-27,image copyrightGetty Images\r\nimage captionEl...,0.7003,0.167,0.0,0.833


In [9]:
# Create the Ethereum sentiment scores DataFrame
eth_sentiments = []

In [10]:
for article in eth_headlines['articles']:
    try:
        text = article['content']
        date = article['publishedAt'][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment['compound']
        pos = sentiment['pos']
        neu = sentiment['neu']
        neg = sentiment['neg']
        
        eth_sentiments.append({
        'text': text,
        'date': date,
        'compound': compound,
        'positive' : pos,
        'neutral' : neu,
        'negative': neg
        }
        )
    except AttributeError:
        pass
eth_df = pd.DataFrame(eth_sentiments)

In [11]:
cols = ["date", "text", "compound", "positive", "negative", "neutral"]
eth_df = eth_df[cols]

eth_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral
0,2021-05-04,Their investors call them disruptive innovator...,-0.4019,0.072,0.15,0.778
1,2021-04-20,Cryptocurrency continues to gain mainstream ac...,0.7506,0.171,0.0,0.829
2,2021-04-20,Venmo is jumping aboard the cryptocurrency ban...,0.0258,0.034,0.0,0.966
3,2021-05-01,New York (CNN Business)Bitcoin prices continue...,0.0,0.0,0.0,1.0
4,2021-05-03,"The creators behind CryptoPunks, one of the mo...",0.4754,0.091,0.0,0.909


In [12]:
# Describe the Bitcoin Sentiment
btc_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,100.0,100.0,100.0,100.0
mean,0.10841,0.05893,0.02922,0.91186
std,0.348701,0.067328,0.046492,0.083839
min,-0.6808,0.0,0.0,0.662
25%,0.0,0.0,0.0,0.8435
50%,0.0,0.048,0.0,0.922
75%,0.34,0.10075,0.06125,1.0
max,0.8176,0.269,0.219,1.0


In [13]:
# Describe the Ethereum Sentiment
eth_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,100.0,100.0,100.0,100.0
mean,0.105502,0.04986,0.0233,0.92685
std,0.341558,0.066301,0.045337,0.079832
min,-0.9186,0.0,0.0,0.694
25%,0.0,0.0,0.0,0.87425
50%,0.0,0.0,0.0,0.9435
75%,0.32365,0.07975,0.0405,1.0
max,0.8271,0.256,0.289,1.0


### Questions:

Q: Which coin had the highest mean positive score?

A: Bitcoin has the highest mean positive score of 0.589.

Q: Which coin had the highest compound score?

A: Ethereum has the highest compound score of 0.829.

Q. Which coin had the highest positive score?

A: Bitcoin has the highest positive score of 0.269.

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [14]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [15]:
# Instantiate the lemmatizer
wnl = WordNetLemmatizer()

# Create a list of stopwords
print(stopwords.words('english'))

# Expand the default stopwords list if necessary


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [16]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)
    words = word_tokenize(text)
    words = list(filter(lambda w: w.lower(), words))
    words = list(filter(lambda t: t not in punctuation, words))
    words = list(filter(lambda t: t.lower() not in text, words))
    lemmatizer = WordNetLemmatizer()
    tokens = [wnl.lemmatize(word) for word in words]
    print(words)
    return tokens
    
    # Remove the punctuation from text

   
    # Create a tokenized list of the words
    
    
    # Lemmatize words into root words

   
    # Convert the words to lowercase
    
    
    # Remove the stop words
    
    


In [17]:
# Create a new tokens column for Bitcoin
btc_df['tokens']=btc_df.text.apply(tokenizer)

['Tesla', 'CFO', 'Zach', 'Kirkhorn', 'Monday', 'Instead']
['Cryptocurrency', 'PayPal', 'Bitcoin', 'Ethereum', 'Bitcoin', 'Cash', 'Litecoin', 'Venmo', 'With']
['Cryptocurrency', 'Friday', 'Bitcoin', 'March', 'Bitcoin']
[]
['copyrightGetty', 'Images', 'captionElon', 'Musk', 'Tesla', 'Tesla', 'Bitcoin']
['Earth', 'Day', 'So', 'Jack', 'Dorsey']
['Venmo', 'Tuesday', 'Venmo', 'Four']
['New', 'York', 'CNN', 'Business', 'Bitcoin', 'Saturday', 'But']
['Its', 'Robinhood', 'Christine', 'Brown', 'Robinhoods']
['By', 'Reuters', 'Staff', 'Reuters', '-MicroStrategy', 'Inc', 'Monday', 'Shares']
['Crusoe']
['We', 'Dogecoin', 'DOGE']
['By', 'Reuters', 'Staff', 'April', 'Reuters', 'MicroStrategy', 'Inc', 'Monday', 'R…']
['Jack', 'Dorsey', 'CEO', 'Twitter', 'Wednesday', '``', "''", 'Elon', 'Musk', '``', 'True', '``', 'BBC', '``']
['Coinswitch', 'Kuber', 'India', 'Thursday', 'Indi…']
['By', 'Reuters', 'Staff', 'FILE', 'PHOTO', 'Tesla', 'Santa', 'Clarita', 'California', 'U.S.', 'October', 'REUTERS/Mike', 'B

In [18]:
btc_df

Unnamed: 0,date,text,compound,positive,negative,neutral,tokens
0,2021-04-27,Tesla’s relationship with bitcoin is not a dal...,0.0000,0.000,0.000,1.000,"[Tesla, CFO, Zach, Kirkhorn, Monday, Instead]"
1,2021-04-20,Cryptocurrency continues to gain mainstream ac...,0.7506,0.171,0.000,0.829,"[Cryptocurrency, PayPal, Bitcoin, Ethereum, Bi..."
2,2021-04-23,Cryptocurrency prices continued to tumble Frid...,0.0000,0.000,0.000,1.000,"[Cryptocurrency, Friday, Bitcoin, March, Bitcoin]"
3,2021-04-13,The crypto industry as a whole has seen a mome...,0.6124,0.135,0.000,0.865,[]
4,2021-04-27,image copyrightGetty Images\r\nimage captionEl...,0.7003,0.167,0.000,0.833,"[copyrightGetty, Images, captionElon, Musk, Te..."
...,...,...,...,...,...,...,...
95,2021-04-27,"(Reuters) - Evolving rules, environmental conc...",-0.4767,0.052,0.148,0.800,"[Reuters, Evolving]"
96,2021-04-12,The bank will not facilitate the buying or exc...,0.0000,0.000,0.000,1.000,"[HSBC, InvestDirect, MicroStrategy]"
97,2021-04-25,"Bitcoin ""fell dramatically in late April,"" wri...",-0.2732,0.000,0.058,0.942,"[Bitcoin, ``, April, '', The, Street, ``, '', ..."
98,2021-04-28,Mastercard and BNY Mellon warmed to bitcoin on...,0.2960,0.148,0.083,0.769,"[Mastercard, BNY, Mellon, Thursday, XRP, Yurik..."


In [19]:
# Create a new tokens column for Ethereum
eth_df['tokens']=eth_df.text.apply(tokenizer)

['Their', 'Detractors', 'North', 'Carolina', 'Attorney', 'General', 'Josh', 'Stein', 'But', 'Leda', 'Health', 'Madison', 'Campbell', 'Liesel', 'Vaidya']
['Cryptocurrency', 'PayPal', 'Bitcoin', 'Ethereum', 'Bitcoin', 'Cash', 'Litecoin', 'Venmo', 'With']
['Venmo', 'Tuesday', 'Venmo', 'Four']
['New', 'York', 'CNN', 'Business', 'Bitcoin', 'Saturday', 'But']
['CryptoPunks', 'NFT', 'Meebits', '3D']
['ConsenSys', 'Ethereum', 'J.P.', 'Morgan', 'Mastercard', 'UBS', 'AG']
['By', 'Reuters', 'Staff', 'FILE', 'PHOTO', 'Ethereum', 'February', 'REUTERS/Dado', 'Ruvic/Illustration', 'SINGAPORE…']
['Ethereum', 'February', 'REUTERS/Dado', 'Ruvic/IllustrationCryptocurrency']
['Its', 'Robinhood', 'Christine', 'Brown', 'Robinhoods']
['This', 'StockMarketUS', 'Edge', 'Monday', 'Morning', 'U.S.', 'Monday']
['Spanish', 'AI', 'Errors', 'Cryptocurrencies', 'Vitalik', 'Buterin']
['Non-fungible', 'NFTs', 'Topps', 'NFT-based']
['By', 'Reuters', 'Staff', 'SINGAPORE', 'May', 'Reuters', 'Cryptocurrency', 'Monday', 'Eu

In [20]:
eth_df

Unnamed: 0,date,text,compound,positive,negative,neutral,tokens
0,2021-05-04,Their investors call them disruptive innovator...,-0.4019,0.072,0.150,0.778,"[Their, Detractors, North, Carolina, Attorney,..."
1,2021-04-20,Cryptocurrency continues to gain mainstream ac...,0.7506,0.171,0.000,0.829,"[Cryptocurrency, PayPal, Bitcoin, Ethereum, Bi..."
2,2021-04-20,Venmo is jumping aboard the cryptocurrency ban...,0.0258,0.034,0.000,0.966,"[Venmo, Tuesday, Venmo, Four]"
3,2021-05-01,New York (CNN Business)Bitcoin prices continue...,0.0000,0.000,0.000,1.000,"[New, York, CNN, Business, Bitcoin, Saturday, ..."
4,2021-05-03,"The creators behind CryptoPunks, one of the mo...",0.4754,0.091,0.000,0.909,"[CryptoPunks, NFT, Meebits, 3D]"
...,...,...,...,...,...,...,...
95,2021-04-19,Yuriko Nakao/Getty Images\r\nWall Street Bets ...,0.6369,0.157,0.000,0.843,"[Yuriko, Nakao/Getty, Images, Wall, Street, Be..."
96,2021-04-20,In this photo illustration a Venmo mobile paym...,0.0000,0.000,0.000,1.000,"[In, Venmo, Igor, Golovniov/SOPA, Images/Light..."
97,2021-04-30,U.S. one hundred dollar notes are seen in this...,0.2263,0.058,0.000,0.942,"[U.S., Seoul, February, REUTERS/Lee, Jae-WonTh..."
98,2021-04-30,The U.S. dollar skidded toward a fourth straig...,0.4404,0.129,0.053,0.818,"[U.S., Friday, Federal, Reserve]"


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [21]:
from collections import Counter
from nltk import ngrams

In [25]:
N = 2
btc_corpus = btc_df.text.str.cat()
btc_bigram = ngrams(tokenizer(btc_corpus), N)
btc_bigram_counts(btc_bigram).most_common(20)

['Tesla', 'CFO', 'Zach', 'Kirkhorn', 'Monday', 'PayPal', 'Litecoin', 'Venmo', 'Friday', 'copyrightGetty', 'Images', 'captionElon', 'Musk', 'Tesla', 'Tesla', 'Earth', 'Jack', 'Dorsey', 'Venmo', 'Tuesday', 'Venmo', 'Four', 'York', 'CNN', 'Business', 'Saturday', 'Robinhood', 'Christine', 'Brown', 'Robinhoods', 'Reuters', 'Staff', 'Reuters', '-MicroStrategy', 'Monday', 'Crusoe', 'Dogecoin', 'DOGE', 'Reuters', 'Staff', 'April', 'Reuters', 'MicroStrategy', 'Monday', 'Jack', 'Dorsey', 'CEO', 'Twitter', 'Wednesday', '``', "''", 'Elon', 'Musk', '``', 'True', '``', 'BBC', '``', 'Coinswitch', 'Kuber', 'India', 'Thursday', 'Indi…', 'Reuters', 'Staff', 'FILE', 'PHOTO', 'Tesla', 'Santa', 'Clarita', 'California', 'U.S.', 'October', 'REUTERS/Mike', 'Blake', 'Reuters', 'Tesla', 'Reuters', 'Staff', 'April', 'Reuters', 'Sunday', 'Reuters', 'Staff', 'LONDON', 'April', 'Reuters', 'Tuesday', 'Reuters', 'Staff', 'FILE', 'PHOTO', 'REUTERS/Dado', 'Ruvic/Illustration/File', 'Photo', 'LON…', 'Reuters', 'Staff', 

NameError: name 'btc_bigram_counts' is not defined

In [30]:
# Generate the Bitcoin N-grams where N=2
btc_bigram_counts = Counter(ngrams(btc_df.text.str.cat(), n=2))
print(dict(btc_bigram_counts))

{('T', 'e'): 13, ('e', 's'): 198, ('s', 'l'): 19, ('l', 'a'): 62, ('a', '’'): 1, ('’', 's'): 3, ('s', ' '): 364, (' ', 'r'): 78, ('r', 'e'): 276, ('e', 'l'): 36, ('a', 't'): 132, ('t', 'i'): 118, ('i', 'o'): 92, ('o', 'n'): 219, ('n', 's'): 40, ('s', 'h'): 36, ('h', 'i'): 74, ('i', 'p'): 4, ('p', ' '): 17, (' ', 'w'): 91, ('w', 'i'): 35, ('i', 't'): 208, ('t', 'h'): 248, ('h', ' '): 79, (' ', 'b'): 119, ('b', 'i'): 49, ('t', 'c'): 90, ('c', 'o'): 157, ('o', 'i'): 114, ('i', 'n'): 390, ('n', ' '): 382, (' ', 'i'): 195, ('i', 's'): 102, (' ', 'n'): 30, ('n', 'o'): 28, ('o', 't'): 28, ('t', ' '): 210, (' ', 'a'): 266, ('a', ' '): 106, (' ', 'd'): 66, ('d', 'a'): 59, ('a', 'l'): 90, ('l', 'l'): 80, ('l', 'i'): 57, ('i', 'a'): 21, ('a', 'n'): 163, ('n', 'c'): 111, ('c', 'e'): 69, ('e', ','): 10, (',', ' '): 160, ('a', 'c'): 43, ('c', 'c'): 12, ('o', 'r'): 135, ('r', 'd'): 27, ('d', 'i'): 57, ('n', 'g'): 120, ('g', ' '): 93, (' ', 't'): 314, ('t', 'o'): 165, ('o', ' '): 100, ('h', 'e'): 199,

In [29]:
# Generate the Ethereum N-grams where N=2
eth_bigram_counts = Counter(ngrams(eth_df['text'], n=2))
print(dict(eth_bigram_counts))

{('Their investors call them disruptive innovators. Detractors like North Carolina Attorney General Josh Stein call them dirty scammers. But Leda Health co-founders Madison Campbell and Liesel Vaidya th… [+8679 chars]', 'Cryptocurrency continues to gain mainstream acceptance, as PayPal announced that they have added Bitcoin, Ethereum, Bitcoin Cash, and Litecoin to its Venmo app. With a user base of 70 million, the mo… [+2782 chars]'): 1, ('Cryptocurrency continues to gain mainstream acceptance, as PayPal announced that they have added Bitcoin, Ethereum, Bitcoin Cash, and Litecoin to its Venmo app. With a user base of 70 million, the mo… [+2782 chars]', 'Venmo is jumping aboard the cryptocurrency bandwagon.\xa0\r\nThe social payment service announced on Tuesday that its customers can now buy, hold, and sell cryptocurrencies within the Venmo app.\xa0\r\nFour cr… [+1524 chars]'): 1, ('Venmo is jumping aboard the cryptocurrency bandwagon.\xa0\r\nThe social payment service announced on Tues

In [25]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(10), columns=['word', 'count'])

SyntaxError: invalid syntax (<ipython-input-25-0535ac1ae92d>, line 4)

In [None]:
# Define the counter function
def word_counter(corpus): 
    # Combine all articles in corpus into one large string
    big_string = ' '.join(corpus)
    processed = process_text(big_string)
    top_10 = dict(Counter(processed).most_common(10))
    return pd.DataFrame(list(top_10.items()), columns=['word', 'count'])

In [19]:
# Use token_count to get the top 10 words for Bitcoin
# YOUR CODE HERE!

In [20]:
# Use token_count to get the top 10 words for Ethereum
# YOUR CODE HERE!

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [21]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [22]:
# Generate the Bitcoin word cloud
# YOUR CODE HERE!

In [23]:
# Generate the Ethereum word cloud
# YOUR CODE HERE!

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [24]:
import spacy
from spacy import displacy

In [25]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [26]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [27]:
# Concatenate all of the Bitcoin text together
# YOUR CODE HERE!

In [28]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [29]:
# Render the visualization
# YOUR CODE HERE!

In [30]:
# List all Entities
# YOUR CODE HERE!

---

### Ethereum NER

In [31]:
# Concatenate all of the Ethereum text together
# YOUR CODE HERE!

In [32]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [33]:
# Render the visualization
# YOUR CODE HERE!

In [34]:
# List all Entities
# YOUR CODE HERE!

---