# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [4]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\12152\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [5]:
# Read your api key environment variable

load_dotenv("classkeys.env")
api_key = os.getenv("news_api")

In [6]:
print(api_key)

6fb642cc03e54c0ba1db100a02479da5


In [7]:
# Create a newsapi client

from newsapi import NewsApiClient
newsapi = NewsApiClient(api_key=api_key)

In [8]:
# Fetch the Bitcoin news articles

bitcoin_articles = newsapi.get_everything(
    q="bitcoin",
    language="en",
    page_size=100,
    sort_by="relevancy"
)

In [9]:
# Fetch the Ethereum news articles

ethereum_articles = newsapi.get_everything(
    q="ethereum",
    language="en",
    page_size=100,
    sort_by="relevancy"
)

In [10]:
# Create the Bitcoin sentiment scores DataFrame
bitcoin_sentiments = []

for article in bitcoin_articles["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        bitcoin_sentiments.append({
            "Text": text,
            "Compound": compound,
            "Positive": pos,
            "Negative": neg,
            "Neutral": neu
            
        })
        
    except AttributeError:
        pass

# Create DataFrame
btc_df = pd.DataFrame(bitcoin_sentiments)

# Reorder DataFrame columns
cols = ["Compound",  "Negative", "Neutral","Positive","Text"]
btc_df = btc_df[cols]

btc_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,Text
0,0.7351,0.0,0.853,0.147,"Even in cyberspace, the Department of Justice ..."
1,0.0,0.0,1.0,0.0,"When Russia invaded Ukraine, Niki Proshin was ..."
2,-0.7713,0.169,0.831,0.0,"""Bitcoin was seen by many of its libertarian-l..."
3,-0.1779,0.067,0.887,0.046,Feb 22 (Reuters) - Bitcoin miners are feeling ...
4,0.0,0.0,1.0,0.0,March 1 (Reuters) - Bitcoin has leapt since Ru...


In [11]:
# Create the Ethereum sentiment scores DataFrame

ethereum_sentiments = []

for article in ethereum_articles["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:10]
        sentiment = analyzer.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        
        ethereum_sentiments.append({
            "Text": text,
            "Compound": compound,
            "Positive": pos,
            "Negative": neg,
            "Neutral": neu
            
        })
        
    except AttributeError:
        pass
    
# Create DataFrame
eth_df = pd.DataFrame(ethereum_sentiments)

# Reorder DataFrame columns
cols = ["Compound",  "Negative", "Neutral","Positive","Text"]
eth_df = eth_df[cols]

eth_df.head()


Unnamed: 0,Compound,Negative,Neutral,Positive,Text
0,-0.3182,0.093,0.848,0.059,"In February, shit hit the fan in the usual way..."
1,0.6705,0.0,0.812,0.188,Coinbase reported that the share of trading vo...
2,-0.4588,0.083,0.917,0.0,Illustration by James Bareham / The Verge\r\n\...
3,0.834,0.05,0.713,0.236,"If it sounds too good to be true, youre not wr..."
4,-0.1326,0.044,0.956,0.0,"It seems that in 2022, you cant escape from th..."


In [12]:
# Describe the Bitcoin Sentiment

btc_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,100.0,100.0,100.0,100.0
mean,0.069736,0.04744,0.8826,0.06996
std,0.430337,0.060644,0.083611,0.069317
min,-0.8957,0.0,0.694,0.0
25%,-0.25,0.0,0.8375,0.0
50%,0.0,0.0,0.8915,0.065
75%,0.4019,0.083,0.94075,0.0965
max,0.91,0.265,1.0,0.301


In [13]:
# Describe the Ethereum Sentiment

eth_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,100.0,100.0,100.0,100.0
mean,0.141145,0.04127,0.88297,0.07573
std,0.42946,0.060426,0.083793,0.069882
min,-0.9136,0.0,0.688,0.0
25%,0.0,0.0,0.83525,0.0
50%,0.1531,0.0,0.887,0.069
75%,0.5052,0.06625,0.943,0.11175
max,0.8625,0.312,1.0,0.29


In [14]:
## Questions:
Q: Which coin had the highest mean positive score?

ETH

Q: Which coin had the highest compound score?

BTC

A: Ethereum had the highest compound score with a score of 0.827100

Q. Which coin had the highest positive score?

BTC

SyntaxError: invalid syntax (Temp/ipykernel_51060/26945621.py, line 10)

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [15]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [16]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
sw = set(stopwords.words('english'))

# Add stopwords to the default list
sw_addon = {
    "Reuters",
    "reuters",
    "Getty Images",
    "Getty",
    "AP",
    "Dec",
    "Nov",
    "char",
    "ha",
    "day",
    "dec",
    "wa",
    "charswhat"
}
print(sw.union(sw_addon))

{'is', 'under', 'these', 'there', 'i', 'such', 'while', 'off', 'any', 'once', 'it', 'ma', 'more', "wouldn't", 'about', 'dec', "hadn't", 'its', 'being', 'with', 'needn', "shan't", 'your', 'ourselves', 'down', 'and', 'reuters', 'should', 'Getty Images', "you'd", 'doing', 'just', 'ain', 've', "hasn't", 'here', "weren't", 'against', "you'll", 'am', "couldn't", 'hadn', 'very', 'he', 'into', 'above', 'because', 'my', 'you', "shouldn't", 'now', 'wouldn', 'had', 'wa', 'has', 'aren', 'myself', 'who', 'theirs', "didn't", 'are', 'll', "aren't", 'both', 'hasn', 'as', 'couldn', 'mightn', 'the', 'yourselves', 'me', 'whom', 'in', 'when', 'further', 'why', "should've", 'own', 're', 'weren', 'does', 'Dec', 'a', 'itself', 'nor', 'haven', 'shouldn', 'at', 'which', 'if', 'an', 'd', "it's", 'won', "wasn't", 'having', 'himself', 'of', "mightn't", 'few', 'too', 'after', 'their', 'his', 'charswhat', 'yours', 'hers', "needn't", "she's", 'we', 's', 'our', "mustn't", 'o', 'will', 'up', 'on', 'where', "haven't", 

In [17]:
# Create the tokenizer function, and combine the tokens into a string (i.e., no list seprated by quotes) at the end
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)   

    # Create a tokenized list of the words
    words = word_tokenize(re_clean)    
    
    # Lemmatize words into root words
    lem = [lemmatizer.lemmatize(word) for word in words]
   
    # Convert the words to lowercase
    lower = [word.lower() for word in lem]
    
    # Remove the stop words
    tokens = [word for word in lower if word not in sw.union(sw_addon)]
    
    return tokens

In [18]:
# Create a new tokens column for Bitcoin

btc_df['tokens'] = btc_df['Text'].apply(tokenizer)
btc_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,Text,tokens
0,0.7351,0.0,0.853,0.147,"Even in cyberspace, the Department of Justice ...","[even, cyberspace, department, justice, able, ..."
1,0.0,0.0,1.0,0.0,"When Russia invaded Ukraine, Niki Proshin was ...","[russia, invaded, ukraine, niki, proshin, alre..."
2,-0.7713,0.169,0.831,0.0,"""Bitcoin was seen by many of its libertarian-l...","[bitcoin, seen, many, libertarianleaning, fan,..."
3,-0.1779,0.067,0.887,0.046,Feb 22 (Reuters) - Bitcoin miners are feeling ...,"[feb, bitcoin, miner, feeling, heat, pain, rip..."
4,0.0,0.0,1.0,0.0,March 1 (Reuters) - Bitcoin has leapt since Ru...,"[march, bitcoin, leapt, since, russias, invasi..."


In [19]:
# Create a new tokens column for Ethereum

eth_df['tokens'] = eth_df['Text'].apply(tokenizer)
eth_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,Text,tokens
0,-0.3182,0.093,0.848,0.059,"In February, shit hit the fan in the usual way...","[february, shit, hit, fan, usual, way, old, tw..."
1,0.6705,0.0,0.812,0.188,Coinbase reported that the share of trading vo...,"[coinbase, reported, share, trading, volume, e..."
2,-0.4588,0.083,0.917,0.0,Illustration by James Bareham / The Verge\r\n\...,"[illustration, james, bareham, verge, million,..."
3,0.834,0.05,0.713,0.236,"If it sounds too good to be true, youre not wr...","[sound, good, true, youre, wrong, yield, farmi..."
4,-0.1326,0.044,0.956,0.0,"It seems that in 2022, you cant escape from th...","[seems, cant, escape, metaversefrom, facebook,..."


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [20]:
from collections import Counter
from nltk import ngrams

In [21]:
# Generate the Bitcoin N-grams where N=2

btc_string = ' '.join(btc_df.Text)
btc_processed = tokenizer(btc_string)
btc_bigrams = ngrams(btc_processed, n=2)
btc_top_10 = dict(Counter(btc_bigrams).most_common(10))
pd.DataFrame(list(btc_top_10.items()), columns=['bigram', 'count'])

Unnamed: 0,bigram,count
0,"(new, york)",5
1,"(invasion, ukraine)",4
2,"(since, russia)",4
3,"(london, feb)",4
4,"(russia, invaded)",3
5,"(invaded, ukraine)",3
6,"(march, bitcoin)",3
7,"(russias, invasion)",3
8,"(joe, biden)",3
9,"(central, american)",3


In [22]:
# Generate the Ethereum N-grams where N=2

eth_string = ' '.join(eth_df.Text)
eth_processed = tokenizer(eth_string)
eth_bigrams = ngrams(eth_processed, n=2)
eth_top_10 = dict(Counter(eth_bigrams).most_common(10))
pd.DataFrame(list(eth_top_10.items()), columns=['bigram', 'count'])

Unnamed: 0,bigram,count
0,"(venture, capital)",5
1,"(hit, billion)",4
2,"(nonfungible, token)",4
3,"(cryptocurrency, boom)",3
4,"(boom, past)",3
5,"(past, year)",3
6,"(year, helped)",3
7,"(helped, propel)",3
8,"(propel, newer)",3
9,"(newer, market)",3


In [23]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [24]:
# Use token_count to get the top 10 words for Bitcoin

pd.DataFrame(token_count(eth_processed), columns=['word', 'count'])

Unnamed: 0,word,count
0,cryptocurrency,16
1,cryptocurrencies,16
2,ethereum,15


In [25]:
# Use token_count to get the top 10 words for Ethereum

pd.DataFrame(token_count(btc_processed), columns=['word', 'count'])

Unnamed: 0,word,count
0,bitcoin,35
1,cryptocurrency,19
2,ukraine,18


---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [30]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [None]:
# Generate the Bitcoin word cloud

wc = WordCloud().generate(btc_string)
plt.title('Bitcoin Word Cloud', fontsize=25, fontweight='bold')
plt.imshow(wc)

<matplotlib.image.AxesImage at 0x18a371f19d0>

In [None]:
# Generate the Ethereum word cloud

wc = WordCloud().generate(eth_string)
plt.title('Ethereum Word Cloud', fontsize=25, fontweight='bold')
plt.imshow(wc)

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [1]:
import spacy
from spacy import displacy

C:\Users\12152\anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\12152\anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


In [2]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [26]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [27]:
# Concatenate all of the Bitcoin text together

bitcoin_text = btc_df['Text'].str.cat()
bitcoin_text

'Even in cyberspace, the Department of Justice is able to use a tried and true investigative technique, following the money, Ms. Monaco said. Its what led us to Al Capone in the 30s. It helped us dest… [+1176 chars]When Russia invaded Ukraine, Niki Proshin was already a year into making a living as a vlogger — he had a YouTube channel, a TikTok channel, and an Instagram. He also ran an online Russian club for a… [+5883 chars]"Bitcoin was seen by many of its libertarian-leaning fans as a kind of doomsday insurance," argues a columnist in the New York Times, "a form of \'digital gold\' that would be a source of stability as … [+3914 chars]Feb 22 (Reuters) - Bitcoin miners are feeling the heat - and the pain\'s rippling downstream to pressure prices.\r\nThe cryptocurrency\'s spectacular rally in 2021 drew thousands of entrants into mining,… [+4196 chars]March 1 (Reuters) - Bitcoin has leapt since Russia\'s invasion of Ukraine, bolstered by people in those countries looking to store and mo

In [28]:
# Run the NER processor on all of the text

btc_doc = nlp(bitcoin_text)

# Add a title to the document

btc_doc.user_data["title"] = "Bitcoin NER"

In [29]:
# Render the visualization

displacy.render(btc_doc, style='ent')

In [30]:
# List all Entities

[ent.text for ent in btc_doc.ents]

['the Department of Justice',
 'Monaco',
 'Al Capone',
 'Russia',
 'Ukraine',
 'Niki Proshin',
 'a year',
 'YouTube',
 'TikTok',
 'Instagram',
 'Russian',
 'the New York Times',
 '22',
 'Reuters',
 '2021',
 'thousands',
 'Reuters',
 'Russia',
 'Ukraine',
 'Satoshi Nakamoto',
 '2008',
 'Nonfungible Tidbits',
 'this week',
 'Russia',
 'Ukraine',
 'Ukrainians',
 'Russian',
 'YouTube',
 'Alex Castro',
 'Verge',
 'BitConnect',
 'Getty',
 'Russia',
 'last Thursday',
 'Ukranian',
 'Mexico City',
 'Telegr',
 'March 4',
 'Reuters',
 'Russia',
 'Ukraine',
 'Russia',
 '15',
 'Reuters',
 'U.S.',
 'Joe Biden',
 'Reuters',
 'BITCOIN',
 '+6882 chars]<ul><li>',
 'Summary</li><li>',
 'Law firms</li><li>\r\n',
 'documents</li></ul',
 'Ukrainian',
 'Tom Lee',
 'Ukraine',
 'Shark Tank',
 "Kevin O'Leary's",
 '100,000',
 '200,000',
 '300,000',
 'two-week',
 'Tuesday',
 'Russians',
 'Ukrainians',
 'March 11',
 'Reuters',
 'El Salvador',
 'between March 15 and 20',
 'Central American',
 'Feb 20',
 'Reuters',


---

### Ethereum NER

In [31]:
# Concatenate all of the Ethereum text together

ethereum_text = eth_df['Text'].str.cat()
ethereum_text

'In February, shit hit the fan in the usual way: An old tweet resurfaced. Brantly Millegan, director of operations at Ethereum Name Service (ENS), a web3 business, had written the following in May 201… [+3096 chars]Coinbase reported that the share of trading volume for ethereum and other altcoins increased last year, while bitcoin\'s share dropped dramatically.\xa0\r\nBetween 2020 and 2021, ethereum trading volume in… [+1187 chars]Illustration by James Bareham / The Verge\r\n\n \n\n\n More than $15 million has been donated so far More than $15 million in cryptocurrency has been donated to Ukrainian groups since Russia attacked the c… [+7442 chars]If it sounds too good to be true, youre not wrong. Yield farming is riskier than staking. The tokens that are offering such high interest rates and fee yields are also the ones most likely to take a … [+2371 chars]It seems that in 2022, you cant escape from the metaverse.\xa0From Facebook to Microsoft, seemingly every centralized tech firm is 

In [32]:
# Run the NER processor on all of the text

eth_doc = nlp(ethereum_text)
# Add a title to the document

eth_doc.user_data["title"] = "Ethereum NER"

In [33]:
# Render the visualization

displacy.render(eth_doc, style='ent')

In [34]:
# List all Entities

[ent.text for ent in eth_doc.ents]

['February',
 'Ethereum Name Service',
 'ENS',
 'May 201',
 'last year',
 'Between 2020 and 2021',
 'James Bareham',
 'More than $15 million',
 'More than $15 million',
 'Ukrainian',
 'Russia',
 '2022',
 'Facebook',
 'Microsoft',
 'two hours',
 'YouTube',
 'Waka Flacka Fla',
 'the past few years',
 'NFT',
 '$23 billion',
 'TIME',
 'weekly',
 'Biden',
 'Wednesday',
 'first',
 'the past few years',
 'NFT',
 '$23 billion',
 'the past few years',
 'NFT',
 '$23 billion',
 'Ethereum',
 '$450 million',
 'Series',
 'US',
 'over $7 billion',
 'Russia',
 'Ukraine',
 'days',
 'March 11',
 'Yuga Labs',
 'Meebits',
 'Larva Labs',
 'more than a dozen',
 'Ethereum',
 'American Express',
 'Tuesday',
 'Ukrainian',
 'more than $4 million',
 'Russia',
 'Elliptic',
 'Ukraine',
 'Russia',
 'Elliptic',
 'nearly $55 million',
 'Russia',
 'Ukraine',
 'Bloomberg',
 'Getty Images',
 'Bitcoin, Ethereum',
 'DogeCoin',
 'Binance',
 'June 28, 2021',
 'REUTERS',
 'Dado Ruvic/Illus',
 'only one',
 'NFT',
 'Russia',
 

---