# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [1]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\miker\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
# Read your api key environment variable
load_dotenv() #need this line to activate the dotenv command
news_api_key = os.getenv("news_api")

In [3]:
# Create a newsapi client
from newsapi import NewsApiClient
newsapi = NewsApiClient(api_key=news_api_key)

In [4]:
# Fetch the Bitcoin news articles
bitcoin_articles = newsapi.get_everything(q='bitcoin', language='en', sort_by='relevancy')
bitcoin_articles['totalResults']

11733

In [5]:
# Fetch the Ethereum news articles
ethereum_articles = newsapi.get_everything(q='ethereum', language='en', sort_by='relevancy')
ethereum_articles['totalResults']

3338

In [64]:
# Create the Bitcoin sentiment scores DataFrame

# Added title into df to see difference vs content of article
btc_df = pd.DataFrame(bitcoin_articles['articles'])[['title', 'content']]

import numpy as np
btc_df['title_sentiment'] = np.nan
btc_df['content_sentiment'] = np.nan
btc_df['compound'] = np.nan
btc_df['negative'] = np.nan
btc_df['neutral'] = np.nan
btc_df['positive'] = np.nan

for idx, row in btc_df.iterrows():
    btc_df.loc[idx,'title_sentiment'] = analyzer.polarity_scores(row['title'])['compound']
    btc_df.loc[idx,'content_sentiment'] = analyzer.polarity_scores(row['content'])['compound']
    btc_df.loc[idx,'compound'] = analyzer.polarity_scores(row['content'])['compound']
    btc_df.loc[idx,'negative'] = analyzer.polarity_scores(row['content'])['neg']
    btc_df.loc[idx,'neutral'] = analyzer.polarity_scores(row['content'])['neu']
    btc_df.loc[idx,'positive'] = analyzer.polarity_scores(row['content'])['pos']
    
btc_df.head(10)

Unnamed: 0,title,content,title_sentiment,content_sentiment,compound,negative,neutral,positive
0,El Salvador becomes the first countr...,El Salvador's President Nayib Bukele...,0.128,0.8402,0.8402,0.0,0.718,0.282
1,El Salvador Becomes First Country to...,El Salvador has become the first cou...,0.0,0.128,0.128,0.0,0.957,0.043
2,Elon Musk says Tesla will resume Bit...,"It's all about clean energy, it seem...",0.0,0.6908,0.6908,0.0,0.831,0.169
3,Bitcoin: El Salvador makes cryptocur...,image captionThe move means bitcoin ...,0.128,0.2732,0.2732,0.0,0.94,0.06
4,Miami's Bitcoin Conference May Be th...,Several crypto fans that descended o...,0.5994,0.5574,0.5574,0.0,0.893,0.107
5,In search of a new crypto deity,"Hello friends, and welcome back to W...",0.0,0.75,0.75,0.0,0.846,0.154
6,PayPal Will Let Users Transfer Bitco...,In spite of the environmental and re...,0.0,-0.5267,-0.5267,0.096,0.904,0.0
7,Donald Trump calls Bitcoin 'a scam a...,"By Mary-Ann RussonBusiness reporter,...",-0.5719,0.34,0.34,0.0,0.93,0.07
8,Iran bans cryptocurrency mining for ...,The ban affects licensed and unlicen...,0.0,-0.5574,-0.5574,0.107,0.893,0.0
9,PayPal will soon let you exchange Bi...,"After years of hesitation, PayPal co...",0.0,-0.2732,-0.2732,0.062,0.938,0.0


In [7]:
analyzer.polarity_scores(row['content'])

{'neg': 0.075, 'neu': 0.841, 'pos': 0.084, 'compound': 0.0772}

In [8]:
btc_df.describe()

Unnamed: 0,title_sentiment,content_sentiment,compound,negative,neutral,positive
count,20.0,20.0,20.0,20.0,20.0,20.0
mean,0.02357,0.17992,0.17992,0.0269,0.89825,0.0748
std,0.340676,0.383615,0.383615,0.039178,0.06832,0.069831
min,-0.6868,-0.5574,-0.5574,0.0,0.718,0.0
25%,0.0,0.0,0.0,0.0,0.84475,0.03225
50%,0.0,0.16515,0.16515,0.0,0.9105,0.0655
75%,0.128,0.40105,0.40105,0.059,0.94375,0.1055
max,0.5994,0.8402,0.8402,0.107,1.0,0.282


In [9]:
print(f"The max compound score for Bitcoin is {btc_df['compound'].max()}")
print(f"The min compound score for Bitcoin is {btc_df['compound'].min()}")

print(f"The max positive score for Bitcoin is {btc_df['positive'].max()}")
print(f"The min positive score for Bitcoin is {btc_df['positive'].min()}")

The max compound score for Bitcoin is 0.8402
The min compound score for Bitcoin is -0.5574
The max positive score for Bitcoin is 0.282
The min positive score for Bitcoin is 0.0


In [10]:
# Out of interest article with the highest compound score for BTC
df = pd.DataFrame(btc_df.loc[btc_df['compound'].idxmax()]).transpose()
pd.set_option("max_colwidth", 500)
print(df['content'])

0    El Salvador's President Nayib Bukele has made good on his promise to adopt Bitcoin as legal tender. Officials in the Central American country's congress voted to accept the cryptocurrency by a majori… [+1414 chars]
Name: content, dtype: object


In [11]:
df.to_csv('btc_highest_compound_sentiment.csv', encoding='utf-8', index=False)

In [29]:
# resetting column width as I played around with this
pd.set_option("max_colwidth", 40)

In [101]:
# Create the Ethereum sentiment scores DataFrame
ethereum_df = pd.DataFrame(ethereum_articles['articles'])[['title', 'content']]

ethereum_df['title_sentiment'] = np.nan
ethereum_df['content_sentiment'] = np.nan
ethereum_df['compound'] = np.nan
ethereum_df['negative'] = np.nan
ethereum_df['neutral'] = np.nan
ethereum_df['positive'] = np.nan

for idx, row in ethereum_df.iterrows():
    ethereum_df.loc[idx,'title_sentiment'] = analyzer.polarity_scores(row['title'])['compound']
    ethereum_df.loc[idx,'content_sentiment'] = analyzer.polarity_scores(row['content'])['compound']
    ethereum_df.loc[idx,'compound'] = analyzer.polarity_scores(row['content'])['compound']
    ethereum_df.loc[idx,'negative'] = analyzer.polarity_scores(row['content'])['neg']
    ethereum_df.loc[idx,'neutral'] = analyzer.polarity_scores(row['content'])['neu']
    ethereum_df.loc[idx,'positive'] = analyzer.polarity_scores(row['content'])['pos']
    
ethereum_df.head(10)

Unnamed: 0,title,content,title_sentiment,content_sentiment,compound,negative,neutral,positive
0,This blockchain development course b...,TL;DR: The Cryptocurrency with Ether...,0.0,0.0,0.0,0.0,1.0,0.0
1,Classic memes that have sold as NFTs,It wasn't long ago that your average...,0.0,-0.296,-0.296,0.061,0.939,0.0
2,Ethereum extends gains to rise 8%; b...,A representation of virtual currency...,0.4404,0.0,0.0,0.0,1.0,0.0
3,Ethereum creator Vitalik Buterin mad...,This article was translated from our...,0.4404,-0.34,-0.34,0.066,0.934,0.0
4,Norton 360 Antivirus Now Lets You Mi...,This new mining feature is called 'N...,0.0,0.0,0.0,0.0,1.0,0.0
5,GameStop Is Building An NFT Platform...,"""We are building a team"" the page de...",0.0,0.6705,0.6705,0.0,0.812,0.188
6,SafeMoon: New Dogecoin or Ponzi scheme?,Opinions expressed by Entrepreneur c...,0.0,0.128,0.128,0.0,0.949,0.051
7,8 Money Toilets,CryptoPunks represent the only histo...,0.0,-0.4588,-0.4588,0.151,0.782,0.067
8,Buying a pink NFT cat was a crypto n...,By Cristina CriddleTechnology report...,0.0,0.0,0.0,0.0,1.0,0.0
9,Two ether bulls break down 3 reasons...,"In March 2020, blockchain protocol s...",0.5413,0.2023,0.2023,0.0,0.909,0.091


In [102]:
ethereum_df['content'][0]

'TL;DR: The Cryptocurrency with Ethereum and Solidity Blockchain Developer Bundle is on sale for £21.25 as of June 17, saving you 97% on list price.\r\nIs everyone you know investing in cryptocurrency? … [+949 chars]'

In [14]:
ethereum_df.describe()

Unnamed: 0,title_sentiment,content_sentiment,compound,negative,neutral,positive
count,20.0,20.0,20.0,20.0,20.0,20.0
mean,0.10304,0.11026,0.11026,0.01785,0.93405,0.04805
std,0.228527,0.319522,0.319522,0.040227,0.071475,0.060335
min,-0.4019,-0.4588,-0.4588,0.0,0.782,0.0
25%,0.0,0.0,0.0,0.0,0.91275,0.0
50%,0.0,0.0,0.0,0.0,0.937,0.0255
75%,0.32895,0.3612,0.3612,0.0,1.0,0.073
max,0.5413,0.7783,0.7783,0.151,1.0,0.191


In [15]:
print(f"The max compound score for Ethereum is {ethereum_df['compound'].max()}")
print(f"The min compound score for Ethereum is {ethereum_df['compound'].min()}")

print(f"The max positive score for Ethereum is {ethereum_df['positive'].max()}")
print(f"The min positive score for Ethereum is {ethereum_df['positive'].min()}")

The max compound score for Ethereum is 0.7783
The min compound score for Ethereum is -0.4588
The max positive score for Ethereum is 0.191
The min positive score for Ethereum is 0.0


In [16]:
# Describe the Bitcoin Sentiment
# The Bitcoin sentiment is neutral with a slightly positive weighting, mean title sentiment at 0.023 and mean content sentiment at 0.179.
# There appears to be a reasonable size difference in title vs content sentiment for BTC with content being more positive.

In [17]:
# Describe the Ethereum Sentiment
# The Ethereum sentiment is also neutral with a slightly positive weighting, mean title sentiment at 0.103 and mean content sentiment at 0.110.
# The content sentiment for Ethereum articles closely matches the title sentiment.

### Questions:

Q: Which coin had the highest mean positive score?

A: For content (text) BTC had highest mean positive sentiment score of 0.074800 (vs ETH at 0.048050)

Q: Which coin had the highest compound score?

A: For content (text) BTC had highest compound sentiment score of 0.8402 (vs ETH at 0.7783). Highest compound article was related to El Salvador making BTC legal tender.

Q. Which coin had the highest positive score?

A: For content (text) BTC had highest positive sentiment score of 0.282 (vs ETH at 0.191)

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [34]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [35]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
sw = stopwords.words('english')

# Expand the default stopwords list if necessary
# No additions

In [36]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)

    # Create a tokenized list of the words
    words = word_tokenize(re_clean)
    
    # Lemmatize words into root words
    lem_words = [lemmatizer.lemmatize(word) for word in words]
   
    # Convert the words to lowercase
    tokens = [word.lower() for word in lem_words if word.lower() not in sw]
    
    # Remove the stop words - included in line above
    
    
    return tokens

In [66]:
# Create a new tokens column for Bitcoin
btc_df['tokens'] = np.nan

btc_texts = btc_df['content']

tokens = []
for text in btc_texts:
    tokens.append(tokenizer(text))

btc_df['tokens'] = tokens
btc_df.drop(columns = 'title', inplace=True)
btc_df.drop(columns = 'content', inplace=True)
btc_df.drop(columns = 'title_sentiment', inplace=True)
btc_df.drop(columns = 'content_sentiment', inplace=True)
btc_df.head(5)

Unnamed: 0,compound,negative,neutral,positive,tokens
0,0.8402,0.0,0.718,0.282,"[el, salvadors, president, nayib, bu..."
1,0.128,0.0,0.957,0.043,"[el, salvador, ha, become, first, co..."
2,0.6908,0.0,0.831,0.169,"[clean, energy, seemselon, musk, tes..."
3,0.2732,0.0,0.94,0.06,"[image, captionthe, move, mean, bitc..."
4,0.5574,0.0,0.893,0.107,"[several, crypto, fan, descended, mi..."


In [51]:
# Create a new tokens column for Ethereum
ethereum_df['tokens'] = np.nan

eth_texts = ethereum_df['content']

tokens = []
for text in eth_texts:
    tokens.append(tokenizer(text))

ethereum_df['tokens'] = tokens
ethereum_df.drop(columns = 'title', inplace=True)
ethereum_df.drop(columns = 'content', inplace=True)
ethereum_df.drop(columns = 'title_sentiment', inplace=True)
ethereum_df.drop(columns = 'content_sentiment', inplace=True)
ethereum_df.head(5)

Unnamed: 0,compound,negative,neutral,positive,tokens
0,0.0,0.0,1.0,0.0,"[tldr, cryptocurrency, ethereum, sol..."
1,-0.296,0.061,0.939,0.0,"[wasnt, long, ago, average, person, ..."
2,0.0,0.0,1.0,0.0,"[representation, virtual, currency, ..."
3,-0.34,0.066,0.934,0.0,"[article, wa, translated, spanish, e..."
4,0.0,0.0,1.0,0.0,"[new, mining, feature, called, norto..."


In [88]:
btc_str = str(btc_df['tokens'])
btc_str = ''.join(btc_str)
btc_str

eth_str = str(ethereum_df['tokens'])
eth_str = ''.join(eth_str)
eth_str

'0     [tldr, cryptocurrency, ethereum, sol...\n1     [wasnt, long, ago, average, person, ...\n2     [representation, virtual, currency, ...\n3     [article, wa, translated, spanish, e...\n4     [new, mining, feature, called, norto...\n5     [building, team, page, declares, sta...\n6     [opinions, expressed, entrepreneur, ...\n7     [cryptopunks, represent, historicall...\n8     [cristina, criddletechnology, report...\n9     [march, blockchain, protocol, solana...\n10    [sir, tim, bernerslee, credited, inv...\n11    [steven, ferdmangetty, imagesbillion...\n12    [two, ethereumbased, protocol, keep,...\n13    [youve, likely, seen, headline, surr...\n14    [along, stock, market, cryptocurrenc...\n15    [yao, qian, former, head, chinas, di...\n16    [visual, representation, digital, cr...\n17    [disclosure, goal, feature, product,...\n18    [consumer, us, digital, yuan, red, e...\n19    [bitcoin, ethereum, price, making, c...\nName: tokens, dtype: object'

In [89]:
btc_str = tokenizer(btc_str)
eth_str = tokenizer(eth_str)

---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [39]:
from collections import Counter
from nltk import ngrams

In [104]:
N=2
grams=ngrams(tokenizer(btc_df.text.str.cat()),N)
Counter(grams).most_common(10)

AttributeError: 'DataFrame' object has no attribute 'text'

In [90]:
# Generate the Bitcoin N-grams where N=2   
btc_ngram = Counter(ngrams(btc_str,n=2)).most_common()
btc_ngram

[(('el', 'salvador'), 4),
 (('june', 'reuters'), 3),
 (('reuters', 'el'), 3),
 (('welcome', 'back'), 2),
 (('london', 'june'), 2),
 (('salvador', 'president'), 1),
 (('president', 'nayib'), 1),
 (('nayib', 'bu'), 1),
 (('bu', 'el'), 1),
 (('salvador', 'ha'), 1),
 (('ha', 'become'), 1),
 (('become', 'first'), 1),
 (('first', 'co'), 1),
 (('co', 'clean'), 1),
 (('clean', 'energy'), 1),
 (('energy', 'seemselon'), 1),
 (('seemselon', 'musk'), 1),
 (('musk', 'te'), 1),
 (('te', 'image'), 1),
 (('image', 'captionthe'), 1),
 (('captionthe', 'move'), 1),
 (('move', 'mean'), 1),
 (('mean', 'bitc'), 1),
 (('bitc', 'several'), 1),
 (('several', 'crypto'), 1),
 (('crypto', 'fan'), 1),
 (('fan', 'descended'), 1),
 (('descended', 'mi'), 1),
 (('mi', 'hello'), 1),
 (('hello', 'friend'), 1),
 (('friend', 'welcome'), 1),
 (('back', 'week'), 1),
 (('week', 'spite'), 1),
 (('spite', 'environmental'), 1),
 (('environmental', 'regulatory'), 1),
 (('regulatory', 'maryann'), 1),
 (('maryann', 'russonbusiness

In [91]:
# Generate the Ethereum N-grams where N=2
eth_ngram = Counter(ngrams(eth_str,n=2)).most_common()
eth_ngram

[(('tldr', 'cryptocurrency'), 1),
 (('cryptocurrency', 'ethereum'), 1),
 (('ethereum', 'sol'), 1),
 (('sol', 'wasnt'), 1),
 (('wasnt', 'long'), 1),
 (('long', 'ago'), 1),
 (('ago', 'average'), 1),
 (('average', 'person'), 1),
 (('person', 'representation'), 1),
 (('representation', 'virtual'), 1),
 (('virtual', 'currency'), 1),
 (('currency', 'article'), 1),
 (('article', 'wa'), 1),
 (('wa', 'translated'), 1),
 (('translated', 'spanish'), 1),
 (('spanish', 'e'), 1),
 (('e', 'new'), 1),
 (('new', 'mining'), 1),
 (('mining', 'feature'), 1),
 (('feature', 'called'), 1),
 (('called', 'norto'), 1),
 (('norto', 'building'), 1),
 (('building', 'team'), 1),
 (('team', 'page'), 1),
 (('page', 'declares'), 1),
 (('declares', 'sta'), 1),
 (('sta', 'opinion'), 1),
 (('opinion', 'expressed'), 1),
 (('expressed', 'entrepreneur'), 1),
 (('entrepreneur', 'cryptopunks'), 1),
 (('cryptopunks', 'represent'), 1),
 (('represent', 'historicall'), 1),
 (('historicall', 'cristina'), 1),
 (('cristina', 'criddl

In [92]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [97]:
# Use token_count to get the top 10 words for Bitcoin
token_count(btc_str)

[('salvador', 6), ('el', 5), ('reuters', 4)]

In [94]:
# Use token_count to get the top 10 words for Ethereum
token_count(btc_str)

[('salvador', 6), ('el', 5), ('reuters', 4)]

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [98]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [99]:
corpus = [btc_df['tokens'](text) for text in texts]
big_str = ' '.join(corpus)
processed_topic = ' '.join(process_text(big_str))

TypeError: 'Series' object is not callable

In [None]:
# Generate the Bitcoin word cloud
wc = WordCloud().generate(processed_topic)
plt.imshow(wc)

In [None]:
# Generate the Ethereum word cloud
# YOUR CODE HERE!

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [None]:
import spacy
from spacy import displacy

In [None]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [None]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [65]:
# Concatenate all of the Bitcoin text together
#pd.concat(btc_df['content'])

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "Series"

In [None]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [None]:
# Render the visualization
# YOUR CODE HERE!

In [None]:
# List all Entities
# YOUR CODE HERE!

---

### Ethereum NER

In [None]:
# Concatenate all of the Ethereum text together
ethereum_df['tokens'].join ' '

In [None]:
# Run the NER processor on all of the text
doc = nlp(article)

# Add a title to the document
# YOUR CODE HERE!

In [None]:
# Render the visualization
displacy.render(doc, style='ent')

In [45]:
# List all Entities
#entities = [ent.text for ent in doc.ents if ent.label_ in ['GPE', 'ORG']]

---