# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [59]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
load_dotenv()
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/jcmignon/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [60]:
# Read your api key environment variable
api_key = os.getenv("NEWS_API_KEY")

In [61]:
# Create a newsapi client
from newsapi import NewsApiClient
newsapi = NewsApiClient(api_key=api_key)

In [62]:
# Fetch the Bitcoin news articles
btc_news1 = pd.DataFrame(newsapi.get_everything(q="bitcoin", language ="en",sort_by='relevancy',page = 1)['articles'])
btc_news2 = pd.DataFrame(newsapi.get_everything(q="bitcoin", language ="en",sort_by='relevancy',page = 2)['articles'])
btc_news3 = pd.DataFrame(newsapi.get_everything(q="bitcoin", language ="en",sort_by='relevancy',page = 3)['articles'])
btc_news4 = pd.DataFrame(newsapi.get_everything(q="bitcoin", language ="en",sort_by='relevancy',page = 4)['articles'])
btc_news5 = pd.DataFrame(newsapi.get_everything(q="bitcoin", language ="en",sort_by='relevancy',page = 5)['articles'])

btc_news = pd.concat([btc_news1,btc_news2,btc_news3,btc_news4,btc_news5])
btc_news

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': 'engadget', 'name': 'Engadget'}",https://www.engadget.com/about/editors/saqib-shah,El Salvador becomes the first country to appro...,El Salvador has voted to adopt Bitcoin as lega...,https://www.engadget.com/el-salvador-bitcoin-l...,https://s.yimg.com/os/creatr-uploaded-images/2...,2021-06-09T12:04:40Z,El Salvador's President Nayib Bukele has made ...
1,"{'id': None, 'name': 'Gizmodo.com'}",Matt Novak,El Salvador Becomes First Country to Recognize...,El Salvador has become the first country in th...,https://gizmodo.com/el-salvador-becomes-first-...,https://i.kinja-img.com/gawker-media/image/upl...,2021-06-09T10:00:00Z,El Salvador has become the first country in th...
2,"{'id': 'mashable', 'name': 'Mashable'}",Stan Schroeder,Elon Musk says Tesla will resume Bitcoin purch...,"It's all about clean energy, it seems. \nElon ...",https://mashable.com/article/tesla-bitcoin-pur...,https://mondrian.mashable.com/2021%252F06%252F...,2021-06-14T07:15:49Z,"It's all about clean energy, it seems. \r\nElo..."
3,"{'id': 'bbc-news', 'name': 'BBC News'}",https://www.facebook.com/bbcnews,Bitcoin: El Salvador makes cryptocurrency lega...,It is the first country in the world to make t...,https://www.bbc.co.uk/news/world-latin-america...,https://ichef.bbci.co.uk/news/1024/branded_new...,2021-06-09T08:27:58Z,image captionThe move means bitcoin will be ac...
4,"{'id': None, 'name': 'Gizmodo.com'}",Alyse Stanley,Miami's Bitcoin Conference May Be the Latest C...,"Several crypto fans that descended on Miami, F...",https://gizmodo.com/miamis-bitcoin-conference-...,https://i.kinja-img.com/gawker-media/image/upl...,2021-06-11T00:45:00Z,"Several crypto fans that descended on Miami, F..."
...,...,...,...,...,...,...,...,...
15,"{'id': 'reuters', 'name': 'Reuters'}",Reuters,Square to invest $5 mln in Blockstream's solar...,Blockchain technology company Blockstream Mini...,https://www.reuters.com/technology/square-inve...,,2021-06-05T18:22:00Z,Blockchain technology company Blockstream Mini...
16,"{'id': 'reuters', 'name': 'Reuters'}",Tom Wilson,"EXCLUSIVE El Salvador bitcoin transfers soar, ...",Small transfers of bitcoin to El Salvador jump...,https://www.reuters.com/business/finance/exclu...,https://www.reuters.com/resizer/AUSGmfvZnVBr2N...,2021-06-14T14:24:00Z,: Bitcoin banners are seen outside of a small ...
17,"{'id': 'reuters', 'name': 'Reuters'}",Reuters Staff,"IMF sees legal, economic issues with El Salvad...",The International Monetary Fund said on Thursd...,https://www.reuters.com/article/us-el-salvador...,https://static.reuters.com/resources/r/?m=02&d...,2021-06-10T14:30:00Z,By Reuters Staff\r\nFILE PHOTO: The Internatio...
18,"{'id': 'reuters', 'name': 'Reuters'}",Gertrude Chavez-dreyfuss,Crypto sees 2nd week of outflows; ether posts ...,Cryptocurrency investment products and funds s...,https://www.reuters.com/technology/crypto-sees...,https://www.reuters.com/resizer/tZ8a3w6k8k2N_N...,2021-06-14T21:26:00Z,Cryptocurrency investment products and funds s...


In [63]:
# Fetch the Ethereum news articles
eth_news = newsapi.get_everything(q="ethereum", language ="en",sort_by='relevancy')

eth_news1 = pd.DataFrame(newsapi.get_everything(q="ethereum", language ="en",sort_by='relevancy',page = 1)['articles'])
eth_news2 = pd.DataFrame(newsapi.get_everything(q="ethereum", language ="en",sort_by='relevancy',page = 2)['articles'])
eth_news3 = pd.DataFrame(newsapi.get_everything(q="ethereum", language ="en",sort_by='relevancy',page = 3)['articles'])
eth_news4 = pd.DataFrame(newsapi.get_everything(q="ethereum", language ="en",sort_by='relevancy',page = 4)['articles'])
eth_news5 = pd.DataFrame(newsapi.get_everything(q="ethereum", language ="en",sort_by='relevancy',page = 5)['articles'])

eth_news = pd.concat([eth_news1,eth_news2,eth_news3,eth_news4,eth_news5])
eth_news.head()

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': 'mashable', 'name': 'Mashable'}",Joseph Green,This blockchain development course bundle is o...,TL;DR: The Cryptocurrency with Ethereum and So...,https://mashable.com/uk/shopping/june-17-crypt...,https://mondrian.mashable.com/2021%252F06%252F...,2021-06-17T04:05:00Z,TL;DR: The Cryptocurrency with Ethereum and So...
1,"{'id': 'mashable', 'name': 'Mashable'}",Tim Marcin,Classic memes that have sold as NFTs,It wasn't long ago that your average person ha...,https://mashable.com/article/classic-memes-sol...,https://mondrian.mashable.com/2021%252F06%252F...,2021-06-20T19:28:07Z,It wasn't long ago that your average person ha...
2,"{'id': None, 'name': 'Entrepreneur'}",Entrepreneur en Español,Ethereum creator Vitalik Buterin made more tha...,Ethereum billionaire founder Vitalik Buterin b...,https://www.entrepreneur.com/article/374264,https://assets.entrepreneur.com/content/3x2/20...,2021-06-10T19:12:00Z,This article was translated from our Spanish e...
3,"{'id': 'business-insider', 'name': 'Business I...",cshumba@insider.com (Camomile Shumba ),Ethereum now has more active addresses than bi...,Ethereum overtook bitcoin in the number of act...,https://markets.businessinsider.com/news/stock...,https://images2.markets.businessinsider.com/60...,2021-07-02T15:49:56Z,Bitcoin and Ethereum\r\nYuriko Nakao\r\nEther ...
4,"{'id': None, 'name': 'Gizmodo.com'}",Whitney Kimball,8 Money Toilets,A ferocious bidding war over a misshapen chick...,https://gizmodo.com/8-money-toilets-1847036893,https://i.kinja-img.com/gawker-media/image/upl...,2021-06-04T22:20:00Z,CryptoPunks represent the only historically re...


In [64]:
# Create the Bitcoin sentiment scores DataFrame
btc_sentiment = []

for content in btc_news['content']:
    sentiment = analyzer.polarity_scores(content)
    compound = sentiment["compound"]
    pos = sentiment["pos"]
    neu = sentiment["neu"]
    neg = sentiment["neg"]
        
    btc_sentiment.append({
            "text": content,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu   
        })

# Create DataFrame
btc_df = pd.DataFrame(btc_sentiment)

# Reorder DataFrame columns
cols = ["text", "compound", "positive", "negative", "neutral"]
btc_df = btc_df[cols]

btc_df.head()

Unnamed: 0,text,compound,positive,negative,neutral
0,El Salvador's President Nayib Bukele has made ...,0.8402,0.282,0.0,0.718
1,El Salvador has become the first country in th...,0.128,0.043,0.0,0.957
2,"It's all about clean energy, it seems. \r\nElo...",0.6908,0.169,0.0,0.831
3,image captionThe move means bitcoin will be ac...,0.2732,0.06,0.0,0.94
4,"Several crypto fans that descended on Miami, F...",0.5574,0.107,0.0,0.893


In [65]:
# Create the Bitcoin sentiment scores DataFrame
eth_sentiment = []

for content in eth_news['content']:
    sentiment = analyzer.polarity_scores(content)
    compound = sentiment["compound"]
    pos = sentiment["pos"]
    neu = sentiment["neu"]
    neg = sentiment["neg"]
        
    eth_sentiment.append({
            "text": content,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu   
        })

# Create DataFrame
eth_df = pd.DataFrame(eth_sentiment)

# Reorder DataFrame columns
cols = ["text", "compound", "positive", "negative", "neutral"]
eth_df = eth_df[cols]

eth_df.head()

Unnamed: 0,text,compound,positive,negative,neutral
0,TL;DR: The Cryptocurrency with Ethereum and So...,0.0,0.0,0.0,1.0
1,It wasn't long ago that your average person ha...,-0.296,0.0,0.061,0.939
2,This article was translated from our Spanish e...,-0.34,0.0,0.066,0.934
3,Bitcoin and Ethereum\r\nYuriko Nakao\r\nEther ...,0.3612,0.11,0.041,0.849
4,CryptoPunks represent the only historically re...,-0.4588,0.067,0.151,0.782


In [66]:
# Describe the Bitcoin Sentiment
btc_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,100.0,100.0,100.0,100.0
mean,0.061507,0.04704,0.0282,0.92475
std,0.352836,0.063612,0.043421,0.070162
min,-0.7783,0.0,0.0,0.718
25%,-0.081925,0.0,0.0,0.871
50%,0.0,0.0,0.0,0.929
75%,0.32365,0.084,0.0605,1.0
max,0.8402,0.282,0.178,1.0


In [67]:
# Describe the Ethereum Sentiment
eth_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,100.0,100.0,100.0,100.0
mean,0.150449,0.05848,0.02215,0.91939
std,0.362227,0.06918,0.036668,0.071027
min,-0.5719,0.0,0.0,0.71
25%,0.0,0.0,0.0,0.87525
50%,0.05135,0.046,0.0,0.927
75%,0.411525,0.09725,0.05775,1.0
max,0.8481,0.29,0.151,1.0


### Questions:

Q: Which coin had the highest mean positive score?

*__A: Etheruem had the highest mean positive score, 5.49% vs 4.68% for Bitcoin.__*

Q: Which coin had the highest compound score?

*__A: Ethereum had the highest mean compound score (12.84% vs 5.92% respectively), as well as the highest compound score for an individual article (84.81% vs 84.02% respectively).__*

Q. Which coin had the highest positive score?

*__A: Ethereum had the highest positive score (29% vs 28.2% for ETH)__*

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [68]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import reuters, stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [69]:
btc_string=" ".join(btc_news['content'].astype(str))

In [70]:
# Create a list of stopwords
sentence = sent_tokenize(btc_string)
words = word_tokenize(btc_string)
sw = set(stopwords.words('english'))
first_result = [word.lower() for word in words if word.lower() not in sw]

# unsure if the list has to be expanded 
# Expand list with char reuters

In [71]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

In [72]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)

    # Create a tokenized list of the words
    words = word_tokenize(re_clean)    
    
    # Define the stop words
    sw = stopwords.words('english')
    # Additional words as they appear in the top 10 most common words and add little to no value
    sw_addon = {'char', 'reuters'}
    
    # Lemmatize words into root words
    lem = [lemmatizer.lemmatize(word) for word in words]
   
    # Convert the words to lowercase
    output = [word.lower() for word in lem if word.lower() not in sw]
    
    # unsure how to add the custom list - .union(sw_addon) - to be pasted behind sw. in the previous line, error: union is not recognized
    
    return output

In [73]:
# Create a new tokens column for Bitcoin
btc_token = btc_news['content'].apply(tokenizer)
btc_df['token'] = btc_token.values
btc_df.head()

Unnamed: 0,text,compound,positive,negative,neutral,token
0,El Salvador's President Nayib Bukele has made ...,0.8402,0.282,0.0,0.718,"[el, salvadors, president, nayib, bukele, ha, ..."
1,El Salvador has become the first country in th...,0.128,0.043,0.0,0.957,"[el, salvador, ha, become, first, country, wor..."
2,"It's all about clean energy, it seems. \r\nElo...",0.6908,0.169,0.0,0.831,"[clean, energy, seemselon, musk, tesla, caused..."
3,image captionThe move means bitcoin will be ac...,0.2732,0.06,0.0,0.94,"[image, captionthe, move, mean, bitcoin, accep..."
4,"Several crypto fans that descended on Miami, F...",0.5574,0.107,0.0,0.893,"[several, crypto, fan, descended, miami, flori..."


In [74]:
# Create a new tokens column for Ethereum
eth_token = eth_news['content'].apply(tokenizer)
eth_df['token'] = eth_token.values
eth_df.head()

Unnamed: 0,text,compound,positive,negative,neutral,token
0,TL;DR: The Cryptocurrency with Ethereum and So...,0.0,0.0,0.0,1.0,"[tldr, cryptocurrency, ethereum, solidity, blo..."
1,It wasn't long ago that your average person ha...,-0.296,0.0,0.061,0.939,"[wasnt, long, ago, average, person, clue, nft,..."
2,This article was translated from our Spanish e...,-0.34,0.0,0.066,0.934,"[article, wa, translated, spanish, edition, us..."
3,Bitcoin and Ethereum\r\nYuriko Nakao\r\nEther ...,0.3612,0.11,0.041,0.849,"[bitcoin, ethereumyuriko, nakaoether, overtook..."
4,CryptoPunks represent the only historically re...,-0.4588,0.067,0.151,0.782,"[cryptopunks, represent, historically, relevan..."


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [75]:
from collections import Counter
from nltk import ngrams

In [76]:
# Create btc_string of all tokenized words per article 
btc_string = " ".join(btc_df['token'].astype(str))
# Tokenize btc_string to ensure you have one big string 
btc_string_processed = tokenizer(btc_string)

In [77]:
# Generate the Bitcoin N-grams where N=2
btc_counts = Counter(ngrams(btc_string_processed, n=2))
print(dict(btc_counts))

{('el', 'salvador'): 21, ('salvador', 'president'): 2, ('president', 'nayib'): 7, ('nayib', 'bukele'): 6, ('bukele', 'ha'): 1, ('ha', 'made'): 1, ('made', 'good'): 1, ('good', 'promise'): 1, ('promise', 'adopt'): 1, ('adopt', 'bitcoin'): 3, ('bitcoin', 'legal'): 12, ('legal', 'tender'): 11, ('tender', 'official'): 1, ('official', 'central'): 1, ('central', 'american'): 3, ('american', 'country'): 1, ('country', 'congress'): 1, ('congress', 'voted'): 1, ('voted', 'accept'): 1, ('accept', 'cryptocurrency'): 1, ('cryptocurrency', 'majori'): 1, ('majori', 'char'): 1, ('char', 'el'): 5, ('salvador', 'ha'): 6, ('ha', 'become'): 6, ('become', 'first'): 5, ('first', 'country'): 6, ('country', 'world'): 6, ('world', 'recognize'): 1, ('recognize', 'cryptocurrency'): 1, ('cryptocurrency', 'bitcoin'): 4, ('legal', 'currency'): 1, ('currency', 'according'): 1, ('according', 'president'): 1, ('bukele', 'tweet'): 1, ('tweet', 'wednesday'): 1, ('wednesday', 'citizen'): 1, ('citizen', 'able'): 1, ('abl

In [78]:
# Generate the Ethereum N-grams where N=2
# Create eth_string of all tokenized words per article 
eth_string = " ".join(eth_df['token'].astype(str))
# Tokenize eth_string to ensure you have one big string 
eth_string_processed = tokenizer(eth_string)
# Generate the Ethereum N-grams where N=2
eth_counts = Counter(ngrams(eth_string_processed, n=2))
print(dict(eth_counts))



In [79]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=10):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [80]:
# Use token_count to get the top 10 words for Bitcoin

# Create string with individual words
btc_counts_indiv = Counter(ngrams(btc_string_processed, n=1))
# Apply token_count function
token_count(btc_counts_indiv)

#Alternative solution
# print(dict(btc_counts_indiv.most_common(10)))

[(('char',), 97),
 (('bitcoin',), 96),
 (('reuters',), 57),
 (('june',), 38),
 (('cryptocurrency',), 31),
 (('salvador',), 28),
 (('el',), 26),
 (('world',), 24),
 (('monday',), 19),
 (('seen',), 19)]

In [35]:
# Use token_count to get the top 10 words for Ethereum

# Create string with individual words
eth_counts_indiv = Counter(ngrams(eth_string_processed, n=1))
# Apply token_count function
token_count(eth_counts_indiv)

#Alternative solution
# print(dict(eth_counts_indiv.most_common(10)))

[(('char',), 99),
 (('bitcoin',), 41),
 (('reuters',), 34),
 (('cryptocurrency',), 33),
 (('world',), 26),
 (('june',), 23),
 (('ethereum',), 14),
 (('biggest',), 14),
 (('representation',), 14),
 (('week',), 13)]

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [81]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [82]:
# Generate the Bitcoin word cloud

# confused - btc_string_processed is already a string (error given - expects a string like object)

wc = WordCloud().generate(btc_string_processed)
plt.imshow(wc)

TypeError: expected string or bytes-like object

In [23]:
# Generate the Ethereum word cloud

wc = WordCloud().generate(eth_string_processed)
plt.imshow(wc)

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [89]:
import spacy
from spacy import displacy

In [91]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [92]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [103]:
# Concatenate all of the Bitcoin text together
#already done in btc_string

In [96]:
# Run the NER processor on all of the text
btc_NER = nlp(btc_string)

# Add a title to the document
# YOUR CODE HERE!

In [97]:
# Render the visualization
displacy.render(btc_NER, style='ent')

In [101]:
# List all Entities
# all geopolitical entities like in the example. Not sure if that's the question
print([ent.text for ent in btc_NER.ents if ent.label_ == 'GPE'])

['florida', 'china', 'germany', 'china', "'york'", 'china', 'china', 'paris', "'salvador'", "'york'", "'york'", 'london', "'york'", "'york'", 'china', 'america', "'york'", 'london', 'china', "'beach'", 'washington', "'york'"]


---

### Ethereum NER

In [104]:
# Concatenate all of the Ethereum text together
# already done in eth_string

In [105]:
# Run the NER processor on all of the text
eth_NER = nlp(eth_string)

# Add a title to the document
# YOUR CODE HERE!

In [106]:
# Render the visualization
displacy.render(eth_NER, style='ent')

In [107]:
# List all Entities
# all geopolitical entities like in the example. Not sure if that's the question
print([ent.text for ent in eth_NER.ents if ent.label_ == 'GPE'])

["'york'", 'china', 'london', 'britain', 'london', 'london', 'staffmumbai', "'york'", 'london', 'london', 'shanghai', 'china', 'beijing', 'india']


---