# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [2]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
from newsapi.newsapi_client import NewsApiClient

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\mksta\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [3]:
# Read your api key environment variable && # Create a newsapi client
# YOUR CODE HERE!
load_dotenv()
newsapi = NewsApiClient(api_key=os.environ["news_api"])
sid = SentimentIntensityAnalyzer()

In [4]:
# Fetch the Bitcoin news articles
# YOUR CODE HERE!
bitcoin = newsapi.get_everything(q="bitcoin", language = "en")

In [5]:
# Fetch the Ethereum news articles
# YOUR CODE HERE!
ethereum = newsapi.get_everything(q="ethereum", language = "en")

In [68]:
bitcoin

{'status': 'ok',
 'totalResults': 9638,
 'articles': [{'source': {'id': 'wired', 'name': 'Wired'},
   'author': 'Arielle Pardes',
   'title': 'Miami’s Bitcoin Conference Left a Trail of Harassment',
   'description': 'For some women, inappropriate conduct from other conference-goers continued to haunt them online.',
   'url': 'https://www.wired.com/story/bitcoin-2022-conference-harassment/',
   'urlToImage': 'https://media.wired.com/photos/627a89e3e37e715cb7d760d2/191:100/w_1280,c_limit/Bitcoin_Miami_Biz_GettyImages-1239817123.jpg',
   'publishedAt': '2022-05-10T16:59:46Z',
   'content': 'Now, even though there are a number of women-focused crypto spaces, Odeniran says women are still underrepresented. Ive been in spaces where Im the only Black person, or the only woman, or the only B… [+3828 chars]'},
  {'source': {'id': 'the-verge', 'name': 'The Verge'},
   'author': 'Justine Calma',
   'title': 'Why fossil fuel companies see green in Bitcoin mining projects',
   'description': 'Exxo

In [30]:
# Create the Bitcoin sentiment scores DataFrame
# YOUR CODE HERE!
bitcoin_sentiments = []
for article in bitcoin["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:3]
        sentiment = sid.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        bitcoin_sentiments.append({
            "text": text,
            "date": date,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu
        })
    except AttributeError:
        pass
btc_df = pd.DataFrame(bitcoin_sentiments)

In [31]:
# Create the Ethereum sentiment scores DataFrame
# YOUR CODE HERE!
ethereum_sentiments = []
for article in ethereum["articles"]:
    try:
        text = article["content"]
        date = article["publishedAt"][:3]
        sentiment = sid.polarity_scores(text)
        compound = sentiment["compound"]
        pos = sentiment["pos"]
        neu = sentiment["neu"]
        neg = sentiment["neg"]
        ethereum_sentiments.append({
            "text": text,
            "date": date,
            "compound": compound,
            "positive": pos,
            "negative": neg,
            "neutral": neu
        })
    except AttributeError:
        pass
eth_df = pd.DataFrame(ethereum_sentiments)

In [32]:
# Describe the Bitcoin Sentiment
# YOUR CODE HERE!
btc_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,-0.09339,0.05945,0.08045,0.86005
std,0.389782,0.062439,0.07613,0.104336
min,-0.8593,0.0,0.0,0.557
25%,-0.36635,0.0,0.0535,0.827
50%,-0.1901,0.048,0.063,0.888
75%,0.152575,0.085,0.08425,0.93025
max,0.7506,0.202,0.3,0.964


In [33]:
# Describe the Ethereum Sentiment
# YOUR CODE HERE!
eth_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,-0.02918,0.0459,0.04945,0.9048
std,0.402413,0.059923,0.043592,0.052498
min,-0.6908,0.0,0.0,0.822
25%,-0.28445,0.0,0.0,0.85875
50%,-0.1897,0.0,0.059,0.9255
75%,0.2887,0.073,0.069,0.937
max,0.6908,0.178,0.178,1.0


### Questions:

Q: Which coin had the highest mean positive score?

A: 

Q: Which coin had the highest compound score?

A: 

Q. Which coin had the highest positive score?

A: 

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [34]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [35]:
# Instantiate the lemmatizer
# YOUR CODE HERE!
lemmatizer = WordNetLemmatizer()
# Create a list of stopwords
# YOUR CODE HERE!
sw = set(stopwords.words('english'))

# Expand the default stopwords list if necessary
# YOUR CODE HERE!
sw_addon=[]

In [61]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Remove the punctuation from text
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)
    # Create a tokenized list of the words
    words = word_tokenize(re_clean)
    
    # Lemmatize words into root words
    result = [lemmatizer.lemmatize(word) for word in words]
       
    # Convert the words to lowercase
    token = [word.lower() for word in result if word.lower() not in sw.union(sw_addon)]
    # Remove the stop words
    tokens=token
    
    return tokens

In [65]:
# Create a new tokens column for Bitcoin
# YOUR CODE HERE!
# from collections import Counter
# def word_counter(corpus): 
#     # Combine all articles in corpus into one large string
#     #big_string = ' '.join(corpus)
#     processed = tokenizer(corpus)
#     top_10 = dict(Counter(processed).most_common())
#     return pd.DataFrame(list(top_10.items()), columns=['word', 'count'])

text=''
for content in bitcoin['articles']:
     text = text + str(content["content"])

token_test = tokenizer(text)
btcTest=btc_df
btcTest=btcTest.append(pd.DataFrame(token_test))

In [67]:
btc_df

Unnamed: 0,text,date,compound,positive,negative,neutral
0,"Now, even though there are a number of women-f...",202,0.0772,0.036,0.0,0.964
1,A Bitcoin mining site powered by otherwise los...,202,-0.0516,0.056,0.061,0.882
2,Warren Buffett has always been a bitcoin skept...,202,-0.3269,0.085,0.143,0.772
3,"As a kid, I remember when my father tried to u...",202,0.3818,0.114,0.052,0.833
4,"Image source, Getty Images\r\nThe value of Bit...",202,0.34,0.072,0.0,0.928
5,If youve ever felt like introducing some Vegas...,202,0.7506,0.193,0.0,0.807
6,Cryptocurrency mixers are sometimes used to he...,202,-0.4404,0.202,0.241,0.557
7,Posted \r\nFrom Bitcoin's dramatic drop to a n...,202,-0.3612,0.0,0.123,0.877
8,"May 11 (Reuters) - Bitcoin fell 7.23% to $28,7...",202,-0.3818,0.0,0.077,0.923
9,"May 4 (Reuters) - Bitcoin rose 5.7% to $39,862...",202,-0.2732,0.0,0.063,0.937


In [14]:
# Create a new tokens column for Ethereum
# YOUR CODE HERE!
eth_df['eth_tokens'] = tokenizer('ethereum')

---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [15]:
from collections import Counter
from nltk import ngrams

In [16]:
# Generate the Bitcoin N-grams where N=2
# YOUR CODE HERE!

In [17]:
# Generate the Ethereum N-grams where N=2
# YOUR CODE HERE!

In [18]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [19]:
# Use token_count to get the top 10 words for Bitcoin
# YOUR CODE HERE!

In [20]:
# Use token_count to get the top 10 words for Ethereum
# YOUR CODE HERE!

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [21]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [22]:
# Generate the Bitcoin word cloud
# YOUR CODE HERE!

In [23]:
# Generate the Ethereum word cloud
# YOUR CODE HERE!

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [24]:
import spacy
from spacy import displacy

In [25]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [26]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [27]:
# Concatenate all of the Bitcoin text together
# YOUR CODE HERE!

In [28]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [29]:
# Render the visualization
# YOUR CODE HERE!

In [30]:
# List all Entities
# YOUR CODE HERE!

---

### Ethereum NER

In [31]:
# Concatenate all of the Ethereum text together
# YOUR CODE HERE!

In [32]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [33]:
# Render the visualization
# YOUR CODE HERE!

In [34]:
# List all Entities
# YOUR CODE HERE!

---