# Vader Sentiment Analysis
<p>There are two types of sentiment analyzing approcahes - Polarity and Valence based.</p>
<p>VADER is a VALENCE based sentiment analyzer.
Valence based approach taken into consideration the "intensity" of a word as opposed to only the polarity (+ve or -ve). For ex. "Great" is more treated as more +ve as opposed to "Good".</p>
<p>Sentiment analysis is a perfect approach to sieve and respond to unstructured data - social media, product reviews, etc.</p>
<p>Ideal scale for classification based on compound value:</p>
<ol>
<li>Neutral = -0.5 > and <= 0.5
<li>Positive = >0.5
<li>Negative = <=-0.5
</ol>

# Importing Libraries and Data Loading

In [19]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [37]:
df = pd.read_csv('Bitcoin News.csv')

In [38]:
df.shape

(6601, 2)

In [39]:
df.columns

Index(['Date', 'Body'], dtype='object')

In [40]:
df.head()

Unnamed: 0,Date,Body
0,6/23/2020,Bitcoin’s price action has been incredibly lac...
1,6/23/2020,"The United States is in turmoil, and the dolla..."
2,6/23/2020,An over 300 million active register accounts o...
3,6/24/2020,Bitcoin started a downside correction from the...
4,6/24/2020,"Throughout 2017 and 2018, many altcoin-advocat..."


# Text Preprocessing


In [41]:
# Dictionary of common English contractions
contraction_dict = {
    "can't": "cannot",
    "couldn't": "could not",
    "didn't": "did not",
    "don't": "do not",
    "doesn't": "does not",
    "won't": "will not",
    "wouldn't": "would not",
    "haven't": "have not",
    "hasn't": "has not",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "it's": "it is",
    "he's": "he is",
    "she's": "she is",
    "they're": "they are",
    "we're": "we are",
    "I'm": "I am"
}

# Function to expand contractions
def expand_contractions(text, contraction_dict):
    for contraction, expansion in contraction_dict.items():
        text = text.replace(contraction, expansion)
    return text

# Function to preprocess text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Expand contractions
    text = expand_contractions(text, contraction_dict)

    # Remove punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    text = ' '.join(tokens)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    additional_stopwords = {'against', 'no', 'down', 'not', 'below', 'out', 'above', 'won'}
    stop_words -= additional_stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # # Stemming
    # stemmer = PorterStemmer()
    # stemmed_tokens = [stemmer.stem(word) for word in tokens]

    # # Lemmatization
    # lemmatizer = WordNetLemmatizer()
    # lemmatized_tokens = [lemmatizer.lemmatize(word) for word in stemmed_tokens]
    """
    The general rule for whether to lemmatize is unsurprising: if it does not improve performance, do not lemmatize.
    Not lemmatizing is the conservative approach, and should be favored unless there is a significant performance gain.
    For example, a popular sentiment analysis method, VADER, has different ratings depending on the form of the word and therefore
    the input should not be stemmed or lemmatized.
    """

    # Join tokens back into a single string
    preprocessed_text = ' '.join(tokens)

    return preprocessed_text

In [None]:
# Apply preprocessing to the 'body' column
df['body_preprocessed'] = df['Body'].apply(preprocess_text)



In [44]:
# Display the original and preprocessed text
print("Original Text:")
print(df['Body'].head(5))
print("\nPreprocessed Text:")
print(df['body_preprocessed'].head(5))

Original Text:
0    Bitcoin’s price action has been incredibly lac...
1    The United States is in turmoil, and the dolla...
2    An over 300 million active register accounts o...
3    Bitcoin started a downside correction from the...
4    Throughout 2017 and 2018, many altcoin-advocat...
Name: Body, dtype: object

Preprocessed Text:
0    bitcoin price action incredibly lackluster rec...
1    united states turmoil dollar weakening slowly ...
2    300 million active register accounts paypal hi...
3    bitcoin started downside correction 9 780 swin...
4    throughout 2017 2018 many altcoin advocates po...
Name: body_preprocessed, dtype: object


# Vader Polarity Scores

In [45]:
pip install vaderSentiment



In [None]:
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize Vader sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Function to get sentiment scores using Vader
def get_sentiment_scores(text):
    # Get sentiment scores
    scores = analyzer.polarity_scores(text)
    return pd.Series([scores, scores['compound']])

# Apply Vader sentiment analysis to the preprocessed text
df[['sentiment_scores', 'compound_scores']] = df['body_preprocessed'].apply(get_sentiment_scores)

In [51]:
df.head(5)

Unnamed: 0,Date,Body,body_preprocessed,"(sentiment_scores, compound_scores)",sentiment_scores,compound_scores
0,6/23/2020,Bitcoin’s price action has been incredibly lac...,bitcoin price action incredibly lackluster rec...,"({'neg': 0.056, 'neu': 0.878, 'pos': 0.066, 'c...","{'neg': 0.057, 'neu': 0.874, 'pos': 0.068, 'co...",0.1027
1,6/23/2020,"The United States is in turmoil, and the dolla...",united states turmoil dollar weakening slowly ...,"({'neg': 0.171, 'neu': 0.618, 'pos': 0.211, 'c...","{'neg': 0.171, 'neu': 0.618, 'pos': 0.211, 'co...",0.4019
2,6/23/2020,An over 300 million active register accounts o...,300 million active register accounts paypal hi...,"({'neg': 0.0, 'neu': 0.752, 'pos': 0.248, 'com...","{'neg': 0.0, 'neu': 0.758, 'pos': 0.242, 'comp...",0.8225
3,6/24/2020,Bitcoin started a downside correction from the...,bitcoin started downside correction 9 780 swin...,"({'neg': 0.075, 'neu': 0.674, 'pos': 0.251, 'c...","{'neg': 0.079, 'neu': 0.655, 'pos': 0.266, 'co...",0.8555
4,6/24/2020,"Throughout 2017 and 2018, many altcoin-advocat...",throughout 2017 2018 many altcoin advocates po...,"({'neg': 0.055, 'neu': 0.812, 'pos': 0.133, 'c...","{'neg': 0.055, 'neu': 0.817, 'pos': 0.127, 'co...",0.4588
