### Using NLP: For Evidence-enhanced dis-information detection in crypto news space

This algorithm is an implementation of a content-based recommendation system using NLP for crypto news articles. It uses a dataset of crypto news articles in the form of a CSV file with columns for year, title, text, source, and URL. It takes a user text as an input and provides relevant [ text, source, url] with relevancy score to justify how authentic or relevant that news article can be. It also provides analysis parameters to assess the summarized results.


In [9]:
from textblob import TextBlob

def get_sentiment_score(text):
    """
    Returns the sentiment score for a given text.
    """
    blob = TextBlob(text)
    sentiment_score = blob.sentiment.polarity
    return sentiment_score

def get_polarity_score(text):
    """
    Returns the polarity score for a given text.
    """
    blob = TextBlob(text)
    polarity_score = blob.sentiment.subjectivity
    return polarity_score

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import numpy as np
import re
from nltk.stem import PorterStemmer
from nltk.sentiment import SentimentIntensityAnalyzer

def preprocess_text(text):
    """
    Preprocesses text by removing non-alphabetic characters, converting to lowercase, 
    tokenizing, and stemming.
    """
    ps = PorterStemmer()
    text = re.sub('[^a-zA-Z]', ' ', str(text))
    text = text.lower()
    text = text.split()
    text = [ps.stem(word) for word in text]
    text = ' '.join(text)
    return text

def get_similar_articles(query, df, n=10):
    """
    Returns a list of n tuples, where each tuple contains the [text, source, url] and its relevancy score.
    """
    # preprocess query
    query = preprocess_text(query)
    
    # preprocess text in df
    df['text'] = df['text'].apply(preprocess_text)
    
    # compute TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(df['text'])
    
    # compute cosine similarity between query and all documents
    query_vector = vectorizer.transform([query])
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    
    # get indices of top n similar documents
    similar_indices = cosine_similarities.argsort()[:-n-1:-1]
    
    # create list of tuples containing [text, source, url] and relevancy score
    similar_articles = []
    for index in similar_indices:
        text = df.loc[index, 'text']
        source = df.loc[index, 'source']
        url = df.loc[index, 'url']
        score = cosine_similarities[index]
        similar_articles.append((text, source, url, score))
    
    # calculate sentiment and polarity scores
    sentiment_scores = [get_sentiment_score(text) for text, _, _, _ in similar_articles]
    polarity_scores = [get_polarity_score(text) for text, _, _, _ in similar_articles]
    
    # calculate mean and standard deviation of sentiment and polarity scores
    mean_sentiment_score = np.mean(sentiment_scores)
    std_sentiment_score = np.std(sentiment_scores)
    mean_polarity_score = np.mean(polarity_scores)
    std_polarity_score = np.std(polarity_scores)
    
    # return dictionary with analysis parameters
    analysis_params = {
        "num_similar_articles": n,
        "mean_sentiment_score": mean_sentiment_score,
        "std_sentiment_score": std_sentiment_score,
        "mean_polarity_score": mean_polarity_score,
        "std_polarity_score": std_polarity_score
    }
    
    return similar_articles, analysis_params


query = "Walmart and Litecoin Payment News Debunked by Walmart Spokesperson, LTC Prices Shudder from Fake News"
similar_articles, analysis_params = get_similar_articles(query, df)
# print list of similar articles
for article in similar_articles:
    text, source, url, score = article
    print(f"Text: {text}\nSource: {source}\nURL: {url}\nRelevancy Score: {score}\n")
    
print(f"Number of similar articles: {analysis_params['num_similar_articles']}")
print(f"Mean sentiment score: {analysis_params['mean_sentiment_score']}")
print(f"Standard deviation of sentiment score: {analysis_params['std_sentiment_score']}")
print(f"Mean polarity score: {analysis_params['mean_polarity_score']}")
print(f"Standard deviation of polarity score: {analysis_params['std_polarity_score']}")


Text: week ago i had the pleasur of write an articl i d dream of write for month and month on end gyft add walmart gift card ala all good thing are not meant to last for the last week bitcoin around the countri have had the pleasur of buy walmart gift card with bitcoin receiv back in the form of gyft point and in essenc spend bitcoin at walmart on ga and groceri make bitcoin a cheaper option to buy ga ha been a long stand dream of the bitcoin commun thi wa the first step toward realiz that dream and said step ha now been rever despit the loss of walmart gyft is gear to provid more to smaller busi with the launch of gyft cloud gyft inform custom via email that due to reason outsid of gyft s control they are unabl to stock walmart gift card ani longer the bitcoin commun know full well that vinni at gyft would not unlist walmart gift card from gyft s impress registri unless he wa forc to there is no doubt in my mind that the initi deci that ha culmin in today s end of gyft s support of wa

#### Output Result Understanding

The output analysis parameters suggest the following:

Number of similar articles: 10: This indicates that the algorithm has retrieved 10 most similar articles from the dataset for the given query text.

Mean sentiment score: 0.06501299226163085: This is the average sentiment score of the retrieved articles. The sentiment score ranges from -1 (most negative) to 1 (most positive), with 0 being neutral. The mean sentiment score of 0.065 indicates that the retrieved articles are slightly positive.

Standard deviation of sentiment score: 0.060099302742572205: This value indicates the variability in the sentiment scores of the retrieved articles. A lower standard deviation indicates that the sentiment scores are clustered closely around the mean, while a higher standard deviation indicates greater variability in the scores. The relatively low standard deviation of 0.060 suggests that the sentiment scores of the retrieved articles are relatively consistent.

Mean polarity score: 0.40925166040176963: This is the average polarity score of the retrieved articles. The polarity score ranges from 0 (most objective) to 1 (most subjective). The mean polarity score of 0.409 indicates that the retrieved articles are slightly subjective.

Standard deviation of polarity score: 0.09215631159297351: This value indicates the variability in the polarity scores of the retrieved articles. A lower standard deviation indicates that the polarity scores are clustered closely around the mean, while a higher standard deviation indicates greater variability in the scores. The relatively high standard deviation of 0.092 suggests that the polarity scores of the retrieved articles are relatively diverse.

#### Code Explanation

The function get_similar_articles takes a query string, a dataframe df containing the news articles, and an optional parameter n for the number of similar articles to return (default 10). The function first preprocesses the query string and the text of the news articles using a set of text preprocessing techniques such as removing non-alphabetic characters, converting to lowercase, tokenizing, and stemming.

Then it computes the TF-IDF (term frequency-inverse document frequency) matrix, which is a numerical statistic that reflects how important a word is to a document in a collection. It then computes the cosine similarity between the query and all the documents in the dataset using the TF-IDF matrix. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space.

It then returns a list of n tuples, where each tuple contains the text of the article, its source, its URL, and its relevancy score, sorted by the relevancy score in descending order. The function also calculates sentiment and polarity scores for each article using the TextBlob library, which is a Python library for processing textual data. The sentiment score measures the overall positive or negative sentiment of the article, while the polarity score measures the degree of subjectivity of the article.

Finally, the function returns a dictionary with various analysis parameters, such as the number of similar articles, the mean sentiment score, the standard deviation of the sentiment score, the mean polarity score, and the standard deviation of the polarity score.

In the main code, a sample query is provided and passed to the get_similar_articles function, which returns a list of similar articles and analysis parameters. The list of similar articles is printed to the console along with their source, URL, and relevancy score. The analysis parameters are also printed to the console.

Overall, this code implements a content-based recommendation system for news articles that uses a combination of text preprocessing techniques, TF-IDF matrix, and cosine similarity to recommend similar articles based on a query string. It also provides sentiment and polarity scores for each recommended article.