### KNN to solve evidence based dis-information detection in crypto space

In [4]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

# Load data
df = pd.read_csv("crypto_news_dataset.csv")
df.dropna(subset=['text'], inplace=True)

# Define function to preprocess text
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', str(text))
    return text

# Preprocess text in the dataset
df['text'] = df['text'].apply(preprocess_text)

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(df['text'])

# Train the KNN model
knn = NearestNeighbors(n_neighbors=10, algorithm='brute', metric='cosine')
knn.fit(tfidf_matrix)

# Define function to get similar articles
def get_similar_articles(query, k=10):
    # Preprocess the query
    query = preprocess_text(query)
    
    # Vectorize the query
    query_vector = vectorizer.transform([query])
    
    # Find the k nearest neighbors
    distances, indices = knn.kneighbors(query_vector, n_neighbors=k)
    
    # Create a list of tuples containing the relevant information
    similar_articles = []
    for i in range(len(indices[0])):
        index = indices[0][i]
        text = df.loc[index, 'text']
        source = df.loc[index, 'source']
        url = df.loc[index, 'url']
        score = 1 - distances[0][i] # convert distance to similarity score
        similar_articles.append((text, source, url, score))
    
    # Calculate analysis parameters
    similarity_scores = [score for _, _, _, score in similar_articles]
    mean_similarity_score = sum(similarity_scores) / len(similarity_scores)
    std_similarity_score = np.std(similarity_scores)
    
    # Print the list of similar articles
    for article in similar_articles:
        text, source, url, score = article
        print(f"Text: {text}\nSource: {source}\nURL: {url}\nSimilarity Score: {score}\n")
    
    # Print analysis parameters
    print(f"Number of similar articles: {k}")
    print(f"Mean similarity score: {mean_similarity_score}")
    print(f"Standard deviation of similarity score: {std_similarity_score}")
    
# Example usage
query = "Walmart and Litecoin Payment News Debunked by Walmart Spokesperson, LTC Prices Shudder from Fake News"
get_similar_articles(query)


Text: two days ago technology giant microsoft started accepting bitcoin as a means of payment for users to fund their microsoft accounts with this move microsoft became the largest company in the world to accept bitcoin reddit users from rbitcoin eagerly began posting to rxboxone and other microsoftrelated subreddits about this addition explaining what bitcoin is and tipping people many xbox fans quickly caught on and started asking questions on rbitcoin such as what is the benefit of switching from cashcredit card to bitcoin and what are some recommended wallets bitcoiners have been tipping users in all subreddits relevant to microsoft trying to increase knowledge of microsofts acceptance of bitcoin and bitcoin in general also read first reactions to microsofts adoption of bitcoinchange of opinionsome of the recent posts on rbitcoin about rxboxonereddit user uatheist_saint posted links to two posts on rxboxone one from a year ago and one from two days ago a year ago a user asked what 

This code uses the TF-IDF vectorizer to transform the text data into numerical feature vectors, then trains a KNN model on the feature vectors using cosine similarity as the distance metric. Given a query text, the code preprocesses it, vectorizes it, and uses the trained KNN model to find the 10 most similar articles in the dataset. The code also calculates and prints the analysis parameters, which include the number of similar articles, mean similarity score, and standard deviation of similarity score.