#### Different Models we found
In the following code cells you will find a selection of misinformation classifier models that are published on hugginface.co and directly applicable in our setup. To inspect a models details (training set, performance metrics, ets...) please follow the provided link to the respective model card. To load a model into the Tweet-Buster application simply execute the corresponding code cell.

In [45]:
import numpy as np
import torch
import sys
import warnings
import time
from transformers import AutoTokenizer, AutoModelForSequenceClassification

Model: https://huggingface.co/XSY/albert-base-v2-fakenews-discriminator 

In [46]:
tokenizer = AutoTokenizer.from_pretrained('XSY/albert-base-v2-fakenews-discriminator')
model = AutoModelForSequenceClassification.from_pretrained('XSY/albert-base-v2-fakenews-discriminator')

def get_veracity(text):
    while True:
        try:
            tokens = tokenizer.encode(str(text), return_tensors='pt')
            result = model(tokens)
            break
        except Exception as e:
            if len(text) > 20:
                text = text[:-10]
            else:
                return -1
    return result.logits

Model: https://huggingface.co/hamzab/roberta-fake-news-classification

In [11]:
tokenizer = AutoTokenizer.from_pretrained('hamzab/roberta-fake-news-classification')
model = AutoModelForSequenceClassification.from_pretrained('hamzab/roberta-fake-news-classification')

def get_veracity(text):
    while True:
        try:
            tokens = tokenizer.encode(str(text), return_tensors='pt')
            result = model(tokens)
            break
        except Exception as e:
            if len(text) > 20:
                text = text[:-10]
            else:
                return -1
    return result.logits

Model: https://huggingface.co/roupenminassian/TwHIN-BERT-Misinformation-Classifier   (2.25 GB!!)

In [12]:
tokenizer = AutoTokenizer.from_pretrained('roupenminassian/TwHIN-BERT-Misinformation-Classifier')
model = AutoModelForSequenceClassification.from_pretrained('roupenminassian/TwHIN-BERT-Misinformation-Classifier')

def get_veracity(text):
    while True:
        try:
            tokens = tokenizer.encode(str(text), return_tensors='pt')
            result = model(tokens)
            break
        except Exception as e:
            if len(text) > 20:
                text = text[:-10]
            else:
                return -1
    return result.logits

Model: https://huggingface.co/dlentr/lie_detection_distilbert/tree/main

In [113]:
tokenizer = AutoTokenizer.from_pretrained('dlentr/lie_detection_distilbert')
model = AutoModelForSequenceClassification.from_pretrained('dlentr/lie_detection_distilbert')

def get_veracity(text):
    while True:
        try:
            tokens = tokenizer.encode(str(text), return_tensors='pt')
            result = model(tokens)
            break
        except Exception as e:
            if len(text) > 20:
                text = text[:-10]
            else:
                return -1
    return result.logits

Model: https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis (3 Classes)

In [29]:
tokenizer = AutoTokenizer.from_pretrained('mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis')
model = AutoModelForSequenceClassification.from_pretrained('mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis')

def get_veracity(text):
    while True:
        try:
            tokens = tokenizer.encode(str(text), return_tensors='pt')
            result = model(tokens)
            break
        except Exception as e:
            if len(text) > 20:
                text = text[:-10]
            else:
                return -1
    return result.logits

Model: https://huggingface.co/vectara/hallucination_evaluation_model (Probability)

In [51]:
tokenizer = AutoTokenizer.from_pretrained('vectara/hallucination_evaluation_model')
model = AutoModelForSequenceClassification.from_pretrained('vectara/hallucination_evaluation_model')

def get_veracity(text):
    while True:
        try:
            tokens = tokenizer.encode(str(text), return_tensors='pt')
            result = model(tokens)
            break
        except Exception as e:
            if len(text) > 20:
                text = text[:-10]
            else:
                return -1
    tensor = result.logits
    return tensor

Model: https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier (Probability / 5)

In [101]:
tokenizer = AutoTokenizer.from_pretrained('HuggingFaceFW/fineweb-edu-classifier')
model = AutoModelForSequenceClassification.from_pretrained('HuggingFaceFW/fineweb-edu-classifier')

def get_veracity(text):
    while True:
        try:
            tokens = tokenizer.encode(str(text), return_tensors='pt')
            result = model(tokens)
            break
        except Exception as e:
            if len(text) > 20:
                text = text[:-10]
            else:
                return -1
    tensor = result.logits
    value = tensor.item()
    value = int(round(max(0, min(value, 5))))
    value = value/5
    tensor = tensor.detach()
    tensor[0, 0] = value
    tensor.requires_grad_(True)
    return tensor

Model: RANDOM

In [53]:
def get_veracity(text):
    return torch.rand(1, 1, requires_grad=True)

Model: ADD YOUR OWN

Your resulting tensor should be of the following format: 

For probability classification:
<code>tensor([[0.5]], requires_grad=True)</code>

For class classification:
<code>tensor([[0.01,  0.99]], requires_grad=True)</code>

In [None]:
def get_veracity(text):
    
    # Your model code here

    tensor = torch.rand(1, 1, requires_grad=True) #<-- Store your result in to the tensor variable
    return tensor

#### Test the Model
The following code cell gives you the opportunity to test the loaded model before deploying it in on live streamed Twitter (X) data. The text from the test sample is sourced from a tweet made by Donald J. Trump, however you can change it to you liking. If the model works as intended you should receive the number of possible classes and the outcome of the model in the cell output.

In [52]:
text = 'The 75,000,000 great American Patriots who voted for me, AMERICA FIRST, and MAKE AMERICA GREAT AGAIN, will have a GIANT VOICE long into the future. They will not be disrespected or treated unfairly in any way, shape or form!!!'
classification = get_veracity(text)
if len(classification.tolist()[0]) > 1:
    print('Number of possible Classes: ' + str(len(classification.tolist()[0])))
    print('Outcome: ' + str(int(torch.argmax(classification))))
else:
    print('Number of possible Classes: 1')
    print('Outcome: ' + str(classification.item()))

Number of possible Classes: 1
Outcome: 0.23633378744125366


#### Setup Text Sanitization and Response Classes
The following code imports the Flask libraries, sets up a function for tweet-text sanitization and initializes a list storing the response classes. A more transparent example of how our tweet-text sanitization is performed is given at the end of this notebook. You have the option to reverse the response classes and to display the classification output in your browser as emojis only. For this set the respective variable in the following code cell to 1 (reverse_resp_classes, emojis_only).

In [49]:
from flask import Flask, request, jsonify
from flask_cors import CORS
import time
import re

# Set your display preferences here!
reverse_resp_classes = 0
emojis_only = 0


def get_tweet_text(select_tweet):
    if re.match(r'^[^@]+@[^@]+·', select_tweet):
        text = re.sub(r'^[^@]+@[^@]+·', '', select_tweet)
        text = re.sub(r'^(?:[A-Za-z]{3}\s\d+|\d+[hm])', '', text)
        text = re.sub(r', \d{4}', '', text)
        if text .startswith('From'):
            return 'SHORT'
        text = re.sub(r'\dFrom.*', '', text)
        text = text.replace('views', '')
        last_word = text.split()
        last_word = last_word[-1]
        if bool(re.search(r'\d', last_word)):
            text = text.replace(last_word, '')
            last_word = last_word[:re.search(r'\d', last_word).start()]
            text = text + last_word
        if not bool(re.search(r'[a-zA-Z]{5,}', text)):
            text = ''
        return text
    else:
        return'THIS_TWEET_IS_AN_AD'


response_classes = [['This is an AD! &#128566;'],
                        ['Bli Bla Blup!'],
                        ['Likely FALSE! &#128533;', 'Likely TRUE! &#128512;'],
                        ['Likely FALSE! &#128533;', 'Not entirely clear &#129488;', 'Likely TRUE! &#128512;'],
                        ['Likely FALSE! &#128533;', 'Perhaps not quite correct! &#129320;', 'Perhaps a little bit correct! &#128524;', 'Likely TRUE! &#128512;'],
                        ['Likely FALSE! &#128533;', 'Perhaps not quite correct! &#129320;', 'Not entirely clear &#129488;', 'Perhaps a little bit correct! &#128524;', 'Likely TRUE! &#128512;'],
                        ['Quite WRONG! &#128549;', 'Likely FALSE! &#128533;', 'Perhaps not quite correct! &#129320;', 'Perhaps a little bit correct! &#128524;', 'Likely TRUE! &#128512;', 'Quite TRUE! &#128513;'],
                        ['Quite WRONG! &#128549;', 'Likely FALSE! &#128533;', 'Perhaps not quite correct! &#129320;', 'Not entirely clear &#129488;', 'Perhaps a little bit correct! &#128524;', 'Likely TRUE! &#128512;', 'Quite TRUE! &#128513;'],
                        ['Liar, Liar! &#129396;', 'Quite WRONG! &#128549;', 'Likely FALSE! &#128533;', 'Perhaps not quite correct! &#129320;', 'Perhaps a little bit correct! &#128524;', 'Likely TRUE! &#128512;', 'Quite TRUE! &#128513;', 'Super TRUE! &#129299;'],
                        ['Liar, Liar! &#129396;', 'Quite WRONG! &#128549;', 'Likely FALSE! &#128533;', 'Perhaps not quite correct! &#129320;', 'Not entirely clear &#129488;', 'Perhaps a little bit correct! &#128524;', 'Likely TRUE! &#128512;', 'Quite TRUE! &#128513;', 'Super TRUE! &#129299;'],
                        ['PANTS ON FIRE &#128086;&#128293;', 'Liar, Liar! &#129396;', 'Quite WRONG! &#128549;', 'Likely FALSE! &#128533;', 'Perhaps not quite correct! &#129320;', 'Perhaps a little bit correct! &#128524;', 'Likely TRUE! &#128512;', 'Quite TRUE! &#128513;', 'Super TRUE! &#129299;', 'Spreading the GOSPEL OF TRUTH! &#129321;']]
if reverse_resp_classes == 1:
    for i, val in enumerate(response_classes):
        if i >= 2:
            response_classes[i] = list(reversed(val))
if emojis_only == 1:
    for i, val in enumerate(response_classes):
        if i >= 2:
            for j in range(len(val)):
                response_classes[i][j] = re.search(r'&\#\d{6};', response_classes[i][j]).group(0)
        if reverse_resp_classes == 0:
            response_classes[10][0] = '&#128086;&#128293;'
        else:
            response_classes[10][9] = '&#128086;&#128293;'



#### Start Flask Instance
Executing the following code cell will start the Flask instance. If everything works as intended the instance will listen for requests on your machine's local port (5000). After executing leave the code cell open/running and start browsing Twitter (X). Once you are done browsing, terminate the execution of the cell manually.

In [None]:
tweet_texts = []
app = Flask(__name__)
CORS(app)

@app.route('/process', methods=['POST'])
def process_text():
    data = request.get_json()
    tweet_text = data['text']

    processed_text = get_tweet_text(tweet_text)

    if len(processed_text) > 10:
        if processed_text == 'THIS_TWEET_IS_AN_AD':
            return_text = response_classes[0][0]
        else:
            classification = get_veracity(processed_text)
            if len(classification.tolist()[0]) > 1:
                return_text = response_classes[len(classification.tolist()[0])][int(torch.argmax(classification))]
            else:
                print(classification.item())
                print('Outcome: ' + str(classification.item()))
                return_text = response_classes[10][max(0, min(round((classification.item())*10),9))]
    else:
        return_text = 'Not enough text! &#128683;'

    #time.sleep(2)
    tweet_texts.append(tweet_text)
    tweet_texts.append(processed_text)
    #print(tweet_text)
    return jsonify(return_text)

if __name__ == '__main__':
    app.run(port=5000)

After completing your Twitter (X) session, the following cell gives you the opportunity to view all the tweets that you have come across during your browsing session.

In [55]:
iter = 0
for txt in tweet_texts:
    if iter == 0:
        print('RAW TWEET:')
        iter = 1
    else:
        print('CLEANED TWEET:')
        iter = 0
    print(txt)
    print('--------------------')
    print('\n')

RAW TWEET:
Saruei @Saruei_·12hTrust is the most dangerous weapon  #MGS35332K28K413K
--------------------


CLEANED TWEET:
Trust is the most dangerous weapon  #MGS
--------------------


RAW TWEET:
The_Real_Fly@The_Real_Fly·2h23 years later they blame Saudi Arabia for 9/11From CBS Evening News38371358.1K
--------------------


CLEANED TWEET:
23 years later they blame Saudi Arabia for 
--------------------


RAW TWEET:
Catizen@CatizenAI·22h#CatizenVibe - Heal the World  
Calling all Catizens!
Vote, Share, Participate to rescue stray cats! 

Post your heartwarming story with stray Kitty and get $Ton rewards! 

Share Your Story with Stray Cats on Twitter and Telegram

Format: 
-Video or Picture with TextShow more4K58K30K399K
--------------------


CLEANED TWEET:
#CatizenVibe - Heal the World  
Calling all Catizens!
Vote, Share, Participate to rescue stray cats! 

Post your heartwarming story with stray Kitty and get $Ton rewards! 

Share Your Story with Stray Cats on Twitter and Telegram



#### Demonstration of Tweet Text Sanitization
The following code cell illustrates the different stages of removing miscellaneous string content that comes with scraped tweets.

In [22]:
select_tweet = 'Historic Vids@historyinmemes·10hRussian hockey is different man1446315.9K1.4M'
#select_tweet = tweet_texts[0]
#select_tweet = 'Donald J. Trump@realDonaldTrump·Jan 6, 2021RSBN TV·23.2M views3:32:03 / 4:55:211:23:1723.2M views3:32:02From pscp.tv34K52K234K'
#select_tweet = 'Donald J. Trump@realDonaldTrump·Jan 5, 2021From Team Trump (Text TRUMP to 88022)14K17K87K'
print(select_tweet)
print('--------------------')

if re.match(r'^[^@]+@[^@]+·', select_tweet):
    text = re.sub(r'^[^@]+@[^@]+·', '', select_tweet)
    print(text)
    print('--------------------')

    text = re.sub(r'^(?:[A-Za-z]{3}\s\d+|\d+[hm])', '', text)
    print(text)
    print('--------------------')

    text = re.sub(r', \d{4}', '', text)
    print(text)
    print('--------------------')

    if text.startswith('From'):
        print('RETURNING SHORT!')
        print('--------------------')
        sys.exit()

    text = re.sub(r'\dFrom.*', '', text)
    print(text)
    print('--------------------')

    text = text.replace('views', '')
    print(text)
    print('--------------------')

    last_word = text.split()
    last_word = last_word[-1]
    if bool(re.search(r'\d', last_word)):
        text = text.replace(last_word, '')
        last_word = last_word[:re.search(r'\d', last_word).start()]
        text = text + last_word
        print(text)
        print('--------------------')

    if not bool(re.search(r'[a-zA-Z]{5,}', text)):
        text = ''
        print(text)
        print('--------------------')

else:
    print('Tweet is an AD!')

Historic Vids@historyinmemes·10hRussian hockey is different man1446315.9K1.4M
--------------------
10hRussian hockey is different man1446315.9K1.4M
--------------------
Russian hockey is different man1446315.9K1.4M
--------------------
Russian hockey is different man1446315.9K1.4M
--------------------
Russian hockey is different man1446315.9K1.4M
--------------------
Russian hockey is different man1446315.9K1.4M
--------------------
Russian hockey is different man
--------------------
