# TwitterFakeNews
#### Large scale dataset of 200k+ tweets that were labelled automatically to be true/false. Tweets that are issued by accounts known for spreading false news are considered false (all of them) and vice-versa for trustworthy accounts

https://www.mdpi.com/1999-5903/13/5/114

#### Neccessery imports

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

pd.set_option("display.max_rows", None, "display.max_columns", None)
pd.options.display.max_colwidth = 200


## 1. Loading the dataset
tweet_fake - 1 (Fake), 0 (True)

In [2]:
picked_columns = ['tweet__text', 'tweet__contains_hashtags', 'tweet__nr_of_hashtags', 'tweet__sent_tokenized_text', 'tweet__tokenized_um_url_removed', 'tweet__nr_of_sentences', 'tweet__fake', 'tweet__retweet_count', 'tweet__favorite_count']
Auto_df = pd.read_csv('Initial_datasets/data.csv', sep=';', usecols = picked_columns , low_memory=False)

Auto_df.head(3)

Unnamed: 0,tweet__retweet_count,tweet__favorite_count,tweet__fake,tweet__sent_tokenized_text,tweet__text,tweet__nr_of_sentences,tweet__tokenized_um_url_removed,tweet__nr_of_hashtags,tweet__contains_hashtags
0,32,71,0,"[""Joe Buck ruined Brooks Koepka's U.S. Open kiss with a super-awkward mistake ""]",Joe Buck ruined Brooks Koepka's U.S. Open kiss with a super-awkward mistake https://t.co/UO1GcIUPU6,1,"['Joe', 'Buck', 'ruined', 'Brooks', ""Koepka's"", 'U', '.', 'S', '.', 'Open', 'kiss', 'with', 'a', 'super-awkward', 'mistake']",0,False
1,67,55,0,"['Two engaged doctors found bound and slain in luxury Boston penthouse.', 'Man in custody after shootout. ']",Two engaged doctors found bound and slain in luxury Boston penthouse. Man in custody after shootout. https://t.co/A22ivfzonQ,2,"['Two', 'engaged', 'doctors', 'found', 'bound', 'and', 'slain', 'in', 'luxury', 'Boston', 'penthouse', '.', 'Man', 'in', 'custody', 'after', 'shootout', '.']",0,False
2,0,1,1,"['DROP EVERYTHING and watch these four urgent, must-see #Documentries ']","DROP EVERYTHING and watch these four urgent, must-see #Documentries https://t.co/TiFhU40nWq https://t.co/1dk8PZWFYC",1,"['drop', 'everything', 'and', 'watch', 'these', 'four', 'urgent', ',', 'must-see']",1,True


## 2. Claim Insepctions & Formatting

### 2.1. URLs

In [3]:
# The one that start with a link are dropped (41 entries)
Auto_df = Auto_df.drop(Auto_df.loc[Auto_df['tweet__text'].str.lower().str.startswith('http')].index)

#The ones that have link in the middle are also dropped (8500) (too risky - most of them are faulty/weird)
Auto_df = Auto_df.drop(Auto_df.loc[Auto_df['tweet__text'].str.lower().str.split().str[-1].str.contains('http') == False].index)

#The ones that have a link at the end (180.000+) are having their link\links deleted
f = lambda x: ' '.join([item for item in x.split() if 'http' not in item])
Auto_df["tweet__text"] = Auto_df["tweet__text"].apply(f)

print(Auto_df.shape[0])

188773


### 2.2. Hashtags - Dropping hashtags
Because the dataset is big we can drop the hashtags and still end up with a very big dataset. The hashtag entries are not reliable, hence the decision to not examine them any further

In [4]:
print("size of entries with hashtags: ", Auto_df[Auto_df['tweet__contains_hashtags'] == True].shape[0])
hash_df = Auto_df[Auto_df['tweet__contains_hashtags'] == True]

Auto_df = Auto_df.drop(Auto_df[Auto_df['tweet__contains_hashtags'] == True].index)

size of entries with hashtags:  25098


### 2.3. Duplicates

In [5]:
# Dropping duplicate entries (over 25.0000 rows)
Auto_df = Auto_df.drop_duplicates(subset='tweet__text', keep='first')

### 2.4. @ts

In [6]:
# Droping entries with mentions (@) as there is no real way to transorm them to normal text (around 15.000)
Auto_df = Auto_df.drop(Auto_df.loc[Auto_df['tweet__text'].str.lower().str.contains('@')].index)

### 2.5. Other Special Characters
- &amp
- ;
- non ascii characters (like emojis for example)

In [7]:
# drop entries containing &amp (1500)
Auto_df = Auto_df.drop(Auto_df[Auto_df['tweet__text'].str.contains('&amp')].index)
Auto_df = Auto_df.replace(';',',', regex=True)
Auto_df['tweet__text'] = Auto_df['tweet__text'].astype(str).apply(lambda x: x.encode('ascii', 'ignore').decode('ascii')) # remove characters

### 2.6. Nr of Sentences
- Deleting everything that is over 2 sentences long (around 1500 entries)

In [8]:
Auto_df = Auto_df.drop(Auto_df[Auto_df['tweet__nr_of_sentences'] > 2].index)

### 2.7. Tweet Favorite/Retweet Count - Maybe take only the ones that are above some tweet fav count
- Droping the tweets that have 0 fav and retweet count (Not really worth checking them)

In [9]:
Auto_df = Auto_df.drop(Auto_df[Auto_df['tweet__favorite_count'] == 0].index)
Auto_df = Auto_df.drop(Auto_df[Auto_df['tweet__retweet_count'] == 0].index)

### 2.8. VIDEO entries
- Many entries have '- video' at the end of their claims. Drop them

In [10]:
Auto_df = Auto_df.drop(Auto_df[Auto_df['tweet__text'].str.contains('- video')].index)

## 3. True/Fake Distribution

In [None]:
fig = px.histogram(Auto_df, x='tweet__fake')
fig.update_layout(bargap=0.2)
fig.show()

##  4. Saving new data

In [12]:
# Droping unncessery columns
Auto_df.drop(['tweet__retweet_count', 'tweet__favorite_count', 'tweet__sent_tokenized_text', 'tweet__nr_of_sentences', 'tweet__tokenized_um_url_removed', 'tweet__nr_of_hashtags', 'tweet__contains_hashtags'], axis=1, inplace=True)

- Spliting formatted data into fake and true entries that will be fed to the google verification algorithm, where each entry will get a levensthein distance score based on inputing the claim in google search

In [13]:
true_df = Auto_df[Auto_df['tweet__fake'] == 0]
fake_df = Auto_df[Auto_df['tweet__fake'] == 1]

true_df.to_csv('Auto_Format_True.csv', sep=';', encoding='utf-8')
fake_df.to_csv('Auto_Format_Fake.csv', sep=';', encoding='utf-8')

In [14]:
Auto_df.to_csv('TwitterFakeNews_Final.csv', encoding='utf-8')

## 5. Examining Google Verified Data
- These datasets went through a "google verification" method where each claim was inputted to the google search engine, and the levensthein score between the claim and the results were calculated. The idea is that the entries with higher average levensthein score are not as trustworthy as the ones with lower distance. This suggest that if the query has many similar results in google it should be considered true, and false otherwise. This was done to this noisy dataset as a layer of verification to their automatic labelling

### 5.1. Loading the datasets

In [15]:
Auto_true_score = pd.read_csv('Formatted_datasets/Auto_Format_True_Score.csv', index_col=0, low_memory=False)
Auto_false_score = pd.read_csv('Formatted_datasets/Auto_Format_False_Score.csv', index_col=0, low_memory=False)

# Delete the entries with leven_score = 0
Auto_true_score = Auto_true_score.drop(Auto_true_score.loc[Auto_true_score['leven__score'] <= 0].index)
Auto_false_score = Auto_false_score.drop(Auto_false_score.loc[Auto_false_score['leven__score'] <= 0].index)

### 5.2. Sentence Length correlation to score

- For True entries:

In [16]:
Auto_true_score['tweet__count'] = Auto_true_score['tweet__text'].str.len()
Auto_true_score['tweet__count'].corr(Auto_true_score['leven__score'])

0.9132956515728649

- For False entries:

In [17]:
Auto_false_score['tweet__count'] = Auto_false_score['tweet__text'].str.len()
Auto_false_score['tweet__count'].corr(Auto_false_score['leven__score'])

0.9053585456499308

As we can see there is a big correaltion between the tweet count and the levesthein score calculated from the google verifier. To adjust for that a new column is created

### 5.3. Adding Leven_Score/Tweet_Count column
- Adjusted Levensthein score to adjust for the length of the tweet. The leven_score alone was biased

In [18]:
Auto_true_score['adjusted_leven'] = Auto_true_score['leven__score'] / Auto_true_score['tweet__count']
Auto_false_score['adjusted_leven'] = Auto_false_score['leven__score'] / Auto_false_score['tweet__count']

In [19]:
print(Auto_true_score['tweet__count'].corr(Auto_true_score['adjusted_leven']))
print(Auto_false_score['tweet__count'].corr(Auto_false_score['adjusted_leven']))

-0.08080815730036466
-0.385536888505841


As we can see strong correlation is now gone, making the results more trustworthy. The lower adjusted_leven, the more chance the tweet__text label is actually true

### 5.4 Picking Best Entries Based On Sentence Length

### - For True entries:

Pick 3500 entries with the best (lowest) adjusted_leven score -> This indicates that is simillarity between the claim and the google search query results

In [None]:
# Take 3000 best rated entries from all
Final_Auto_True = Auto_true_score.sort_values('adjusted_leven').head(3500)
Final_Auto_True

fig = px.histogram(Final_Auto_True, x='tweet__count', title="Best 3500 claims and their characters length")
fig.update_layout(bargap=0.2)
fig.show() 

fig = px.histogram(Auto_true_score, x='tweet__count', title="Characters length distribution of all true entries")
fig.update_layout(bargap=0.2)
fig.show() 

##### Conclusion
We can see claims mostly at around 60 characters long, and not much claims that are over 80 characters long. However from the distribution of the whole dataset we can see that big portion of the claims are longer entries with around 100+ characters. Whereas on manual inspection shorter claims appear to be make more sense and be more reliable I decided to include 

In [21]:
# Remaining dataset: (droping rows that are already considered valid)
Reamaining_df = Auto_true_score.drop(Final_Auto_True.index)

print("4 different ranges of characters length that split the dataset into 4 groups: \n\n", 
      pd.cut(Reamaining_df['tweet__count'], 4).unique())

Reamaining_df['char__category'] = pd.cut(Reamaining_df['tweet__count'], 4, labels=[0, 1, 2, 3])

4 different ranges of characters length that split the dataset into 4 groups: 

 [(89.75, 118.0], (61.5, 89.75], (33.25, 61.5], (4.887, 33.25]]
Categories (4, interval[float64]): [(4.887, 33.25] < (33.25, 61.5] < (61.5, 89.75] < (89.75, 118.0]]


#### Category Inspection conclusions:
- **0 (4 - 33 characters)** -> Entries don't tell anything and **should not** be classified as true or false. They are mostly empty headlines. Do not include
- **1 (34 - 61 characters)** -> Entries seem very reliable and the ones with the best score show most potential. Include best 500
- **2 (62 - 89 characters)** -> These also show promise and look reliable upon initial inspection. Include best 500
- **3 (90 - 118 chatacters)** -> Same with these. Include best 500

In [22]:
# Adding additional 1500 enries from categories 1-3 (500 best in each category)
for category in [1,2,3]:
    top500_df = Reamaining_df[Reamaining_df['char__category'] == category].sort_values('adjusted_leven').head(500)
    Final_Auto_True = pd.concat([Final_Auto_True, top500_df])

### - For Fake entries:
For fake entries we will pick claims that achieved the biggest levensthein distance, meaning entries that had the biggest discrepancy between the claim and the google search results

In [None]:
fig = px.histogram(Auto_false_score, x='tweet__count', title="Characters length distribution of all fake entries")
fig.update_layout(bargap=0.2)
fig.show() 

Upon inspection we can't just simply take the worst values from leven__score or adjusted_leven as the values are biased towards claims with little characters or many characters accordingly. Therefore as with true instances, the entries are grouped by character length beforehand .

In [24]:
print("4 different ranges of characters length that split the dataset into 4 groups: \n\n", 
      pd.cut(Auto_false_score['tweet__count'], 4).unique())

Auto_false_score['char__category'] = pd.cut(Auto_false_score['tweet__count'], 4, labels=[0, 1, 2, 3])

4 different ranges of characters length that split the dataset into 4 groups: 

 [(88.0, 116.0], (32.0, 60.0], (60.0, 88.0], (3.888, 32.0]]
Categories (4, interval[float64]): [(3.888, 32.0] < (32.0, 60.0] < (60.0, 88.0] < (88.0, 116.0]]


#### Category Inspection conclusions:
- **0 (4 - 32 characters)** -> Entries don't tell anything and **should not** be classified as true or false. They are mostly empty headlines. Do not include
- **1 (33 - 60 characters)** -> Entries seem very reliable and the ones with the best score show most potential. Include best 25%
- **2 (61 - 88 characters)** -> These also show promise and look reliable upon initial inspection. Include best 25%
- **3 (89 - 116 chatacters)** -> Same with these. Include best 25%

In [25]:
dfs = []
for category in [1,2,3]:
    category_df = Auto_false_score[Auto_false_score['char__category'] == category].sort_values('adjusted_leven', ascending=False)
    dfs.append(category_df.head(int(len(category_df) * (0.20))))
    
Final_Auto_False = pd.concat(dfs)

## 6. Saving Final Dataset

In [15]:
# Switch the tweet_fake labels to reflect not wether the tweet is fake but wether if it's true
Final_Auto_False = Final_Auto_False.assign(tweet__fake=0)
Final_Auto_True = Final_Auto_True.assign(tweet__fake=1)

In [16]:
# Merging True and False datasets together
Final_df = pd.concat([Final_Auto_False, Final_Auto_True])

In [17]:
# Dropping unncessery columns, renamining and rearranging remaining one
Final_df = Final_df.sample(frac=1).reset_index(drop=True).drop(['leven__score', 'tweet__count', 'adjusted_leven', 'char__category'], axis=1)
Final_df.rename(columns={'tweet__fake': 'claim_veracity', 'tweet__text': 'claim'}, inplace=True)
Final_df = Final_df[['claim', 'claim_veracity']]

In [18]:
Final_df.to_csv('TwitterFakeNews_Picked.csv', encoding='utf-8')