# Finding Fact-checkable Tweets with Machine Learning


## Credits

This notebook was based on one originally created by Jeremy Howard and the other folks at [fast.ai](https://fast.ai) as part of [this fantastic class](https://course.fast.ai/). Specifically, it comes from Lesson 4. You can [see the lession video](https://course.fast.ai/videos/?lesson=4) and [the original class notebook](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb). 

The idea for the project came from Dan Keemahill at the Austin American-Statesman newspaper. Dan, Madlin Mekelburg, and others at the paper hand-coded the tweets used for the classificaiton training.

For more information about this project, and details about how to use this work in the wild, check out our [Quartz AI Studio blog post about the checkable-tweets project](https://qz.ai/?p=89).

-- John Keefe, [Quartz](https://qz.com), October 2019

## Setup

### For those using Google Colaboratory ...

Be aware that Google Colab instances are ephemeral -- they vanish *Poof* when you close them, or after a period of sitting idle (currently 90 minutes), or if you use one for more than 12 hours.

If you're using Google Colaboratory, be sure to set your runtime to "GPU" which speeds up your notebook for machine learning:

![change runtime](https://qz-aistudio-public.s3.amazonaws.com/workshops/notebook_images/change_runtime_2.jpg)
![pick gpu](https://qz-aistudio-public.s3.amazonaws.com/workshops/notebook_images/pick_gpu_2.jpg)

Then run this cell:

In [0]:
## ALL GOOGLE COLAB USERS RUN THIS CELL

## This runs a script that installs fast.ai
!curl -s https://course.fast.ai/setup/colab | bash

Updating fastai...
Done.


### For those _not_ using Google Colaboratory ...

This section is just for people who decide to use one of the notebooks on a system other than Google Colaboartory. 

Those people should run the cell below.

In [0]:
## NON-COLABORATORY USERS SHOULD RUN THIS CELL
%reload_ext autoreload
%autoreload 2
%matplotlib inline

### Everybody do this ...

Everyone needs to run the next cell, which initializes the Python libraries we'll use in this notebook.

In [0]:
## AND *EVERYBODY* SHOULD RUN THIS CELL
import warnings
warnings.filterwarnings('ignore')
from fastai.text import *
import fastai
print(f'fastai: {fastai.__version__}')
print(f'cuda: {torch.cuda.is_available()}')

fastai: 1.0.61
cuda: True


## The Data

We're going to be using two sets of tweets for this project:

- A CSV (comma-separated values file) containing a bunch of #txlege tweets
- A CSV of #txlege tweets that have been hand-coded as "fact-checkable" or "not fact-checkable"


In [0]:
# Run this cell to download the data we'll use for this exercise
#!wget -N https://s3.amazonaws.com/media.johnkeefe.net/newmark-investigations/unclassified_tweets.zip --quiet
#!unzip -q unclassified_tweets.zip

!wget -N https://s3.amazonaws.com/media.johnkeefe.net/newmark-investigations/TWITTER_CSV_EXPORTS.zip --quiet
!unzip -q TWITTER_CSV_EXPORTS.zip

print('Done!')

Done!


Let's take a look.

In [0]:
%ls

[0m[01;36mdata[0m@      [01;34mgdrive[0m/  [01;34mTWITTER_CSV_EXPORTS[0m/     [01;34munclassified_tweets[0m/
first.csv  [01;36mmodels[0m@  TWITTER_CSV_EXPORTS.zip  unclassified_tweets.zip


In [0]:
import glob
import pandas as pd

# file_list = sorted(glob.glob('./unclassified_tweets/*.csv'))
file_list = sorted(glob.glob('./TWITTER_CSV_EXPORTS/*.csv'))
file_list

['./TWITTER_CSV_EXPORTS/AZ-01-Shedd - shedd_tweets.csv',
 './TWITTER_CSV_EXPORTS/AZ-02 - stauz_tweets.csv',
 './TWITTER_CSV_EXPORTS/AZ-02.2 - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/CA-25 - garcia_tweets.csv',
 './TWITTER_CSV_EXPORTS/CA-39-Kim - kim_tweets.csv',
 './TWITTER_CSV_EXPORTS/CA-45 - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/CA-48-Steel - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/CO-06-House - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/IA-01-Hinson - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/IA-02-Miller-Meeks - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/IA-03-Young - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/IL-06-Ives - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/IL-14-Oberweis - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/IL-17-Joy-King - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/KS-03-Adkins - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/ME-02-Bennett - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/MI-11-Bentivoglio - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/MI-11-Esshaki - Sheet1.csv',
 './TWITTER_CSV_EXPORTS/NJ-02-VanDrew - Sheet1.csv',
 './TW

In [0]:
!head './unclassified_tweets/Twitter_Batch 1 - GeorgeBuck (1).csv'

handle,date,content,url,covid_1,covid_2,covid_3,covid_final,other_1,other_2,other_3,other_final
,2020-04-15 14:47:47,https://t.co/oC0bKhoYdm,,,FALSE,,,,FALSE,,
,2020-04-10 23:28:58,"Pinellas GOP Congressional Candidates on Coronavirus, Vote-by-Mail and President Trump https://t.co/tmRnXlZpQU",,,FALSE,,,,FALSE,,
,2020-03-18 23:55:05,"Attorney General William Barr Busts Out Bagpipes Alongside NYPD, Wows Cr... https://t.co/qJiGLJJNXw via @YouTube",,,FALSE,,,,FALSE,,

In [0]:
!head './unclassified_tweets/VA-10-dove - Sheet1.csv'

date,handle,content,link,covid_1,covid_2,covid_3,covid_final,other_1,other_2,other_3,other_final
"April 05, 2020 at 03:49PM",@JefferyADoveJr,Enjoy a safe and blessed #PalmSunday . Take time to think those of those in need. #VA10,http://twitter.com/JefferyADoveJr/status/1246887755810066438,,,,,,,,
"April 08, 2020 at 07:14AM",@JefferyADoveJr,I saw the Representative tell her story. Its interesting that she was almost unable to try this medication that ended up saving her when she was startimg to experience very labored breathing. We need more testing but this looks promising. #VA10 https://t.co/havF1OpHiJ,http://twitter.com/JefferyADoveJr/status/1247845189886136322,,,,,,,,
"April 08, 2020 at 07:48PM",@JefferyADoveJr,#HappyPassover to all those celebrating around the world. Have a blessed and safe celebration. #VA10 https://t.co/L6YpNRz7Ks,http://twitter.com/JefferyADoveJr/status/1248035102011121664,,,,,,,,
"April 09, 2020 at 12:12PM",@JefferyADoveJr,"We give our profound thanks to th

## Using our saved model

In [0]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'ai-workshop/candidate_tweets/'
save_path = Path(base_dir)
save_path.mkdir(parents=True, exist_ok=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# load the model from the 'export.pkl' file on your Google Drive
my_model = load_learner(save_path, file="export-tweetmodel.pkl")  

## Getting every candidate's stats

In [0]:
import csv
import os

In [0]:
threshold = 0.80
summary_data = []

In [0]:
# loop through all the file names
for file in file_list: 

  # open csv
  with open(file, newline='') as csvfile:
    reader = csv.DictReader(csvfile)

    # establish fresh values
    count_all = 0
    count_true = 0
    count_threshold = 0
    pct_true = 0.0
    pct_threshold = 0.0

    # loop through all the rows in the csv
    for row in reader:

      # skip this row if there's no content
      if row['content'] == "":
        continue 

      # we have content, count this row
      count_all += 1

      # make the prediction
      fear = my_model.predict(row['content'])
      pct_true = fear[2][1]

      # did we detect fear?
      if  pct_true > 0.50:
        count_true += 1

      # did it exceed the threshold?
      if pct_true > threshold:
        count_threshold += 1

    # calculate percentages, avoiding division by zero
    if (count_all > 0):
      pct_true = count_true / count_all
      pct_threshold = count_threshold / count_all

    short_file = os.path.basename(file)
    
    this_item = [short_file, count_all, count_true, round(pct_true, 2), count_threshold, round(pct_threshold, 2)]
    print(this_item)

    summary_data.append(this_item)

print(summary_data)

 

['AZ-01-Shedd - shedd_tweets.csv', 36, 1, 0.03, 0, 0.0]
['AZ-02 - stauz_tweets.csv', 8, 0, 0.0, 0, 0.0]
['AZ-02.2 - Sheet1.csv', 39, 6, 0.15, 4, 0.1]
['CA-25 - garcia_tweets.csv', 82, 7, 0.09, 4, 0.05]
['CA-39-Kim - kim_tweets.csv', 30, 0, 0.0, 0, 0.0]
['CA-45 - Sheet1.csv', 69, 6, 0.09, 6, 0.09]
['CA-48-Steel - Sheet1.csv', 19, 0, 0.0, 0, 0.0]
['CO-06-House - Sheet1.csv', 38, 5, 0.13, 1, 0.03]
['IA-01-Hinson - Sheet1.csv', 77, 0, 0.0, 0, 0.0]
['IA-02-Miller-Meeks - Sheet1.csv', 37, 1, 0.03, 0, 0.0]
['IA-03-Young - Sheet1.csv', 49, 2, 0.04, 1, 0.02]
['IL-06-Ives - Sheet1.csv', 53, 34, 0.64, 27, 0.51]
['IL-14-Oberweis - Sheet1.csv', 31, 10, 0.32, 5, 0.16]
['IL-17-Joy-King - Sheet1.csv', 23, 5, 0.22, 3, 0.13]
['KS-03-Adkins - Sheet1.csv', 25, 3, 0.12, 3, 0.12]
['ME-02-Bennett - Sheet1.csv', 51, 18, 0.35, 13, 0.25]
['MI-11-Bentivoglio - Sheet1.csv', 1, 0, 0.0, 0, 0.0]
['MI-11-Esshaki - Sheet1.csv', 25, 6, 0.24, 3, 0.12]
['NJ-02-VanDrew - Sheet1.csv', 14, 1, 0.07, 1, 0.07]
['NJ-07-Keane-Jr

In [0]:
summary_data

[['AZ-01-Shedd - shedd_tweets.csv', 36, 1, 0.03, 0, 0.0],
 ['AZ-02 - stauz_tweets.csv', 8, 0, 0.0, 0, 0.0],
 ['AZ-02.2 - Sheet1.csv', 39, 6, 0.15, 4, 0.1],
 ['CA-25 - garcia_tweets.csv', 82, 7, 0.09, 4, 0.05],
 ['CA-39-Kim - kim_tweets.csv', 30, 0, 0.0, 0, 0.0],
 ['CA-45 - Sheet1.csv', 69, 6, 0.09, 6, 0.09],
 ['CA-48-Steel - Sheet1.csv', 19, 0, 0.0, 0, 0.0],
 ['CO-06-House - Sheet1.csv', 38, 5, 0.13, 1, 0.03],
 ['IA-01-Hinson - Sheet1.csv', 77, 0, 0.0, 0, 0.0],
 ['IA-02-Miller-Meeks - Sheet1.csv', 37, 1, 0.03, 0, 0.0],
 ['IA-03-Young - Sheet1.csv', 49, 2, 0.04, 1, 0.02],
 ['IL-06-Ives - Sheet1.csv', 53, 34, 0.64, 27, 0.51],
 ['IL-14-Oberweis - Sheet1.csv', 31, 10, 0.32, 5, 0.16],
 ['IL-17-Joy-King - Sheet1.csv', 23, 5, 0.22, 3, 0.13],
 ['KS-03-Adkins - Sheet1.csv', 25, 3, 0.12, 3, 0.12],
 ['ME-02-Bennett - Sheet1.csv', 51, 18, 0.35, 13, 0.25],
 ['MI-11-Bentivoglio - Sheet1.csv', 1, 0, 0.0, 0, 0.0],
 ['MI-11-Esshaki - Sheet1.csv', 25, 6, 0.24, 3, 0.12],
 ['NJ-02-VanDrew - Sheet1.csv', 1

In [0]:
output_csv_df = pd.DataFrame(summary_data, columns=['file', 'count_all', 'count_true', 'pct_true', 'count_threshold', 'pct_threshold'])

In [0]:
output_csv_df

Unnamed: 0,file,count_all,count_true,pct_true,count_threshold,pct_threshold
0,CA-10 - Sheet1 (1).csv,53,2,0.04,0,0.0
1,CA-10 - Sheet1 (1).csv,53,2,0.04,0,0.0
2,CA-50-Najjar - Sheet1.csv,106,6,0.06,2,0.02
3,GA-07-Bordeaux - Sheet1.csv,46,1,0.02,0,0.0
4,IA-04-Scholten - Sheet1.csv,565,18,0.03,3,0.01
5,MI-08-Detmer - Sheet1.csv,50,4,0.08,0,0.0
6,NC-13-Huffman - Sheet1.csv,193,19,0.1,3,0.02
7,NJ-11-Becchi - Sheet1.csv,82,14,0.17,6,0.07
8,NY-02-Gordon - Sheet1.csv,39,0,0.0,0,0.0
9,NY-11 - malliotakis_tweets.csv,291,15,0.05,5,0.02


In [0]:
output_csv_name = f'{save_path}/summary2.csv'
output_csv_df.to_csv(output_csv_name, index=False)

In [0]:
!head summary.csv

file,count_all,count_true,pct_true,count_threshold,pct_threshold
file,count_all,count_true,pct_true,count_threshold,pct_threshold
CA-10 - Sheet1 (1).csv,53,2,0.04,0,0.0
CA-50-Najjar - Sheet1.csv,106,6,0.06,2,0.02


In [0]:
!rm summary.csv