# Finding Fact-checkable Tweets with Machine Learning


## Credits

This notebook was based on one originally created by Jeremy Howard and the other folks at [fast.ai](https://fast.ai) as part of [this fantastic class](https://course.fast.ai/). Specifically, it comes from Lesson 4. You can [see the lession video](https://course.fast.ai/videos/?lesson=4) and [the original class notebook](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb). 

The idea for the project came from Dan Keemahill at the Austin American-Statesman newspaper. Dan, Madlin Mekelburg, and others at the paper hand-coded the tweets used for the classificaiton training.

For more information about this project, and details about how to use this work in the wild, check out our [Quartz AI Studio blog post about the checkable-tweets project](https://qz.ai/?p=89).

-- John Keefe, [Quartz](https://qz.com), October 2019

## Setup

### For those using Google Colaboratory ...

Be aware that Google Colab instances are ephemeral -- they vanish *Poof* when you close them, or after a period of sitting idle (currently 90 minutes), or if you use one for more than 12 hours.

If you're using Google Colaboratory, be sure to set your runtime to "GPU" which speeds up your notebook for machine learning:

![change runtime](https://qz-aistudio-public.s3.amazonaws.com/workshops/notebook_images/change_runtime_2.jpg)
![pick gpu](https://qz-aistudio-public.s3.amazonaws.com/workshops/notebook_images/pick_gpu_2.jpg)

Then run this cell:

In [0]:
## ALL GOOGLE COLAB USERS RUN THIS CELL

## This runs a script that installs fast.ai
!curl -s https://course.fast.ai/setup/colab | bash

Updating fastai...
Done.


### For those _not_ using Google Colaboratory ...

This section is just for people who decide to use one of the notebooks on a system other than Google Colaboartory. 

Those people should run the cell below.

In [0]:
## NON-COLABORATORY USERS SHOULD RUN THIS CELL
%reload_ext autoreload
%autoreload 2
%matplotlib inline

### Everybody do this ...

Everyone needs to run the next cell, which initializes the Python libraries we'll use in this notebook.

In [0]:
## AND *EVERYBODY* SHOULD RUN THIS CELL
import warnings
warnings.filterwarnings('ignore')
from fastai.text import *
import fastai
print(f'fastai: {fastai.__version__}')
print(f'cuda: {torch.cuda.is_available()}')

fastai: 1.0.61
cuda: False


## The Data

We're going to be using two sets of tweets for this project:

- A CSV (comma-separated values file) containing a bunch of #txlege tweets
- A CSV of #txlege tweets that have been hand-coded as "fact-checkable" or "not fact-checkable"


In [0]:
# Run this cell to download the data we'll use for this exercise
#!wget -N https://s3.amazonaws.com/media.johnkeefe.net/newmark-investigations/unclassified_tweets.zip --quiet
#!unzip -q unclassified_tweets.zip

# !wget -N https://s3.amazonaws.com/media.johnkeefe.net/newmark-investigations/TWITTER_CSV_EXPORTS.zip --quiet
# !unzip -q TWITTER_CSV_EXPORTS.zip

!wget -N https://www.dropbox.com/s/rwt23g38d2krdae/output.zip
!unzip -q output.zip

print('Done!')

--2020-05-11 15:31:07--  https://www.dropbox.com/s/rwt23g38d2krdae/output.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.82.1, 2620:100:6032:1::a27d:5201
Connecting to www.dropbox.com (www.dropbox.com)|162.125.82.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/rwt23g38d2krdae/output.zip [following]
--2020-05-11 15:31:07--  https://www.dropbox.com/s/raw/rwt23g38d2krdae/output.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc862f15ad788c9667332a832e23.dl.dropboxusercontent.com/cd/0/inline/A3j9hl7nXs14Bfdc3fPy5Vjpv3wIYzpu6FZdIP2_6hPV0WBZM9rSPiX7hphhsleC61GH14xK3lcQwf9OYucDuNOdd4oMvVkL5NR3TxbEYj_yqQ/file# [following]
--2020-05-11 15:31:07--  https://uc862f15ad788c9667332a832e23.dl.dropboxusercontent.com/cd/0/inline/A3j9hl7nXs14Bfdc3fPy5Vjpv3wIYzpu6FZdIP2_6hPV0WBZM9rSPiX7hphhsleC61GH14xK3lcQwf9OYucDuNOdd4oMvVkL5NR3TxbEYj_yqQ/file
Resolving uc862f15ad78

Let's take a look.

In [0]:
%ls

[0m[01;36mdata[0m@  [01;36mmodels[0m@  [01;34moutput[0m/  output.zip


In [0]:
import glob
import pandas as pd

# file_list = sorted(glob.glob('./unclassified_tweets/*.csv'))
# file_list = sorted(glob.glob('./TWITTER_CSV_EXPORTS/*.csv'))
file_list = sorted(glob.glob('./output/*.csv'))
file_list

['./output/A. Donald McEachin.csv',
 './output/A. Drew Ferguson IV.csv',
 './output/Abby Finkenauer.csv',
 './output/Abigail Davis Spanberger.csv',
 './output/Adam B. Schiff.csv',
 './output/Adam Kinzinger.csv',
 './output/Adam Smith.csv',
 './output/Adrian Smith.csv',
 './output/Adriano Espaillat.csv',
 './output/Al Green.csv',
 './output/Al Lawson, Jr..csv',
 './output/Alan S. Lowenthal.csv',
 './output/Albio Sires.csv',
 './output/Alcee L. Hastings.csv',
 './output/Alexander X. Mooney.csv',
 './output/Alexandria Ocasio-Cortez.csv',
 './output/Alma S. Adams.csv',
 './output/Ami Bera.csv',
 './output/Amy Klobuchar.csv',
 './output/André Carson.csv',
 './output/Andy Barr.csv',
 './output/Andy Biggs.csv',
 './output/Andy Harris.csv',
 './output/Andy Kim.csv',
 './output/Andy Levin.csv',
 './output/Angie Craig.csv',
 './output/Angus S. King, Jr..csv',
 './output/Ann Kirkpatrick.csv',
 './output/Ann M. Kuster.csv',
 './output/Ann Wagner.csv',
 './output/Anna G. Eshoo.csv',
 './output/Anth

In [0]:
!head './output/Veronica Escobar.csv'

tweet,dates
"Happy #MothersDay to all the wonderful moms in El Paso and to my rock and inspiration - my mom! 

It’s been hard not being able to hug or share a meal with her for weeks, but today I dropped off breakfast and flowers from a safe distance to celebrate her and keep her safe! https://t.co/e1PSYzh6Yt",2020-05-10 17:25:08
"Latino entrepreneurs are among the majority of small-business owners directly affected by the economic crisis caused by the #COVID19 pandemic.

We must work to protect vulnerable small businesses and ensure they are able to fully access relief funds.
 https://t.co/3GFOi56G0o",2020-05-09 17:00:01
"Laws allowing “citizens arrests” &amp; easy access to guns coupled w/ our country’s deep roots of racism put a target on the backs of African Americans like Ahmaud.



## Using our saved model

In [0]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'ai-workshop/candidate_tweets/'
save_path = Path(base_dir)
save_path.mkdir(parents=True, exist_ok=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# load the model from the 'export.pkl' file on your Google Drive
my_model = load_learner(save_path, file="export-tweetmodel.pkl")  

## Getting every candidate's stats

In [0]:
import csv
import os

In [0]:
threshold = 0.80
summary_data = []

In [0]:
# loop through all the file names
for file in file_list: 

  # open csv
  with open(file, newline='') as csvfile:
    reader = csv.DictReader(csvfile)

    # establish fresh values
    count_all = 0
    count_true = 0
    count_threshold = 0
    pct_true = 0.0
    pct_threshold = 0.0

    # loop through all the rows in the csv
    for row in reader:

      # skip this row if there's no content
      if row['tweet'] == "":
        continue 

      # we have content, count this row
      count_all += 1

      # make the prediction
      fear = my_model.predict(row['tweet'])
      pct_true = fear[2][1]

      # did we detect fear?
      if  pct_true > 0.50:
        count_true += 1

      # did it exceed the threshold?
      if pct_true > threshold:
        count_threshold += 1

    # calculate percentages, avoiding division by zero
    if (count_all > 0):
      pct_true = count_true / count_all
      pct_threshold = count_threshold / count_all

    short_file = os.path.basename(file)
    
    this_item = [short_file, count_all, count_true, round(pct_true, 2), count_threshold, round(pct_threshold, 2)]
    print(this_item)

    summary_data.append(this_item)

print(summary_data)

 

['A. Donald McEachin.csv', 118, 6, 0.05, 1, 0.01]
['A. Drew Ferguson IV.csv', 133, 12, 0.09, 5, 0.04]
['Abby Finkenauer.csv', 151, 5, 0.03, 1, 0.01]
['Abigail Davis Spanberger.csv', 153, 8, 0.05, 1, 0.01]
['Adam B. Schiff.csv', 195, 76, 0.39, 26, 0.13]
['Adam Kinzinger.csv', 115, 16, 0.14, 5, 0.04]
['Adam Smith.csv', 131, 6, 0.05, 0, 0.0]
['Adrian Smith.csv', 187, 7, 0.04, 1, 0.01]
['Adriano Espaillat.csv', 146, 3, 0.02, 0, 0.0]
['Al Green.csv', 146, 18, 0.12, 10, 0.07]
['Al Lawson, Jr..csv', 140, 2, 0.01, 1, 0.01]
['Alan S. Lowenthal.csv', 165, 11, 0.07, 0, 0.0]
['Albio Sires.csv', 102, 3, 0.03, 1, 0.01]
['Alcee L. Hastings.csv', 103, 5, 0.05, 1, 0.01]
['Alexander X. Mooney.csv', 92, 7, 0.08, 1, 0.01]
['Alexandria Ocasio-Cortez.csv', 136, 6, 0.04, 0, 0.0]
['Alma S. Adams.csv', 113, 3, 0.03, 0, 0.0]
['Ami Bera.csv', 103, 3, 0.03, 0, 0.0]
['Amy Klobuchar.csv', 173, 13, 0.08, 3, 0.02]
['André Carson.csv', 151, 12, 0.08, 1, 0.01]
['Andy Barr.csv', 124, 5, 0.04, 3, 0.02]
['Andy Biggs.csv',

In [0]:
summary_data

In [0]:
output_csv_df = pd.DataFrame(summary_data, columns=['file', 'count_all', 'count_true', 'pct_true', 'count_threshold', 'pct_threshold'])

In [0]:
output_csv_df

In [0]:
output_csv_name = f'{save_path}/summary-output-5-11-2020.csv'
output_csv_df.to_csv(output_csv_name, index=False)

In [0]:
!head summary.csv

file,count_all,count_true,pct_true,count_threshold,pct_threshold
file,count_all,count_true,pct_true,count_threshold,pct_threshold
CA-10 - Sheet1 (1).csv,53,2,0.04,0,0.0
CA-50-Najjar - Sheet1.csv,106,6,0.06,2,0.02


In [0]:
!rm summary.csv