## Scraping politifact.com TRUTH-O-METER
* Gets 'Latest Fact-Checks' from PolitiFact.com
    - All facts are categoried as 'true', 'mostly true', 'half true', 'mostly true', 'false' or 'absolutely false' based on PolitiFact.com's assessment

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import string
import pandas as pd

In [3]:
def clean_bs4_result(bs4_find_all_ResultSet):
    clean_words = []
    texts = [i.text for i in bs4_find_all_ResultSet]
    for text in texts:
        text = re.sub("\n", "", text)
        text = re.sub("[%s]" % re.escape(string.punctuation), "", text)
        text = re.sub('[‘’“”…]', '', text)
#         text = re.sub('\w*\d\w*', '', text)
        clean_words.append(text)
    return clean_words

### Web Scraping

In [4]:
# Assigning url variables
true = "https://www.politifact.com/factchecks/list/?ruling=true"
mostly_true = "https://www.politifact.com/factchecks/list/?ruling=mostly-true"
half_true = "https://www.politifact.com/factchecks/list/?ruling=half-true"
mostly_false = "https://www.politifact.com/factchecks/list/?ruling=barely-true"
false = "https://www.politifact.com/factchecks/list/?ruling=false"
absolutely_false = "https://www.politifact.com/factchecks/list/?ruling=pants-fire"

# Creating list of urls
urls = [true, mostly_true, half_true, mostly_false, false, absolutely_false]

# Scraping
soup_list = []
for url in urls:
    page = requests.get(url).text
    soup = BeautifulSoup(page)
    soup_list.append(soup)

In [5]:
clean_statements = []
clean_names = []

for soup in soup_list:
    # Finding desired text in soup
    statement = soup.find_all(class_="m-statement__quote")
    # Cleaning text
    clean_statement = clean_bs4_result(statement)
    clean_statements.append(clean_statement)
    # Finding more desired text in soup
    name = soup.find_all(class_="m-statement__name")
    # Cleaning text
    clean_name = clean_bs4_result(name)
    clean_names.append(clean_name)

clean_statements[0][1]

'Cal Cunningham voted for over 1 billion in new taxes'

### Truth-Meter DataFrames

In [6]:
# Constructing truth-meter DataFrames
true_df = pd.DataFrame({"name":clean_names[0], "true":clean_statements[0]})
mostly_true_df = pd.DataFrame({"name":clean_names[1], "mostly_true":clean_statements[1]})
half_true_df = pd.DataFrame({"name":clean_names[2], "half_true":clean_statements[2]})
mostly_false_df = pd.DataFrame({"name":clean_names[3], "mostly_false":clean_statements[3]})
false_df = pd.DataFrame({"name":clean_names[4], "false":clean_statements[4]})
absolutely_false_df = pd.DataFrame({"name":clean_names[5], "absolutely_false":clean_statements[5]})

In [16]:
true_df.head()

Unnamed: 0,name,true
0,Michelle Obama,In one of the states that determined the outco...
1,National Republican Senatorial Committee,Cal Cunningham voted for over 1 billion in new...
2,Cory Booker,Says the US Senate is dominated by millionaire...
3,Instagram posts,The GOP lawyer who helped Kanye West get on th...
4,MJ Hegar,Says 1 in 5 Texans did not have health insuran...


### Truth-Meter Strings

In [8]:
# Constructing Truth-Meter string objects
true_string = " ".join([row["true"] for index, row in true_df.iterrows()])
mostly_true_string = " ".join([row["mostly_true"] for index, row in mostly_true_df.iterrows()])
half_true_string = " ".join([row["half_true"] for index, row in half_true_df.iterrows()])
mostly_false_string = " ".join([row["mostly_false"] for index, row in mostly_false_df.iterrows()])
false_string = " ".join([row["false"] for index, row in false_df.iterrows()])
absolutely_false_string = " ".join([row["absolutely_false"] for index, row in absolutely_false_df.iterrows()])

In [9]:
true_string

'In one of the states that determined the outcome of the 2016 presidential race the winning margin averaged out to just two votes per precinct two votes Cal Cunningham voted for over 1 billion in new taxes Says the US Senate is dominated by millionaires and that he is not one of them The GOP lawyer who helped Kanye West get on the ballot in Wisconsin is actively working for Donald Trumps campaign Says 1 in 5 Texans did not have health insurance coverage before the pandemic and now nearly 1 in 3 Texans under the age of 65 dont have access to health care insurance On home health workers low pay and limited benefits 40 are still on SNAP or Medicaid Murders this year have spiked 27 in Philadelphia Taxpayers spent 70000000 to develop this drug  remdesivir The average family in America forks over more of their hardearned income to their local hospital than to the IRS Homicides are intraracial When you look at whiteonwhite crime its 84 Right Homicides White people kill white people Higher edu

### Document Term Matrix - True Statements

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english',ngram_range=(1,2))
data_cv = cv.fit_transform(true_df.true)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index=true_df.name
# Dropping undesired columns
data_dtm = data_dtm.drop(["12","14","17","1942","1960s","1965","20","200","2016","201819","27","40","65","70000000","84"],axis=1)
data_dtm.head()

Unnamed: 0_level_0,12 states,14 wisconsin,17 billion,1942 milwaukee,20 african,2016 presidential,201819 school,27 philadelphia,40 billion,40 snap,...,worn public,year,year 17,year contributed,year spiked,year year,ymca,ymca brochure,york,york 12
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Michelle Obama,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
National Republican Senatorial Committee,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Cory Booker,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Instagram posts,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MJ Hegar,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
