# Remove Persons from the Title and Text Fields

Executing this notebook is beyond the capabilities of my available resources.

On Colab it would take approx. 14 hours to run which exceeds max runtime of 12 hours.
On my local computer it would take 125 hours (5 days)

# Pre-Requests

This notebook should be run after eda.ipynb and text_pre_processing.ipynb

# Imports and Constants

In [2]:
!pip install stanza



In [3]:
import pandas as pd
import numpy as np
import ast
import stanza
import random
from nltk.tokenize import word_tokenize

In [4]:
DATA_PATH = '../data/'
PRE_PROCESSED_DATA_FILE_NAME = 'news_dataset_pre_processed.csv'
RANDOM_STATE = 42
SAVE_FILE = False

In [5]:
random.seed(a=RANDOM_STATE)

# Load Data

In [6]:
df = pd.read_csv(DATA_PATH + PRE_PROCESSED_DATA_FILE_NAME, 
                 low_memory = False)

In [7]:
# although these tokens will not be used in this notebook they need to be
# converted to lists of strings so that they can be saved in the updated file
df.clean_text_tokens = df.clean_text_tokens.map(ast.literal_eval)
df.clean_title_tokens = df.clean_title_tokens.map(ast.literal_eval)

In [8]:
df.head()

Unnamed: 0,title,text,subject,date,label,title_len,text_len,caps_in_title,norm_caps_in_title,caps_in_text,norm_caps_in_text,text_tokens,text_urls,clean_text,title_urls,twitter_handles,clean_title,clean_text_tokens,clean_title_tokens
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,2017-12-31,fake,79,2893,11,0.139241,138,0.047701,"['Donald', 'Trump', 'just', 'couldn', 't', 'wi...",['pic.twitter.com/4FPAe2KypA'],donald trump just couldn t wish all americans ...,[],"['@realDonaldTrump', '@TalbertSwan', '@calvins...",donald trump sends out embarrassing new year’s...,"[donald, trump, just, couldn, wish, all, ameri...","[donald, trump, sends, out, embarrassing, new,..."
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,2017-12-31,fake,69,1898,8,0.115942,88,0.046365,"['House', 'Intelligence', 'Committee', 'Chairm...",[],house intelligence committee chairman devin nu...,[],[],drunk bragging trump staffer started russian c...,"[house, intelligence, committee, chairman, dev...","[drunk, bragging, trump, staffer, started, rus..."
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,2017-12-30,fake,90,3597,15,0.166667,308,0.085627,"['On', 'Friday', 'it', 'was', 'revealed', 'tha...","['pic.twitter.com/XtZW5PdU2b', 'pic.twitter.co...","on friday, it was revealed that former milwauk...",[],"['@SheriffClarke', '@SheriffClarke', '@KeithLe...",sheriff david clarke becomes an internet joke ...,"[on, it, was, revealed, that, former, milwauke...","[sheriff, david, clarke, becomes, an, internet..."
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,2017-12-29,fake,78,2774,19,0.24359,123,0.04434,"['On', 'Christmas', 'day', 'Donald', 'Trump', ...","['https://t.co/Fg7VacxRtJ', 'pic.twitter.com/5...","on christmas day, donald trump announced that ...",[],"['@pbump', '@_cingraham', '@_cingraham', '@_ci...",trump is so obsessed he even has obama’s name ...,"[on, christmas, day, donald, trump, announced,...","[trump, is, so, obsessed, he, even, has, obama..."
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,2017-12-25,fake,70,2346,11,0.157143,63,0.026854,"['Pope', 'Francis', 'used', 'his', 'annual', '...",[],pope francis used his annual christmas day mes...,[],[],pope francis just called out donald trump duri...,"[pope, francis, used, his, annual, christmas, ...","[pope, francis, just, called, out, donald, tru..."


# Remove Persons

Using [Stanza's](https://stanfordnlp.github.io/stanza) Named Entity Recognition model, the persons identified by that model will be removed from the raw title and text fields.  The other pre-processing steps from text_pre_processing.ipynb will then be done on these fields.  The pre-processing needs to be done again because the NER is not as effective when run on the pre-processed text fields.

In [9]:
def find_and_remove_persons(text, verbose=False):
    """Using the Stanza NER, remove all entities identified as PERSONs from the text."""
    
    doc = nlp(text)
   
    clean_text = ' '.join([token.text for sent in doc.sentences for token in sent.tokens if token.ner not in ['B-PERSON', 'E-PERSON']])
    
    if verbose:
        
        people_set = set()
        
        for ent in doc.entities:
        if ent.type == 'PERSON':
            people_set.add(ent.text)
            
        print(people_set)
        print('')
        print(clean_text)
        print('')
        
    return clean_text

In [10]:
# initialize the stanza object
stanza.download('en')
nlp = stanza.Pipeline('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 1.55MB/s]                    
2020-10-14 08:41:35 INFO: Downloading default packages for language: en (English)...
2020-10-14 08:41:42 INFO: File exists: /Users/freethrall/stanza_resources/en/default.zip.
2020-10-14 08:41:50 INFO: Finished downloading models and saved to /Users/freethrall/stanza_resources.
2020-10-14 08:41:50 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| sentiment | sstplus   |
| ner       | ontonotes |

2020-10-14 08:41:50 INFO: Use device: cpu
2020-10-14 08:41:50 INFO: Loading: tokenize
2020-10-14 08:41:51 INFO: Loading: pos
2020-10-14 08:41:52 INFO: Loading: lemma
2020-10-14 08:41:52 INFO: Loading: depparse
2020-10-14 08:41:53 INFO: Loading: sentiment
2020-10-14 08:41:54 INFO: Loading: 

Let's do a test run to see what PERSONs are found and which ones aren't.

## Test Run

In [11]:
for i in range(10):
    index = random.randint(0, len(df))
    clean_text = find_and_remove_persons(df.iloc[index].text, verbose=True)

{'Abre  Conner', 'Novella Coleman'}

Two African - American women who went for what they hoped would be a fun night out at a karaoke bar in Fresno ended up in a scene from a 1980 s after school special about racism . The women had ordered some drinks already and were waiting to sing Waterfalls by TLC when the bartenders began treating them like they were the only two black people in a bar full of racists . As it turned out , they were the only two black women in a bar full of racists . According to the women , a bartender told them they couldn t sing karaoke unless they ordered drinks , which they had already done . Another rather large bartender got right in their faces and screamed buy drinks at them . Ultimately the women were asked to leave . The bar , which is described as a dive on Facebook , unfortunately bit off more than it could chew as the two women turned out to be lawyers for the American Civil Liberties Union . and , the two women treated so horribly , wrote an article on

{'Christine Keeler', 'Profumo', 'Harold Macmillan', 'Seymour Platt', 'Chris', 'Arne Jacobsen', 'Yevgeny Ivanov', 'Stephen Ward', 'Mandy Rice-Davies', 'Stephen', 'Keeler', 'William Astor', 'Platt', 'John Profumo'}

LONDON ( Reuters ) - , the model and dancer whose liaisons with a British minister and a Soviet diplomat at the height of the Cold War shocked Britain and embroiled the government in a notorious political sex scandal , has died aged 75 . Keeler s relationship with married Minister of War , whom she met , aged 19 , while swimming naked at the grand Buckinghamshire estate of his colleague , shocked socially conservative Britain in the early 1960s . Front - page revelations that she was also having an affair with a Soviet naval attache , , titillated the public and shone a light on the social and sexual mores of Britain s secretive ruling establishment . Profumo was forced to resign after lying to parliament about their relationship . The political and diplomatic firestorm helpe

{'Mike Lee', 'Larry Nichols', 'Bobby Jindal', 'Tulsi Gabbard', 'Pete Hoekstra', 'Chris Christie', 'Stephen Hadley', 'George H.W. Bush', 'Linda McMahon', 'Richard Grenell', 'James Connaughton', 'Carol Comer', 'William Pryor', 'Mike Pompeo', 'Ben Carson', 'Robert Grady', 'Tom Price', 'Jeb Hensarling', 'Rick Perry', 'Mitt Romney', 'Reince Priebus', 'Dan DiMicco', 'Jeff Holmstead', 'Zalmay Khalilzad', 'Jeff Sessions', 'Trump', 'Jamie Dimon', 'John Bolton', 'Jan Brewer', 'Steve Bannon', 'Jim Talent', 'Michael McCaul', 'George W. Bush', 'Tom Barrack', 'Joe Arpaio', 'Peter King', 'Nikki Haley', 'Bob Corker', 'Elaine Chao', 'Leslie Rutledge', 'Rich Bagger', 'Sarah Palin', 'Harold Hamm', 'Michael Flynn', 'Mary Fallin', 'Rudy Giuliani', 'Donald Trump', 'Robert Cardillo', 'Tom Cotton', 'Jonathan Gray', 'Duncan Hunter', 'Jon Kyl', 'James Mattis', 'Wilbur Ross', 'Mike Rogers', 'Chao', 'David Clarke', 'David Petraeus', 'Mitch McConnell', 'Steven Mnuchin', 'Andrew Puzder', 'Kris Kobach', 'Forrest Luc

The NER does not do a perfect job, but appears to do a pretty good one.  Some names aren't recognized as PERSONs, but at least in these samples, no non-PERSONs are identified as PERSONs.

## Remove PERSONs

In [13]:
df['title_no_persons'] = df['title'].apply(find_and_remove_persons)
df['text_no_persons'] = df['text'].apply(find_and_remove_persons)

## Remove URLS

In [None]:
df['title_no_persons'] = df['title_no_persons'].apply(lambda x: re.sub(URL_REGEX, '{link}', x))
df['text_no_persons'] = df['text_no_persons'].apply(lambda x: re.sub(URL_REGEX, '{link}', x))

## Anonymise Twitter Handles

In [None]:
df['title_no_persons'] = df['title_no_persons'].apply(lambda x: re.sub(TWITTER_HANDLE_REGEX, '@twitter-handle', x))
df['text_no_persons'] = df['text_no_persons'].apply(lambda x: re.sub(TWITTER_HANDLE_REGEX, '@twitter-handle', x))

## Lowercase except for All Caps

In [None]:
def lower_unless_all_caps(string_):
    """
    Make all words in the input string lowercase unless that 
    word is in all caps
    """
    words = string_.split()
    processed_words = [w.lower() if not (w.isupper() and len(w) > 1) else w for w in words]
    return ' '.join(processed_words)

In [None]:
df['title_no_persons'] = df['title_no_persons'].apply(lower_unless_all_caps)
df['text_no_persons'] = df['text_no_persons'].apply(lower_unless_all_caps)

## Numbers

Numbers do not seem likely to indicate Fake news, although certain dates or numbers may. The only date/number I've come across that may have significant meaning is 9/11. I will change it to nine-eleven so that numbers can more easily be removed.

I will replace the numbers with a space because some of the sentences run together and end with a number. Replacing the number with a space will split the sentences.

In [None]:
df['title_no_persons'] = df['title_no_persons'].apply(lambda x: re.sub(r'9\/11', 'nine-eleven', x))
df['text_no_persons'] = df['text_no_persons'].apply(lambda x: re.sub(r'9\/11', 'nine-eleven', x))

In [None]:
df['title_no_persons'] = df['title_no_persons'].apply(lambda x: re.sub(r'\d+', ' ', x))
df['text_no_persons'] = df['text_no_persons'].apply(lambda x: re.sub(r'\d+', ' ', x))

## Remove (reuters) from news stories.

keeping (reuters) in the news text will create an overfit model when applying it to data outside the current dataset.

In [None]:
df['text_no_persons'] = df['text_no_persons'].apply(lambda x: re.sub(r'\(\s*reuters\s*\)', ' ', x))

## Tokenize Title and Text

In [None]:
df['tokens_title_no_persons'] = df['title_no_persons'].apply(word_tokenize)
df['tokens_text_no_persons'] = df['text_no_persons'].apply(word_tokenize)

## Remove Punctuation and Single Letter Tokens from Text

I will remove all of the Punctuation tokens except for the exclamation point, because it seems like it may be an indicator of Fake news. I will also remove all the single characters except for i.

In [None]:
def remove_single_characters(word_list, exception_list):
    """Remove all the single characters, except those on the exception list"""
    return [w for w in word_list if (len(w) > 1 or w in exception_list)]

In [None]:
df['tokens_title_no_persons'] = df['tokens_title_no_persons'].apply(lambda x: remove_single_characters(x, ['i', '!']))
df['tokens_text_no_persons'] = df['tokens_text_no_persons'].apply(lambda x: remove_single_characters(x, ['i', '!']))

## Remove 's

While the fake news frequently or always didn't removed the apostrophe from 's, it doesn't look like that was done to the true news. 's will need to be removed so that it doesn't become an indicator of true news.

In [None]:
df['tokens_title_no_persons'] = df['tokens_title_no_persons'].apply(lambda x: remove_words(x, ["'s"]))
df['tokens_text_no_persons'] = df['tokens_text_no_persons'].apply(lambda x: remove_words(x, ["'s"]))

## Remove Date Words

After running some initial models, I noticed that Wednesday came up as a important feature. To better generalize the models I will remove all the date words.

In [None]:
date_words = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 
              'saturday', 'sunday', 'january', 'february', 'march', 'april',
             'may', 'june', 'july', 'august', 'september', 'october',
             'november', 'december']

In [None]:
df['tokens_title_no_persons'] = df['tokens_title_no_persons'].apply(lambda x: remove_words(x, date_words))
df['tokens_text_no_persons'] = df['tokens_text_no_persons'].apply(lambda x: remove_words(x, date_words))

## Save Data

In [None]:
if SAVE_FILE:
    df.to_csv(DATA_PATH + PRE_PROCESSED_DATA_FILE_NAME, index=False)