[![Image from Gyazo](https://i.gyazo.com/2ea8e65b2ca18b372527bc2a1e3838c0.png)](https://gyazo.com/2ea8e65b2ca18b372527bc2a1e3838c0)

# NLP -  Data Cleaning

## Fake News Detection

                                             Pau Roger Puig-Sureda
                                             15/05/2018

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

### Import & Explore train data

In [2]:
train = pd.read_csv('fake_or_real_news_training.csv')
train.head(3)

Unnamed: 0,ID,title,text,label,X1,X2
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,


Sample:

In [3]:
train.loc[0]

ID                                                    8476
title                         You Can Smell Hillary’s Fear
text     Daniel Greenfield, a Shillman Journalism Fello...
label                                                 FAKE
X1                                                     NaN
X2                                                     NaN
Name: 0, dtype: object

In [4]:
train.loc[0]['title']

'You Can Smell Hillary’s Fear'

In [5]:
train.loc[0]['text']



In [6]:
train.loc[0]['label']

'FAKE'

Shape of the dataset:

In [7]:
train.shape

(3999, 6)

In [8]:
train.describe(include='all')

Unnamed: 0,ID,title,text,label,X1,X2
count,3999.0,3999,3999,3999,33,2
unique,,3968,3839,35,4,2
top,,OnPolitics | 's politics blog,"Killing Obama administration rules, dismantlin...",REAL,REAL,FAKE
freq,,4,41,1990,17,1
mean,5288.250063,,,,,
std,3045.011156,,,,,
min,3.0,,,,,
25%,2696.0,,,,,
50%,5250.0,,,,,
75%,7907.5,,,,,


#### Variables

**ID**: ID of the tweet.

**Title**: Title of the news report.

**Text**: Textual content of the news report.

**Label**: Target Variable - [FAKE, REAL].

**X1, X2** additional fields.

In [9]:
train['X1'].describe()

count       33
unique       4
top       REAL
freq        17
Name: X1, dtype: object

In [10]:
Counter(train['X1'])

Counter({nan: 3966,
         'REAL': 17,
         'FAKE': 14,
         'PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE.\xa0Congress may have spent August away from Washington but Planned Parenthood’s campaign to convince lawmakers to protect the group’s funding followed them back to their home states. Power Post has more.\n\n“Lawmakers will raise the stakes when Congress returns next week by threatening to defund the group through the federal appropriations process. Planned Parenthood’s counter-offensive is widespread and varied and is unfolding inside and outside the Beltway. The group has been\xa0organizing rallies, flooding lawmakers’ town hall meetings, commissioning polls, shelling\xa0out six figures for television\xa0ads and\xa0hiring forensics experts to try to discredit undercover video footage that sparked the controversy. The success of these lobbying efforts will be tested when Congress returns and must move a short-term spending bill to keep the government open. Some conserv

This variables is probably useless.

In [11]:
df = train[train['X1'].notnull()]
df.shape

(33, 6)

I will check weather the not null values of the X1 variable are equal to the label variable.

In [12]:
Counter(train['label'])

Counter({'FAKE': 1976,
         'REAL': 1990,
         'Election Day: No Legal Pot In Ohio; Democrats Lose In The South\n\nTuesday is "off year" Election Day in parts of the country. Legalizing marijuana is on the ballot in Ohio, Houston voters will decide on an equal rights ordinance and San Francisco weighs short-term rentals in what\'s being called the "Airbnb Initiative."\n\nElsewhere, eyes are on governor races in Kentucky and Louisiana, and whether Democrats can make any progress in the South.\n\nHere\'s a look at some of the races:\n\nHouston voters will decide whether to keep an equal rights ordinance that was approved by the City Council last year. The Houston Equal Rights Ordinance (HERO) would ban discrimination based on sexual orientation and gender identity — criteria not covered by national anti-discrimination laws. The ordinance is hotly debated, particularly after some opposition ads were released. The ads claim that the ordinance would allow men who identify as women t

In [13]:
df1 = train[(train['label'] != 'REAL') & (train['label'] != 'FAKE')]
df1.shape

(33, 6)

In [14]:
df1

Unnamed: 0,ID,title,text,label,X1,X2
192,599,Election Day: No Legal Pot In Ohio,Democrats Lose In The South,Election Day: No Legal Pot In Ohio; Democrats ...,REAL,
308,10194,Who rode it best? Jesse Jackson mounts up to f...,Leonardo DiCaprio to the rescue?,Who rode it best? Jesse Jackson mounts up to f...,FAKE,
382,356,Black Hawk crashes off Florida,human remains found,(CNN) Thick fog forced authorities to suspend ...,REAL,
660,2786,Afghanistan: 19 die in air attacks on hospital,U.S. investigating,(CNN) Aerial bombardments blew apart a Doctors...,REAL,
889,3622,Al Qaeda rep says group directed Paris magazin...,US issues travel warning,A member of Al Qaeda's branch in Yemen said Fr...,REAL,
911,7375,Shallow 5.4 magnitude earthquake rattles centr...,shakes buildings in Rome,00 UTC © USGS Map of the earthquake's epicent...,FAKE,
1010,9097,ICE Agent Commits Suicide in NYC,Leaves Note Revealing Gov’t Plans to Round-up...,Email Print After writing a lengthy suicide no...,FAKE,
1043,9203,Political Correctness for Yuengling Brewery,What About Our Opioid Epidemic?,We Are Change \n\nIn today’s political climate...,FAKE,
1218,1602,Poll gives Biden edge over Clinton against GOP...,VP meets with Trumka,A new national poll shows Vice President Biden...,REAL,
1438,4562,Russia begins airstrikes in Syria,U.S. warns of new concerns in conflict,Russian warplanes began airstrikes in Syria on...,REAL,


It appears that in X1 are the labels missing on the label variable except for 2 typos, the answer of wich might be in the X2 variable.

In [15]:
train.loc[3706]

ID                                                    9954
title    Incredible smoke haze seen outside NDTV office...
text                    bursting of firecrackers suspected
label    Incredible smoke haze seen outside NDTV office...
X1                                                    FAKE
X2                                                     NaN
Name: 3706, dtype: object

In [16]:
train.loc[3706]['title']

'Incredible smoke haze seen outside NDTV office after Arnab quits'

In [17]:
train.loc[3706]['text']

' bursting of firecrackers suspected'

In [18]:
train.loc[3706]['label']

'Incredible smoke haze seen outside NDTV office after Arnab quits; bursting of firecrackers suspected Posted on Tweet (Image via shutterstock.com) \nAn incredible smoke haze was spotted seen outside the NDTV office on Tuesday. Onlookers claimed that the reason was and uninhibited bursting of firecrackers by people in the building. \nOne onlooker claimed he also heard loud firecrackers-like noise near the NDTV office area followed by fumes curling up to form a V-sign. \nExperts say it will be difficult to ascertain the source of the emission. A leading pollution expert opined, “These days such peculiar fumes can be due to firecrackers during the Diwali season or because of pure human emotions giving rise to intense celebrations, revelry, etc.” \nThis incident, according to the onlooker, happened on Tuesday evening, minutes after the news of journalist Arnab Goswami quitting Times Now surfaced. \nThe UnReal Times could not verify whether the two events were correlated as the onlooker tri

In addition, the variable 'label' seems to contain the information that should be in the 'text' feature. Also, the variable 'text' appears to contain the end of the 'title'. When creating the dataset, it probably considered that ':' meant the 'title' ended. Let's try to confirm this hipothesis with another instance:

In [19]:
train.loc[2920]

ID                                                     496
title                     Nearly 300K New Jobs In February
text                      Unemployment Dips To 5.5 Percent
label    Nearly 300K New Jobs In February; Unemployment...
X1                                                    REAL
X2                                                     NaN
Name: 2920, dtype: object

In [20]:
train.loc[2920]['title']

'Nearly 300K New Jobs In February'

In [21]:
train.loc[2920]['text']

' Unemployment Dips To 5.5 Percent'

In [22]:
train.loc[2920]['label']

'Nearly 300K New Jobs In February; Unemployment Dips To 5.5 Percent\n\nThe U.S. economy added 295,000 jobs last month, according to the Labor Department\'s monthly survey, and the unemployment rate dropped to 5.5 percent. The latest strong data beat expectations and follow a robust jump the previous month — a sign that the nation\'s economy is finally picking up steam.\n\nEconomists had predicted the economy would add 240,000 jobs in February and that the unemployment rate would notch back down to 5.6 percent, where it stood for December. The slight increase in the rate last month was attributed to strong growth in the labor force.\n\nThe average workweek for nonfarm payrolls was 34.6 hours, a figure that has held steady for five months. The average hourly wage rose 3 cents, to $24.78.\n\nAs NPR\'s John Ydstie reported this morning ahead of this morning\'s release by the department\'s Bureau of Labor Statistics, the report for January "was stellar on almost every count. It revealed a m

Also, it appears that in the instances saved in 'df1', the title is repeated at the begining of the 'text'.

First, I will try to append the variables 'title' and 'text' in order to complete the title in the proper feature:

In [23]:
train['title'][(train['label'] != 'REAL') & (train['label'] != 'FAKE')] = df1['title'].str.cat(df1['text'], sep=':')

In [24]:
train.loc[889]['title']



**On the dataset df1, the titles of the articles are both inside the variable title and at the begining of the article, I think this is a good idea for 2 reasons:**

**The models will work better, since only one variable will be considered.**

**All the rows will have the same weight.**

In [25]:
df_big = train[(train['label'] == 'REAL') | (train['label'] == 'FAKE')]
df_big.shape

(3966, 6)

In [26]:
train['text'][(train['label'] == 'REAL') | (train['label'] == 'FAKE')] = df_big['title'].str.cat(df_big['text'], sep='. ')

In [27]:
train.loc[3333]['title']

"Donald Trump's lost month in Iowa"

In [28]:
train.loc[3333]['text']

'Donald Trump\'s lost month in Iowa. West Des Moines, Iowa (CNN) Donald Trump had every reason to feel optimistic Monday. His poll numbers were up; he had secured two prominent endorsements in the space of a week; and even the weather seemed to be cooperating, with a snowstorm coming in from the west expected to hold off until after midnight.\n\nAnd then, he lost, coming in second to Ted Cruz\n\nTrump spent January in full-on attack mode against Cruz. Trump questioned whether Cruz qualifies as a "natural-born citizen" eligible to serve as president. He went after the evangelical vote, winning the endorsement of Liberty University President Jerry Falwell, Jr., but didn\'t back up the appeals with a ground game. And then he sat out the final debate, opting for a rally across town at the same time.\n\nHe acknowledged Tuesday that may have backfired. "I think some people were disappointed that I didn\'t go into the debate," Trump said in New Hampshire.\n\nTrump and his advisers don\'t have

Second, I will input the text of the news on the proper variable.

In [29]:
train['text'][(train['label'] != 'REAL') & (train['label'] != 'FAKE')] = df1['label']

In [30]:
train.loc[192]['title']

'Election Day: No Legal Pot In Ohio: Democrats Lose In The South'

In [31]:
train.loc[192]['text']

'Election Day: No Legal Pot In Ohio; Democrats Lose In The South\n\nTuesday is "off year" Election Day in parts of the country. Legalizing marijuana is on the ballot in Ohio, Houston voters will decide on an equal rights ordinance and San Francisco weighs short-term rentals in what\'s being called the "Airbnb Initiative."\n\nElsewhere, eyes are on governor races in Kentucky and Louisiana, and whether Democrats can make any progress in the South.\n\nHere\'s a look at some of the races:\n\nHouston voters will decide whether to keep an equal rights ordinance that was approved by the City Council last year. The Houston Equal Rights Ordinance (HERO) would ban discrimination based on sexual orientation and gender identity — criteria not covered by national anti-discrimination laws. The ordinance is hotly debated, particularly after some opposition ads were released. The ads claim that the ordinance would allow men who identify as women to assault women and young girls in bathrooms.\n\nHillar

Third, The content of 'X1' goes into the variable 'label'.

In [32]:
train['label'][(train['label'] != 'REAL') & (train['label'] != 'FAKE')] = df1['X1']

In [33]:
train.loc[192]

ID                                                     599
title    Election Day: No Legal Pot In Ohio: Democrats ...
text     Election Day: No Legal Pot In Ohio; Democrats ...
label                                                 REAL
X1                                                    REAL
X2                                                     NaN
Name: 192, dtype: object

There are still 2 rows without the information in the proper variable:

In [34]:
df2 = df1[(df1['X1'] != 'REAL') & (df1['X1'] != 'FAKE')]
df2

Unnamed: 0,ID,title,text,label,X1,X2
2184,9,Planned Parenthood’s lobbying effort,pay raises for federal workers,and the future Fed rates,PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE....,REAL
3537,6268,Chart Of The Day: Since 2009—–Recovery For The 5%,Stagnation for the 95%,Chart Of The Day: Since 2009 Recovery For The 5%,Stagnation for the 95%,FAKE


In [35]:
train.loc[3537]

ID                                                    6268
title    Chart Of The Day: Since 2009—–Recovery For The...
text      Chart Of The Day: Since 2009 Recovery For The 5%
label                             Stagnation for the 95%  
X1                                Stagnation for the 95%  
X2                                                    FAKE
Name: 3537, dtype: object

In [36]:
train.loc[3537]['title']

'Chart Of The Day: Since 2009—–Recovery For The 5%: Stagnation for the 95%'

In [37]:
train.loc[3537]['text']

'Chart Of The Day: Since 2009 Recovery For The 5%'

In [38]:
train.loc[3537]['label']

' Stagnation for the 95%  '

In [39]:
train.loc[3537]['X1']

' Stagnation for the 95%  '

In this instance, the variables 'label' and 'X1' are repeated and the variable 'X2' contains the information that should be contained in 'label'. 

Let's fix it:

In [40]:
train['label'][3537] = df1['X2'][3537]
train.loc[3537]

ID                                                    6268
title    Chart Of The Day: Since 2009—–Recovery For The...
text      Chart Of The Day: Since 2009 Recovery For The 5%
label                                                 FAKE
X1                                Stagnation for the 95%  
X2                                                    FAKE
Name: 3537, dtype: object

In [41]:
train = train.drop([3537])
train.shape

(3998, 6)

In [42]:
train['label'][2184] = train['X2'][2184]
train.loc[2184]

ID                                                       9
title    Planned Parenthood’s lobbying effort: pay rais...
text                              and the future Fed rates
label                                                 REAL
X1       PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE....
X2                                                    REAL
Name: 2184, dtype: object

In [43]:
train.loc[2184]['X1']

'PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE.\xa0Congress may have spent August away from Washington but Planned Parenthood’s campaign to convince lawmakers to protect the group’s funding followed them back to their home states. Power Post has more.\n\n“Lawmakers will raise the stakes when Congress returns next week by threatening to defund the group through the federal appropriations process. Planned Parenthood’s counter-offensive is widespread and varied and is unfolding inside and outside the Beltway. The group has been\xa0organizing rallies, flooding lawmakers’ town hall meetings, commissioning polls, shelling\xa0out six figures for television\xa0ads and\xa0hiring forensics experts to try to discredit undercover video footage that sparked the controversy. The success of these lobbying efforts will be tested when Congress returns and must move a short-term spending bill to keep the government open. Some conservatives in both chambers are pushing to defund Planned Parenthood, even 

The text of this particular instance appears to be in the X1 variable.

In [44]:
train['text'][2184] = df1['X1'][2184]
train.loc[2184]

ID                                                       9
title    Planned Parenthood’s lobbying effort: pay rais...
text     PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE....
label                                                 REAL
X1       PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE....
X2                                                    REAL
Name: 2184, dtype: object

All the 'df1' dataset should be clean now.

In [45]:
train.describe(include='all')

Unnamed: 0,ID,title,text,label,X1,X2
count,3998.0,3998,3998,3998,32,1
unique,,3967,3989,2,3,1
top,,OnPolitics | 's politics blog,OnPolitics | 's politics blog. Who has Trump a...,REAL,REAL,REAL
freq,,4,2,2008,17,1
mean,5288.005003,,,,,
std,3045.352604,,,,,
min,3.0,,,,,
25%,2695.5,,,,,
50%,5249.5,,,,,
75%,7907.75,,,,,


Now that I have the dataset cleaned, I can safely delete **X1** and **X2**.

In [46]:
del train['X1']
del train['X2']

## Import test data

In [47]:
test = pd.read_csv('fake_or_real_news_test.csv')
test.head(3)

Unnamed: 0,ID,title,text
0,10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...
1,2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...
2,864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...


In [48]:
test.shape

(2321, 3)

Add target variable.

In [49]:
test['label'] = np.nan

In [50]:
test.describe(include='all')

Unnamed: 0,ID,title,text,label
count,2321.0,2321,2321,0.0
unique,,2310,2246,
top,,"All Governments Lie, The Movie","Killing Obama administration rules, dismantlin...",
freq,,2,17,
mean,5261.05601,,,
std,3028.488104,,,
min,2.0,,,
25%,2630.0,,,
50%,5300.0,,,
75%,7888.0,,,


Like I did on the train data, i will append the  title to the article.

In [51]:
test['text'] = test['title'].str.cat(test['text'], sep='. ')

In [52]:
test.loc[864]['title']

'Bombs Ready: The American Blob Is Already Oozing Into Syria - Ryan Cooper'

In [53]:
test.loc[864]['text']

"Bombs Ready: The American Blob Is Already Oozing Into Syria - Ryan Cooper. Politics Bombs Ready: The American Blob Is Already Oozing Into Syria \nThe US foreign policy establishment is laying the groundwork for Hillary to send the US military against the Syrian state Originally appeared at The Week \nSyria is in absolute ruins. The ongoing civil war, a disorganized melee involving the Assad regime, various rebel groups, Russia, Iran, ISIS and other Islamists, Turkey, Kurdish forces, and the U.S., has been stuck in stalemate for month after month. Much of the country is in a state of utter collapse, hundreds of thousands have died, and refugees continue to pour into neighboring states and Europe. \nWith the election of a militarist-inclined Hillary Clinton looking all but certain, the Blob — White House aide Ben Rhodes' apt name for the permanent D.C. foreign policy establishment — is quickly coalescing around a new consensus that existing U.S. intervention should be dramatically scale

## Join datasets

In [54]:
data = train.append(pd.DataFrame(data = test), ignore_index=True)
data.shape

(6319, 4)

In [55]:
data.describe(include='all')

Unnamed: 0,ID,title,text,label
count,6319.0,6319,6319,3998
unique,,6240,6290,2
top,,OnPolitics | 's politics blog,OnPolitics | 's politics blog. Who has Trump a...,REAL
freq,,5,3,2008
mean,5278.106504,,,
std,3038.957018,,,
min,2.0,,,
25%,2670.5,,,
50%,5268.0,,,
75%,7898.5,,,


In [56]:
data.isnull().sum()

ID          0
title       0
text        0
label    2321
dtype: int64

We do not have NA's in the data.

### Save cleaned dataset

In [57]:
data.to_csv("cdata.csv", index = False)