# Introduction

In this notebook, I will try to focus on the factors that cause a news to be fake news. Here I will try to do the analysis of the dataset and then try to build a rule based apporoach for news classification. 

# NO ML/DL is USED.

# Dataset

* train.csv: A full training dataset with the following attributes:
* id: unique id for a news article
* title: the title of a news article
* author: author of the news article
* text: the text of the article; could be incomplete
* label: a label that marks the article as potentially unreliable. Where 1: unreliable and 0: reliable.

# Contents

* Introduction
* Dataset
* Importing important libraries
* Reading dataset
* Analysis
* Rules
* Submission file
* Conclusion

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/fake-news/submit.csv
/kaggle/input/fake-news/train.csv
/kaggle/input/fake-news/test.csv


# Importing Libraries

In [2]:
import numpy as np 
import pandas as pd
from collections import Counter
import math
import random

# Load Dataset

In [3]:
#reading all the files
df = pd.read_csv("/kaggle/input/fake-news/train.csv")
df_test = pd.read_csv("/kaggle/input/fake-news/test.csv")
df_sub = pd.read_csv("/kaggle/input/fake-news/submit.csv")

# Start Analysis

In [4]:
df.label.value_counts()

1    10413
0    10387
Name: label, dtype: int64

# the dataset is balanced 

In [5]:
len(df.author.unique()),len(df.author[df.label == 1].unique()),len(df.author[df.label == 0].unique())

(4202, 1982, 2226)

In [6]:
df1 = df[df.label==1]
fake_news_author = Counter(df1.author)
df2 = df[df.label==0]
reliable_news_authors = Counter(df2.author)
len(fake_news_author),len(reliable_news_authors)

(1982, 2226)

In [7]:
unreliable_authors = fake_news_author.keys() & reliable_news_authors.keys()
len(unreliable_authors)

6

# Only 6 authors are common in those who provide reliable vs fake news

In [8]:
for key in unreliable_authors:
    print("fake news count for author ", key, " : ", fake_news_author[key])
    print("Reliable news count for author ", key, " : ", reliable_news_authors[key])

fake news count for author  nan  :  1931
Reliable news count for author  nan  :  26
fake news count for author  Reuters  :  2
Reliable news count for author  Reuters  :  4
fake news count for author  Pam Key  :  1
Reliable news count for author  Pam Key  :  242
fake news count for author  Ann Coulter  :  5
Reliable news count for author  Ann Coulter  :  16
fake news count for author  AFP  :  1
Reliable news count for author  AFP  :  2
fake news count for author  Pamela Geller  :  4
Reliable news count for author  Pamela Geller  :  1


In [9]:
df.author.isna().sum()

1957

In [10]:
df.author.isna().sum()/len(df.author)

0.09408653846153846

# 9% data is a lot of data. Else we would have made a rule based classifier based on the names of authors.

In [11]:
df.title.isna().sum(), df.text.isna().sum()

(558, 39)

In [12]:
print("missing data in title : ",df.title.isna().sum()/len(df.title), "\nmissing data in text :",df.text.isna().sum()/len(df.text))

missing data in title :  0.026826923076923078 
missing data in text : 0.001875


# 0.18% of text is missing. so we can think about dropping them. lets try to check the authers of those news and i which category they are. also lets check in which category the missing text news are.

In [13]:
df3 = df[df.text.isna()]
len(df3.label), df3.label.value_counts()

(39,
 1    39
 Name: label, dtype: int64)

# here we can create a rule where if the text is missing then the news is fake and missing auther confirms the fake news.

In [14]:
df4 = df[df.title.isna()]
df4.label.value_counts()

1    558
Name: label, dtype: int64

# here we get another similar insight that if the title is missing then the news is fake

# Lets try to create a rule based system to classify the test data

In [15]:
len(df_sub),len(df_test)

(5200, 5200)

In [16]:
df_test.title.isna().sum(), df_test.author.isna().sum(), df_test.text.isna().sum()

(122, 503, 7)

# The test data is also having the null values. SO let's write RULES to classify test data.

In [17]:
df_test.columns

Index(['id', 'title', 'author', 'text'], dtype='object')

# Rules based on the prior observations

In [18]:

for i in range(len(df_sub.id)):
    flag_check = True
    if df_test['author'][i] in unreliable_authors:
        try:
            if math.isnan(df_test.text[i]):
                df_sub['label'][i] = 1
                flag_check = False
        except TypeError as e:
#             print(e)
            try:
                if math.isnan(df_test['title'][i]):
                    df_sub['label'][i] = 1
                    flag_check = False
            except TypeError as e:
                pass
#                 print(e)
        if flag_check:
            df_sub['label'][i] = random.randint(0, 1)
    elif df_test['author'][i] in fake_news_author:
        df_sub['label'][i] = 1
    else:
        df_sub['label'][i] = 0
        

In [19]:
df_sub.head()

Unnamed: 0,id,label
0,20800,0
1,20801,1
2,20802,0
3,20803,0
4,20804,1


In [20]:
df_sub.to_csv('submission.csv', index = False)

# Conclusion

* The Rule based approach should be able to classify the 85% of the train data correctly. 
* Here we know the factors that cause the news to be fake viz., 
    * the set of Authors who spread the fake news, 
    * the absence of Text, 
    * the absence of Title.
* Let's try to improve upon the accuracy of this kernel by using ML/DL approaches.