# Task 1

%
If you haven't already done so, the first task is to form a group of 2-4 people. There is an announcement on Absalon describing how to do this. Make sure that you list the names of all members of the group at the top of the jupyter notebook along with your group number.
%

## Group members:
* Oleksandr kryshtalov [mdq842]  
* Ulrik Bjørn Meelby [na]  


# Task 2

%
For our fake news predictor, we will be using the FakeNewsCorpus dataset as our primary dataset. 
It is available from this github repository:, where you can also find information about how the data is collected, the available fields, etc. In this first milestone, we will work only on a small subset of the FakeNewsCorpus dataset. 
Your first task is to retrieve this subset from https://raw.githubusercontent.com/several27/FakeNewsCorpus/master/news_sample.csv and structure/process/clean it. Describe which procedures (and which libraries) you used and why they are appropriate.
%

In order to clean the data we need to first know what the primary goal of cleaning it is. 
In task 3 we have chosen to analyze following questions:
* Are there words that have a higher frequency in fake-news articles over trustworthy ones, and what are those?
* What are the meta_keywords that have the highest inclination to be fake?
* Observe zipf law on english text with given dataset.

All observations require text tokenization, and a way to clean and stem them, in order to not differentiate between words, eg. word and words.

For the cleaning of the text we will be using cleatext.
nltk will be used for tokinization via word_tokenize, and for stemming via the porter stemmer.

In [1]:
# imports:
import requests
import re
import numpy as np
import pandas as pd
import string

from cleantext import clean
import nltk
#from nltk.corpus import stopwords
#from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [2]:
#We get the data from the web 
csvdata = requests.get("https://raw.githubusercontent.com/several27/FakeNewsCorpus/master/news_sample.csv")
csvdata

<Response [200]>

In [3]:
# Write csvfile
with open("rawdata.csv","wb") as file:
    file.write(csvdata.content)

# Parse csv into dataframe
dfRaw = pd.read_csv ('rawdata.csv')

# drop some tables for brevity:
df = dfRaw.iloc[: , 1:]
df = df.drop(columns =[ "url", "scraped_at", "scraped_at", "inserted_at", "updated_at", "keywords", "summary"])

# have a peek at the data:
#df

## Cleaning:

In [4]:
# We want to make a pd.Series -> pd.Series function that cleans all the text
# lowercases everything removes urls, emails etc.

def clean_text(textSeries:pd.Series) -> pd.Series:
    def clean_func (text):
        return clean(text,
            fix_unicode=True,            
            to_ascii=True,               
            lower=True,                  
            no_line_breaks=True,        
            no_urls=True,                
            no_emails=True,              
            no_phone_numbers=False,      
            no_numbers=True,             
            no_punct=True,              
            replace_with_punct="",       
            replace_with_url="oURL", # clean_text lowers after tagging for some reason???
            replace_with_email="oEMAIL",
            replace_with_phone_number="oPHONE",
            replace_with_number="oNUM",
            replace_with_digit="0",
            lang="en")
    return textSeries.apply(clean_func)
            
df.content = clean_text(df.content)

## Tokenization:

In [5]:
nltk.download('punkt') # fix

# Same as before a pd.Series -> pd.Series function.
def tokenize(textSeries:pd.Series) -> pd.Series:
    return pd.Series(textSeries.apply(nltk.tokenize.wordpunct_tokenize),name="tokens")
df["tokens"] = tokenize(df.content)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Oleks\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Removal of stopwords:

## Stemming:

In [6]:
# stems all the tokens
def stem (data:pd.Series)-> pd.Series:
    ps = PorterStemmer()
    # function to be applied on each element in series
    def stem (tokens):
        return [ps.stem(token) for token in tokens]
    return pd.Series(data.apply(stem))

# removal of duplicates
def dupeRemove (tokens:pd.Series) -> pd.Series:
    def unDupe (tokens):
        return list(set(tokens))
    return tokens.apply(unDupe)



df["stemmed_with_dupes"] = stem(df.tokens)
df["stemmed"] = dupeRemove(df.tokens)

# Task 3
%\
Now try to explore the FakeNewsCorpus dataset. Make at least three non-trivial observations/discoveries about the data. These observations could be related to outliers, artefacts, or even better: genuinely interesting patterns in the data that could potentially be used for fake-news detection. Examples of simple observations could be how many missing values there are in particular columns - or what the distribution over domains is. Be creative! :).  
%

We found some observations we wanted to look into in Task 2: 
1. Are there words that have a higher frequency in fake-news articles over trustworthy ones, and what are those?
2. What are the meta_keywords that have the highest inclination to be fake?
3. Observe zipf law on english text with given dataset.

These tasks can then be split into smaller tasks:
### 1
* Generate BOW for fake and credibile articles seperatly. 
* For each word find credible/fake ratio of the frequencies. 


### 2
* Create BOW for meta keywords for fakes
* Sort 
* Visualize this coefficant.


## 3
* Create a vocabulary for the entire dataset 
* calculate bag of words for all text. 
* Sort BOW by frequency. 
* plot on a scatterplot. 

## Creating BOWs

In [7]:
# In the following section we have BOW code, to generate bags of words
# and combining them. As previous we well be using pd.Series as out datatype.
# The code takes a Long time to complete, might want to skip this section and 
# instead unpickle. 


# generate BOW given a set of tokens(with dupes).
def BOWGen(tokenss:pd.Series) -> pd.DataFrame: # with 2 cols for word and count
    
    #generate Bow for 1 list of tokens
    def _BOWGen(tokens:list) -> (list,list): # with 2 cols for word and count
        vocab = []
        freqs = []
        for token in tokens:
            if token not in vocab:
                vocab.append(token)
                freqs.append(1)
            else:
                # increment frequncy counter by one.
                freqs[vocab.index(token)] += 1
        return vocab,freqs
    return tokenss.apply(_BOWGen)


def BOWGenGlobal(BOWs:pd.Series) -> pd.DataFrame:
    def mergeBOWs(a:(list,list), b):
        vocab, freqs = a
        for (word, count) in zip(b[0],b[1]):
            if word not in vocab:
                vocab.append(word)
                freqs.append(count)
            else:
                freqs[vocab.index(word)] += count
        return vocab, freqs
    
    BOWList = [i for (_, i) in BOWs.items()]
    
    # accumulator
    _BOW = ([],[])
    for idx, BOW in enumerate(BOWList):
        _BOW = mergeBOWs(BOW, _BOW)
        if (idx) % 20 == 0:
            print(f"{idx} out of {len(BOWList)} ")
    out = pd.DataFrame((_BOW)).transpose()
    out.columns = ["word", "count"]
    out["count"] = out["count"].astype(int)
    print("Done")
    return out
      
# test 
BOWs = BOWGenGlobal(BOWGen(df.tokens.iloc[0:10]))
BOWs.sort_values("count",ascending=False).iloc[0:5]

0 out of 10 
Done


Unnamed: 0,word,count
19,the,299
94,onum,175
39,of,155
1,to,112
109,in,105


## Generating all the BOWs we need

In [12]:
# uncommment this
# BOW Whole text
BOW_Whole_text = BOWGenGlobal(BOWGen(df.tokens))
BOW_Whole_text.head

0 out of 250 
20 out of 250 
40 out of 250 
60 out of 250 
80 out of 250 
100 out of 250 
120 out of 250 
140 out of 250 
160 out of 250 
180 out of 250 
200 out of 250 
220 out of 250 
240 out of 250 
Done


<bound method NDFrame.head of             word  count
0         former     57
1             us    447
2      president    175
3           bill     47
4        clinton     75
...          ...    ...
16671    potency      1
16672    pitched      1
16673     waffle      4
16674      split      2
16675    empathy      1

[16676 rows x 2 columns]>

In [18]:
# code for saving/reading dataframe to/from disk:
import pickle

with open("FullTextBOW.pickle", "wb") as file:
    #pickle.dump(BOW_Whole_text, file)
    pass
with open("FullTextBOW.pickle", "rb") as file:
    BOW_Whole_text = pickle.load(file)
    pass
# dont lose your pickle
BOW_Whole_text

EOFError: Ran out of input

In [None]:
# BOW keywords fake
BOW_keywords_fake = BOWGenGlobal(BOWGen(df[df["type"] == "fake"].meta_keywords))

In [None]:
# BOW whole text fake
BOW_Whole_text_fake = BOWGenGlobal(BOWGen(df[df["type"] == "fake"].stemmed))
# BOW whole text credible
BOW_Whole_text_fake = BOWGenGlobal(BOWGen(df[df["type"] == "reliable"].stemmed))

## Task 3.1

As our final result we will be calculating the frequncy delta between fake and reliable.
And plotting it.

In [13]:
BOW_Whole_text_fake

NameError: name 'BOW_Whole_text_fake' is not defined