## ass:
X Read the CSV file  
Manually inspect the data to get an idea of potential problems that need to be fixed.  



## Imports: 

In [1]:
import requests
import re
import numpy as np
import pandas as pd
import string

## Downloading csv from internet

In [2]:
csvdata = requests.get("https://raw.githubusercontent.com/several27/FakeNewsCorpus/master/news_sample.csv")
csvdata

<Response [200]>

## Saving csv as a file

In [3]:
# Write csvfile 
with open("rawdata.csv","wb") as file:
    file.write(csvdata.content)

In [4]:
# Parse csv into dataframe
df = pd.read_csv ('rawdata.csv')
print(df)

     Unnamed: 0     id                domain        type  \
0             0    141               awm.com  unreliable   
1             1    256     beforeitsnews.com        fake   
2             2    700           cnnnext.com  unreliable   
3             3    768               awm.com  unreliable   
4             4    791  bipartisanreport.com   clickbait   
..          ...    ...                   ...         ...   
245         245  39259     beforeitsnews.com        fake   
246         246  39468     beforeitsnews.com        fake   
247         247  39477       www.newsmax.com         NaN   
248         248  39550       www.newsmax.com         NaN   
249         249  39558       www.newsmax.com         NaN   

                                                   url  \
0    http://awm.com/church-congregation-brings-gift...   
1    http://beforeitsnews.com/awakening-start-here/...   
2    http://www.cnnnext.com/video/18526/never-hike-...   
3    http://awm.com/elusive-alien-of-the-sea-ca

### First takeaways:
* Data is in csv format, means we have to parse it.
* Newlines are used for each row, commas are used to seperate each field.
* Some commas exist in the text, though they should be within sets of quotiationmarks.
* A lot of empty fields
* A lot of duplicate whitespace and newlines
* browsing the data has shown that summary is always None

In [5]:
# Summary is float, since its unspecefied. 
df.dtypes

Unnamed: 0            int64
id                    int64
domain               object
type                 object
url                  object
content              object
scraped_at           object
inserted_at          object
updated_at           object
title                object
authors              object
keywords            float64
meta_keywords        object
meta_description     object
tags                 object
summary             float64
dtype: object

Clean the data. First, we'll try to do this manually, by writing our own clean_text() function that uses regular expressions. The function should take raw text as input and return a version of the text with the following modifications:  
        all words must be lowercased
        it should not contain multiple white spaces, tabs or new lines
        numbers, dates, emails and urls should be replaced by "<NUM>", "<DATE>", "<EMAIL>" AND "<URL>", respectively. Note that replacing dates with <DATE> is particularly tricky as dates can be expressed in many forms. It's ok to to just choose one or a few common date formats present in the data set and only replace those. (Be careful about tokenizing <> symbols because these are punctuation in most Tokenizers).
   

In [6]:
# each aspect of clean_text can be split up into seperate functions.

# function that takes string and returns lowercased string
def str2Lower(inputStr:str):
    # "repl" substitution function for re.sub()
    def upper2Lower(match:re.Match):
        Ls = string.ascii_lowercase
        Us = string.ascii_uppercase
        return Ls[Us.index(match.group(0))]
    # returns string after substituion operation
    return re.sub(r"([A-Z])", upper2Lower, inputStr)

# function that takes string with consecutive spaces and returns stripped string.
def stripStr(inputStr:str):
    def repl(match:re.Match):
        return " "
    return re.sub(r"(\s{2,})", repl, inputStr)


# function wich inserts <tags>
def insertSymbols(inputStr:str):
    text = inputStr
    
    # insert <DATE>
    # DD-MM-YY format
    text = re.sub(r"[0-3]\d[/-][01]\d[/-]\d\d", (lambda _ : "<DATE>"), text)
    
    # insert <NUM>
    text = re.sub(r"([\d]+[\d,.]*)", (lambda _ : "<NUM>"), text)
    
    # insert <EMAIL>
    # RFC 5322 compliant regex
    emailPat = """(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
    text = re.sub(emailPat, (lambda _ : "<EMAIL>"), text)
    
    # insert <URL>
    # URL Pattern, requres http/https 
    URLPat = """https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"""
    text = re.sub(URLPat, (lambda _ : "<URL>"), text)
    return text

def clean_text(inputStr:str):
    text = inputStr
    text = str2Lower(text)
    text = stripStr(text)
    text = insertSymbols(text)
    return text

In [7]:
# test
testStr = r"absd   AbsD :;';01932,./' howdy@hello.dk date: 14/12/01 02-03-22, https://regexr.com/"
print("test:", str2Lower("Abasda;l3;'123l1'"))
print("test:", str2Lower(testStr))


print("test:", stripStr("Ab  asd  a;l3   ;'1  23l1'"))
print("test:", stripStr(testStr))

print("test:", insertSymbols(testStr))

print("test:", clean_text(testStr))

test: abasda;l3;'123l1'
test: absd   absd :;';01932,./' howdy@hello.dk date: 14/12/01 02-03-22, https://regexr.com/
test: Ab asd a;l3 ;'1 23l1'
test: absd AbsD :;';01932,./' howdy@hello.dk date: 14/12/01 02-03-22, https://regexr.com/
test: absd   AbsD :;';<NUM>/' <EMAIL> date: <DATE> <DATE>, <URL>
test: absd absd :;';<NUM>/' <EMAIL> date: <DATE> <DATE>, <URL>


### Better clean_text(str)
Now, let's try to use a library for cleaning the data. The clean-text module (https://pypi.org/project/clean-text/ (Links to an external site.)) provides out-of-the-box functionality for much of the cleaning we did in the previous exercise (pip install clean-text). Use it to implement the same cleaning steps as in your own clean_text implementation.

In [8]:
# new imports
from cleantext import clean
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to /home/olekkr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
def clean_text_new(text:str):
    return clean(text,
        fix_unicode=True,               # fix various unicode errors
        to_ascii=True,                  # transliterate to closest ASCII representation
        lower=True,                     # lowercase text
        no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
        no_urls=True,                  # replace all URLs with a special token
        no_emails=True,                # replace all email addresses with a special token
        no_phone_numbers=False,         # replace all phone numbers with a special token
        no_numbers=True,               # replace all numbers with a special token
        no_digits=False,                # replace all digits with a special token
        no_currency_symbols=False,      # replace all currency symbols with a special token
        no_punct=False,                 # remove punctuations
        replace_with_punct="",          # instead of removing punctuations you may replace them
        replace_with_url="<URL>",
        replace_with_email="<EMAIL>",
        replace_with_phone_number="<PHONE>",
        replace_with_number="<NUM>",
        replace_with_digit="0",
        replace_with_currency_symbol="<CUR>",
        lang="en"                       # set to 'de' for German special handling
    )
(df.content[0])[0:40] # example ...

'Sometimes the power of Christmas will ma'

In [10]:
# Apply function on our dataset
df.content = df.content.apply(clean_text_new)
df.content[1]

'awakening of <num> strands of dna - "reconnecting with you" movie\n% of readers think this story is fact. add your two cents.\nheadline: bitcoin & blockchain searches exceed trump! blockchain stocks are next!\n[january <num>, <num> - zurichtimes.net]\nas miles johnston was giving update, it was another case of strange synchronicities of goodness hidden inside of tests and trials, like a follow the whiterabbit down the rabbit hole type of exercise.\nin researching the <num> strands of dna we came across some articles, one in particular was as a strange synchronicity written exactly <num> year ago on the same topic.\n<url>\n<url>\nwhat are the <num> strands of our dna and why is a war against our dna?\ntrailer for awakening of <num> strands\nthe full video is only available as a paid video on vimeo.\nawakening of <num> strands - "reconnecting with you"\nvimeo.com/ondemand/awakeningof12strands\nawakening of <num> strands - "reconnecting with you". from sandra daroy on vimeo.\nwe have not

Now that we are done cleaning, we can start process the text. The nltk library (https://www.nltk.org/ (Links to an external site.)) has built-in support for many of the most common operations. Try to:
Tokenize the text.


In [11]:
testSentence = df.content[0][0:76]
testSentence 

'sometimes the power of christmas will make you do wild and wonderful things.'

## Stopword removal

In [12]:
def genStopwordStats (data:pd.Series):
    def countTokens (tokens:list[str]):
        if tokens is None:
            return 0
        return len(tokens)
    
    def tokenize (text:str):
        return list(set(word_tokenize(text)))
    
    def removeStopWords(tokens:list[str]):
        sWords = stopwords.words('english')
        return list(set([token for token in tokens if not token in sWords]))
    
    tokensWithStops = pd.Series(data.apply(tokenize), name = "tokensWithStops")
    countWithStops = pd.Series(tokensWithStops.apply(countTokens), name = "countWithStops")
    tokensNoStops = pd.Series(tokensWithStops.apply(removeStopWords), name = "tokensNoStops")
    countNoStops = pd.Series(tokensNoStops.apply(countTokens), name = "countNoStops")
    reductionRate = pd.Series((countWithStops-countNoStops)/countNoStops, name="reductionRate")
    
    newData = pd.concat([
        tokensWithStops, countWithStops, 
        tokensNoStops, countNoStops,reductionRate], axis=1)

    return newData
stopWordStats = genStopwordStats(df.content)

Remove stopwords and compute the size of the vocabulary. 
Compute the reduction rate of the vocabulary size after removing stopwords.
Remove word variations with stemming and compute the size of the vocabulary. Compute the reduction rate of the vocabulary size after stemming.

## Stemming

In [13]:

ps = PorterStemmer()
def stemStats (data:pd.Series):
    
    def countTokens (tokens:list[str]):
        if tokens is None:
            return 0
        return len(tokens)
    
    # function to be applied on each element in series
    def stem (tokens:list[str]):
        return list(set([ps.stem(token) for token in tokens]))
 
    
    tokenLenUnstemmed = pd.Series(data.apply(countTokens), name="tokenLenUnstemmed")
    stemmedTokens = pd.Series(data.apply(stem), name="stemmedTokens")
    tokenLenStemmed = pd.Series(stemmedTokens.apply(countTokens), name="tokenLenStemmed")
    stemmingReductionR = pd.Series((tokenLenUnstemmed-tokenLenStemmed)/(tokenLenStemmed), name="reductionRate")
    
    return pd.concat([data,stemmedTokens,tokenLenUnstemmed,tokenLenStemmed,stemmingReductionR],axis=1)

In [14]:

stemmedStats = stemStats(stopWordStats.tokensNoStops)
len(stopWordStats.tokensNoStops[0])
#len(
#[ps.stem(token) for token in stopWordStats.tokensNoStops[0]]
#)

169

In [15]:
stemmedStats

Unnamed: 0,tokensNoStops,stemmedTokens,tokenLenUnstemmed,tokenLenStemmed,reductionRate
0,"[still, preacher, fashion, -, lived, room, shi...","[gener, still, preacher, fashion, famili, -, r...",169,162,0.043210
1,"[-, exercise, articles, within, next, dna, vim...","[-, avail, reader, within, next, dna, research...",91,90,0.011111
2,"[crew, covington, studios, wardrobe, j.d, disa...","[still, crew, disclaim, co-produc, enchilada, ...",232,224,0.035714
3,"[fashion, fisheries, -, vessel, extendable, kn...","[uniqu, fashion, -, blunder, vessel, known, da...",196,184,0.065217
4,"[spanish, angerer, saw, greatest, random, arou...","[imag, spanish, abl, ask, million, saw, greate...",122,118,0.033898
...,...,...,...,...,...
245,"[....., flagged, pussy, room, ratings, insuran...","[....., room, eunuch, civil, allegedli, outlin...",494,454,0.088106
246,"[rely, stretch, drying, degree, set, worth, we...","[basi, stretch, set, room, extra, dri, worth, ...",223,207,0.077295
247,"[example, newspaper, idea, judgment, immigrati...","[immigr, restor, ask, idea, angel, judgment, u...",100,98,0.020408
248,"[nobody, 'me, 26th, room, -, ask, plans, crowd...","[gener, 'me, 26th, room, -, ask, avail, crowd,...",197,188,0.047872


In [16]:
Final = pd.concat([stopWordStats,stemmedStats], axis=1)