# Preprocess for news dataset

We searched the web to find other datasets with German fake news in order to enhance our model and test it on different data that are from different pools.  
We found this source : https://www.kaggle.com/astoeckl/fake-news-dataset-german

# Import necessary libraries

In [2]:
import pandas as pd
import requests
import nltk 
import re 
from nltk.corpus import stopwords
nltk.download('stopwords')
from imblearn.under_sampling import RandomUnderSampler

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Data preprocessing

## Load news.csv

In [3]:
df = pd.read_csv('../../Datasets/news/news.csv', index_col=0)

In [39]:
df

Unnamed: 0_level_0,url,Titel,Body,Kategorie,Datum,Quelle,Fake,Art
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
773233,http://www.der-postillon.com/2018/01/grokoleak...,Exklusiv! Das geheime WhatsApp-Chat-Protokoll ...,Die Sondierungsgespräche zwischen Union und SP...,wirtschaft,2018-01-18 00:00:00,Postillion,1,
773234,http://www.der-postillon.com/2018/01/trump-san...,"Trump droht, jeden zu verspeisen, der an seine...",Nun ist es auch medizinisch offiziell bestätig...,wirtschaft,2018-01-17 00:00:00,Postillion,1,
773235,http://www.der-postillon.com/2018/01/fdp-sondi...,"Soli runter, keine Steuererhöhungen, kein Klim...","Es waren zähe Verhandlungen, doch die Freien D...",wirtschaft,2018-01-12 00:00:00,Postillion,1,
773236,http://www.der-postillon.com/2018/01/joachim-s...,Hat sie eine Affäre? Joachim Sauer glaubt Ange...,Wo treibt sie sich immer bis spät in die Nacht...,wirtschaft,2018-01-09 00:00:00,Postillion,1,
773237,http://www.der-postillon.com/2018/01/halb-so-s...,"""Er hat ja nur HALBneger gesagt"": So begründet...",Der Parteivorstand drückt nochmal ein Auge zu:...,wirtschaft,2018-01-08 00:00:00,Postillion,1,
...,...,...,...,...,...,...,...,...
838144,http://www.kleinezeitung.at//international/537...,Lehrer entging durch Hochzeit mit Schülerin Ve...,55-Jähriger muss nach Sex mit damals 15-Jährig...,International,2018-02-26 00:00:00,Kleine,0,
838145,http://www.kleinezeitung.at//wirtschaft/wirtsc...,Warum die Taiwaner Toilettenpapier bunkern,Aus Angst vor Preiserhöhungen bei Klopapier ka...,Wirtschaft,2018-02-26 00:00:00,Kleine,0,
838146,http://www.kleinezeitung.at//wirtschaft/wirtsc...,Warum die Taiwaner Toilettenpapier bunkern,Aus Angst vor Preiserhöhungen bei Klopapier ka...,Wirtschaft,2018-02-26 00:00:00,Kleine,0,
838147,http://www.kleinezeitung.at//wirtschaft/wirtsc...,\r\nDie neue Premium-Klasse von Samsung\r\n ...,Am Vorabend der Eröffnung des Mobile World Con...,Wirtschaft,2018-02-25 00:00:00,Kleine,0,


## Create new dataset using Undersampling

We created a new dataset that has Title, Body and Fake as its columns, then because the fake news is much less than the real, according to the EDA, we performed random Undersampling to balance our dataset.  

In [60]:
X_train=np.array([df['Titel'],df['Body']]).T
y_train = np.array(df['Fake'])

rus = RandomUnderSampler(random_state=0)
X_resampled_under, y_resampled_under =rus.fit_resample(X_train, y_train)

x_final = np.append(X_resampled_under, y_resampled_under.reshape(9254,1), axis=1)

df_new = pd.DataFrame(x_final)

df_new.columns = ['Title', 'Text', 'Fake-Real']

In [81]:
df_new

## Stemming

We used a German stemmer we found on github: https://github.com/LeonieWeissweiler/CISTEM  
And we created a function named stemmer that performs German stopword removal and stemming in a dataset entry.

In [90]:
stripge = re.compile(r"^ge(.{4,})")
replxx = re.compile(r"(.)\1")
replxxback = re.compile(r"(.)\*");
stripemr = re.compile(r"e[mr]$")
stripnd = re.compile(r"nd$")
stript = re.compile(r"t$")
stripesn = re.compile(r"[esn]$")


def stem(word, case_insensitive = False):
    if len(word) == 0:
        return word

    upper = word[0].isupper()
    word = word.lower()

    word = word.replace("ü","u")
    word = word.replace("ö","o")
    word = word.replace("ä","a")
    word = word.replace("ß","ss")

    word = stripge.sub(r"\1", word)
    word = word.replace("sch","$")
    word = word.replace("ei","%")
    word = word.replace("ie","&")
    word = replxx.sub(r"\1*", word)

    while len(word) > 3:
        if len(word) > 5:
            (word, success) = stripemr.subn("", word)
            if success != 0:
                continue

            (word, success) = stripnd.subn("", word)
            if success != 0:
                continue

        if not upper or case_insensitive:
            (word, success) = stript.subn("", word)
            if success != 0:
                continue

        (word, success) = stripesn.subn("", word)
        if success != 0:
            continue
        else:
            break

    word = replxxback.sub(r"\1\1", word)
    word = word.replace("%","ei")
    word = word.replace("&","ie")
    word = word.replace("$","sch")

    return word

def stemmer(title):
    review = re.sub('[^a-zA-ZäöüÄÖÜß]',' ', title)
    review = review.lower().split()
    review = [stem(word) for word in review if not word in stopwords.words('german')]
    review = ' '.join(review)
    return(review)

def create_dataset_fake_news(df):
    
    for i in range(0, len(df)):
        df.iloc[i,0] = stemmer(df.iloc[i,0])    
        df.iloc[i,1] = stemmer(df.iloc[i,1])   
    return(df)


## Creating new dataset

In [None]:
df_preprocessed = create_dataset_fake_news(df_new)

In [None]:
#Some statistics about our data
print(df_preprocessed[df_preprocessed['Fake-Real'] == 0].count())
print(df_preprocessed[df_preprocessed['Fake-Real'] == 1].count())

   ## Saving our dataset 

In [28]:
df_preprocessed.to_csv('../../Datasets/news/df_preprocessed_news')

'http://www.der-postillon.com/2018/01/grokoleaks.html'