# Data Science Task <a class='tocSkip'>

The *olx_spam_data__training_set.csv* contains ads data from the Real Estate category with a label indicating if it’s spam.

**Your task is to build and train a model that allows to identify spam content based on the provided ad parameter.**

The output of your work should be a script that we can run in Jupyter notebooks and which would allow us to run your model on our validation set. 


## Libraries import

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import morfeusz2
import nltk
import string

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, accuracy_score

## Loading data

### For the automatic validation please put the file named ``olx_spam_data__validation_set.csv`` in the ``data`` folder.

In [None]:
# Training Set
training_set = 'data/olx_spam_data__training_set.csv'
df_train = pd.read_csv(training_set, sep=None, engine='python', encoding='utf-8')
print('Training set ready.')

# Validation Set
try:
    validation_set = 'data/olx_spam_data__validation_set.csv'
    df_val = pd.read_csv(validation_set, sep=None, engine='python', encoding='utf-8')
    print('Validation set ready.')
except FileNotFoundError:
    print('Validation set not found. Check the name and path of the file.')

In [None]:
df_train.head()

In [None]:
# Dropping two useless columns
df_train.drop(['ID', 'Unnamed: 5'], axis=1, inplace=True)

In [None]:
df_train.head()

In [None]:
# Checking distribution of classes to assess the baseline for classifier
print('Mean value:', df_train['SPAM_FLAG'].mean())
print('Distribution:')
print(df_train['SPAM_FLAG'].value_counts())

Classes are balanced

## Lemmatization using Morfeusz library

In [None]:
# Using SGJP dictionary
morf = morfeusz2.Morfeusz(dict_path='./data', dict_name='sgjp')
morf.dict_id()

**Helper function for automatic lemmatization**

In [None]:
def lemmatizer(df, col):
    '''
    Arguments:
    df: DataFrame with columns to be lemmatized,
    col: Column name to be lemmatized, eg. 'DESCRIPTION'
    '''
    
    
    # Generating list from column values
    raw_ = [str([art]) for art in df[col]]

    # Small letters 
    lower_ = [art.lower() for art in raw_]

    # Tokenization
    tokenized_ = [nltk.word_tokenize(art) for art in lower_]

    # Removing punctuation chars
    no_punc_ = [[token for token in art if token not in string.punctuation] for art in tokenized_]

    # Removing stopwords
    stopwords = './data/polishstopwords.txt'
    no_stopw_ = [[token for token in art if not token in stopwords] for art in no_punc_]

    # Lemmatization
    morf = morfeusz2.Morfeusz(expand_tags=False, dict_path='./data/', dict_name='sgjp')
    lemmatized__ = [[morf.analyse(token)[0][2][1] for token in art] for art in no_stopw_]

    # Removing artifacts after lemmatization
    lemmatized_ = [[token.split(':')[0] for token in art] for art in lemmatized__]

    # Joining tokens back together into corpus
    cleaned_ = [' '.join(tokens) for tokens in lemmatized_]

    # Returning new dataframe with lemmatized values
    return pd.DataFrame(cleaned_)

In [None]:
desc = lemmatizer(df_train, 'DESCRIPTION')

In [None]:
title = lemmatizer(df_train, 'TITLE')

In [None]:
# Concatenation of lemmatized columns
df_adult = pd.concat([title, desc, df_train['PRICE'], df_train['SPAM_FLAG']], axis=1)
df_adult.columns = ['TITLE','DESCRIPTION','PRICE','SPAM_FLAG']

## Classification
Using TF-IDF vectorizer and Multinomial Naive Bayes as a predictor

In [None]:
# Spliting data into train and test sets
train_df, test_df = train_test_split(df_adult, test_size=0.3, random_state=42)

# Defining pipeline steps and fitting classifier
steps = [('tfidf', TfidfVectorizer()), ('cls', MultinomialNB())]
pipe = Pipeline(steps=steps)
pipe.fit(train_df['DESCRIPTION'], train_df['SPAM_FLAG'])

y_pred = pipe.predict(test_df['DESCRIPTION'])
y_true = test_df['SPAM_FLAG']

# Returning results
plt.figure(figsize=(2,2))
sns.heatmap(confusion_matrix(y_true, y_pred), square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');

print(classification_report(y_true, y_pred))

## Validation

In [None]:
try:
    description_val = lemmatizer(df_val, 'DESCRIPTION')
    title_val = lemmatizer(df_val, 'TITLE')
    
    df_adult_val = pd.concat([title_val, description_val, df_val['PRICE'], df_val['SPAM_FLAG']], axis=1)
    df_adult_val.columns = ['TITLE','DESCRIPTION','PRICE','SPAM_FLAG']
    
    y_pred = pipe.predict(df_adult_val['DESCRIPTION'])
    y_true = df_adult_val['SPAM_FLAG']

    plt.figure(figsize=(2,2))
    sns.heatmap(confusion_matrix(y_true, y_pred), square=True, annot=True, cbar=False)
    plt.xlabel('predicted value')
    plt.ylabel('true value');
    print(classification_report(y_true, y_pred))
    
except NameError:
    print('No such dataframe. Check name and proper path to file.')

## Keras with word2vec embeddings

TBD