# Data Cleaning and EDA

The project task is to build a model that can classify between posts of different subreddits.

Scraped raw data from reddit can be found in the [scraped data folder](../data/scraped).

This notebook covers cleaning the scraped data and saving a cleaned dataset, doing EDA on the datasets, and trying out some light models to get a sense of how the subreddits differ.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import regex as re
from nltk import pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords, wordnet
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

## Data Cleaning of Scraped Reddit Data

### Cleaning of the Language Technology Scraped Data

In [None]:
# load the Language Technology dataset
lt = pd.read_csv('../data/scraped/lt_data_1_160621.csv')

In [None]:
lt = lt.drop('Unnamed: 0', axis=1)

In [None]:
lt.head()

Drop irrelevant columns.

In [None]:
lt = lt[['title', 'selftext', 'url', 'subreddit', 'name']]

Check whether sensible to dedupe only based on name field for this dataset

In [None]:
len(lt[lt.duplicated(subset=['title','name'], keep=False)])

In [None]:
len(lt[lt.duplicated(subset=['name'], keep=False)])

In [None]:
lt = lt.drop_duplicates('name')

In [None]:
lt.reset_index(inplace=True)

In [None]:
lt.shape

In [None]:
lt.isna().sum()

There are many entries with no selftext due to it being a shared embedded video. Theese entries with no selftext but with title are still useful as the titles are quite descriptive.

Combine the title and selftext to a new field called 'content'. 

In [None]:
lt.loc[:,'selftext'] = lt['selftext'].fillna('')

In [None]:
lt.isna().sum()

In [None]:
lt['content'] = lt['title'] + ' ' + lt['selftext']

In [None]:
lt['content'].head()

In [None]:
lt = lt.drop_duplicates('content')
lt.reset_index(inplace=True)

In [None]:
lt.shape

Get a sense of how the posts vary by length.

In [None]:
lt['len'] = lt['content'].map(lambda x: len(x))

In [None]:
lt['len'].hist()
plt.title('Distribution of Post Character Length')
plt.xlabel('Character Length of Language Tech Posts')
plt.ylabel('Number of Posts')

In [None]:
lt[lt['len'] > 10000]['len'].hist()
plt.title('Posts Above 10000 Characters')
plt.xlabel('Character Length of Language Tech Posts')
plt.ylabel('Number of Posts')

In [None]:
lt[lt['len'] < 1000]['len'].hist()
plt.title('Posts Below 10000 Characters')
plt.xlabel('Character Length of Language Tech Posts')
plt.ylabel('Number of Posts')

As expected, there will be many posts that are very short, especially given that around 200 posts had words in the title only and not in selftext.

Next, some entries from title and content are printed to get a sense of any weird characters or patterns to look our for when cleaning.

In [None]:
for t in lt['title'][:100]:
    print(t, '\n')

In [None]:
def clean_sqb(text):
    """
    Removes square brackets
    and text between them from a string of text.
    Allows for larger window within a url-like pattern.
    """
    return re.sub(r'\[.{1,20}\]|\[.{1,20}\..{1,8}/.{0,50}\]', '', text)

From the titles, one pattern we way want to remove that does not preserve information is words in square brackets, such as `[d]` and `[Video]`.

In [None]:
for t in lt['selftext'][:10]:
    print(repr(t), '\n')

Looking through the content, one observation is that there are a lot of urls, but the urls are not totally useless information. It is worth getting part of the substrings in the url such as 'github' or the contituent words in 'facebook-ai-open-source-the-flores-101-data-set-for-better-translation-systems-around-the-world'. 

Perhaps the urls can be broken down instead of totally removed. To address this, we can remove all slashes and dashes from the texts, and remove http/https and common top level domain names. 

In [None]:
def melt_urls(text):
    """
    Breaks down urls in text
    into constituent information.
    """
    return re.sub(r'/|-|_|http:|https:|html|en|www|google|facebook|reddit|\.com|\.co|\.net|\.info|\.org|\.us|\.uk|\.eu|\.ru|\.de|\.fr|\.au|\.cn|\.in|\.jp|\.ca|\.tk|\.ly|\.io', ' ',
                  text)

Note that long token strings in urls will be filtered out by the min_df in the vectorizer.

In [None]:
def stem(text):
    pstem = PorterStemmer()
    return ' '.join([pstem.stem(w) for w in text.split(' ')])

In [None]:
def remove_stopwords(text, stpwrds = stopwords.words('english')):
    return ' '.join([w for w in text.split(' ') if w not in stpwrds])

Numbers will also be removed at the preprocessing as previous iterations of CV vocab shows numbers to be unhelpful features.

In [None]:
def preproc(raw_text):
    processed = re.sub(r'\n|\t', ' ', raw_text)
    processed = melt_urls(clean_sqb(processed))
    processed = re.sub(r'\d+', '', processed)
    processed = processed.lower()
    processed = remove_stopwords(processed)
    return stem(processed)

### Visualising Vocab of the Language Technology Scraped Data

In [None]:
cvec = CountVectorizer(
    preprocessor = preproc,
    ngram_range = (1,4),
    min_df = 2,
    max_df = 0.95,
)

In [None]:
lt_cvec = cvec.fit(lt['content'])

In [None]:
lt_cvec_tr = lt_cvec.transform(lt['content'])

In [None]:
lt_vocab = lt_cvec.get_feature_names()

In [None]:
lt_counts = np.asarray(lt_cvec_tr.sum(axis=0))

In [None]:
lt_word_counts_dict = dict(sorted(zip(lt_vocab, lt_counts[0,:]), key=lambda x:x[1], reverse=True))

The vocab dictionary ordered by count gives a sense of the most common words and phrases found in the Language Technology corpus.

In [None]:
lt_word_counts_dict

In [None]:
# lt.to_csv('../data/clean/lt_data.csv', index=False)

### Cleaning of the Neuro Linguistic Programming Scraped Data

In [None]:
# load the NeuroLingPro dataset
nl = pd.read_csv('../data/scraped/ud_nl_data_1_160621.csv')

In [None]:
nl.head()

In [None]:
nl.shape

In [None]:
nl = nl[['title', 'selftext', 'url', 'subreddit', 'name']]

Check whether sensible to dedupe only based on name field for this dataset

In [None]:
len(nl[nl.duplicated(subset=['title','name'], keep=False)])

In [None]:
len(nl[nl.duplicated(subset=['name'], keep=False)])

All the duplicates by title are same as duplicates by name. The data can be deduped by name.

In [None]:
nl = nl.drop_duplicates('name')

In [None]:
nl.reset_index(inplace=True)

In [None]:
nl.shape

In [None]:
nl.isna().sum()

In [None]:
nl[nl.selftext.isna()][10:20]

Although there are quite a lot of entries without selftext, the title contains quite a lot of descriptive vocabulary that could be used to train the model. This is data we should preserve.

Combine the title and selftext to a new field called 'content'. 

In [None]:
nl.loc[:,'selftext'] = nl['selftext'].fillna('')

In [None]:
nl.isna().sum()

In [None]:
nl['content'] = nl['title'] + ' ' + nl['selftext']

In [None]:
nl['content'].head()

In [None]:
nl.shape

In [None]:
nl.drop_duplicates('content').shape

In [None]:
nl = nl.drop_duplicates('content')
nl.reset_index(inplace=True)

In [None]:
nl.shape

Get a sense of how the posts vary by length.

In [None]:
nl['len'] = nl['content'].map(lambda x: len(x))

In [None]:
nl['len'].hist()
plt.title('Distribution of NLP Posts by Character Length')
plt.xlabel('Character Length of NLP Posts')
plt.ylabel('Number of Posts')

In [None]:
nl[nl['len'] > 10000]['len'].hist()
plt.title('Posts Above 10000 Characters')
plt.xlabel('Character Length of NLP Posts')
plt.ylabel('Number of Posts')

In [None]:
nl[nl['len'] < 1000]['len'].hist()
plt.title('Posts Below 1000 Characters')
plt.xlabel('Character Length of NLP Posts')
plt.ylabel('Number of Posts')

There are only 2 posts above 10,000 characters long in this dataset, and most of the posts are under 1000 characters. A large majority are between 0-100 characters, as there are the posts where the title makes up the most of the text content.

Next, samples of the text will be looked through to see if there are any patterns to clean that the above cleaning functions do not take care of.

In [None]:
for t in nl['title'][:100]:
    print(t, '\n')

In [None]:
for t in nl[nl['selftext']!='']['selftext'][:10]:
    print(repr(t), '\n')

There seems to be numbers and words in square bracket that require cleaning, which the above functions will take care of. urls do not seem to be as prevalent in this dataset. Nonetheless, they will be broken down as with the first dataset, as and when they occur.

In [None]:
nl_cvec = CountVectorizer(
    preprocessor = preproc,
    ngram_range = (1,4),
    min_df = 2,
    max_df = 0.95,
)

In [None]:
nl_cvec.fit(nl['content'])

In [None]:
nl_cvec_tr = nl_cvec.transform(nl['content'])

In [None]:
nl_vocab = nl_cvec.get_feature_names()

In [None]:
nl_counts = np.asarray(nl_cvec_tr.sum(axis=0))

In [None]:
nl_word_counts_dict = dict(sorted(zip(nl_vocab, nl_counts[0,:]), key=lambda x:x[1], reverse=True))

Next, the top vocabulary counts will be inspected for anything that requires further cleaning.

In [None]:
nl_word_counts_dict

In [None]:
# nl.to_csv('../data/clean/nl_data.csv', index=False)

The vocabulary output generally looks clean for now, enough to move on to combine the datasets and train a prototype model.

### Combine the Datasets

In [None]:
# load cleaned data
# nl = pd.read_csv('../data/clean/nl_data.csv')
# lt = pd.read_csv('../data/clean/lt_data.csv')

In [None]:
df = pd.concat([nl, lt], axis=0, ignore_index=True)

In [None]:
df = df[['content', 'len', 'subreddit']]

In [None]:
df.head(2)

In [None]:
df.shape

### Compare Vocabulary

A Count Vectorizer will be used on the combined data to get the word counts for each feature. After that, the resultant matrix will be converted to a dataframe, with some aggregations and transformations made to it so that the feature counts for the two topics can be compared.

In [None]:
combined_cvec = CountVectorizer(
    preprocessor=preproc,
    ngram_range = (1,4),
    min_df = 2,
    max_df = 0.95,
)

In [None]:
combined_cvec_matrix = combined_cvec.fit_transform(df['content'])

In [None]:
cvec_df = pd.DataFrame(combined_cvec_matrix.todense(), columns=combined_cvec.get_feature_names())

In [None]:
count_df = pd.concat([df.drop(columns='len'), cvec_df], axis=1)

In [None]:
count_df = count_df.groupby('subreddit').sum().T

In [None]:
count_df['total count'] = count_df['LanguageTechnology'] + count_df['NLP']

In [None]:
count_df['diff'] = np.abs(count_df['LanguageTechnology'] - count_df['NLP'])

In [None]:
count_df.rename_axis('feature', inplace=True)

In [None]:
count_df.describe()

Get the top occuring words/phrases in the Language Technology data.

In [None]:
count_df.sort_values(by='LanguageTechnology', ascending=False)[:20]

Get the top occuring words/phrases in the Neuro Linguistic Programming data.

In [None]:
count_df.sort_values(by='NLP', ascending=False)[:20]

Examine top words that occur in one topic but not the other.

In [None]:
count_df[count_df['NLP']==0].sort_values(by='LanguageTechnology', ascending=False)[:20]

In [None]:
count_df[count_df['LanguageTechnology']==0].sort_values(by='NLP', ascending=False)[:20]

Examine words that occur a similar amount of times between both topics' datasets.

In [None]:
count_df[(count_df['LanguageTechnology'] > 30)
         &(count_df['NLP'] > 30)
         &(count_df['diff'] < 10)
         &(count_df['total count'] < 800)]

Examine words that differ largely in the number of counts between topics.

In [None]:
count_df.sort_values(by='diff', ascending=False)[:20]

Based on background knowledge of the topics, some word distributions between the two topics can be visualised. Some words that one could guess will be prominent across both topics, or distinct to one can be used to subset the dataframe and have their distributions visualised.

In [None]:
ax = count_df.loc[['nlp', 'program', 'languag', 'learn', 'cours', 'linguist', 'search', 
              'expert', 'coach', 'practition', 'scam', 'scienc', 'python', 'gpt'],
             ['LanguageTechnology','NLP']].plot(kind='barh',
                                                title='Word Count Comparison',
                                                figsize=(8,12),
                                                fontsize=14,
                                                ylabel='Word Count Log Scale',
                                                #xlim=(0,100),
                                                logx=True
                                               )

In [None]:
# fig = ax.get_figure()

In [None]:
# fig.savefig('../img/word_dist_barh.png')

Next, get an overall sense of how many terms are unique to each topic.

In [None]:
count_df.shape[0]

In [None]:
count_df[(count_df['LanguageTechnology'] > 0)
         &(count_df['NLP'] > 0)].shape[0]

In [None]:
count_df[(count_df['LanguageTechnology'] == 0)
         &(count_df['NLP'] != 0)].shape[0]

In [None]:
count_df[(count_df['LanguageTechnology'] != 0)
         &(count_df['NLP'] == 0)].shape[0]

Although there are quite a lot of commonly occurding words/phrases between the topics (5350), there are also many words/phrases that occur in one topic's data and not the other (about 10000). This shows that the classification task to separate the two has a lot of potential to be a successful one as there are many potential distinguishing features.

Next, some further clean up will be done and the first model prototype will be tried.

## Train a Prototype Model

Now, the combined dataset will be used to build a prototype model to get a better feel of the classification task.

The main goal is to get a rough sense of how succesful this classification task will be, and what sort of further data cleaning needs to be done before the data moves to the next stage where many different models and hyper params will be tested.

In [None]:
df['subreddit'].value_counts(normalize=True)

There is a close to 50/50 split. The benchmark accuracy is about 51%.

Next, the subreddit will be mapped to numberic class labels.
The positive label (1) will be the Language Technology label, and 'NLP' (the Neuro Linguistic Programming) subreddit will be the negative label (0).

In [None]:
df['LT'] = df['subreddit'].map(lambda x: 0 if x == 'NLP' else 1)

In [None]:
df['LT'].value_counts(normalize=True)

In [None]:
# df.to_csv('../data/clean/nl_lt_data.csv', index=False)

### Separate Data into Predictors and Target

In [None]:
# X is a series as it will be featurised with a Count Vectorizer later
X = df['content']
y = df['LT']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

### Removing Obvious Vocabulary from the Features
The model will not be useful if it simply learns the subreddit name or synonyms for the subreddit name.
These will be removed from featurisation by being added to stopwords. 

However, note that for the task of classifying between Language Technology and Neuro Linguistic Programming posts, the term 'NLP' will not be removed from the features as it is highly prevalent in both topics ('NLP' also stands for 'Natural Language Processing' which is a popular synonym for Language Technology). Hence, 'NLP' should not be a distinguishing feature between the two topics, and in reality it is better to leave that term there precisely because these two topics confusingly use the term 'NLP' to refer to two different things, and the utlity of the classifier will be to distinguish between the two.

In [None]:
def check_vocab(wc_dict, substr):
    """
    Check for a certain term in the
    word count dictionary 
    based on a substring match.
    """
    return {k:v for k,v in wc_dict.items() if substr in k}

In [None]:
check_vocab(lt_word_counts_dict, 'program')

In [None]:
check_vocab(nl_word_counts_dict, 'neurolinguist')

In [None]:
# not removing 'nlp', 'linguist', programming' as they are commonly used terms in both sets
stpwrds_lt_nl = stopwords.words('english') + ['language', 'technology',
                                              'natural', 'processing',
                                              'neuro', 'neurolinguist', 'neurolinguistic']

Uodate the preprecessing function to account for extended stopwords.

In [None]:
def preproc_lt_nl(raw_text):
    processed = re.sub(r'\n|\t', ' ', raw_text)
    processed = melt_urls(clean_sqb(processed))
    processed = re.sub(r'\d+', '', processed)
    processed = processed.lower()
    processed = remove_stopwords(processed, stpwrds=stpwrds_lt_nl)
    return stem(processed)

### Create the Count Vectorizer 

In [None]:
lt_nl_cvec = CountVectorizer(
    preprocessor = preproc_lt_nl,
    ngram_range = (1,4),
    min_df = 2,
    max_df = 0.95,
)

In [None]:
lt_nl_cvec.fit(X_train)

In [None]:
lt_nl_cvec_tr = lt_nl_cvec.transform(X_train)

In [None]:
lt_nl_vocab = lt_nl_cvec.get_feature_names()

In [None]:
lt_nl_counts = np.asarray(lt_nl_cvec_tr.sum(axis=0))

In [None]:
word_counts_dict = dict(sorted(zip(lt_nl_vocab, lt_nl_counts[0,:]), key=lambda x:x[1], reverse=True))

Inspect the combined vocabulary.

In [None]:
word_counts_dict

### Train a Logistic Regression Model

In [None]:
model = Pipeline([('cvec', CountVectorizer(
    preprocessor = preproc_lt_nl,
    ngram_range = (1,4),
    min_df = 2,
    max_df = 0.95,)),
                 ('logreg', LogisticRegression())])

In [None]:
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

A simple logistic regression classifier seems to work remarkably well in distinguishing between the two topics.
The features' coefficients will be inspected to see if the prominent features seem sensible in the sense that it would generalise to more examples.

### Inspect Coefficients for Features

In [None]:
dict(sorted(zip(model['cvec'].vocabulary_.keys(), np.exp(model['logreg'].coef_[0])), key=lambda x:np.abs(x[1]), reverse=True))

Some names are surfacing in the vocabulary. Some investigation is warranted to ensure these are not reddit users, but actual people associated with the topic.

In [None]:
def inspect_pattern(phrase, text):
    if phrase in text.lower():
        print(re.findall(r'.{1,20}' + phrase.replace(' ', r'\s') + r'.{1,20}', text, flags=re.I))

In [None]:
for text in nl['content']:
    inspect_pattern('james pesch', text)

A google search shows that James Pesch is some seemingly popular personality in Neuro Linguistic Programming, so the name would be a sensible feature to retain to distinguish between the topics.

In [None]:
for text in nl['content']:
    inspect_pattern('blow', text)

In [None]:
for text in lt['content']:
    inspect_pattern('specif', text)

It seems weird that 'blow' is a significant feature to distinguish between the two topics.

Next, to further test the robustness of this preliminary classifier, it will be evaluated on a more difficult task - being able to distinguish between closely related topics to Language Technology and Neuro Linguistic Programming. These closely related topics are Deep Learning (similar topic to Language Technology) and Hypnosis (similar topic to Neuro Linguistic Programming).

## Evaluation Using Related Topics 
### Deep Learning and Hypnosis Data
This data of related topics will be used to see if the model can generalise well to unseen data of closely related topics.

### Load and Clean the Deep Learning and Hypnosis Data

In [None]:
dl = pd.read_csv('../data/scraped/ud_dl_data_1_160621.csv')
hy = pd.read_csv('../data/scraped/ud_hy_data_1_160621.csv')

In [None]:
dl.head(2)

In [None]:
hy.head(2)

In [None]:
def clean_and_dedupe(data):
    """
    Converts raw scraped data into 
    format ready to be featurised.
    Follows the dataframe transformations 
    in the top half of this nb.
    """
    data = data[['name', 'title', 'selftext', 'url', 'subreddit']]
    data = data.drop_duplicates('name')
    data['selftext'] = data['selftext'].fillna('')
    data['content'] = data['title'] + ' ' + data['selftext']
    data = data.drop_duplicates('content')
    data.reset_index(inplace=True)
    return data

In [None]:
dl = clean_and_dedupe(dl)

In [None]:
dl.shape

In [None]:
hy = clean_and_dedupe(hy)

In [None]:
hy.shape

In [None]:
hy_dl = pd.concat([hy,dl], axis=0, ignore_index=True)

In [None]:
hy_dl = hy_dl[['content', 'subreddit']]

In [None]:
hy_dl['subreddit'].value_counts(normalize=True)

Ensure that the positive label (1) in this set lines up with the analogous positive label of the model training data. So deep learning posts will be labelled '1', as they are the analogous posts to the language technology posts labelled '1' above.

In [None]:
hy_dl['DL'] = hy_dl['subreddit'].map(lambda x: 1 if x=='deeplearning' else 0)

In [None]:
hy_dl['DL'].value_counts(normalize=True)

In [None]:
# hy_dl.to_csv('../data/clean/hy_dl_data.csv', index=False)

### Evaluate Logistic Regression Model on Related Topics

Now, with the data from the related topics, Hypnosis and Deep Learning, prepared for prediction, the logistic regression model above trained on the Neuro Linguistic Programming and Language Technology data will be evaluated for how well it can generalise to distinguish between closely related topics.

In [None]:
X_dl = hy_dl['content']
y_dl = hy_dl['DL']

In [None]:
model.score(X_dl, y_dl)

The model trained on distinguishing Language Techonology posts from Neural Linguistic Programming posts generalises quite well to distinguish between Deep Learning and Hypnosis content.

In [None]:
print(classification_report(y_dl, model.predict(X_dl)))

The model's perfomance on the Deep Learning and Hypnosis set is decent, however, some analysis can be done on where the model could improve.

Firstly, the model suffers on precision (of 78%) for the negative label. Meaning for every 10 posts the model classifies as 'Hypnosis', about 2 are wrongly classified, and actually about 'Deep Learning'.

The model has a relatively low recall rate (75%) for the positive model too. It cannot detect about one quarter of the posts on deep learning.

There are several ways one could try to improve the score on unseen data. Different models or different regularization parameters can be tried, which will be done in the next notebook where there is a more thorough exploration of different models.

Another method to try is to change the preprocessor to get better features. Above, stemming was tried, but not lemmatization, which wil be quickly tried next.

### Try Lemmatization
Lemmatizaion will be tested to see if it can outperform stemming as a preprocessing step that can contribute to better model performance. In theory lemmatization should provide a better consolidation of different forms of words into its original form, thereby helping the model learn better with more distinct and consolidated word features.

In [None]:
def get_pos(word):
    pos_key = pos_tag([word])[0][1][0]
    pos_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'R': wordnet.ADV,
        'V': wordnet.VERB
    }   
    return pos_dict.get(pos_key, wordnet.NOUN)

def lem(text):
    lemm = WordNetLemmatizer()
    return ' '.join([lemm.lemmatize(w, get_pos(w)) for w in text.split(' ')])

In [None]:
def preproc_lem(raw_text):
    processed = re.sub(r'\n|\t', ' ', raw_text)
    processed = melt_urls(clean_sqb(processed))
    processed = re.sub(r'\d+', '', processed)
    processed = processed.lower()
    processed = remove_stopwords(processed, stpwrds=stpwrds_lt_nl)
    return lem(processed)

In [None]:
model_lem = Pipeline([('cvec', CountVectorizer(
    preprocessor = preproc_lem,
    ngram_range = (1,4),
    min_df = 2,
    max_df = 0.95,)),
                 ('logreg', LogisticRegression())])

In [None]:
model_lem.fit(X_train, y_train)

In [None]:
model_lem.score(X_train, y_train)

In [None]:
model_lem.score(X_test, y_test)

In [None]:
model_lem.score(X_dl, y_dl)

Lemmatization does not seem to make a difference in the model accuracy on both the train and test sets. The model with lemmatization proprocessing also suffers a slightly lower score on the related topics. Considering lemmatization is also a more expensive technique than stemming, it will be disregarded in the subsequent model testing.

## Next Notebook: Trying Different Classification Models

**In the [subsequent notebook](02_Models.ipynb), different featurisers and models will be evaluated for their performance on the original dataset and the related dataset.**