# Exploratory Data Analysis
The objective of this notebook is to compute some statistics and visualizations to understand the data better once the corpus is created.

In [145]:
import pandas as pd
import numpy as np
import requests
from dotenv import load_dotenv
import os
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/davidzhu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/davidzhu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/davidzhu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [146]:
DATA_PATH = os.path.join("..", 'data')
CORPUS_FILENAME = 'dataset-v1.pkl'

CORPUS_FILE = os.path.join(DATA_PATH, CORPUS_FILENAME)

## Import Corpus

In [147]:
input_df = pd.read_pickle(CORPUS_FILE)

In [148]:
# convert every issue that is not "spam" to "not-spam"
df = pd.DataFrame(columns=['label', 'text'])
df['label'] = input_df['labels'].apply(lambda x: 'spam' if 'spam' in x else 'not-spam')
df['text'] = input_df['title'] + " " + input_df['body']

## Preprocess
Before analyzing the text data, it's best-practice to do pre-processing. For reference, read [this article](https://medium.com/@maleeshadesilva21/preprocessing-steps-for-natural-language-processing-nlp-a-beginners-guide-d6d9bf7689c9)
- Stem
- Lemmatize
- Lowercase
- etc

### Define regex that will be used to clean the text

In [149]:
code_block_pattern = r"```([\w\W]*?)```" # matches all code blocks enclosed in triple backticks ```
code_pattern = r"`([\w\W]*?)`" # matches all code blocks enclosed in single backticks `
url_pattern = r"http\S+" # matches all urls
new_line_pattern = r'[\r|\n]' # matches all new lines and carriage returns
non_word_pattern = r'[^\w\s]' # matches all non-word characters (only retains alphanumeric and underscores)
number_pattern = r'\S*\d+\S*' # matches all words that contain numbers

def clean(text):
    text = re.sub(code_block_pattern, ' ', text)
    text = re.sub(code_pattern, ' ', text)
    text = re.sub(url_pattern, ' ', text)
    text = re.sub(new_line_pattern, ' ', text)
    text = re.sub(non_word_pattern, ' ', text)
    text = re.sub(number_pattern, ' ', text)
    return text

### Applying lowercase for all characters and preprocessing the data
The data is cleaned using the regex defined above

In [150]:
df = df.map(lambda x: x.lower() if isinstance(x, str) else x)

df['text'] = df['text'].apply(clean)

### Tokenization

In [151]:
df['text'] = df['text'].apply(word_tokenize)

### Removing stop words

In [152]:
stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: [word for word in x if word not in stop_words])
df

Unnamed: 0,label,text
0,not-spam,"[missing, migrations, libavif, pinned, conda, forge, pinnings, comment, debugging, strange, resolutions, boiled, c, users, straversaro, conda, create, n, test, pygame, ffmpeg, libavif, channels, robostack, staging, conda, forge, platform, win, collecting, package, metadata, repodata, json, done, solving, environment, failed, libmambaunsatisfiableerror, encountered, problems, solving, package, pygame, requires, none, providers, installed, could, solve, environment, specs, following, packages, incompatible, ffmpeg, installable, potential, options, ffmpeg, would, require, aom, installed, svt, installed, ffmpeg, would, require, aom, conflicts, installable, versions, previously, reported, ffmpeg, would, require, aom, conflicts, installable, versions, previously, reported, ffmpeg, would, require, aom, conflicts, installable, versions, previously, reported, pygame, installable, requires, viable, options, would, require, ...]"
1,not-spam,"[ui, feedback, uag, search, button, need, click, search, button, click, area, write, type, enter, keyboard, enough, understand, highlight, done, several, lenses, applied, difficult, understand, available, local, language, lack, consistency, appearance, font, capital, letters, functionality, basic]"
2,not-spam,"[issues, resolved, welcome, repository]"
3,not-spam,"[create, account, rules, broken, error, prompt]"
4,not-spam,"[nan, search, results, table, general, question, something, make, less, brittle, passing, nan, search, result, table, functions, operations, fine, seem, sensible, missing, data, values, want, easily, maintainable, robust, reasonably, process, working, discovered, nans, added, pandas, dataframe, search, result, table, broke, things, unexpected, ways, particularly, nan, datauri, got, passed, astroquery, mast, observations, get_cloud_uris, nans, get, introduced, query, columns, empty, use, pd, concat, join, tables, together, astroquery, mast, observations, outer, join, preserve, much, information, columns, possible, recent, example, example, issue, mainline, lightkurve, concatenate, dataframe, tesscut, information, main, self, table, however, main, self, table, also, lots, nans, similar, operation, concatenate, tables, astroquery, mast, observations, ...]"
...,...,...
1816,spam,"[want, explicit, sex, secs, ring, costs, min, gsex, pobox]"
1817,spam,"[asked, chatlines, inclu, free, mins, india, cust, servs, sed, yes, got, mega, bill, dont, giv, shit, bailiff, due, days, å, want, å]"
1818,spam,"[contract, mobile, mnths, latest, motorola, nokia, etc, free, double, mins, amp, text, orange, tariffs, text, yes, callback, remove, records]"
1819,spam,"[reminder, get, pounds, free, call, credit, details, great, offers, pls, reply, text, valid, name, house, postcode]"


### Lemmatization

In [154]:
lemmatizer = WordNetLemmatizer()

# define function to lemmatize tokens
def lemmatize_tokens(tokens):
    # convert POS tag to WordNet format
    def get_wordnet_pos(word):
        tag = nltk.pos_tag([word])[0][1][0].upper()
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN)
    
    # lemmatize tokens
    lemmas = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]
    
    # return lemmatized tokens as a list
    return lemmas
    

# apply lemmatization function to column of dataframe
df['lemmatized_text'] = df['text'].apply(lemmatize_tokens)

Unnamed: 0,label,text,lemmatized_text
0,not-spam,"[missing, migrations, libavif, pinned, conda, forge, pinnings, comment, debugging, strange, resolutions, boiled, c, users, straversaro, conda, create, n, test, pygame, ffmpeg, libavif, channels, robostack, staging, conda, forge, platform, win, collecting, package, metadata, repodata, json, done, solving, environment, failed, libmambaunsatisfiableerror, encountered, problems, solving, package, pygame, requires, none, providers, installed, could, solve, environment, specs, following, packages, incompatible, ffmpeg, installable, potential, options, ffmpeg, would, require, aom, installed, svt, installed, ffmpeg, would, require, aom, conflicts, installable, versions, previously, reported, ffmpeg, would, require, aom, conflicts, installable, versions, previously, reported, ffmpeg, would, require, aom, conflicts, installable, versions, previously, reported, pygame, installable, requires, viable, options, would, require, ...]","[miss, migration, libavif, pin, conda, forge, pinning, comment, debug, strange, resolution, boil, c, user, straversaro, conda, create, n, test, pygame, ffmpeg, libavif, channel, robostack, stag, conda, forge, platform, win, collect, package, metadata, repodata, json, do, solve, environment, fail, libmambaunsatisfiableerror, encounter, problem, solve, package, pygame, require, none, provider, instal, could, solve, environment, spec, follow, package, incompatible, ffmpeg, installable, potential, option, ffmpeg, would, require, aom, instal, svt, instal, ffmpeg, would, require, aom, conflict, installable, version, previously, report, ffmpeg, would, require, aom, conflict, installable, version, previously, report, ffmpeg, would, require, aom, conflict, installable, version, previously, report, pygame, installable, require, viable, option, would, require, ...]"
1,not-spam,"[ui, feedback, uag, search, button, need, click, search, button, click, area, write, type, enter, keyboard, enough, understand, highlight, done, several, lenses, applied, difficult, understand, available, local, language, lack, consistency, appearance, font, capital, letters, functionality, basic]","[ui, feedback, uag, search, button, need, click, search, button, click, area, write, type, enter, keyboard, enough, understand, highlight, do, several, lens, apply, difficult, understand, available, local, language, lack, consistency, appearance, font, capital, letter, functionality, basic]"
2,not-spam,"[issues, resolved, welcome, repository]","[issue, resolve, welcome, repository]"
3,not-spam,"[create, account, rules, broken, error, prompt]","[create, account, rule, broken, error, prompt]"
4,not-spam,"[nan, search, results, table, general, question, something, make, less, brittle, passing, nan, search, result, table, functions, operations, fine, seem, sensible, missing, data, values, want, easily, maintainable, robust, reasonably, process, working, discovered, nans, added, pandas, dataframe, search, result, table, broke, things, unexpected, ways, particularly, nan, datauri, got, passed, astroquery, mast, observations, get_cloud_uris, nans, get, introduced, query, columns, empty, use, pd, concat, join, tables, together, astroquery, mast, observations, outer, join, preserve, much, information, columns, possible, recent, example, example, issue, mainline, lightkurve, concatenate, dataframe, tesscut, information, main, self, table, however, main, self, table, also, lots, nans, similar, operation, concatenate, tables, astroquery, mast, observations, ...]","[nan, search, result, table, general, question, something, make, less, brittle, passing, nan, search, result, table, function, operation, fine, seem, sensible, miss, data, value, want, easily, maintainable, robust, reasonably, process, work, discover, nan, add, panda, dataframe, search, result, table, broke, thing, unexpected, way, particularly, nan, datauri, get, pass, astroquery, mast, observation, get_cloud_uris, nan, get, introduce, query, column, empty, use, pd, concat, join, table, together, astroquery, mast, observation, outer, join, preserve, much, information, column, possible, recent, example, example, issue, mainline, lightkurve, concatenate, dataframe, tesscut, information, main, self, table, however, main, self, table, also, lot, nan, similar, operation, concatenate, table, astroquery, mast, observation, ...]"
...,...,...,...
1816,spam,"[want, explicit, sex, secs, ring, costs, min, gsex, pobox]","[want, explicit, sex, sec, ring, cost, min, gsex, pobox]"
1817,spam,"[asked, chatlines, inclu, free, mins, india, cust, servs, sed, yes, got, mega, bill, dont, giv, shit, bailiff, due, days, å, want, å]","[ask, chatlines, inclu, free, min, india, cust, servs, sed, yes, get, mega, bill, dont, giv, shit, bailiff, due, day, å, want, å]"
1818,spam,"[contract, mobile, mnths, latest, motorola, nokia, etc, free, double, mins, amp, text, orange, tariffs, text, yes, callback, remove, records]","[contract, mobile, mnths, late, motorola, nokia, etc, free, double, min, amp, text, orange, tariff, text, yes, callback, remove, record]"
1819,spam,"[reminder, get, pounds, free, call, credit, details, great, offers, pls, reply, text, valid, name, house, postcode]","[reminder, get, pound, free, call, credit, detail, great, offer, pls, reply, text, valid, name, house, postcode]"


## Statistics
We can learn a lot about our text data through standard statistical metrics. For reference, read [this article](https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools). Compute each statistic individually for spam and not-spam issues
- Type-token ratio
- etc 

## Visualizations
Visualize each of the following individually for spam and not-spam issues. [Reference](https://medium.com/@melody.zapotoczny/a-quick-easy-guide-to-text-analysis-seaborn-4c1a20addba3)
- Most frequent unigrams (words), bigrams(pairs of words), trigrams
- Average issue content (issue title + issue body) length
- Histogram of issue content length