#### **Natural Language Processing**

**1. What is a Corpus?**

A **corpus** is a collection of authentic text or audio collected for a particular research project. It can be thought of as a **paragraph**, which is composed of **sentences** called **documents**, which then are composed of **vocabularies**, which are all the unique words found in all sentences.

A corpus can be made up of everything from newspapers, novels, recipes, radio broadcasts to television shows, movies, and tweets. In natural language processing, a corpus is transformed into a **dataset**, which contains text and speech data that can be used to train AI and machine learning systems, and with labels for supervised learning.

Features of a good corpus:

* large quantities of specialized datasets are vital to training algorithms designed to perform sentiment analysis.
* high-quality - due to the large volume of data required for a corpus, even minuscule errors in the training data can lead to large-scale errors in the machine learning system’s output.
* clean from errors or duplicate data to create a more reliable corpus for NLP.
* a high quality corpus is a balanced corpus - if one doesn’t streamline and structure the data collection process, it could unbalance the relevance of the dataset

Challenges of creating a corpus:
* deciding the type of data needed to solve the problem statement
* availability of data
* quality of the data
* adequacy of data in terms of amount

**An example of a corpus**

This corpus is a collection of selected English tweets from 2013 to 2016 which are compiled as part of the **SemEval-2017** Task 4 Competition: *Sentiment Analysis on Twitter.*

The collection of tweets are already labeled based on its sentiment: **negative**, **neutral**, and **positive** - which makes it a dataset suitable for supervised learning.

In [1]:
import os, re, csv
import numpy as np
import pandas as pd

In [2]:
data_dir = '../datasets/2017_English_final/Subtask_A/'
train_files = [
    'twitter-2013train-A.txt',
    'twitter-2013dev-A.txt',
    'twitter-2013test-A.txt',
    'twitter-2014sarcasm-A.txt', 
    'twitter-2014test-A.txt',
    'twitter-2015train-A.txt',
    'twitter-2015test-A.txt',
    'twitter-2016train-A.txt',
    'twitter-2016dev-A.txt',
    'twitter-2016devtest-A.txt',
    'twitter-2016test-A.txt',
]

In [3]:
def load_dataframe(file_path):
    return pd.read_csv(
        file_path, 
        sep='\t',
        quoting=csv.QUOTE_NONE,
        usecols=[0,1,2],
        names=['id', 'label', 'message'],
        index_col=0,
        dtype={'label': 'category'})

train_dfs = []
for f in train_files:
    train_dfs.append(load_dataframe(os.path.join(data_dir, f)))
tweets_train = pd.concat(train_dfs)
# Dropping duplicates, as mentioned in its README there are 665 duplicate annotations across and within the files of Subtask_A
tweets_train.drop_duplicates(inplace=True)
# Dropping null records, either without label, or without message
tweets_train.dropna(inplace=True)
# Randomizing the arrangement of the records
tweets_train = tweets_train.sample(frac=1.0, random_state=42)


# Clean and prepare messages:
def preprocess_messages(messages):
    
    messages = messages.str.decode('unicode_escape', errors='ignore')
    messages = messages.str.strip('"')  # remove left-most and right-most quotation mark
    messages = messages.str.replace('""', '"', regex=False) # replacing double quotation to single quotation
    
    return messages

tweets_train['message'] = preprocess_messages(tweets_train['message'])

print('Total number of examples for training: {}\nDistribution of classes:\n{}'.format(
    len(tweets_train),
    tweets_train['label'].value_counts() / len(tweets_train),
))

tweets_train.head()

Total number of examples for training: 49675
Distribution of classes:
neutral     0.448032
positive    0.395994
negative    0.155974
Name: label, dtype: float64


Unnamed: 0_level_0,label,message
id,Unnamed: 1_level_1,Unnamed: 2_level_1
640329403277438976,neutral,[ARIRANG] SIMPLY KPOP - Kim Hyung Jun - Cross ...
640810454730833920,neutral,@TyTomlinson just read a politico article abou...
111344128507392000,neutral,"I just typed in ""the Bazura Project"" into goog..."
641414049083691009,neutral,Fast Lerner: Subpoenaed tech guy who worked on...
637666734300905472,negative,Sony rewards app is like a lot of 19 y.o femal...


In [4]:
tweets_train.message.iloc[4]

'Sony rewards app is like a lot of 19 y.o female singers and a non retro sale. 2nd one with no info'

In [5]:
# Mapping the labels 'negative', 'neutral' and 'positive' into 0, 1, 2
tweets_train_y = tweets_train['label'].cat.codes
labels = tweets_train.label.cat.categories.tolist()
labels

['negative', 'neutral', 'positive']

In [6]:
labels_codes = {}
for i, label in enumerate(labels):
    labels_codes[label] = i

labels_codes

{'negative': 0, 'neutral': 1, 'positive': 2}

However, we cannot simply give these sentences to a machine learning model and ask it to tell us whether a review was positive or negative or neutral. We need to perform certain text preprocessing steps.

**Text Processing** - from **Text** to **Vectors**

**I. Tokenization**

In [7]:
import spacy
from sklearn.base import BaseEstimator, TransformerMixin

class TweetTokenizer(BaseEstimator, TransformerMixin):
    """
    inherits the BaseEstimator and TransformerMixin (which contains the fit and transform functions) class from sklearn
    used spacy for tokenization and lemmatization
    """

    def __init__(self):
        # initializing spacy pipeline
        self.nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser', 'textcat'])
        self.stops = self.nlp.Defaults.stop_words
        # Removing negation words from the default stopwords set
        # not, cannot, no, never, nothing, none, without, nor, neither, nobody, nowhere
        # This is so we can keep the negative sentiment of a tweet brought by these words
        negation_words = ['not','cannot','no', 'never', 'nothing','none','without','nor','neither','nobody','nowhere']
        
        for neg in negation_words:
            self.stops.remove(neg)
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, messages):
    
        # Replace all whitespace characters by only one space
        messages = messages.str.replace(r'\s+', ' ',regex=True)
        messages = messages.str.strip()
        messages = messages.str.lower()

        # returns a lemmatized version of a token if it is not a stop word and is an alphabet character
        return messages.apply(lambda msg: " ".join([token.lemma_ for token in self.nlp(msg) if token.lemma_.lower() not in self.stops and token.is_alpha]))

# let's see some examples:
tweets_train_tokenized = TweetTokenizer().fit_transform(tweets_train['message'])
tweets_train_tokenized[:10]

id
640329403277438976    arirang simply kpop kim hyung jun cross ha yeo...
640810454730833920    read politico article donald trump running mat...
111344128507392000    type bazura project google image image photo d...
641414049083691009    fast lerner subpoena tech guy work hillary pri...
637666734300905472    sony reward app like lot female singer non ret...
264185448358875136    watch brooklyn nets new york knick tonight pos...
636407569108586496                   guy open gate naruto save ass goat
633549773337964545    triple h never ric flair bitch sunday no press...
622833484571254784    joint leader amateur paul dunne win open champ...
522787955216482304             glenn beck owner box redskin game sunday
Name: message, dtype: object

In [8]:
tweets_train_tokenized.to_csv("csvs/tweets_train_tokens.csv", index=False)

In [9]:
tweets_train_y

id
640329403277438976    1
640810454730833920    1
111344128507392000    1
641414049083691009    1
637666734300905472    0
                     ..
264260341070954497    0
641411364641206277    1
636722845599469568    1
264084248057765888    1
276099025340612608    1
Length: 49675, dtype: int8

In [10]:
tweets_train_y.to_csv("csvs/tweets_train_y.csv", index=False)

#### **End. Thank you!**