<a href="https://colab.research.google.com/github/pranav-kural/deep-learning-textual-analysis/blob/main/Deep_Learning_Textual_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning Applied to Textual Data

In this notebook we will perform a classification empirical study using real textual data.

This is part of **Assignment 4** of the course CSI4106 at University of Ottawa.

Author: Pranav Kural

Group: 28 | Student number: 300241227

## Data

Let's begin by having a closer look at our data.

Dataset selected: **Airline Passenger Reviews**

Reason for selection:

- Enough training data is available and the class imabalance is not too bad.

- Did not choose the **4000 CNN articles dataset** because of the presence of too many classes, and the difficulty of effectively classifying articles when there is an overlap, for example, an article about US and politics has to be categorized into only one of the categories, even though it may relate to both equally. Also, less training data is available in comparison to other datasets.

- Did not choose the **UCI Drug Review dataset** because the substantial class imabalance in the dataset. One dominant condition "birth control" accounts for more than all other conditions combined (based on original dataset).

## 1. Generate two additional datasets

We will generate two additional datasets from our selected original dataset, so that we have three datasets to work with at the end:

- Original Dataset
- Derived-Dataset-1: contains subset of POS tags (Part-of-Speech tags)
- Derived-Dataset-2: contains subset of named entities found in the text + some POS of importance

For example:

- Original Dataset
- Derived-Dataset-1 with only the verbs and adjectives lemmatized
- Derived-Dataset-2 with 3 types of named entities (organizations, money and dates) and with adjectives lemmatized

In our classification experiments, we will test with these 3 datasets and compare results.

### Original Dataset

We will be using a reduced version of the original dataset for this study. The dataset being used is openly available at: https://github.com/baharin/CSI4106-Assignment4-Datasets/blob/main/reduced_file_AirPassengerReviews.csv



In [None]:
# original dataset

### Derived Dataset 1

For generating our first derived dataset, we will focus on following POS:

**Reasoning**



In [1]:
# generate derived-ds-1


### Derived Dataset 2

For generating our first derived dataset, we will focus on following named entities and POS:

**Reasoning**



In [None]:
# generate derived-ds-1




---



## 2. Classification Empirical Study


### 1. Encode text as input features with associated values

Using `scikit-learn` we encode the text data as features which can than be used in our deep learning model. The text becomes a [bag-of-words](#) where each word becomes an independent feature.

We will also remove [stopwords](#) to reduce the corpus size and use tf-idf as attribute value.

**Why use TF-IDF?**

Using relative frequency of a word in the corpus allows for better identification of more important and meaningful words relative to the documents, instead of simply assigning the importance of a word based on its frequency in the corpus. For example, a word occuring more often in only two of the six documents would likely be more important and meaningful to those two documents, compared to a word that occurs more frequently in all the six documents. <sup>[3](#)</sup>

<img src="https://drive.google.com/uc?id=1qiq4wWu4078sHKhDdeW6-5YgM-qwnbmy" alt="TF-IDF" width=400 />
<p id="img-source">source: <a href="https://youtu.be/fIYSi41f1yg?si=-oFATPFxyop7RnmE">NLP Demystified 6: TF-IDF and Simple Document Search</a></p>

In [None]:
# remove stopwords

# generate bag-of-words using tf-idf


### 2. Two models using default parameters

We will define two models using default parameters. One supervised learning model ([Logistic Regression Model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)) and one deep learning model ([MLP classifier model](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)).

In [None]:
# Logistic Regression model

# MLP classifier model


### 3. Train Test Evaluate

Now, we will train, test and evaluate previously defined two models (using default parameters) on the three datasets (original dataset, derived-dataset-1 and derived-dataset-2).

In [None]:
# define models with default parameters
lg = None

mlp_1 = None

#### 1. Train

We will be using [4-fold cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) for each of the models.

In [None]:
# define cross validation

# training pipeline

# train each of the models


#### 2. Evaluate

We will now evaluate the precision/recall measures of each of the models. Since, we are working with a multi-class classification problem, we will be comparing both micro and macro averages.

In [None]:
# function generate visual graphs for comparison of precision and recall


In [None]:
# compare micro precision and recall of our models on the three datasets

# original dataset

# derived-dataset-1

# derived-dataset-2


In [None]:
# compare macro precision and recall of our models on the three datasets

# original dataset

# derived-dataset-1

# derived-dataset-2


In [None]:
# compare micro and macro precision of our models on the three datasets

# original dataset

# derived-dataset-1

# derived-dataset-2


In [None]:
# compare micro and macro recall of our models on the three datasets

# original dataset

# derived-dataset-1

# derived-dataset-2


**Observations:**


**How the class imbalance impacts the micro/macro results?**



## 4. Improving our MLP classifier model

Now, that we have examined the performance of our models with default parameters, we will modify some parameters of our MLP classifier model, and train, test and evaluate its performance again, to see how changes in parameters affect the model performance.

We will do this **two times**.

### 1. Attempt 1 - MLP model #2

Parameters being changes:

- Activation function:
- Hidden layer size:
- Learning rate:

**Reasoning:**





In [None]:
# parameters

# instantiate the new MLP model
mlp_2 = None

# train the new model


### 2. Attempt 2 - MLP model #3

Parameters being changes:

- Activation function:
- Hidden layer size:
- Learning rate:

**Reasoning:**




In [None]:
# parameters

# instantiate the new MLP model
mlp_3 = None

# train the new model


## 5. Comparative Analysis

We will now quantitatively and visually compare the precision and recall measures of our 12 results. The 12 results come  from 4 models (Logistic Regression + 3 variations of MLP) each applied on 3 datasets (Original + Derived1 + Derived2).


### Comparison 1

- LG applied to Original Dataset
- MLP (with default parameters) applied to Original Dataset

In [None]:
# compare micro precision and recall of our models

# compare macro precision and recall of our models


#### Observations



### Comparison 2

- MLP (with default parameters) applied to Original Dataset
- MLP (with modified parameters) applied to Original Dataset

In [None]:
# compare micro precision and recall of our models

# compare macro precision and recall of our models


#### Observations



### Comparison 3

- MLP (with default parameters) applied to Derived Dataset 1
- MLP (with modified parameters) applied to Derived Dataset 1

In [None]:
# compare micro precision and recall of our models

# compare macro precision and recall of our models


#### Observations



### Comparison 4

- MLP (with default parameters) applied to Derived Dataset 2
- MLP (with modified parameters) applied to Derived Dataset 2

In [None]:
# compare micro precision and recall of our models

# compare macro precision and recall of our models


#### Observations



## 6. Conclusion



## 7. References

