 # 67-212 Homework 4
## Summarization of Customer Reviews

In this homework, you will work on the customer reviews of products available at online marketplaces. Managing and mining these reviews can be a tricky and cumbersome task. Let's say you are working as an NLP expert for an online marketplace such as an apparel company that accepts customer reviews. However, there are too many reviews to handle and some of them are very long. So it is difficult to go through all of them. Your task is to build an automatic title (summary) generator for customer reviews. Below is a sample of reviews and their corresponding titles.

<center><img src='example2.png' width='300' height='150'></center>


In this notebook, you will build a deep neural network that functions as part of an end-to-end text summarization pipeline, using the Python version of the [OpenNMT toolkit]. Your completed system will accept a given product review as input and outputs a title summarizing it.

This homework is organized into 4 main parts:

1. **Preprocess** - You'll clean your text and split it into training, development, and testing sets.
2. **Modeling** Use OpenNMT to create a model that accepts a review (sequence of words) as input and returns a title summarizing it.
3. **Prediction** Run the model on the Test set.
4. **Evaluation** Evaluate the quality of the system using the ROUGE metric.

In [41]:
# A helper method for loading data from files
def load_data(path):
    """
    Load dataset
    """
    input_file = os.path.join(path)
    f=open(input_file, "r")
    return f.readlines()

## Dataset
We begin by investigating the dataset that will be used to train and evaluate your pipeline.  We'll be using an e-Commerce clothing reviews dataset. The dataset has over 23,000 customer reviews. And apart from the `reviews`, the other key features are the `titles` and the `ratings` assigned to each review by the customers.

### Load Data
The data is located in `data/women_clothing_ecommerce_reviews` and `data/women_clothing_ecommerce_review_titles`. The `reviews` file contains the full reviews with their corresponding titles in the `review_titles` file. Load the reviews and their titles these files from running the cell below.

In [42]:
# Load Reviews data
reviews = load_data('data/women_clothing_ecommerce_reviews.txt')
# Load Titles data
titles = load_data('data/women_clothing_ecommerce_review_title.txt')

print('Dataset Loaded')

Dataset Loaded


### Files
Each line in `reviews` contains a review with the respective translation in each line of `titles`. Let's view the first three lines from each file.

In [44]:
for sample_i in range(3):
        print('TITLE Line {}:  {}'.format(sample_i + 1, titles[sample_i]))
        print('REVIEW Line {}:  {}'.format(sample_i + 1, reviews[sample_i]))
        print("****************************")


TITLE Line 1:  Some major design flaws

REVIEW Line 1:  I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c

****************************
TITLE Line 2:  My favorite buy!

REVIEW Line 2:  I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!

****************************
TITLE Line 3:  Flattering shirt

REVIEW Line 3:  This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with 

# 1. Preprocess (10 points)
Before getting started with text summarization, let's preprocess the text in Title and Review Text. The objective is to make the text suitable for modeling by taking off as much noise as possible.

In [45]:
# define a dictionary of all possible contractions and their expanded forms
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",
                           "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
                           "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                           "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                           "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                           "you're": "you are", "you've": "you have"}

## 1.1. Clean (4 points)

Define a function `clean(text)` that will preprocess and clean the text `text`.

We would like to carry out the following preprocessing operations:

- Convert text to lowercase
- Expand the contractions ("isn't" to "is not"), a dictionary of all possible contractions and their expanded forms (`contraction_mapping`) is provided for you.
- Remove everything from the text except alphabets, '.' and ','
- Remove single-character tokens




In [46]:
def clean(text):
    """
    clean x
    :param x: text to be cleaned
    :return: cleaned text as per the guidelines specified above
    """
    # TODO: Implement

    return ""



'hello you do not like this. too sxasxas much complex so much wkqda'

Now preprocess the text in the lists __reviews__ and __titles__.

In [47]:
# preprocess review text
cleaned_reviews = [clean(r) for r in reviews]
cleaned_titles = [clean(t) for t in titles]

In [50]:
cleaned_titles[1], cleaned_reviews[1]
# this should return: 
# ('my favorite buy','love, love, love this jumpsuit. it is fun, flirty, and fabulous every time wear it, get nothing but great compliments')

('my favorite buy',
 'love, love, love this jumpsuit. it is fun, flirty, and fabulous every time wear it, get nothing but great compliments')

## 1.2. Explore the Vocabulary  (6 points)
The complexity of the problem is determined by the complexity of the vocabulary.  A more complex vocabulary is a more complex problem.  Let's look at the complexity of the dataset we'll be working with.
- Calculate the total number of words in the reviews
- Calculate the number of unique words in the reviews
- Calculate the total number of words in the titles
- Calculate the number of unique words in the titles


In [1]:
  # TODO: Implement

Your code should print the following:
```python
1125528 Review words.
22628 unique Review words.
10 Most common words in the Reviews dataset:
"the" "and" "it" "is" "this" "to" "in" "not" "but" "on"

64407 Title words.
4575 unique Title words.
10 Most common words in the Title dataset:
"and" "great" "love" "dress" "but" "cute" "beautiful" "not" "for" "top"
```
``

# 2. Train, validation, Test splitting (10 points)

You will have to **manually** split the reviews dataset into training, validation (development), and testing sets. 
10% of the data for validation, 10% for testing and 80% for training the model. 

In [2]:
# TODO: Implement

# 3. Build a model using OpenNMT (60 points)

[OpenNMT](https://opennmt.net/) is an open source ecosystem for neural machine translation and neural sequence learning. Started in December 2016 by the [Harvard NLP group](https://nlp.seas.harvard.edu/) and [SYSTRAN](https://translate.systran.net/), the project has since been used in several research and industry applications. It is currently maintained by SYSTRAN and Ubiqus.

OpenNMT provides implementations in two popular deep learning frameworks: [OpenNMT-py](https://opennmt.net/OpenNMT-py/), including a python-based implementation of the encoder-decoder architecture, and [OpenNMT-tf](https://opennmt.net/OpenNMT-tf), an implementation based on tensorflow.

In this homework you will be using OpenNMT, get started with it and its component in order to user for training the title generation system, and testing it. 

Take a look at the OpenNMT-py documentation and its [quickstart](https://opennmt.net/OpenNMT-py/quickstart.html) page to familiarize yourself with the main training workflow within OpenNMT.

You will be evaluating the quality of the system using the ROUGE de-facto summarization metric. It compares the output of your system with the gold summaries and generates a score. 
You can use this [ROUGE python implementation](https://pypi.org/project/rouge/).



In [74]:
# Preprocess (10 points)
# TODO: Implement

In [None]:
# Train (25 points)
# TODO: Implement

In [75]:
# Test (25 points)
# TODO: Implement

# 4. Evaluate the model (20 points)

In [None]:
# Evaluate
# TODO: Implement