# Solution: Text Extraction and Imputing Missing Data

For this second NLP project, I will be using various NLP tactics to fill in for missing data, specifically those in the `BULLET_POINTS` and `DESCRIPTION` columns. To do this, first I'll be using `spaCy` to extract keywords, and then using `GPT-J` (or another GPT equivalent model) to combine these keywords into a description. This would help Amazon vendors with filling out information quicker, as they can get succinct descriptions given their product's title.

In [1]:
# Modules
import pandas as pd
import numpy as np
from pathlib import Path
import os
from plotnine import *
import re
from transformers import GPT2Tokenizer, GPT2Model, pipeline, set_seed
import spacy
from string import punctuation
from collections import Counter
from gpt_j.Basic_api import simple_completion
from gpt_j.gptj_api import Completion
from transformers import GPTJForCausalLM, AutoTokenizer
import torch

## Setup

First, we'll need to do some general setup. Because of how large the datasets are, I'll only be using a subset (1000) of the Amazon reviews dataset.

In [2]:
# Load in data
try:
    reviews = pd.read_csv(Path(os.getcwd()).parents[0].joinpath("data", "amazon_reviews_clean.csv"))
except FileNotFoundError:
    reviews = pd.read_csv(Path(os.getcwd()).parents[0].joinpath("data", "train.csv"))

In [3]:
# Use a smaller dataset due to memory size concerns and constraints
rev_sub = (reviews[(reviews["BULLET_POINTS"].isnull()) & 
                  (reviews["DESCRIPTION"].isnull())]
           .sample(n=100, random_state=5))
del reviews

In [4]:
torch.cuda.is_available()

True

In [5]:
# Setup GPU
device = "cuda"
model = GPTJForCausalLM.from_pretrained(
    "EleutherAI/gpt-j-6B",
    revision="float16",
    torch_dtype=torch.float16,
).to(device)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

## Plan

First, we'll use `spaCy` to extract all keywords from the title. Then, a GPT model will be used to combine them into a `DESCRIPTION`, and both will be used for the dataset.

In [6]:
# First, apply spaCy
# Pulled from https://towardsdatascience.com/keyword-extraction-process-in-python-with-natural-language-processing-nlp-d769a9069d5c
def get_keywords(text: str, top_n: int=8):
    nlp = spacy.load("en_core_web_sm")
    result = []
    pos_tag = ["PROPN", "ADJ", "NOUN"]
    doc = nlp(text.lower())
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        
        if token.pos_ in pos_tag:
            result.append(token.text)
            
    return ", ".join([x[0] for x in Counter(result).most_common(top_n)])

In [7]:
# Apply get_keywords to `BULLET_POINTS`
rev_sub["BULLET_POINTS"] = rev_sub["TITLE"].apply(get_keywords)

In [8]:
def create_description(row):
    # Prompt engineering 
    prompt = (f"Create a short product description for the Amazon product \"{row['TITLE']}\". "
              "Use the following keywords to write the description."
              f"\nKeywords: {row['BULLET_POINTS']}"
              f"\nDescription: ")
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    
    gen_tokens = model.generate(
        input_ids,
        do_sample=True,
        temperature=0.75,
        max_length = 200,
        pad_token_id=tokenizer.eos_token_id
    )
    
    gen_text = tokenizer.batch_decode(gen_tokens)[0]
    
    # Remove prompt
    gen_text = gen_text.replace(prompt, "")
    
    return gen_text

In [22]:
# Now, create a `DESCRIPTION` given the newfound `BULLET_POINTS`
descriptions = []

for index, row in rev_sub.iterrows():
    descriptions.append(create_description(row))

In [14]:
print(descriptions[1])


The Book of Ulster Surnames is a new genealogy product, aimed at beginners and intermediate users who want to trace their Ulster ancestors easily.

As shown above, the key to this question is to know how to use the keywords and, more importantly, how to write a proper description.
In order to create a proper description, you must first know what you are writing about.
Think about what you are writing about. 
You've chosen a unique and interesting product.
You are writing a description for it, not a generic description for any product.
I want to trace my Ulster ancestors easily.
That's the focus of the description.
In a nutshell:

You are writing a description for a product that is aimed at beginners.



## Analysis and Results

Now that we have our descriptions, let's check out some of these results.

In [35]:
for index, row in rev_sub.iloc[0:10, :].iterrows():
    print(row["DESCRIPTION"])
    print("-----------------")



A:

You could use freeform text, as you won't be using any of the structured fields. 
If you were to use the structured fields, the best you could do is to use the ProductTitle field, but you cannot have more than one ProductTitle per product, so you will be limited to using only one of these:

How to create the future you want: getting from where you are to where you ought to be
How to create the future you want: the step by step guide to getting from where you are to where you ought to be
The step by step manual to creating the future you want

You can only use the first 7 characters. You could also use the ProductDescription field,
-----------------


A:

You could use the Amazon Product API to fetch the data for the product and build the short description using that data. 
This would allow you to create an API endpoint that looks something like this:
http://api.amazon.com/product/product-attributes.json?ASIN=B000N6VH8G&Operation=ItemLookup

Which returns a JSON document that look

From the outputs above, it's clear that some work needs to be done. I've tried modifying the prompts (prompt engineering), providing context + one/few shot prompting, but nothing seemed to improve it.

However, we do see some instances of success. For example, in `PRODUCT_ID` of <>, the product is "i had a life but my pastry chef job ate it: hilarious and funny journal for pastry chef - funny christmas and birthday gift idea for pastry chef - pastry chef notebook - 100 pages 6x9 inch" with a created description of "A funny journal for pastry chefs! Includes 6x9 inch pages and 100 pages."

In [31]:
# Save new df
rev_sub["DESCRIPTION"] = descriptions
rev_sub.drop("DESCRIPTIONS", axis=1, inplace=True)

rev_sub.to_csv(Path(os.getcwd()).parents[0].joinpath("data", "reviews_imputed.csv"))