# Sentimental Extraction
- Name: Minh T. Nguyen
- Date: 11/24/2023
- About:
    - **Description Sentiment Analysis**: Use pretrained models to performed sentimental analysis and create new feature.
    - I only run the first 5 rows to show it work. The actual run is performed on Kaggle's GPU accelerator.

In [1]:
!ls ../data

images_sample		sentimental_extraction_kaggle.csv   train.json
Kaggle-renthop.torrent	sentimental_extraction_kaggle.json
note.md			sentimental_extraction_sample.csv


In [2]:
!pip install -q transformers

**Note:** The datasets can be found [here]((https://www.kaggle.com/competitions/two-sigma-connect-rental-listing-inquiries/data?select=train.json.zip)).
- train.json: the training set.
- images_sample.zip: listing images organized by listing_id (a sample of 100 listings)
- Kaggle-renthop.7z: listing images organized by listing_id. Total size: 78.5 GB compressed.

In [None]:
# import libraries
import numpy as np
import pandas as pd
from transformers import pipeline
from transformers import AutoTokenizer
import re

import warnings
warnings.filterwarnings('ignore')

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (
2023-11-25 19:03:24.070486: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-25 19:03:24.072985: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-25 19:03:24.128765: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-25 19:03:24.129704: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 1. Import dataset

In [None]:
# import the dataset
df = pd.read_json("../data/train.json")
df.head(5)

In [None]:
# outlier removal
upper_bound = np.percentile(df["price"].values, 99)
df_filtered = df[df["price"] <= upper_bound]

In [None]:
# use the first 5 row for test only
df_filtered = df_filtered.head(5) 

In [None]:
df_filtered.head(5)

## 2. Sentimental Analysis With Pretrained Model
- The model used is called "distilbert-base-finetuned-sst-2-english" which is a small and fast version of BERT. The model is trained on Stanford Sentiment Treebank (SST-2) dataset which consists of sentences from movie reviews labeled with their sentiment.
- DistilBERT itself is a transformer-based model that is a distilled version of BERT, designed to be faster and lighter while still retaining most of BERT's performance. It follows the BERT architecture, which is an attention-based neural network: it uses self-attention mechanisms to weigh the importance of different words in a sentence.
- Since their is no available pretrained-BERT for apartment-vocab, this is a good general-purpose sentimental analysis model

### Resources
- [BERT Neural Network - EXPLAINED!](https://www.youtube.com/watch?v=xI0HHN5XKDo)
- [What is BERT and how does it work? | A Quick Review](https://www.youtube.com/watch?v=6ahxPTLZxU8)

In [None]:
# get the first description
test_des = df_filtered.description.iloc[0]
print(test_des)

In [None]:
# function to clean HTML tags and whitespace
def preprocess_text(text):
    text = re.sub(r'<[^>]+>', '', text)  # remove HTML tags
    text = re.sub(r'\s+', ' ', text)     # replace multiple whitespaces with single space
    return text.strip()

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# function to truncate text to a max length
def truncate_text(text, max_length=500):
    # encode the text, ensuring that the total length of the input does not exceed 500 tokens
    inputs = tokenizer.encode_plus(
        text, 
        add_special_tokens=True, 
        max_length=max_length, 
        truncation=True
    )
    # decode back to a string, without the special tokens
    truncated_text = tokenizer.decode(inputs['input_ids'], skip_special_tokens=True)
    return truncated_text

# initialize sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# function to get sentiment score
def get_sentiment(text):
    print(f"Processed")
    return sentiment_pipeline(text)[0]

# apply preprocessing to the descriptions
df_filtered['clean_description'] = df_filtered['description'].apply(preprocess_text)

# truncate descriptions to max_length
df_filtered['truncated_description'] = df_filtered['clean_description'].apply(truncate_text)

# perform sentiment analysis
df_filtered['sentiment'] = df_filtered['truncated_description'].apply(get_sentiment)

# define thresholds for sentiment classification
positive_threshold = 0.75
negative_threshold = 0.25

# function to classify sentiment
def classify_sentiment(sentiment):
    score = sentiment['score']
    if sentiment['label'] == 'POSITIVE' and score >= positive_threshold:
        return 1
    elif sentiment['label'] == 'NEGATIVE' and score <= negative_threshold:
        return -1
    else:
        return 0

# apply sentiment classification to the dataframe
df_filtered['sentiment_label'] = df_filtered['sentiment'].apply(classify_sentiment)

In [None]:
# check dataset
df_filtered.head(5)

In [None]:
# save dataset
df_filtered.to_csv("../data/sentimental_extraction_sample.csv")