# Notebook 2: Data Cleaning and Feature Engineering

_For USD-599 Capstone Project by Hunter Blum, Kyle Esteban Dalope, and Nicholas Lee (Summer 2023)_

***

**Content Overview:**
1. Text Sentiment Feature Creation

**Note: Some Cleaning and Engineering Steps Already Performed in Notebook 1: Data Exploration**
1. Dropped "source" column.
2. Removed duplicates keeping, most recent (date) observation.
3. Removed uneeded columns such as pictures, host id's, etc. 
4. Filled missing values for bathrooms.
5. Added zipcodes for neighborhood categories.


In [1]:
# Library Imports
import pandas as pd
import numpy as np

# Note needed to install older version 4.28.0
from transformers import pipeline



In [2]:
# Import data from last notebook
eda_df = pd.read_csv("../Data/eda.csv.gz", compression = "gzip")
eda_df.head(1)

Unnamed: 0,id,last_scraped,name,description,neighborhood_overview,host_neighbourhood,host_listings_count,host_total_listings_count,neighbourhood,neighbourhood_cleansed,...,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,zipcode,median_income_dollars,property_type_binary,private
0,52582829,2022-06-15,Huge Oceanview decks+RooftopDeck☀sleeps 10☀Gar...,10 steps from the boardwalk! Beautiful beach h...,,,52.0,52.0,,Mission Bay,...,t,30,30,0,0,3.77,92109,95170.0,house,1


## Sentiment Based Feature Creation
In order to capture the sentiment from our text-based variables, we will use transfer learning with a pre-trained model.

First we'll combine all of our text-based columns into one.

In [16]:
# In order to combine, we need to fill any NAs with blank strings
eda_df[['name', 'description', 'neighborhood_overview']] = eda_df[['name', 'description', 'neighborhood_overview']].fillna('')

# Combine
eda_df['text'] = eda_df['name'] + eda_df['description']
eda_df['text'] = eda_df['text'] + eda_df['neighborhood_overview']

# Our model can only handle less than 512 tokens so we'll truncate down
eda_df['text_trunc'] = eda_df['text'].str.slice(0, 511)

# How many observations were affected - Looks like none, but still got an error - could be due to different tokenzations
eda_df['word_counts'] = eda_df['text'].apply(lambda n : len(n.split())) 
len(eda_df[eda_df['word_counts'] < 511])

# Save our observations as a list
corpus = eda_df['text_trunc'].to_list()


In [17]:
# Load our model
sent_clf = pipeline('sentiment-analysis')

preds = []
for i in corpus:
    try:
        pred = sent_clf(i)
        preds.append(pred)
    except:
        preds.append(np.nan)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [30]:
# Un-nest the list
import operator 
from functools import reduce

preds2 = reduce(operator.add, preds)

# Create probability of positive as our variable
probs = [x['score'] if x['label'].startswith('P') else 1 - x['score'] for x in preds2]

# Save as variable in our df
eda_df['sentiment'] = probs

In [31]:
# Write new df so we don't need to rerun model everytime
eda_df.to_csv("../Data/sentiment.csv.gz", compression= "gzip", index=False)

## Part 2