# Notebook 2: Data Cleaning and Feature Engineering

_For USD-599 Capstone Project by Hunter Blum, Kyle Esteban Dalope, and Nicholas Lee (Summer 2023)_

***

**Content Overview:**
1. Text Sentiment Feature Creation
2. Feature Removal and Selection

**Note: Some Cleaning and Engineering Steps Already Performed in Notebook 1: Data Exploration**
1. Dropped "source" column.
2. Removed duplicates keeping, most recent (date) observation.
3. Removed uneeded columns such as pictures, host id's, etc. 
4. Filled missing values for bathrooms.
5. Added zipcodes for neighborhood categories.
6. Created a binary variable based on property type descriptions.
7. Created a binary variable based on if a property is private or shared with host.

In [None]:
# Library Imports
import pandas as pd
import numpy as np

# Note needed to install older version 4.28.0
from transformers import pipeline

In [3]:
# Import data from last notebook
eda_df = pd.read_csv("../Data/eda.csv.gz", compression = "gzip")
eda_df.head(1)

Unnamed: 0,id,last_scraped,name,description,neighborhood_overview,host_neighbourhood,host_listings_count,host_total_listings_count,neighbourhood,neighbourhood_cleansed,...,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,zipcode,median_income_dollars,property_type_binary,private
0,52582829,2022-06-15,Huge Oceanview decks+RooftopDeck☀sleeps 10☀Gar...,10 steps from the boardwalk! Beautiful beach h...,,,52.0,52.0,,Mission Bay,...,t,30,30,0,0,3.77,92109,95170.0,house,1


## Sentiment Based Feature Creation
In order to capture the sentiment from our text-based variables, we will use transfer learning with a pre-trained model.

First we'll combine all of our text-based columns into one.

In [16]:
# In order to combine, we need to fill any NAs with blank strings
eda_df[['name', 'description', 'neighborhood_overview']] = eda_df[['name', 'description', 'neighborhood_overview']].fillna('')

# Combine
eda_df['text'] = eda_df['name'] + eda_df['description']
eda_df['text'] = eda_df['text'] + eda_df['neighborhood_overview']

# Our model can only handle less than 512 tokens so we'll truncate down
eda_df['text_trunc'] = eda_df['text'].str.slice(0, 511)

# How many observations were affected - Looks like none, but still got an error - could be due to different tokenzations
eda_df['word_counts'] = eda_df['text'].apply(lambda n : len(n.split())) 
len(eda_df[eda_df['word_counts'] < 511])

# Save our observations as a list
corpus = eda_df['text_trunc'].to_list()


In [None]:
# Load our model
sent_clf = pipeline('sentiment-analysis')

preds = []
for i in corpus:
    try:
        pred = sent_clf(i)
        preds.append(pred)
    except:
        preds.append(np.nan)


In [30]:
# Un-nest the list
import operator 
from functools import reduce

preds2 = reduce(operator.add, preds)

# Create probability of positive as our variable
probs = [x['score'] if x['label'].startswith('P') else 1 - x['score'] for x in preds2]

# Save as variable in our df
eda_df['sentiment'] = probs

In [31]:
# Write new df so we don't need to rerun model everytime
eda_df.to_csv("../Data/sentiment.csv.gz", compression= "gzip", index=False)

## Feature Removal

In [13]:
# Read back in the data with sentiments
sent_df = pd.read_csv("../Data/sentiment.csv.gz", compression = "gzip")
sent_df.head(1)

Unnamed: 0,id,last_scraped,name,description,neighborhood_overview,host_neighbourhood,host_listings_count,host_total_listings_count,neighbourhood,neighbourhood_cleansed,...,calculated_host_listings_count_shared_rooms,reviews_per_month,zipcode,median_income_dollars,property_type_binary,private,text,text_trunc,word_counts,sentiment
0,52582829,2022-06-15,Huge Oceanview decks+RooftopDeck☀sleeps 10☀Gar...,10 steps from the boardwalk! Beautiful beach h...,,,52.0,52.0,,Mission Bay,...,0,3.77,92109,95170.0,house,1,Huge Oceanview decks+RooftopDeck☀sleeps 10☀Gar...,Huge Oceanview decks+RooftopDeck☀sleeps 10☀Gar...,171,0.900577


There are columns that we don't need, such as id, last scraped ,the text columns, and those from other findings in the EDA. We'll remove them here and keep the reasons separated for easier understanding and possible future adjustments.

EDA Findings:
1. host_listings_count and host_total_listings_count are the same
2. 24 feature pairs had correlations above 0.75 - we removed one feature from each of these pairs. Some showed up multiple times (high correlation with multiple features), so 24 drops may not show up.
3. Longitude and latitude were the only features with very low variance. This could be likely be fixed by scaling. But since we have zipcodes representing neighborhoods, we will remove them.
4. License was revealed to be missing 14,007 values from the 18,000 records. With such a high number of missing values, it would not be feasible or practical to impute or fill the missing values.

# NOTE Questionable
calendar_last_scraped, first_review, last_review could be included. Perhaps we could do a days since.. with one of them. However, it depends if we want to include time components.

In [14]:
# Remove uneeded columns
sent_df = sent_df.drop(columns = [
    'id', 'last_scraped', 'host_neighbourhood', 'neighbourhood', 'neighbourhood_cleansed',
    'property_type', 'amenities', 'license', 'calendar_last_scraped', 'first_review', 'last_review']
    )

# Remove the text columns
sent_df = sent_df.drop(columns = [
    'name', 'description', 'neighborhood_overview',
    'text', 'text_trunc', 'word_counts']
    )

# Remove host_total_listings_count
sent_df = sent_df.drop(columns=['host_total_listings_count'])

# Remove highly correlated features
sent_df = sent_df.drop(columns = [
    'minimum_nights_avg_ntm', 'calculated_host_listings_count_entire_homes',
    'availability_90', 'maximum_nights_avg_ntm', 'availability_60', 'minimum_nights_avg_ntm',
    'review_scores_accuracy', 'maximum_minimum_nights', 'review_scores_value', 'beds',
    'review_scores_cleanliness', 'accommodates', 'review_scores_communication',
    'number_of_reviews_ltm', 'minimum_maximum_nights', 'number_of_reviews_l30d']
    )

# Remove low variance features
sent_df = sent_df.drop(columns=['longitude', 'latitude'])


## Data Types
Let's make sure all the columns are the correct data type 

In [17]:
sent_df.dtypes

host_listings_count                             float64
room_type                                        object
bathrooms                                       float64
bedrooms                                        float64
price                                           float64
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                          float64
maximum_maximum_nights                          float64
has_availability                                 object
availability_30                                   int64
availability_365                                  int64
number_of_reviews                                 int64
review_scores_rating                            float64
review_scores_checkin                           float64
review_scores_location                          float64
instant_bookable                                 object
calculated_host_listings_count                  

In [18]:
# Read out clean df
sent_df.to_csv("../Data/clean_df.csv.gz", compression= "gzip", index=False)