## 1) Exploratory Data Analysis

To gain insights from the dataset, I initially explored the correlation between various numerical predictors and the target variable (using a correlation matrix), which required encoding the host_is_superhost feature into dummy variables. I discovered that attributes related to reviews, such as review scores, along with metrics like how long the host has been active, and different availability metrics, exhibited significant correlations with the likelihood of a host being a superhost.

Moreover, I observed a pattern where several predictors, particularly those associated with reviews, displayed analogous trends and were highly correlated with each other. This redundancy hinted at potential multicollinearity within the dataset, suggesting the need for careful feature selection or regularization techniques during model development.

Additionally, while examining the distribution of classes within the training data, I noted a slight imbalance, with a higher prevalence of class 0 responses indicating hosts who were not superhosts compared to class 1 responses. This class imbalance underscored the importance of employing appropriate techniques, such as oversampling minority class instances or using weighted loss functions, to mitigate the impact of class distribution skewness during model training.

## 2) Data Cleaning/Preparation

Data cleaning steps taken to prepare the data are included below.

In [36]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [37]:
data_train = pd.read_csv('train_classification.csv')

data_train['host_is_superhost'] = data_train['host_is_superhost'].replace({'f': 0, 't': 1})

# Look at predictors
data_train.info()

data_train.host_is_superhost.value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4977 entries, 0 to 4976
Data columns (total 53 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            4977 non-null   int64  
 1   host_id                                       4977 non-null   int64  
 2   host_since                                    4977 non-null   object 
 3   host_location                                 4019 non-null   object 
 4   host_response_time                            4560 non-null   object 
 5   host_response_rate                            4560 non-null   object 
 6   host_acceptance_rate                          4687 non-null   object 
 7   host_is_superhost                             4977 non-null   int64  
 8   host_neighbourhood                            4855 non-null   object 
 9   host_listings_count                           4977 non-null   i

0    2793
1    2184
Name: host_is_superhost, dtype: int64

In [38]:
# impute missing values
data_train = data_train.fillna(data_train.mean())
data_train = data_train.fillna(method = 'ffill')

print(data_train.isnull().sum())

id                                              0
host_id                                         0
host_since                                      0
host_location                                   0
host_response_time                              0
host_response_rate                              0
host_acceptance_rate                            0
host_is_superhost                               0
host_neighbourhood                              0
host_listings_count                             0
host_total_listings_count                       0
host_verifications                              0
host_has_profile_pic                            0
host_identity_verified                          0
neighbourhood_cleansed                          0
latitude                                        0
longitude                                       0
property_type                                   0
room_type                                       0
accommodates                                    0


  data_train = data_train.fillna(data_train.mean())


In [39]:
# bathrooms text (missing values): create 2 different predictors: extract number, and binary column if the word "shared" is in it

data_train['bathrooms'] = data_train['bathrooms_text'].str.extract(r'(\d*\.?\d+)', expand=False).astype(float)

# Detect if the word "shared" is present
data_train['is_shared'] = data_train['bathrooms_text'].str.contains('shared', case=False)

In [40]:
# pick fancy neighborhoods for predictor
fancy_neighborhoods = [
    "Cambridge", "River North", "Logan Square", "West Town", "Lincoln Park",
    "Lake View East", "Near North Side", "Streeterville", "Chicago Loop", "Bucktown",
    "Lakeview", "Lincoln Square", "West Loop/Greektown", "Andersonville", "West Loop",
    "North Center", "Old Town", "Portage Park", "Pilsen", "Uptown", "Lower West Side"
]

data_train['fancy'] = data_train['host_neighbourhood'].isin(fancy_neighborhoods)

In [41]:
# convert to numeric
data_train.host_response_rate = data_train.host_response_rate.str.replace('%', '').str.replace(',','').astype(float)
data_train.host_acceptance_rate = data_train.host_acceptance_rate.str.replace('%', '').str.replace(',','').astype(float)

In [42]:
# host location
data_train['host_location_category'] = data_train['host_location'].apply(lambda x: x if x in ['Chicago, IL', 'New York, NY'] else 'Other')

dummy_variables = pd.get_dummies(data_train['host_location_category'], prefix='host_location')

In [43]:
# convert to datetime, then numeric
data_train['host_since'] = pd.to_datetime(data_train['host_since'])

reference_date = pd.to_datetime('2015-01-01')  
data_train['days_since_host'] = (reference_date - data_train['host_since']).dt.days

data_train['first_review'] = pd.to_datetime(data_train['first_review'])

reference_date = pd.to_datetime('2015-01-01')  
data_train['days_since_first'] = (reference_date - data_train['first_review']).dt.days

data_train['last_review'] = pd.to_datetime(data_train['last_review'])

reference_date = pd.to_datetime('2015-01-01')  
data_train['days_since_last'] = (reference_date - data_train['last_review']).dt.days

In [44]:
# apply everythingnto the test data

data_test = pd.read_csv('test_classification.csv')
data_test = data_test.fillna(data_train.mean())
data_test = data_test.fillna(method = 'ffill')
data_test = data_test.fillna(method = 'bfill')


data_test['bathrooms'] = data_test['bathrooms_text'].str.extract(r'(\d*\.?\d+)', expand=False).astype(float)

data_test['is_shared'] = data_test['bathrooms_text'].str.contains('shared', case=False)

data_test['fancy'] = data_test['host_neighbourhood'].isin(fancy_neighborhoods)

data_test.host_response_rate = data_test.host_response_rate.str.replace('%', '').str.replace(',','').astype(float)
data_test.host_acceptance_rate = data_test.host_acceptance_rate.str.replace('%', '').str.replace(',','').astype(float)

data_test['host_location_category'] = data_test['host_location'].apply(lambda x: x if x in ['Chicago, IL', 'New York, NY'] else 'Other')

dummy_variables = pd.get_dummies(data_test['host_location_category'], prefix='host_location')

data_test['host_since'] = pd.to_datetime(data_train['host_since'])

reference_date = pd.to_datetime('2015-01-01')  
data_test['days_since_host'] = (reference_date - data_test['host_since']).dt.days

data_test['first_review'] = pd.to_datetime(data_test['first_review'])

reference_date = pd.to_datetime('2015-01-01')  
data_test['days_since_first'] = (reference_date - data_test['first_review']).dt.days

data_test['last_review'] = pd.to_datetime(data_test['last_review'])

reference_date = pd.to_datetime('2015-01-01')  
data_test['days_since_last'] = (reference_date - data_test['last_review']).dt.days


  data_test = data_test.fillna(data_train.mean())
  data_test = data_test.fillna(data_train.mean())


## 3) Developing the Model

In crafting the final model, I used the insights gained from the earlier data exploration and applied logical reasoning to guide my decision-making process. Drawing from the observed patterns, I prioritized predictors that exhibited strong correlations with the likelihood of a host being designated as a superhost. Specifically, I identified review values, indicators of host responsiveness and acceptance, tenure as a host, and the number of listings hosted as key indicators. These factors align with the expectations of superhosts being experienced and hospitable, thus logically influencing their designation.

For transformations, I utilized approaches that had proven effective in my previous regression modeling. By selecting predictors based on their strength and iteratively testing higher order terms until they were statistically insignificance (with a p-value above 0.05), I ensured that the model captured nuanced relationships without overfitting. Additionally, I explored interactions between variables, guided by both statistical considerations and logical intuition. This involved identifying interactions between highly correlated terms, such as various availability metrics, as well as intuitively related predictors like days since becoming a host and the count of host listings, which logically tend to increase together.

Following model development, I implemented a technique introduced in class to enhance the model's performance. By cross-referencing host ID values between the training and test datasets, I identified and corrected inconsistencies in predictions, effectively refining the accuracy of the model. This pragmatic approach proved instrumental in addressing potential data discrepancies and further optimizing the model's predictive capabilities.

## 4) Model

Put your model here.

In [45]:
formula = ('host_is_superhost~'
           'host_acceptance_rate+host_response_rate+beds+host_listings_count+instant_bookable+'
           'calculated_host_listings_count_entire_homes+calculated_host_listings_count_private_rooms+'
           'calculated_host_listings_count_shared_rooms+reviews_per_month+beds+latitude+'
          'host_location_category+number_of_reviews*reviews_per_month+number_of_reviews_l30d+review_scores_rating+'
          'I(review_scores_rating**2)+review_scores_accuracy+review_scores_cleanliness+review_scores_communication+review_scores_location+'
          'review_scores_value+instant_bookable+instant_bookable*beds+I(latitude**2)+days_since_host+I(days_since_host**2)+'
          'days_since_host*host_listings_count+availability_30+days_since_last+I(days_since_last**2)+'
          'review_scores_accuracy+review_scores_cleanliness+I(host_acceptance_rate**2)+fancy+instant_bookable*host_acceptance_rate+'
          'fancy*host_acceptance_rate+host_has_profile_pic*host_listings_count+host_has_profile_pic*days_since_host')

model = smf.logit(formula=formula, data=data_train).fit()

model.summary()

         Current function value: 0.457186
         Iterations: 35




0,1,2,3
Dep. Variable:,host_is_superhost,No. Observations:,4977.0
Model:,Logit,Df Residuals:,4939.0
Method:,MLE,Df Model:,37.0
Date:,"Tue, 27 Feb 2024",Pseudo R-squ.:,0.3332
Time:,12:46:50,Log-Likelihood:,-2275.4
converged:,False,LL-Null:,-3412.4
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.284e+04,1.17e+04,1.944,0.052,-184.742,4.59e+04
instant_bookable[T.t],-9.0326,1.879,-4.807,0.000,-12.715,-5.350
"host_location_category[T.New York, NY]",-0.2016,0.262,-0.770,0.441,-0.715,0.311
host_location_category[T.Other],-0.4289,0.120,-3.569,0.000,-0.664,-0.193
fancy[T.True],3.1342,0.737,4.253,0.000,1.690,4.579
host_has_profile_pic[T.t],-1.1982,1.556,-0.770,0.441,-4.249,1.852
host_acceptance_rate,0.2664,0.034,7.786,0.000,0.199,0.333
instant_bookable[T.t]:host_acceptance_rate,0.0847,0.019,4.423,0.000,0.047,0.122
fancy[T.True]:host_acceptance_rate,-0.0316,0.008,-4.082,0.000,-0.047,-0.016


In [46]:
data_train['host_is_superhost'] = data_train['host_is_superhost'].map({0: False, 1: True}).astype(bool)
unique_host_ids = data_train[['host_id', 'host_is_superhost']].drop_duplicates(subset=['host_id'])
test_subset = data_test[['host_id', 'id']]
test_subset['predicted'] = y_pred


merged_data = pd.merge(test_subset, unique_host_ids, on='host_id', how='left')

# Create a new column "final_predictions" and fill it with host_is_superhost where available
merged_data['final_predictions'] = merged_data.groupby('host_id')['host_is_superhost'].transform('first')

# Replace missing values in "final_predictions" with predictions from the model
merged_data['final_predictions'].fillna(merged_data['predicted'], inplace=True)
merged_data
# Drop unnecessary columns
final_output = pd.DataFrame({'id': merged_data.id, 'predicted': merged_data.final_predictions})  # Fill remaining NaNs with False
# Save final output
final_output.to_csv('final_classification_predictions.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_subset['predicted'] = y_pred


In [47]:
final_output.shape

(3324, 2)