### I want to determine feature importance in relation to price, in order to do this I first need to ensure that all features are correctly encoded and that NaN values have been taken care of.

In [2]:
import pandas as pd
import gzip
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate

In [84]:
#Listings, including full descriptions and average review score
#Reviews, including unique id for each reviewer and detailed comments
#Calendar, including listing id and the price and availability for that day

In [85]:
names = ['calendar', 'listings', 'reviews']
dataframes = {}

for name in names:
    # Define the file path
    file_path = 'data/' + name + '.csv.gz'
    # Use gzip.open to decompress the file and then read it with Pandas
    with gzip.open(file_path, 'rt', encoding='utf-8') as file:
        data = pd.read_csv(file)

    dataframes[name] = data

In [86]:
listings = dataframes['listings']

## Features som burde fjernes

#### listing_url, scrape_id, last_scraped, source, host_url, host_name, host_thumbnail_url, host_picture_url, license (noen steder må man ha lisens, kunder ser ikke dette), calender_last_scraped, calendar_updated (kun NaN), name, picture_url,



In [87]:
listings = listings.drop(columns=['listing_url', 'scrape_id', 'last_scraped', 'source', 'host_url', 
    'host_name', 'host_thumbnail_url', 'host_picture_url', 'license', 'calendar_last_scraped', 
    'calendar_updated', 'name', 'picture_url'])

## Features som krever tekstsøk 
#### description, neighborhood_overview, host_about




In [88]:
listings = listings.drop(columns=["description", "neighborhood_overview", "host_about"])

#### Lagrer listings_revised slik at man kan bruke den raskt i andre notebooks

In [89]:
listings.to_csv("listings_revised.csv", index=False)

## Features som burde fjernes pga mange NaN values:
#### Neighbourhood (kan erstattes av clustering med lat/lon), bathrooms (NaN imputation muligheter med bathroom text), bedrooms (NaN imputation muligheter?)

In [90]:
listings = listings.drop(columns=["neighbourhood", "bathrooms", "bedrooms"])

### Fjerner alle max, min nights features utenom max_nights og min_nights. Ikke strengt talt ubrukelige features, men vil utvilsomt ha høy korrelasjon med max_nights og min_nights, og usikker på hvor mye ekstra forklaringsevne de tilfører problemet. 

### Beholder max og min nights ntm (next twelve months), fordi dette blir noe litt annet, tror dette er lengste og korteste planlagte/booket opphold neste tolv månedene.

In [91]:
listings = listings.drop(columns=["minimum_minimum_nights", "maximum_minimum_nights", 
        "minimum_maximum_nights", "maximum_maximum_nights"])

In [58]:
listings.columns

Index(['id', 'host_id', 'host_since', 'host_location', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates',
       'bathrooms_text', 'beds', 'amenities', 'price', 'minimum_nights',
       'maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
       'has_availability', 'availability_30', 'availability_60',
       'availability_90', 'availability_365', 'number_of_reviews',
       'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review',
       'last_review', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scor

## We need to deal with NaN values:

- NaN values for host location, host since, host_neighbourhood are likely due to the host simply not wanting to share this info in the listing. Can be replaced by a simple "Not disclosed" value

- NaN values for host_listings_count, host_total_listings_count, host_verifications,host_has_profile_pic, host_identity_verified and host_is_superhost are likely simply missing values due to error in creating the data. But since all of these have NaN values in less than 0.66 percent of rows, we should simply discard these rows.

- As seen from eda_response_acceptance_rate, I did not manage to find good explanations for the missing values in the majority of cases for host_reponse_rate, host_response_time and host_acceptance_rate. However, from eda_reviews, it is evident that these features have almost no correlation with any of the rating features. To my mind, the only way in which acceptance and response rates could impact price would be through offering "superior service". But this effect would only be realistic if better response rates and acceptance rates actually impacted ratings, especially communication and check_in. Therefore, my conclusion is that these ratings are unlikely to impact the pricing of a listing, and therefore (as well as the unexplained large number of NaN values), I will remove these columns.

- From eda_reviews, we uncovered that NaN values for overall rating is likely caused by a lack of actual reviews (number_of_reviews == 0), whilst more detailed ratings such as cleanliness or accuracy are likely caused by the tenant choosing not to give a more in-depth review as opposed to a more general one. Therefore, a NaN value for overall rating can be replaced by the median of all ratings, and an additional binary variable can be added to indicate that the value was imputed (i.e. "Has Been Reviewed 1/0"). The same can be done for all detailed ratings columns "Detailed Rating 1/0". However, as was uncovered in eda_response_acceptance_rate, there seemed to be no proper correlation between response_rate/acceptance_rate and ratings for communication and checkin. This means that tenants seem to give a score corresponding to their overall enjoyment as opposed to rating the host on separate merits for each rating feature. Therefore, the easiest and perhaps best solution is to remove the finer granularity ratings and simply retain the overall score with imputed values.

- Last_review, as discussed in eda_reviews, is likely not that impactful and is essentially covered by other features. First_review is the closest proxy we have to a creation date of the listing (unless we start going into calendar and finding the first booking), and it may be useful to keep this feature. It is logical to assume that the lack of a review is often tied to the listing not having been active for very long, therefore the best NaN imputation strategy may be to simply choose the latest date in the dataset as the placeholder value. reviews_per_month NaN values can be imputed by replacing NaN values by 0 (since we discovered that NaN values here are caused by a lack of reviews).

In [92]:
# Calculate the percentage of NaN values in each column
nan_percentage = (listings.isna().mean() * 100).round(2)

# Convert the nan_percentage Series to a DataFrame for formatting
nan_percentage_df = nan_percentage.reset_index()
nan_percentage_df.columns = ['Column', 'NaN Percentage']

# Print the nan_percentage as a nicely formatted table
print(tabulate(nan_percentage_df, headers='keys', tablefmt='pretty'))

print("Number of rows: " + str(listings.shape[0]))


+----+----------------------------------------------+----------------+
|    |                    Column                    | NaN Percentage |
+----+----------------------------------------------+----------------+
| 0  |                      id                      |      0.0       |
| 1  |                   host_id                    |      0.0       |
| 2  |                  host_since                  |      0.01      |
| 3  |                host_location                 |     21.34      |
| 4  |              host_response_time              |     33.96      |
| 5  |              host_response_rate              |     33.96      |
| 6  |             host_acceptance_rate             |     30.69      |
| 7  |              host_is_superhost               |      0.66      |
| 8  |              host_neighbourhood              |     19.98      |
| 9  |             host_listings_count              |      0.01      |
| 10 |          host_total_listings_count           |      0.01      |
| 11 |

First I replace host information NaN values

In [93]:
columns_to_replace = ['host_since', 'host_location', 'host_neighbourhood']

listings[columns_to_replace] = listings[columns_to_replace].fillna('Not Disclosed')


Remove columns we deemed not relevant and having high NaN percentage

In [94]:
listings = listings.drop(columns=['host_response_time', 'host_response_rate', 'host_acceptance_rate'])


Replacing NaN values for review_scores_rating by the median of the column and adding a binary variable where "1" means the value was imputed. 

In [95]:
listings['overall_rating_is_imputed'] = listings['review_scores_rating'].isna().astype(int)


In [96]:
median_score = listings['review_scores_rating'].median()

print("Median score: "+ str(median_score))


Median score: 4.83


In [97]:
listings['review_scores_rating'].fillna(median_score, inplace=True)


Removing all finer granularity ratings: communication, accuracy, cleanliness, checkin, value, location

In [98]:
listings = listings.drop(columns=['review_scores_communication', 'review_scores_accuracy', 
        'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_value', 'review_scores_location'])


We remove last_review as discussed and impute first_review by choosing the latest date existing in the first_review column (want this column to serve as a proxy for creation date of the listing). Replace reviews_per_month NaN values by 0.

In [99]:
listings = listings.drop(columns=['last_review'])

In [100]:
listings['first_review'] = pd.to_datetime(listings['first_review'])
listings['first_review'] = listings['first_review'].dt.date



In [101]:
most_recent_date = listings['first_review'].dropna().max()

print("Most recent date: " + str(most_recent_date))


Most recent date: 2023-10-01


In [102]:
listings['first_review'].fillna(most_recent_date, inplace=True)


In [103]:
listings['reviews_per_month'].fillna(0, inplace=True)


We now see that the remaining features with NaN values only account for a bit over 1 % of the total dataset. Since we havent found clever ways to impute these columns, as they are likely a result of purely missing data, the last step in the NaN imputation is to remove all rows in which NaN values exist for any of the remaining features which still have NaN values.

In [104]:
# Calculate the percentage of NaN values in each column
nan_percentage = (listings.isna().mean() * 100).round(2)

# Convert the nan_percentage Series to a DataFrame for formatting
nan_percentage_df = nan_percentage.reset_index()
nan_percentage_df.columns = ['Column', 'NaN Percentage']

# Print the nan_percentage as a nicely formatted table
print(tabulate(nan_percentage_df, headers='keys', tablefmt='pretty'))

+----+----------------------------------------------+----------------+
|    |                    Column                    | NaN Percentage |
+----+----------------------------------------------+----------------+
| 0  |                      id                      |      0.0       |
| 1  |                   host_id                    |      0.0       |
| 2  |                  host_since                  |      0.0       |
| 3  |                host_location                 |      0.0       |
| 4  |              host_is_superhost               |      0.66      |
| 5  |              host_neighbourhood              |      0.0       |
| 6  |             host_listings_count              |      0.01      |
| 7  |          host_total_listings_count           |      0.01      |
| 8  |              host_verifications              |      0.01      |
| 9  |             host_has_profile_pic             |      0.01      |
| 10 |            host_identity_verified            |      0.01      |
| 11 |

In [105]:
listings_cleaned = listings.dropna()


### Here is the final result, all NaN values handled! We see that we only lost roughly 1000 rows from a dataset of 38'000, not too shabby!

In [106]:
# Calculate the percentage of NaN values in each column
nan_percentage = (listings_cleaned.isna().mean() * 100).round(2)

# Convert the nan_percentage Series to a DataFrame for formatting
nan_percentage_df = nan_percentage.reset_index()
nan_percentage_df.columns = ['Column', 'NaN Percentage']

# Print the nan_percentage as a nicely formatted table
print(tabulate(nan_percentage_df, headers='keys', tablefmt='pretty'))


print("Number of rows: " + str(listings_cleaned.shape[0]))


+----+----------------------------------------------+----------------+
|    |                    Column                    | NaN Percentage |
+----+----------------------------------------------+----------------+
| 0  |                      id                      |      0.0       |
| 1  |                   host_id                    |      0.0       |
| 2  |                  host_since                  |      0.0       |
| 3  |                host_location                 |      0.0       |
| 4  |              host_is_superhost               |      0.0       |
| 5  |              host_neighbourhood              |      0.0       |
| 6  |             host_listings_count              |      0.0       |
| 7  |          host_total_listings_count           |      0.0       |
| 8  |              host_verifications              |      0.0       |
| 9  |             host_has_profile_pic             |      0.0       |
| 10 |            host_identity_verified            |      0.0       |
| 11 |

Write it to csv

In [108]:
listings_cleaned.to_csv("listings_cleaned.csv", index=False)

Load so we can start here (do not have to repeat all imputations above every time)

In [16]:
listings_cleaned = pd.read_csv("listings_cleaned.csv")

### Now, to counter the effects of the curse of dimensionality, it is necessary to examine which features can be removed:

- Host_id may be interesting due to the fact that it accounts for differences in host behaviours, but this effect is already likely a part of the rating feature. Therefore, due to the very large number of unique hosts (23'000) we remove this feature. Listing Id (ID) should also be removed for obvious reasons.

- Without having checked it, it is very likely that the four different availability columns for different time horizons are storngly correlated, keeping 30 days and 365 days seems the most obvious choice.

- We dont need three different number_of_reviews for different time horizons. The best choice is probably to remove 30days, since it is such a short time span. We keep last twelve months (ltm) and total.

- Having both host_total_listings_count and host_listings_count, as well as listings counts for entire homes, private rooms and shared rooms seems excessive. Having a total_listings_count column could be a good indicator of the host's experience as a host, and should be kept. The remaining columns are likely not that vital.


In [17]:
listings_cleaned = listings_cleaned.drop(columns=['id', 'host_id', 'host_listings_count', 'calculated_host_listings_count_entire_homes'
        , 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms',
        'availability_60', 'availability_90', 'number_of_reviews_l30d', 'calculated_host_listings_count'])


In [18]:
# Calculate the percentage of NaN values in each column
nan_percentage = (listings_cleaned.isna().mean() * 100).round(2)

# Convert the nan_percentage Series to a DataFrame for formatting
nan_percentage_df = nan_percentage.reset_index()
nan_percentage_df.columns = ['Column', 'NaN Percentage']

# Print the nan_percentage as a nicely formatted table
print(tabulate(nan_percentage_df, headers='keys', tablefmt='pretty'))


print("Number of rows: " + str(listings_cleaned.shape[0]))


+----+------------------------------+----------------+
|    |            Column            | NaN Percentage |
+----+------------------------------+----------------+
| 0  |          host_since          |      0.0       |
| 1  |        host_location         |      0.0       |
| 2  |      host_is_superhost       |      0.0       |
| 3  |      host_neighbourhood      |      0.0       |
| 4  |  host_total_listings_count   |      0.0       |
| 5  |      host_verifications      |      0.0       |
| 6  |     host_has_profile_pic     |      0.0       |
| 7  |    host_identity_verified    |      0.0       |
| 8  |    neighbourhood_cleansed    |      0.0       |
| 9  | neighbourhood_group_cleansed |      0.0       |
| 10 |           latitude           |      0.0       |
| 11 |          longitude           |      0.0       |
| 12 |        property_type         |      0.0       |
| 13 |          room_type           |      0.0       |
| 14 |         accommodates         |      0.0       |
| 15 |    

In [21]:
listings_cleaned.to_csv("listings_cleaned_cleaved.csv", index=False)