# Insights into Urban Dynamics:<br>Analyzing Airbnb Reviews and Neighborhood Metrics

## Project Overview

This study aims to investigate the correlation between subjective Airbnb reviews and objective neighborhood metrics in select US cities. It involves analyzing crime statistics, demographics, socioeconomic indicators, and environmental quality to understand how they relate to sentiments expressed in Airbnb reviews.

---
### Objectives

- Investigate correlations between subjective reviews and quantifiable neighborhood attributes in targeted US cities.
- Understand how guest experiences align with tangible neighborhood characteristics.
---
### Methodologies and Tools

- **Data Collection**: Utilize Python libraries (e.g., Pandas, Requests) for data collection and preprocessing.
- **Sentiment Analysis**: Implement TextBlob for sentiment analysis of Airbnb reviews and consider numerical ratings.
- **Correlation Techniques**: Employ regression analysis, correlation coefficients and other analysis techniques(cluster, principle component, etc.)
- **Visualization**: Use Matplotlib or Plotly or Seaborn for visual representation of relationships.

---

### Import dependencies

In [35]:
import pandas as pd
import requests

In [36]:
#install for sentiment analysis
!pip install textblob



In [37]:
#install corpora
!python -m textblob.download_corpora

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!

### Pull files for review listings

In [99]:
# Read location info and listings CSV files into pandas DataFrames
seattle_listings = pd.read_csv('http://data.insideairbnb.com/canada/qc/montreal/2023-10-08/data/listings.csv.gz')

In [100]:
seattle_listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,29059,https://www.airbnb.com/rooms/29059,20231008213543,2023-10-09,city scrape,Rental unit in Montreal · ★4.67 · 1 bedroom · ...,CITQ 267153<br />Lovely studio with 1 closed r...,,https://a0.muscache.com/pictures/736399/fa6c31...,125031,...,4.76,4.81,4.69,"267153, expires: 2024-03-31",f,2,2,0,0,2.71
1,29061,https://www.airbnb.com/rooms/29061,20231008213543,2023-10-09,city scrape,Home in Montreal · ★4.72 · 2 bedrooms · 2 beds...,Lovely historic house with plenty of period ch...,,https://a0.muscache.com/pictures/9e59d417-4b6a...,125031,...,4.81,4.87,4.72,"267153, expires: 2024-03-31",f,2,2,0,0,0.9
2,34715,https://www.airbnb.com/rooms/34715,20231008213543,2023-10-08,city scrape,Rental unit in Montreal · ★4.89 · 2 bedrooms ·...,Welcome to Montreal<br /><br />Looking for an ...,,https://a0.muscache.com/pictures/1209820/5968a...,149769,...,5.0,4.67,4.89,"261026, expires: 2023-10-31",f,1,1,0,0,0.06
3,36301,https://www.airbnb.com/rooms/36301,20231008213543,2023-10-08,city scrape,Rental unit in Montréal · ★4.88 · 1 bedroom · ...,"Enjoy the best of Montreal in this romantic, ...",The neighborhood is very lively while the stre...,https://a0.muscache.com/pictures/26c20544-475f...,381468,...,4.9,4.88,4.78,,f,7,7,0,0,0.48
4,38118,https://www.airbnb.com/rooms/38118,20231008213543,2023-10-08,city scrape,Rental unit in Montreal · ★4.50 · 3 bedrooms ·...,Nearest metro Papineau.<br /><br /><b>The spac...,,https://a0.muscache.com/pictures/213997/763ec1...,163569,...,4.81,4.63,4.38,,f,1,0,1,0,0.12


In [101]:
seattle_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8396 entries, 0 to 8395
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            8396 non-null   int64  
 1   listing_url                                   8396 non-null   object 
 2   scrape_id                                     8396 non-null   int64  
 3   last_scraped                                  8396 non-null   object 
 4   source                                        8396 non-null   object 
 5   name                                          8396 non-null   object 
 6   description                                   8349 non-null   object 
 7   neighborhood_overview                         4517 non-null   object 
 8   picture_url                                   8396 non-null   object 
 9   host_id                                       8396 non-null   i

In [102]:
#Check for null values in the listing dataset
null_percentage = (seattle_listings.isna().mean() * 100).round(2).sort_values(ascending=False)
print(null_percentage)

neighbourhood_group_cleansed    100.00
bathrooms                       100.00
calendar_updated                100.00
host_neighbourhood               57.68
host_about                       46.83
                                 ...  
maximum_minimum_nights            0.00
minimum_maximum_nights            0.00
maximum_maximum_nights            0.00
host_thumbnail_url                0.00
id                                0.00
Length: 75, dtype: float64


In [103]:
def clean_listings_file(df, columns_to_drop):
    # Drop columns with >50% NA
    threshold = len(df) * 0.5  # 50% threshold
    df = df.dropna(thresh=threshold, axis=1)

    # Identify columns containing 'host' in their names
    host_columns = [col for col in df.columns if 'host' in col]

    # Combine all columns to drop
    cols_to_drop = host_columns + columns_to_drop

    # Filter the columns that actually exist in the DataFrame
    cols_to_drop = [col for col in cols_to_drop if col in df.columns]

    # Drop columns from the DataFrame
    df = df.drop(cols_to_drop, axis=1)
    
    return df

# Use on Seattle:
specified_columns = [
    'listing_url', 'scrape_id', 'last_scraped', 'source',
    'host_thumbnail_url', 'host_picture_url', 'picture_url', 'calendar_last_scraped'
]

cleaned_seattle_listings = clean_listings_file(seattle_listings.copy(), specified_columns)


In [104]:
cleaned_seattle_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8396 entries, 0 to 8395
Data columns (total 44 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           8396 non-null   int64  
 1   name                         8396 non-null   object 
 2   description                  8349 non-null   object 
 3   neighborhood_overview        4517 non-null   object 
 4   neighbourhood                4517 non-null   object 
 5   neighbourhood_cleansed       8396 non-null   object 
 6   latitude                     8396 non-null   float64
 7   longitude                    8396 non-null   float64
 8   property_type                8396 non-null   object 
 9   room_type                    8396 non-null   object 
 10  accommodates                 8396 non-null   int64  
 11  bathrooms_text               8390 non-null   object 
 12  bedrooms                     6500 non-null   float64
 13  beds              

### Create review dataframes from compressed .gz files from url

In [None]:
#DON'T USE THIS BECAUSE THE RESULTING DF IS TOO LARGE
#function that uses gzip to pull compressed file from url
'''import requests
import gzip
import io

def fetch_csv_gz_from_url(url):
    try:
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            compressed_file = io.BytesIO(response.content)
            with gzip.GzipFile(fileobj=compressed_file, mode='rb') as gz_file:
                with io.TextIOWrapper(gz_file, encoding='utf-8') as file:
                    df = pd.read_csv(file)
            print("DataFrame created successfully")
            return df
        else:
            print("Failed to download the file")
            return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Example usage:
url = 'http://data.insideairbnb.com/united-states/wa/seattle/2023-09-18/data/reviews.csv.gz'
seattle_reviews = fetch_csv_gz_from_url(url)'''

### Create filtered review dataframes for 2022 and 2023 only

In [49]:
#function that uses gzip to pull compressed file from url, but you can filter based on the year of the 'date' column
def fetch_filtered_csv_gz_from_url(url, date_column, years):
    try:
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            compressed_file = io.BytesIO(response.content)
            with gzip.GzipFile(fileobj=compressed_file, mode='rb') as gz_file:
                with io.TextIOWrapper(gz_file, encoding='utf-8') as file:
                    df = pd.read_csv(file)
            
            # Convert specified 'date_column' to datetime type
            df[date_column] = pd.to_datetime(df[date_column])
            
            # Filter rows based on the provided years in the specified 'date_column'
            filtered_df = df[df[date_column].dt.year.isin(years)]
            
            print(f"Filtered DataFrame for {years} created successfully based on '{date_column}' column")
            return filtered_df
        else:
            print("Failed to download the file")
            return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

#create dfs for our project
target_years = [2022, 2023]  # List of years to filter
date_column_name = 'date'  # Replace 'date_column' with your actual date column name

seattle_reviews_recent = fetch_filtered_csv_gz_from_url(url, date_column_name, target_years)
la_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/ca/los-angeles/2023-09-03/data/reviews.csv.gz', date_column_name, target_years)
oakland_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/ca/oakland/2023-09-18/data/reviews.csv.gz', date_column_name, target_years)
boston_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/ma/boston/2023-09-16/data/reviews.csv.gz', date_column_name, target_years)
nyc_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/ny/new-york-city/2023-11-01/data/reviews.csv.gz', date_column_name, target_years)
neworleans_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/la/new-orleans/2023-09-03/data/reviews.csv.gz', date_column_name, target_years)
austin_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/tx/austin/2023-09-10/data/reviews.csv.gz', date_column_name, target_years)
chicago_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/il/chicago/2023-09-12/data/reviews.csv.gz', date_column_name, target_years)
nashville_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/tn/nashville/2023-09-16/data/reviews.csv.gz', date_column_name, target_years)


Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column


### Apply TextBlob to calculate sentiment of the comments and create new column with results
**Note**: This can take some time since it calculated each row individually

In [51]:
from textblob import TextBlob

#this can take a bit since TextBlob processing each text entry individually 
def add_sentiment_column(df):
    """
    Apply TextBlob's sentiment analysis to a specified text column in a DataFrame
    and create a new column for sentiment scores.

    Returns:
    - Updated DataFrame with the new column for sentiment scores
    """
    df['sentiment_score'] = df['comments'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
    return df

#List of filtered dataframes
list_of_dataframes = [
    seattle_reviews_recent,
    la_reviews_recent,
    oakland_reviews_recent,
    boston_reviews_recent,
    nyc_reviews_recent,
    neworleans_recent,
    austin_recent,
    chicago_recent,
    nashville_reviews_recent
]

# Apply add_sentiment_column function to each DataFrame in the list
for i, df in enumerate(list_of_dataframes):
    list_of_dataframes[i] = add_sentiment_column(df)

## Add the avg sentiment rating to the listing df - **not working yet**

In [72]:
def merge_avg_sentiment_score(left_df, right_df, left_key, right_key, column_name):
    """
    Merge the average of a specific column from the left DataFrame to the right DataFrame based on specified key columns.

    Parameters:
    - left_df: The left DataFrame from which the average column will be calculated.
    - right_df: The right DataFrame to which the average column will be added.
    - left_key: The key column in the left DataFrame for grouping and merging.
    - right_key: The key column in the right DataFrame for merging.
    - column_name: The name of the column for which the average will be calculated and merged.

    Returns:
    - Updated DataFrame with the merged average column.
    """
    avg_sentiment = left_df.groupby(left_key)[column_name].mean().reset_index()
    right_df = right_df.merge(avg_sentiment, how='left', left_on=right_key, right_on=left_key)
    right_df.rename(columns={column_name: f'avg_{column_name}'}, inplace=True)
    return right_df

# Example usage:
seattle_listings = merge_avg_sentiment_score(seattle_reviews_recent, cleaned_seattle_listings, 'listing_id', 'id', 'sentiment_score')


In [73]:
cleaned_seattle_listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,listing_id,avg_sentiment_score
0,29059,https://www.airbnb.com/rooms/29059,20231008213543,2023-10-09,city scrape,Rental unit in Montreal · ★4.67 · 1 bedroom · ...,CITQ 267153<br />Lovely studio with 1 closed r...,,https://a0.muscache.com/pictures/736399/fa6c31...,125031,...,4.69,"267153, expires: 2024-03-31",f,2,2,0,0,2.71,,
1,29061,https://www.airbnb.com/rooms/29061,20231008213543,2023-10-09,city scrape,Home in Montreal · ★4.72 · 2 bedrooms · 2 beds...,Lovely historic house with plenty of period ch...,,https://a0.muscache.com/pictures/9e59d417-4b6a...,125031,...,4.72,"267153, expires: 2024-03-31",f,2,2,0,0,0.9,,
2,34715,https://www.airbnb.com/rooms/34715,20231008213543,2023-10-08,city scrape,Rental unit in Montreal · ★4.89 · 2 bedrooms ·...,Welcome to Montreal<br /><br />Looking for an ...,,https://a0.muscache.com/pictures/1209820/5968a...,149769,...,4.89,"261026, expires: 2023-10-31",f,1,1,0,0,0.06,,
3,36301,https://www.airbnb.com/rooms/36301,20231008213543,2023-10-08,city scrape,Rental unit in Montréal · ★4.88 · 1 bedroom · ...,"Enjoy the best of Montreal in this romantic, ...",The neighborhood is very lively while the stre...,https://a0.muscache.com/pictures/26c20544-475f...,381468,...,4.78,,f,7,7,0,0,0.48,,
4,38118,https://www.airbnb.com/rooms/38118,20231008213543,2023-10-08,city scrape,Rental unit in Montreal · ★4.50 · 3 bedrooms ·...,Nearest metro Papineau.<br /><br /><b>The spac...,,https://a0.muscache.com/pictures/213997/763ec1...,163569,...,4.38,,f,1,0,1,0,0.12,,
