# Insights into Urban Dynamics:<br>Analyzing Airbnb Reviews and Neighborhood Metrics

## Project Overview

This study aims to investigate the correlation between subjective Airbnb reviews and objective neighborhood metrics in select US cities. It involves analyzing crime statistics, demographics, socioeconomic indicators, and environmental quality to understand how they relate to sentiments expressed in Airbnb reviews.

---
### Objectives

- Investigate correlations between subjective reviews and quantifiable neighborhood attributes in targeted US cities.
- Understand how guest experiences align with tangible neighborhood characteristics.
---
### Methodologies and Tools

- **Data Collection**: Utilize Python libraries (e.g., Pandas, Requests) for data collection and preprocessing.
- **Sentiment Analysis**: Implement TextBlob for sentiment analysis of Airbnb reviews and consider numerical ratings.
- **Correlation Techniques**: Employ regression analysis, correlation coefficients and other analysis techniques(cluster, principle component, etc.)
- **Visualization**: Use Matplotlib or Plotly or Seaborn for visual representation of relationships.

---

### Import dependencies

In [35]:
import pandas as pd
import requests

In [36]:
#install for sentiment analysis
!pip install textblob



In [37]:
#install corpora
!python -m textblob.download_corpora

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\12039\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!

### Pull files for review listings

In [38]:
# Read location info and listings CSV files into pandas DataFrames
seattle_listings = pd.read_csv('http://data.insideairbnb.com/united-states/wa/seattle/2023-09-18/visualisations/listings.csv')

In [39]:
seattle_listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,6606,Guesthouse in Seattle · ★4.60 · 1 bedroom · 1 ...,14942,Joyce,Other neighborhoods,Wallingford,47.65444,-122.33629,Entire home/apt,99,30,160,2023-08-05,0.93,2,177,1,str-opli-19-002622
1,9419,Rental unit in Seattle · ★4.72 · 1 bedroom · 1...,30559,Angielena,Other neighborhoods,Georgetown,47.55017,-122.31937,Private room,85,2,193,2023-09-04,1.21,9,365,23,STR-OPLI-19-003039
2,9531,Home in Seattle · ★4.96 · 2 bedrooms · 3 beds ...,31481,Cassie,West Seattle,Fairmount Park,47.55495,-122.38663,Entire home/apt,185,3,77,2023-09-09,0.54,2,318,12,STR-OPLI-19-002182
3,9534,Guest suite in Seattle · ★4.99 · 2 bedrooms · ...,31481,Cassie,West Seattle,Fairmount Park,47.55627,-122.38607,Entire home/apt,155,2,75,2023-05-28,0.53,2,230,7,STR-OPLI-19-002182
4,9596,Rental unit in Seattle · ★4.56 · 1 bedroom · 4...,14942,Joyce,Other neighborhoods,Wallingford,47.65608,-122.33602,Entire home/apt,130,30,97,2020-09-28,0.65,2,0,0,STR -OPLI-19-002622


### Create dataframes from compressed .gz files from url

In [None]:
#DON'T USE THIS BECAUSE THE RESULTING DF IS TOO LARGE
#function that uses gzip to pull compressed file from url
'''import requests
import gzip
import io

def fetch_csv_gz_from_url(url):
    try:
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            compressed_file = io.BytesIO(response.content)
            with gzip.GzipFile(fileobj=compressed_file, mode='rb') as gz_file:
                with io.TextIOWrapper(gz_file, encoding='utf-8') as file:
                    df = pd.read_csv(file)
            print("DataFrame created successfully")
            return df
        else:
            print("Failed to download the file")
            return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Example usage:
url = 'http://data.insideairbnb.com/united-states/wa/seattle/2023-09-18/data/reviews.csv.gz'
seattle_reviews = fetch_csv_gz_from_url(url)'''

### Create filtered review dataframes for 2022 and 2023 only

In [49]:
#function that uses gzip to pull compressed file from url, but you can filter based on the year of the 'date' column
def fetch_filtered_csv_gz_from_url(url, date_column, years):
    try:
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            compressed_file = io.BytesIO(response.content)
            with gzip.GzipFile(fileobj=compressed_file, mode='rb') as gz_file:
                with io.TextIOWrapper(gz_file, encoding='utf-8') as file:
                    df = pd.read_csv(file)
            
            # Convert specified 'date_column' to datetime type
            df[date_column] = pd.to_datetime(df[date_column])
            
            # Filter rows based on the provided years in the specified 'date_column'
            filtered_df = df[df[date_column].dt.year.isin(years)]
            
            print(f"Filtered DataFrame for {years} created successfully based on '{date_column}' column")
            return filtered_df
        else:
            print("Failed to download the file")
            return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

#create dfs for our project
target_years = [2022, 2023]  # List of years to filter
date_column_name = 'date'  # Replace 'date_column' with your actual date column name

seattle_reviews_recent = fetch_filtered_csv_gz_from_url(url, date_column_name, target_years)
la_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/ca/los-angeles/2023-09-03/data/reviews.csv.gz', date_column_name, target_years)
oakland_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/ca/oakland/2023-09-18/data/reviews.csv.gz', date_column_name, target_years)
boston_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/ma/boston/2023-09-16/data/reviews.csv.gz', date_column_name, target_years)
nyc_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/ny/new-york-city/2023-11-01/data/reviews.csv.gz', date_column_name, target_years)
neworleans_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/la/new-orleans/2023-09-03/data/reviews.csv.gz', date_column_name, target_years)
austin_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/tx/austin/2023-09-10/data/reviews.csv.gz', date_column_name, target_years)
chicago_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/il/chicago/2023-09-12/data/reviews.csv.gz', date_column_name, target_years)
nashville_reviews_recent = fetch_filtered_csv_gz_from_url('http://data.insideairbnb.com/united-states/tn/nashville/2023-09-16/data/reviews.csv.gz', date_column_name, target_years)


Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column
Filtered DataFrame for [2022, 2023] created successfully based on 'date' column


### Apply TextBlob to calculate sentiment of the comments and create new column with results
**Note**: This can take some time since it calculated each row individually

In [None]:
from textblob import TextBlob

#this can take a bit since TextBlob processing each text entry individually 
def add_sentiment_column(df):
    """
    Apply TextBlob's sentiment analysis to a specified text column in a DataFrame
    and create a new column for sentiment scores.

    Returns:
    - Updated DataFrame with the new column for sentiment scores
    """
    df['sentiment_score'] = df['comments'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)
    return df

#List of filtered dataframes
list_of_dataframes = [
    seattle_reviews_recent,
    la_reviews_recent,
    oakland_reviews_recent,
    boston_reviews_recent,
    nyc_reviews_recent,
    neworleans_recent,
    austin_recent,
    chicago_recent,
    nashville_reviews_recent
]

# Apply add_sentiment_column function to each DataFrame in the list
for i, df in enumerate(list_of_dataframes):
    list_of_dataframes[i] = add_sentiment_column(df)

In [None]:
chicago_recent.head()