## Dataset : Airbnb Singapore Dataset from InsideAirbnb
Dataset from Airbnb : **"Singapore, 29 December 2022"**  
Source: http://insideairbnb.com/get-the-data/

### EDA on Middle 25 Variables in dataset  
The purpose of this file is to conduct exploratory data analysis on the middle 25 variables in our dataset.  
### Done by: <b>Isaac Chun</b>

---

### Essential Libraries

Import essential libraries such as numpy, pandas, matplotlib and seaborn.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [435]:
import numpy as np
import pandas as pd 
import seaborn as sb 
import matplotlib.pyplot as plt 

### Additional Libraries

Import additional libraries

> Wordcloud : Library to create tag clouds in Python  

In [436]:
from wordcloud import WordCloud

# Features Description (change this)

This dataset contains 75 features about Airbnb listings within Singapore. This notebook encompasses the cleaning & exploration <br>data analysis (EDA) of the middle 25 features.
Below are the features with their respective descriptions</br>

1.  id: The unique identifier for each Airbnb listing.
2.  listing_url: The URL of the Airbnb listing.
3.  scrape_id: The unique identifier for the data scraping process.
4.  last_scraped: The date when the data was last scraped.
5.  source: The source of the data.
6.  name: The name or title of the Airbnb listing.
7.  description: The description of the Airbnb listing.
8.  neighborhood_overview: A brief overview of the neighborhood where the Airbnb listing is located.
9.  picture_url: The URL of the primary picture of the Airbnb listing.
10.  host_id: The unique identifier for the host of the Airbnb listing.
11.  host_url: The URL of the host's profile.
12.  host_name: The name of the host of the Airbnb listing.
13.  host_since: The date when the host joined Airbnb.
14.  host_location: The location of the host.
15.  host_about: A brief description of the host.
16.  host_response_time: The average response time of the host to messages.
17.  host_response_rate: The percentage of messages that the host responds to.
18.  host_acceptance_rate: The percentage of reservation requests that the host accepts.
19.  host_is_superhost: A binary variable indicating if the host is a superhost.
20.  host_thumbnail_url: The URL of the host's profile picture.
21.  host_picture_url: The URL of the host's profile picture.
22.  host_neighbourhood: The neighborhood where the host is located.
23.  host_listings_count: The number of listings that the host has.
24.  host_total_listings_count: The total number of listings that the host has, including inactive listings.
25.  host_verifications: A list of verification methods that the host has completed.

--- 
## Visual Data Cleaning (change this)

##### In the context of maximizing host profit, the following features can be dropped from the dataset as they provide no relevant insights for our predictions.

1. **id**
2. **listing_url**
3. **scrape_id**
4. **last_scraped**
5. **source** 
6. **name**
7. **picture_url**
8. **host id** 
9. **host url** 
10. **host name** 
11. **host since** 
12. **host about** 
13. **host_thumbnail_url**
14. **host_picture_url** 
15. **host_neighbourhood**
15. **host_listing_count**
16. **host_total_listing_count**

---
### Below are features that we think might be informative for analyzing factors that impact host profit, so we shall conduct our EDA on them and gather insights:

1. **description** : The description should highlight the unique features and amenities of the listing which can help to attract more potential guests and increase bookings, thereby impacting host profit.

2. **neighborhood_overview** : Similar to description, neighborhood surrounding can be an important factor for guests.

3. **host_location** : Host's living location can impact their profit on Airbnb in certain ways, we will perform EDA on this to verify the authenticity of it.

4. **host_response time** : Hosts with faster response times may be more likely to secure bookings and receive positive reviews, which can impact host profit.

5. **host_response rate** : Similar to Host response time , hosts with higher response rates may be more likely to secure bookings and receive positive reviews, which can impact host profit.

6. **host_acceptance rate** : The percentage of guest requests that a host accepts can impact booking rates and guest satisfaction. Hosts with higher acceptance rates may be more likely to secure bookings and receive positive reviews, which can impact host profit.

7. **host_is_superhost** : The "superhost" designation on Airbnb is given to experienced and highly-rated hosts. This can be an important factor in attracting guests and increasing booking rates, which can impact host profit.

8. **host_verifications** : The amount of verification methods that host has completed might be a factor that helps a host gather more profits as the customers feel safer.

In [437]:
def remove_outliers(df, columns, factor=1.5):
    # loop through each column and remove outliers based on the IQR method
    for col in columns:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        upper_bound = q3 + factor * iqr
        lower_bound = q1 - factor * iqr
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    
    return df

In [438]:
def countOutliers (df):
    #Get the q1 and q3 datas to find out the 25% and 75% range, then calculate inter quartile range and then find out whiskers.
    #Then count how many points lie outside of this range.
    q1 = df.quantile(0.25)
    q3 = df.quantile(0.75)
    #Interquartile
    iqr = q3 - q1
    #Calculate whiskers
    leftWhisker = q1 - (1.5 * iqr)
    rightWhisker = q3 + (1.5 * iqr)
    outliers = 0;
    #Loop through data now
    for data in df:
        if(data < leftWhisker or data > rightWhisker):
            outliers+=1

    return outliers

---

>## Import the Dataset

In [439]:
airDF = pd.read_csv("listings.csv")
airDF.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,71609.0,https://www.airbnb.com/rooms/71609,20221200000000.0,12/29/2022,city scrape,Ensuite Room (Room 1 & 2) near EXPO,For 3 rooms.Book room 1&2 and room 4<br /><br ...,,https://a0.muscache.com/pictures/24453191/3580...,367042,...,4.78,4.26,4.32,,f,6,0,6,0,0.15
1,71896.0,https://www.airbnb.com/rooms/71896,20221200000000.0,12/29/2022,city scrape,B&B Room 1 near Airport & EXPO,<b>The space</b><br />Vocational Stay Deluxe B...,,https://a0.muscache.com/pictures/2440674/ac4f4...,367042,...,4.43,4.17,4.04,,t,6,0,6,0,0.17
2,71903.0,https://www.airbnb.com/rooms/71903,20221200000000.0,12/29/2022,city scrape,Room 2-near Airport & EXPO,"Like your own home, 24hrs access.<br /><br /><...",Quiet and view of the playground with exercise...,https://a0.muscache.com/pictures/568743/7bc623...,367042,...,4.64,4.5,4.36,,f,6,0,6,0,0.33
3,275343.0,https://www.airbnb.com/rooms/275343,20221200000000.0,12/29/2022,city scrape,Amazing Room with window 10min to Redhill,Awesome location and host <br />Room near INSE...,,https://a0.muscache.com/pictures/miso/Hosting-...,1439258,...,4.42,4.53,4.63,S0399,f,46,2,44,0,0.19
4,275344.0,https://www.airbnb.com/rooms/275344,20221200000000.0,12/29/2022,city scrape,15 mins to Outram MRT Single Room,Lovely home for the special guest !<br /><br /...,Bus stop <br />Food center <br />Supermarket,https://a0.muscache.com/pictures/miso/Hosting-...,1439258,...,4.54,4.62,4.46,S0399,f,46,2,44,0,0.11


In [440]:
airDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3035 entries, 0 to 3034
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            3035 non-null   float64
 1   listing_url                                   3035 non-null   object 
 2   scrape_id                                     3035 non-null   float64
 3   last_scraped                                  3035 non-null   object 
 4   source                                        3035 non-null   object 
 5   name                                          3035 non-null   object 
 6   description                                   2985 non-null   object 
 7   neighborhood_overview                         1973 non-null   object 
 8   picture_url                                   3035 non-null   object 
 9   host_id                                       3035 non-null   i

In [441]:
print(airDF.dtypes)

id                                              float64
listing_url                                      object
scrape_id                                       float64
last_scraped                                     object
source                                           object
                                                 ...   
calculated_host_listings_count                    int64
calculated_host_listings_count_entire_homes       int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               float64
Length: 75, dtype: object


---
### 1. EDA on FIRST VARIABLE
<b>FIRST VARIABLE NAME</b>: First Variable Description
