# Project 1 â€” Data Analysis

### Data Source
I use the **Inside Airbnb** dataset for New York City.

Dataset URL: https://insideairbnb.com/get-the-data.html  
File: `project1data.csv` (listings data)  
Downloaded: (write todayâ€™s date)

This dataset includes information such as Airbnb listing prices, location, room type, and number of reviews.  
It has more than **30,000 rows**, which satisfies the project requirement of having **at least 1,000 rows**.


In [1]:
import pandas as pd

df = pd.read_csv("listings.csv")
len(df), df.head()

(36111,
          id                            listing_url       scrape_id  \
 0  40824219  https://www.airbnb.com/rooms/40824219  20251001171547   
 1  40833186  https://www.airbnb.com/rooms/40833186  20251001171547   
 2  40837137  https://www.airbnb.com/rooms/40837137  20251001171547   
 3  40838018  https://www.airbnb.com/rooms/40838018  20251001171547   
 4  40839416  https://www.airbnb.com/rooms/40839416  20251001171547   
 
   last_scraped           source                                         name  \
 0   2025-10-02      city scrape   Room close to  Manhattan for FEMALE guests   
 1   2025-10-02  previous scrape  Soho LES East village private room downtown   
 2   2025-10-02  previous scrape     Sunset Park - Quiet and close to subway!   
 3   2025-10-02  previous scrape             Cozy One Bedroom in Clinton Hill   
 4   2025-10-02      city scrape    ðŸª´XL dojo ðŸŒ¾ shared green yogi palace apt ðŸŒ¿   
 
                                          description  \
 0  This c

We will analyze the **price** variable, because it is numeric, relevant, and varies across neighborhoods. 
Understanding price distribution helps us identify how location affects Airbnb rental costs.

In [13]:
s = pd.to_numeric(df["price"], errors="coerce").dropna()
s = s[(s > 0) & (s < s.quantile(0.99))]  # remove extreme outliers
len(s)

0

In [14]:
import pandas as pd

df = pd.read_csv("listings.csv")

# data cleaning
s = df["price"].astype(str)              # string
s = s.str.replace("$", "", regex=False)  # remove $
s = s.str.replace(",", "", regex=False)  # remove ,
s = pd.to_numeric(s, errors="coerce")    # convert to number
s = s.dropna()                           # remove couldn't converted part
s = s[(s > 0) & (s < s.quantile(0.99))]  # remove extreme values

len(s), s.head()

(21114,
 0     66.0
 4     76.0
 5     97.0
 7     60.0
 8    425.0
 Name: price, dtype: float64)

In [15]:
mean_p = s.mean()
median_p = s.median()
mode_p = s.mode().iloc[0]   

mean_p, median_p, mode_p

(np.float64(237.0799943165672), np.float64(152.0), np.float64(150.0))

he mean price is higher than the median price, which suggests the distribution is **right-skewed**.  
This means that most Airbnb listings are moderately priced, but a small number of high-priced listings raise the average.

In [16]:
# convert the cleaned pandas Series into a plain Python list
values = list(s)

# define mean, median, mode without pandas
def mean(vals):
    return sum(vals) / len(vals)

def median(vals):
    vals = sorted(vals)
    n = len(vals)
    mid = n // 2
    if n % 2 == 1:
        return vals[mid]
    else:
        return (vals[mid - 1] + vals[mid]) / 2

def mode(vals):
    counts = {}
    best = None
    max_count = 0
    for v in vals:
        counts[v] = counts.get(v, 0) + 1
        if counts[v] > max_count:
            best = v
            max_count = counts[v]
    return best

mean_std = mean(values)
median_std = median(values)
mode_std = mode(values)

mean_std, median_std, mode_std

(237.0799943165672, 152.0, 150.0)

### Summary Statistics Interpretation

We computed three measures of central tendency for the price of Airbnb listings in New York City:

- **Mean:** ~\$237  
- **Median:** ~\$152  
- **Mode:** ~\$150  

The mean price is noticeably higher than the median price, which indicates that the price distribution is **right-skewed**. In other words, while most Airbnb listings are moderately priced, there are some very expensive listings that raise the average price.

The median and mode are both around \$150, suggesting that the **typical nightly price** for an Airbnb listing in NYC is around \$150 â€” this represents what most travelers are likely to pay.

Overall, the gap between mean and median reflects the presence of a **small number of luxury or high-demand listings** that significantly increase the average.


In [17]:
# Use neighbourhood_cleansed instead of neighbourhood_group
group_stats = (
    pd.DataFrame({
        "price": s, 
        "neighbourhood": df.loc[s.index, "neighbourhood_cleansed"].values
    })
    .dropna(subset=["neighbourhood"])
    .groupby("neighbourhood")["price"]
    .median()
    .sort_values(ascending=False)
)

labels = list(group_stats.index)
vals = list(group_stats.values)

def ascii_bar(labels, vals, width=40):
    max_v = max(vals)
    scale = width / max_v
    for lab, v in zip(labels, vals):
        bar = "â–ˆ" * int(v * scale)
        print(f"{lab:<20} | {bar} {v:.1f}")

ascii_bar(labels, vals)

Riverdale            | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 1043.0
Fort Wadsworth       | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 600.0
Tribeca              | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 462.0
NoHo                 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 458.5
SoHo                 | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 458.0
Battery Park City    | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 424.0
Theater District     | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 404.0
West Village         | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 378.0
Financial District   | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 378.0
Flatiron District    | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 368.0
Midtown              | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 362.0
Greenwich Village    | â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ 361.5


### ASCII Visualization Interpretation

The ASCII chart shows that neighborhoods differ significantly in median Airbnb prices. Areas such as Tribeca, SoHo, and Williamsburg have much higher typical nightly prices, reflecting their desirability and centrality. Meanwhile, neighborhoods like Harlem and Astoria show lower median prices, indicating more affordable options. This supports the conclusion that location is a major factor influencing Airbnb pricing in NYC.