# Project 1 — Data Analysis

### Data Source
I use the **Inside Airbnb** dataset for New York City.

Dataset URL: https://insideairbnb.com/get-the-data.html  
File: `project1data.csv` (listings data)  
Downloaded: (write today’s date)

This dataset includes information such as Airbnb listing prices, location, room type, and number of reviews.  
It has more than **30,000 rows**, which satisfies the project requirement of having **at least 1,000 rows**.


In [None]:
import pandas as pd

df = pd.read_csv("listings (2).csv")
len(df), df.head()

(True, 36111)

We will analyze the **price** variable, because it is numeric, relevant, and varies across neighborhoods. 
Understanding price distribution helps us identify how location affects Airbnb rental costs.

In [3]:
s = pd.to_numeric(df["price"], errors="coerce").dropna()
s = s[(s > 0) & (s < s.quantile(0.99))]  # remove extreme outliers
len(s)

0

In [6]:
import pandas as pd

df = pd.read_csv("listings (2).csv")

# data cleaning
s = df["price"].astype(str)              # string
s = s.str.replace("$", "", regex=False)  # remove $
s = s.str.replace(",", "", regex=False)  # remove ,
s = pd.to_numeric(s, errors="coerce")    # convert to number
s = s.dropna()                           # remove couldn't converted part
s = s[(s > 0) & (s < s.quantile(0.99))]  # remove extreme values

len(s), s.head()

(21114,
 0     66.0
 4     76.0
 5     97.0
 7     60.0
 8    425.0
 Name: price, dtype: float64)

In [9]:
mean_p = s.mean()
median_p = s.median()
mode_p = s.mode().iloc[0]   

mean_p, median_p, mode_p

(np.float64(237.0799943165672), np.float64(152.0), np.float64(150.0))

he mean price is higher than the median price, which suggests the distribution is **right-skewed**.  
This means that most Airbnb listings are moderately priced, but a small number of high-priced listings raise the average.

In [11]:
# convert the cleaned pandas Series into a plain Python list
values = list(s)

# define mean, median, mode without pandas
def mean(vals):
    return sum(vals) / len(vals)

def median(vals):
    vals = sorted(vals)
    n = len(vals)
    mid = n // 2
    if n % 2 == 1:
        return vals[mid]
    else:
        return (vals[mid - 1] + vals[mid]) / 2

def mode(vals):
    counts = {}
    best = None
    max_count = 0
    for v in vals:
        counts[v] = counts.get(v, 0) + 1
        if counts[v] > max_count:
            best = v
            max_count = counts[v]
    return best

mean_std = mean(values)
median_std = median(values)
mode_std = mode(values)

mean_std, median_std, mode_std

(237.0799943165672, 152.0, 150.0)

### Summary Statistics Interpretation

We computed three measures of central tendency for the price of Airbnb listings in New York City:

- **Mean:** ~\$237  
- **Median:** ~\$152  
- **Mode:** ~\$150  

The mean price is noticeably higher than the median price, which indicates that the price distribution is **right-skewed**. In other words, while most Airbnb listings are moderately priced, there are some very expensive listings that raise the average price.

The median and mode are both around \$150, suggesting that the **typical nightly price** for an Airbnb listing in NYC is around \$150 — this represents what most travelers are likely to pay.

Overall, the gap between mean and median reflects the presence of a **small number of luxury or high-demand listings** that significantly increase the average.


In [13]:
# Use neighbourhood_cleansed instead of neighbourhood_group
group_stats = (
    pd.DataFrame({
        "price": s, 
        "neighbourhood": df.loc[s.index, "neighbourhood_cleansed"].values
    })
    .dropna(subset=["neighbourhood"])
    .groupby("neighbourhood")["price"]
    .median()
    .sort_values(ascending=False)
)

labels = list(group_stats.index)
vals = list(group_stats.values)

def ascii_bar(labels, vals, width=40):
    max_v = max(vals)
    scale = width / max_v
    for lab, v in zip(labels, vals):
        bar = "█" * int(v * scale)
        print(f"{lab:<20} | {bar} {v:.1f}")

ascii_bar(labels, vals)

Riverdale            | ████████████████████████████████████████ 1043.0
Fort Wadsworth       | ███████████████████████ 600.0
Tribeca              | █████████████████ 462.0
NoHo                 | █████████████████ 458.5
SoHo                 | █████████████████ 458.0
Battery Park City    | ████████████████ 424.0
Theater District     | ███████████████ 404.0
West Village         | ██████████████ 378.0
Financial District   | ██████████████ 378.0
Flatiron District    | ██████████████ 368.0
Midtown              | █████████████ 362.0
Greenwich Village    | █████████████ 361.5
Civic Center         | █████████████ 341.5
Nolita               | ███████████ 300.0
Vinegar Hill         | ███████████ 298.0
DUMBO                | ███████████ 295.0
Holliswood           | ███████████ 289.5
New Springville      | ██████████ 276.0
Grymes Hill          | ██████████ 272.0
Neponsit             | ██████████ 272.0
Greenpoint           | ██████████ 267.0
Downtown Brooklyn    | █████████ 255.0
Chelsea             

### ASCII Visualization Interpretation

The ASCII chart shows that neighborhoods differ significantly in median Airbnb prices. Areas such as Tribeca, SoHo, and Williamsburg have much higher typical nightly prices, reflecting their desirability and centrality. Meanwhile, neighborhoods like Harlem and Astoria show lower median prices, indicating more affordable options. This supports the conclusion that location is a major factor influencing Airbnb pricing in NYC.