# ### 1. **Univariate Analysis** ðŸ“Š

- **Definition:** Analysis of **one variable** at a time.
- **Purpose:** Understand distribution, central tendency, spread, and patterns of a single variable.
- **Techniques/Tools:**
    - Mean, median, mode
    - Standard deviation, variance
    - Histograms, box plots, frequency tables
- **Example:**
    - Analyzing **student grades** in a class to find the average and distribution.

In [None]:
import pandas as pd 

In [7]:
df=pd.read_csv("AB_NYC_2019.csv")
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [9]:
df.shape

(48895, 16)

In [13]:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

In [17]:
df.shape

(38821, 16)

In [23]:
df["id"]=df["id"].astype(str)
df["host_id"]=df["host_id"].astype(str)
df["latitude"]=df["latitude"].astype(str)
df["longitude"]=df["longitude"].astype(str)

In [25]:
df.describe()

Unnamed: 0,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,38821.0,38821.0,38821.0,38821.0,38821.0,38821.0
mean,142.332526,5.86922,29.290255,1.373229,5.166611,114.886299
std,196.994756,17.389026,48.1829,1.680328,26.302954,129.52995
min,0.0,1.0,1.0,0.01,1.0,0.0
25%,69.0,1.0,3.0,0.19,1.0,0.0
50%,101.0,2.0,9.0,0.72,1.0,55.0
75%,170.0,4.0,33.0,2.02,2.0,229.0
max,10000.0,1250.0,629.0,58.5,327.0,365.0


In [29]:
df.nunique()

id                                38821
name                              38253
host_id                           30232
host_name                          9885
neighbourhood_group                   5
neighbourhood                       218
latitude                          17436
longitude                         13639
room_type                             3
price                               581
minimum_nights                       89
number_of_reviews                   393
last_review                        1764
reviews_per_month                   937
calculated_host_listings_count       47
availability_365                    366
dtype: int64

# Categorical

In [39]:
df["neighbourhood_group"].value_counts()
#Counts how many times each unique value appears in that column.

neighbourhood_group
Manhattan        16621
Brooklyn         16439
Queens            4572
Bronx              875
Staten Island      314
Name: count, dtype: int64

In [36]:
df["neighbourhood_group"].value_counts(normalize=True)
#normalize=True -Instead of counts, it returns proportions (percentage of total) for each unique value

neighbourhood_group
Manhattan        0.428145
Brooklyn         0.423456
Queens           0.117771
Bronx            0.022539
Staten Island    0.008088
Name: proportion, dtype: float64

In [48]:
df["room_type"].value_counts(normalize=True)

room_type
Entire home/apt    0.523454
Private room       0.454754
Shared room        0.021792
Name: proportion, dtype: float64

In [54]:
df["neighbourhood"].value_counts().reset_index()

Unnamed: 0,neighbourhood,count
0,Williamsburg,3163
1,Bedford-Stuyvesant,3141
2,Harlem,2204
3,Bushwick,1942
4,Hell's Kitchen,1528
...,...,...
213,Holliswood,2
214,New Dorp Beach,2
215,Richmondtown,1
216,Rossville,1


In [62]:
df_n=df["neighbourhood"].value_counts().reset_index().rename(columns={"index":"neighbourhood","count":"number_of_hotels"})

In [64]:
df_n

Unnamed: 0,neighbourhood,number_of_hotels
0,Williamsburg,3163
1,Bedford-Stuyvesant,3141
2,Harlem,2204
3,Bushwick,1942
4,Hell's Kitchen,1528
...,...,...
213,Holliswood,2
214,New Dorp Beach,2
215,Richmondtown,1
216,Rossville,1


In [66]:
df_n[df_n["number_of_hotels"]>1000]

Unnamed: 0,neighbourhood,number_of_hotels
0,Williamsburg,3163
1,Bedford-Stuyvesant,3141
2,Harlem,2204
3,Bushwick,1942
4,Hell's Kitchen,1528
5,East Village,1489
6,Upper West Side,1482
7,Upper East Side,1405
8,Crown Heights,1265


# Numerical

In [69]:
df["price"].value_counts(bins=5)
#Counts the number of values falling into each bin (range).
#bins=5 means the entire range of prices will be divided into 5 equal-width intervals.
#Instead of counting exact values, it counts how many prices fall into each interval.

(-10.001, 2000.0]    38786
(2000.0, 4000.0]        20
(4000.0, 6000.0]         8
(8000.0, 10000.0]        5
(6000.0, 8000.0]         2
Name: count, dtype: int64

In [71]:
bins=(0,50,100,200,500,2000,10000)

In [73]:
df["price"].value_counts(bins=bins)

(50.0, 100.0]        14212
(100.0, 200.0]       13544
(200.0, 500.0]        5267
(-0.001, 50.0]        5176
(500.0, 2000.0]        587
(2000.0, 10000.0]       35
Name: count, dtype: int64

In [75]:
df["price"].mean()

142.33252621004095

In [77]:
df["price"].std()

196.99475591833985

In [79]:
df["price"].skew()

23.673594295123014

In [81]:
df["price"].kurt()

953.4807356344944

In [87]:
df.corr(numeric_only=True)


Unnamed: 0,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
price,1.0,0.025501,-0.035924,-0.030623,0.052895,0.078276
minimum_nights,0.025501,1.0,-0.069366,-0.121712,0.073474,0.101658
number_of_reviews,-0.035924,-0.069366,1.0,0.549699,-0.059796,0.193409
reviews_per_month,-0.030623,-0.121712,0.549699,1.0,-0.009442,0.185896
calculated_host_listings_count,0.052895,0.073474,-0.059796,-0.009442,1.0,0.182981
availability_365,0.078276,0.101658,0.193409,0.185896,0.182981,1.0
