#  Data Cleaning & Exploration of Airbnb Listings

## Objective:
#### Clean and explore an Airbnb dataset to understand pricing patterns, popular locations, and listing quality.

##### * Handling missing & duplicate values
##### * Detecting & fixing outliers
##### * Transforming & formatting data
##### * Exploratory Data Analysis (EDA)

In [16]:
import pandas as pd


In [17]:
df = pd.read_csv(r"C:\Users\K A L K I D A N\OneDrive\Desktop\projects\data set\Airbnb_Data.csv")
df.head()  # Show first 5 rows

  df = pd.read_csv(r"C:\Users\K A L K I D A N\OneDrive\Desktop\projects\data set\Airbnb_Data.csv")


Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,...,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,...,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


In [18]:
# Get basic info about the dataset
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              102599 non-null  int64  
 1   NAME                            102349 non-null  object 
 2   host id                         102599 non-null  int64  
 3   host_identity_verified          102310 non-null  object 
 4   host name                       102193 non-null  object 
 5   neighbourhood group             102570 non-null  object 
 6   neighbourhood                   102583 non-null  object 
 7   lat                             102591 non-null  float64
 8   long                            102591 non-null  float64
 9   country                         102067 non-null  object 
 10  country code                    102468 non-null  object 
 11  instant_bookable                102494 non-null  object 
 12  cancellation_pol

In [19]:
df.isnull().sum()

id                                     0
NAME                                 250
host id                                0
host_identity_verified               289
host name                            406
neighbourhood group                   29
neighbourhood                         16
lat                                    8
long                                   8
country                              532
country code                         131
instant_bookable                     105
cancellation_policy                   76
room type                              0
Construction year                    214
price                                247
service fee                          273
minimum nights                       409
number of reviews                    183
last review                        15893
reviews per month                  15879
review rate number                   326
calculated host listings count       319
availability 365                     448
house_rules     

In [24]:
df.duplicated().sum()

541

In [29]:
# Remove dollar signs and spaces, then convert to numeric
df['price'] = df['price'].replace({r'\$': '', ' ': ''}, regex=True)
df['price'] = pd.to_numeric(df['price'], errors='coerce')


In [31]:
# Fill missing values for columns
df['NAME'].fillna("Unknown", inplace=True)
df['host_identity_verified'].fillna("Unknown", inplace=True)
df['host name'].fillna("Unknown", inplace=True)  # Use the correct column name here
df['neighbourhood'].fillna("Unknown", inplace=True)
df['neighbourhood group'].fillna("Unknown", inplace=True)

# Fill latitude and longitude with the mean (or remove rows with missing lat/long)
df['lat'].fillna(df['lat'].mean(), inplace=True)
df['long'].fillna(df['long'].mean(), inplace=True)

# Fill country and country code with "Unknown" or drop
df['country'].fillna("Unknown", inplace=True)
df['country code'].fillna("Unknown", inplace=True)

# Fill 'price' and 'service fee' with the median
# Now fill the missing values for 'price' and 'service fee'
df['price'].fillna(df['price'].median(), inplace=True)
df['service fee'].fillna(df['service fee'].median(), inplace=True)


# Fill 'minimum nights' with 1 (or a reasonable default)
df['minimum nights'].fillna(1, inplace=True)

# Fill missing review-related fields
df['reviews per month'].fillna(df['reviews per month'].mean(), inplace=True)
df['last review'].fillna(df['last review'].mean(), inplace=True)
df['review rate number'].fillna(df['review rate number'].mean(), inplace=True)

# Fill 'availability 365' with the mean value
df['availability 365'].fillna(df['availability 365'].mean(), inplace=True)

# Drop 'house_rules' and 'license' due to many missing values (if not critical)
df.drop(columns=['house_rules', 'license'], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['NAME'].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['host_identity_verified'].fillna("Unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are sett

TypeError: Cannot convert ['$193 ' '$28 ' '$124 ' ... '$198 ' '$109 ' '$206 '] to numeric

In [22]:
# Display column names to check for any discrepancies
print(df.columns)


Index(['id', 'NAME', 'host id', 'host_identity_verified', 'host name',
       'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country',
       'country code', 'instant_bookable', 'cancellation_policy', 'room type',
       'Construction year', 'price', 'service fee', 'minimum nights',
       'number of reviews', 'last review', 'reviews per month',
       'review rate number', 'calculated host listings count',
       'availability 365', 'house_rules', 'license'],
      dtype='object')
