# Analysis of Listings - Inside Airbnb Santiago

This notebook details the process of cleaning and preparing the `listings.csv` dataset from the Inside Airbnb project for further analysis. The focus is on:

- Removing redundant or irrelevant columns
- Handling missing values
- Converting and standardizing key features (especially pricing)
- Flagging outliers
- Preparing a clean dataset for use in the final analysis

## Step 1. Load and Inspect Data.
- Load the dataset.
- Display shape and basic structure.
- Brief overview of columns and data types.


In [29]:
import pandas as pd

# Read the CSV file
df_listings = pd.read_csv('santiago/listings.csv.gz' , compression='gzip')

# Display the first few rows of the DataFrame
display(df_listings.head())

# Check data types
print(df_listings.dtypes)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,49392,https://www.airbnb.com/rooms/49392,20241227033155,2024-12-27,city scrape,Share my Flat in Providencia,,,https://a0.muscache.com/pictures/3740612/b1850...,224592,...,,,,,f,1,0,1,0,
1,52811,https://www.airbnb.com/rooms/52811,20241227033155,2024-12-27,city scrape,Suite Providencia 1 Santiago Chile,Apartment located on the subway station Manuel...,Building located on the access to the Manuel M...,https://a0.muscache.com/pictures/miso/Hosting-...,244792,...,4.59,4.64,4.36,,t,3,3,0,0,0.26
2,53494,https://www.airbnb.com/rooms/53494,20241227033155,2024-12-27,city scrape,depto centro ski el colorado chile,,,https://a0.muscache.com/pictures/310936/ff7d53...,249097,...,4.88,4.79,4.69,,f,1,1,0,0,0.46
3,787045,https://www.airbnb.com/rooms/787045,20241227033155,2024-12-27,city scrape,right at home,"A few steps from metro station ""FERNANDO CASTI...","Metro Station "" FERNANDO CASTILLO"" (LINE 3)"" i...",https://a0.muscache.com/pictures/airflow/Hosti...,4134987,...,4.93,4.66,4.85,,f,2,0,2,0,1.01
4,795701,https://www.airbnb.com/rooms/795701,20241227033155,2024-12-27,city scrape,Lindo Depto 2 dormitorios,Nice and comfortable two-bedroom apartment. Fu...,"Centrally located by day works commercially, a...",https://a0.muscache.com/pictures/14703811/def2...,4191304,...,4.86,4.55,4.69,,f,2,2,0,0,0.2


id                                                int64
listing_url                                      object
scrape_id                                         int64
last_scraped                                     object
source                                           object
                                                 ...   
calculated_host_listings_count                    int64
calculated_host_listings_count_entire_homes       int64
calculated_host_listings_count_private_rooms      int64
calculated_host_listings_count_shared_rooms       int64
reviews_per_month                               float64
Length: 75, dtype: object


## 2. 🧹 Data Cleaning

* The DataFrame has 75 columns and it doesn't show the data types for all the columns, since it's very long, so I learned that I can use a loop to go through all of them



In [30]:
# Check data types of the columns using a loop
for col, dtype in df_listings.dtypes.items():
    print(f"{col}: {dtype}")

id: int64
listing_url: object
scrape_id: int64
last_scraped: object
source: object
name: object
description: object
neighborhood_overview: object
picture_url: object
host_id: int64
host_url: object
host_name: object
host_since: object
host_location: object
host_about: object
host_response_time: object
host_response_rate: object
host_acceptance_rate: object
host_is_superhost: object
host_thumbnail_url: object
host_picture_url: object
host_neighbourhood: object
host_listings_count: int64
host_total_listings_count: int64
host_verifications: object
host_has_profile_pic: object
host_identity_verified: object
neighbourhood: object
neighbourhood_cleansed: object
neighbourhood_group_cleansed: float64
latitude: float64
longitude: float64
property_type: object
room_type: object
accommodates: int64
bathrooms: float64
bathrooms_text: object
bedrooms: float64
beds: float64
amenities: object
price: object
minimum_nights: int64
maximum_nights: int64
minimum_minimum_nights: int64
maximum_minimum_night

### 2.1 Drop Irrelevant Columns
Remove columns that are too verbose, redundant, or not useful for our analysis (e.g., URLs, name, picture URLs, host thumbnails).


In [31]:
# Drop unnecessary columns
columns_to_drop = [
    'scrape_id',
    'last_scraped',
    'source',
    'picture_url',
    'neighborhood_overview', # Unstructured text, not useful for analysis
    'neighbourhood_group_cleansed', # This column has no data
    'host_has_profile_pic',
    'host_location',
    'host_about',
    'host_thumbnail_url',
    'host_picture_url',
    'host_neighbourhood',
    'host_verifications',
    'calendar_updated',
    'calendar_last_scraped',
    'license'
]
# Drop the columns
df_listings.drop(columns=columns_to_drop, inplace=True)

## 2.2 Min and max nights columns

There are multiple columns regarding the minimum and maximum nights:

- **`minimum_nights`**: Minimum nights required to book. Often useful in filtering listings that only accept longer stays

- **`maximum_nights`**: Maximum nights allowed per booking. Helps understand stay limits

- **`minimum_minimum_nights`**: Min value of minimum_nights across 12 months. Very specific and probably not needed unless doing time-series

- **`maximum_minimum_nights`**: Max value of minimum_nights across 12 months. Same as above.

- **`minimum_maximum_nights`**: Min value of maximum_nights across 12 months. Too granular for most analyses

- **`minimum_nights_avg_ntm`**: Avg monthly minimum_nights. Very specific

- **`maximum_nights_avg_ntm`**: Avg monthly maximum_nights. Same as above


So I'll just keep the first two columns mentioned.

In [32]:
# Drop min/max night columns
columns_to_drop_min_max = [
    'minimum_minimum_nights',
    'maximum_minimum_nights',
    'minimum_maximum_nights',
    'maximum_maximum_nights',
    'minimum_nights_avg_ntm',
    'maximum_nights_avg_ntm'
]
# Drop the columns
df_listings.drop(columns=columns_to_drop_min_max, inplace=True)

## 2.3 Covert Data Types

Convert price and boolean fields to appropriate types.



In [33]:
# Check dtypes in columns

for col in df_listings.select_dtypes(include='object').columns:
    print(f"==> {col}")
    print(df_listings[col].map(type).value_counts())
    print()


==> listing_url
listing_url
<class 'str'>    15051
Name: count, dtype: int64

==> name
name
<class 'str'>    15051
Name: count, dtype: int64

==> description
description
<class 'str'>      14533
<class 'float'>      518
Name: count, dtype: int64

==> host_url
host_url
<class 'str'>    15051
Name: count, dtype: int64

==> host_name
host_name
<class 'str'>    15051
Name: count, dtype: int64

==> host_since
host_since
<class 'str'>    15051
Name: count, dtype: int64

==> host_response_time
host_response_time
<class 'str'>      12180
<class 'float'>     2871
Name: count, dtype: int64

==> host_response_rate
host_response_rate
<class 'str'>      12180
<class 'float'>     2871
Name: count, dtype: int64

==> host_acceptance_rate
host_acceptance_rate
<class 'str'>      12636
<class 'float'>     2415
Name: count, dtype: int64

==> host_is_superhost
host_is_superhost
<class 'str'>      14719
<class 'float'>      332
Name: count, dtype: int64

==> host_identity_verified
host_identity_verified
<cl

In [34]:
# Change data types of specific columns

df_listings['listing_url'] = df_listings['listing_url'].astype('string')
df_listings['name'] = df_listings['name'].astype('string')
df_listings['description'] = df_listings['description'].astype('string')
df_listings['host_url'] = df_listings['host_url'].astype('string')
df_listings['host_name'] = df_listings['host_name'].astype('string')
df_listings['host_since'] = pd.to_datetime(df_listings['host_since'], format='%Y-%m-%d')
df_listings['host_response_time'] = df_listings['host_response_time'].astype('string')
df_listings['host_response_rate'] = (
    df_listings['host_response_rate']
    .str.rstrip('%')         # Remove the '%' symbol
    .astype(float) / 100     # Convert to float and scale to 0-1
)
df_listings['host_acceptance_rate'] = (
    df_listings['host_acceptance_rate']
    .str.rstrip('%')         # Remove the '%' symbol
    .astype(float) / 100     # Convert to float and scale to 0-1
)
df_listings['host_is_superhost'] = (
    df_listings['host_is_superhost']
    .map({'t': True, 'f': False, 'True': True, 'False': False})
    .astype('boolean')
)
df_listings['host_identity_verified'] = (
    df_listings['host_identity_verified']
    .map({'t': True, 'f': False, 'True': True, 'False': False})
    .astype('boolean')
)
df_listings['neighbourhood'] = df_listings['neighbourhood'].astype('string')
df_listings['neighbourhood_cleansed'] = df_listings['neighbourhood_cleansed'].astype('string')
df_listings['property_type'] = df_listings['property_type'].astype('category')
df_listings['room_type'] = df_listings['room_type'].astype('category')
df_listings['bathrooms_text'] = df_listings['bathrooms_text'].astype('string')

df_listings['price'] = (
df_listings['price']
    .str.replace('[$,]', '', regex=True)  # Remove '$' and ',' symbols
    .astype(float)                        # Convert to float
)
df_listings['has_availability'] = (
    df_listings['has_availability']
    .map({'t': True, 'f': False, 'True': True, 'False': False})
    .astype('boolean')
)
df_listings['first_review'] = pd.to_datetime(df_listings['first_review'], format='%Y-%m-%d')
df_listings['last_review'] = pd.to_datetime(df_listings['last_review'], format='%Y-%m-%d')
df_listings['instant_bookable'] = (
    df_listings['instant_bookable']
    .map({'t': True, 'f': False, 'True': True, 'False': False})
    .astype('boolean')
)

#### Observations:

Some columns are lists (like amenities), but in the DataFrame they are set as object. So I learn that I can use ast.literal_eval to convert them to lists

In [35]:
# Convert 'amenities' column from string to list using ast.literal_eval
import ast
df_listings['amenities'] = df_listings['amenities'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else [])


In [36]:
# Check again the data types after conversion
for col, dtype in df_listings.dtypes.items():
    print(f"{col}: {dtype}")

id: int64
listing_url: string
name: string
description: string
host_id: int64
host_url: string
host_name: string
host_since: datetime64[ns]
host_response_time: string
host_response_rate: float64
host_acceptance_rate: float64
host_is_superhost: boolean
host_listings_count: int64
host_total_listings_count: int64
host_identity_verified: boolean
neighbourhood: string
neighbourhood_cleansed: string
latitude: float64
longitude: float64
property_type: category
room_type: category
accommodates: int64
bathrooms: float64
bathrooms_text: string
bedrooms: float64
beds: float64
amenities: object
price: float64
minimum_nights: int64
maximum_nights: int64
has_availability: boolean
availability_30: int64
availability_60: int64
availability_90: int64
availability_365: int64
number_of_reviews: int64
number_of_reviews_ltm: int64
number_of_reviews_l30d: int64
first_review: datetime64[ns]
last_review: datetime64[ns]
review_scores_rating: float64
review_scores_accuracy: float64
review_scores_cleanliness: fl

## 2.4 Handle missing values

Drop rows or columns with too many missing values and fill or impute others as needed.
python. First I'll check the columns with missing values.


In [37]:
# Check for missing values
missing_values = df_listings.isnull().sum()
print("Missing values in each column:")
print(missing_values[missing_values > 0])

Missing values in each column:
description                     518
host_response_time             2871
host_response_rate             2871
host_acceptance_rate           2415
host_is_superhost               332
neighbourhood                  9463
bathrooms                      1998
bathrooms_text                   24
bedrooms                        794
beds                           2007
price                          1989
has_availability               1081
first_review                   3244
last_review                    3244
review_scores_rating           3244
review_scores_accuracy         3244
review_scores_cleanliness      3244
review_scores_checkin          3244
review_scores_communication    3244
review_scores_location         3245
review_scores_value            3244
reviews_per_month              3244
dtype: int64


#### Observation

There are some relevant columns with missing values:
- **`price`**: 1989 nulls
- **review_scores** and **dates of reviews**: 3244 (probably corresponding to new listings without reviews)
- **beds** and **bathrooms**: This data is added by the host, maybe in some cases they don't typoe in correctly, so for the analysis is not neccessarily relevant the missing data in these columns.
- **host**, **host_rates**: This can also correspond to new hosts, that's why they are empty.

*Price is relevant for the analysis, so let's check this out* 

In [38]:
# Check listings with missing price
missing_price_listings = df_listings[df_listings['price'].isnull()]
print("Listings with missing price:")
print(missing_price_listings[['listing_url', 'price', 'has_availability']])

Listings with missing price:
                                            listing_url  price  \
9                   https://www.airbnb.com/rooms/844274    NaN   
11                   https://www.airbnb.com/rooms/65058    NaN   
12                   https://www.airbnb.com/rooms/73752    NaN   
13                   https://www.airbnb.com/rooms/80482    NaN   
17                   https://www.airbnb.com/rooms/95139    NaN   
...                                                 ...    ...   
14315  https://www.airbnb.com/rooms/1299604714927560324    NaN   
14321  https://www.airbnb.com/rooms/1298759807866003043    NaN   
14337  https://www.airbnb.com/rooms/1298830859711843500    NaN   
14339  https://www.airbnb.com/rooms/1298876065698878881    NaN   
14341  https://www.airbnb.com/rooms/1298884979272174903    NaN   

       has_availability  
9                  <NA>  
11                 <NA>  
12                 <NA>  
13                 <NA>  
17                 <NA>  
...                 ..

#### Observations

After clicking in some of the listings with missing values in price column, I can see that they don't have any date available, so probably they are no active listings.

Considering that price is an important value for future analysis, and that some of the listings are probably not available, I will filter out those rows from the dataframe

In [39]:
# drop rows with missing price
df_listings.dropna(subset=['price'], inplace=True)

For the rest of the columns with missing values:

- Fill with 'no description' or 'unknown'
    - description
    - neighbourhood
    - host_response_time
    - host_response_rate
    - host_acceptance_rate'

The rest of the columns have data that is not relevant if some values are missing.
For example bathrooms, beds,
For reviews, probably the listings are new and they don't have reviews yet.


In [40]:
# Fill missing values in description and neighborhood_overview with 'no description'
df_listings['description'] = df_listings['description'].fillna('No description')
df_listings['neighbourhood'] = df_listings['neighbourhood'].fillna('Unknown')

# Fill missing values in response time/rate with 'Unknown'
df_listings['host_response_time'] = df_listings['host_response_time'].fillna('Unknown')
df_listings['host_response_rate'] = df_listings['host_response_rate'].fillna(0.0)
df_listings['host_acceptance_rate'] = df_listings['host_acceptance_rate'].fillna(0.0)

# Fill missing values in host name with 'Unknown'
df_listings['host_name'] = df_listings['host_name'].fillna('Unknown')


## 3. Analysis of prices

Here I'll start a general analysis of the prices to check if some extra transformations are needed.

### 3.1 Summary

Here we will explore the prices range

In [41]:
# Summary statistics for price
print(df_listings['price'].describe())
print("99th percentile:", df_listings['price'].quantile(0.99))

count    1.306200e+04
mean     8.535579e+04
std      9.882005e+05
min      7.762000e+03
25%      3.156425e+04
50%      4.385700e+04
75%      6.500000e+04
max      9.692788e+07
Name: price, dtype: float64
99th percentile: 500000.0


#### Observations

- Prices range from 7,762 CLP to 97 million. The lower price seems reasonably, but the highest is definitely an outlier or data entry mistake. 
- The average price is about 85,356 CLP
- Standard deviation = 988,200.5. This shows there is a huge spread/variability in prices — the values vary widely from the average so there are very large outliers
- 99% of the listings have a price below 500,000 CLP


### 3.2 Visualizing Price Distribution

We'll plot a histogram of prices corresponding to the percentile 99

In [None]:
# Histogram of the 'price' column (max 500,000 CLP)

import matplotlib.pyplot as plt

#Filter prices to a maximum of 500,000 CLP
reasonable_prices = df_listings[df_listings['price'] <= 500000]['price']

plt.figure(figsize=(10, 6))
plt.hist(reasonable_prices, bins=100, color='blue', alpha=0.7)
plt.title('Distribution of Prices in Santiago Listings (up to 500,000 CLP)')
plt.xlabel('Price')
plt.ylabel('Number of Listings')
plt.yscale('log')  # Use logarithmic scale for better visibility
plt.show()


And now a histogram of prices corresponding to the percentile 75

In [None]:
# Histogram of the 'price' column (75% of listings)

import matplotlib.pyplot as plt

#Filter prices to a maximum of percentile 76
prices_median = df_listings[df_listings['price'] <= df_listings['price'].quantile(0.75)]['price']

plt.figure(figsize=(10, 6))
plt.hist(prices_median, bins=50, color='blue', alpha=0.7)
plt.title('Distribution of Prices in Santiago Listings (up to 65,000 CLP)')
plt.xlabel('Price')
plt.ylabel('Number of Listings')
plt.show()

Hewre I should add some observations about the histogram

I was curious to see what kind of listings had prices over 500,000 CLP per night (which is very expensive for Santiago. Even for a luxury place sounds like too much.)
So I am checking the url listings, to check on the Airbnb Website some of the listings

In [42]:
# Check listings with price greater than 500000 

high_price_listings = df_listings[df_listings['price'] > 500000]
print("Listings with price greater than 500000:")
print(high_price_listings[['listing_url', 'price']])

Listings with price greater than 500000:
                                            listing_url       price
144                 https://www.airbnb.com/rooms/529207    593436.0
237                https://www.airbnb.com/rooms/4539250  18609607.0
298                https://www.airbnb.com/rooms/3694354    593436.0
313                https://www.airbnb.com/rooms/3826466   1186872.0
368                https://www.airbnb.com/rooms/5418047    593436.0
...                                                 ...         ...
13266  https://www.airbnb.com/rooms/1259753279420806992    767520.0
13798  https://www.airbnb.com/rooms/1282609732909070860   1000000.0
14013  https://www.airbnb.com/rooms/1288340241426105476    653907.0
14125  https://www.airbnb.com/rooms/1291230327012921884   3485714.0
14479  https://www.airbnb.com/rooms/1304327791338910289    642857.0

[125 rows x 2 columns]


#### Observations

There are 125 listings where the price is over 500,000 CLP per night. 
I randomly chose a few of them. 

For example:
Row 237: The price is actually high on the website, but I can see in the description that this person is searching for a roommate to rent out a room in the apartment for 280,000 CLP per month. But they price per night is listed as 18 million CLP.

Row 13266: This is an apartment in Valle Nevado, which is a fancy place up in the mountains by a ski center.

Row 14125: This one must be a mistake, because the listing on the website have a price per night around 65,000 CLP

Since there are 125 rows over 500,000 I cant check them one by one. Some of them can be a real price (like a luxury apartment in Valle Nevado), or a mistake. 
For further analysis, I will flag these rows, but I won't delete them from the dataFrame


## 3.3 Flag outliers

To keep the integrity of the data, I decided to not delete the rows with extremly high prices, but flag them. Some can actually be real prices, so I want to keep them in the analysis

In [44]:
# Flag listings with price greater than 500,000 CLP
df_listings['high_price_flag'] = df_listings['price'] > 500000

# Prtint the number of listings with high price flag
print(f"Number of listings with high price flag: {df_listings['high_price_flag'].sum()}")

Number of listings with high price flag: 125


## 4. Final transformations

The DataFrame contains many columns, so now I will go through them to see if I keep or drop them. Also I need to check the column names.

In [379]:
# check neighbourhoods and neighbourhood_cleansed
print("Unique neighbourhoods:")
print(df_listings['neighbourhood'].unique())
print("Unique neighbourhood_cleansed:")
print(df_listings['neighbourhood_cleansed'].unique())

Unique neighbourhoods:
<StringArray>
[                                                    'Unknown',
                    'Providencia, Región Metropolitana, Chile',
               'Santiago, Santiago Metropolitan Region, Chile',
                       'Santiago, Región Metropolitana, Chile',
          'Providencia, Santiago, Región Metropolitana, Chile',
                     'Las Condes, Región Metropolitana, Chile',
                       'Recoleta, Región Metropolitana, Chile',
                       'La Reina, Región Metropolitana, Chile',
             'Las Condes Santiago, Metropolitan Region, Chile',
                                             'La Parva, Chile',
 ...
 'Las Condes, Santiago de Chile , Región Metropolitana, Chile',
                'Calera de Tango, Región Metropolitana, Chile',
                'Ñuñoa, Santiago, Región Metropolitana, Chile',
            'Pedro Aguirre Cerda, Región Metropolitana, Chile',
                      'Cerrillos, Región Metropolitana, Chile'

So after checking the columns related to nneighbourhood_cleansed and neighbourhood, I will keep only the cleansed one, because it is easier to work with. The information on the dataset corresponds to Santiago, so I don't need to know thee region and coutry. 

In [380]:
# drop neighbourhood column
df_listings.drop(columns=['neighbourhood'], inplace=True)


According to this, I'll keep the calculated_host_listing_count


| Column                           | Description                                                                |
| -------------------------------- | -------------------------------------------------------------------------- |
| `host_listings_count`            | Often self-reported — the number of listings the host claims to have       |
| `calculated_host_listings_count` | Computed by Airbnb — likely more reliable as it's derived from actual data |


In [381]:
# Drop host_listings_count, host_total_listings_count
df_listings.drop(columns=['host_listings_count', 'host_total_listings_count'], inplace=True)


In [382]:
# Check the DataFrame after dropping columns
print("DataFrame after dropping unnecessary columns:")
print(df_listings.info())

DataFrame after dropping unnecessary columns:
<class 'pandas.core.frame.DataFrame'>
Index: 13062 entries, 0 to 15050
Data columns (total 51 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   id                                            13062 non-null  int64         
 1   listing_url                                   13062 non-null  string        
 2   name                                          13062 non-null  string        
 3   description                                   13062 non-null  string        
 4   host_id                                       13062 non-null  int64         
 5   host_url                                      13062 non-null  string        
 6   host_name                                     13062 non-null  string        
 7   host_since                                    13062 non-null  datetime64[ns]
 8   host_response_time       

Last thing, I will change some column names

In [383]:
# Rename columns for consistency

df_listings.rename(columns={
    "id": "listing_id",
    "price": "price_clp",
    "host_is_superhost": "is_superhost",
    "calculated_host_listings_count": "host_total_listings_count",
    "calculated_host_listings_count_entire_homes": "host_entire_home_count",
    "calculated_host_listings_count_private_rooms": "host_private_room_count",
    "calculated_host_listings_count_shared_rooms": "host_shared_room_count"
}, inplace=True)


One last thing I just remember, I will add a calculated price in dollars for a better undestanding of the prices for people who are not familiar with chilean peso currency

In [384]:
# Add a column price_usd after the column listing_price_clp

conversion_rate = 0.0011

df_listings['price_usd'] = df_listings['price_clp'] * conversion_rate

# Reorder columns to place listing_price_usd after listing_price_clp
columns = df_listings.columns.tolist()
price_clp_index = columns.index('price_clp')
columns.insert(price_clp_index + 1, columns.pop(columns.index('price_usd')))
df_listings = df_listings[columns]


In [387]:
# Check DataFrame after renaming columns
print("DataFrame after renaming columns:")

print(df_listings.info())

DataFrame after renaming columns:
<class 'pandas.core.frame.DataFrame'>
Index: 13062 entries, 0 to 15050
Data columns (total 52 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   listing_id                   13062 non-null  int64         
 1   listing_url                  13062 non-null  string        
 2   name                         13062 non-null  string        
 3   description                  13062 non-null  string        
 4   host_id                      13062 non-null  int64         
 5   host_url                     13062 non-null  string        
 6   host_name                    13062 non-null  string        
 7   host_since                   13062 non-null  datetime64[ns]
 8   host_response_time           13062 non-null  string        
 9   host_response_rate           13062 non-null  float64       
 10  host_acceptance_rate         13062 non-null  float64       
 11  is_superhost

In [388]:
# Save the cleaned DataFrame to a new CSV file
import os
output_dir = 'cleaned_data'
os.makedirs(output_dir, exist_ok=True)
# Save the cleaned DataFrame to a new CSV file (not compressed)
df_listings.to_csv(os.path.join(output_dir, 'listings_cleaned.csv'), index=False)
