# Capstone 3: Data Wrangling
## By: Pedro Rodriguez

I will import the Listing data obtained from Kaggle to see what features have and determine what elements are needed to know how the cleaning fee impacts the monthly booking. 

In [1]:
import pandas as pd
import numpy as np

airbnb = pd.read_csv('/Users/pedrorodriguez/Desktop/Springboard/Capstone_3/Raw Data/listings.csv')

airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 92 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                3818 non-null   int64  
 1   listing_url                       3818 non-null   object 
 2   scrape_id                         3818 non-null   int64  
 3   last_scraped                      3818 non-null   object 
 4   name                              3818 non-null   object 
 5   summary                           3641 non-null   object 
 6   space                             3249 non-null   object 
 7   description                       3818 non-null   object 
 8   experiences_offered               3818 non-null   object 
 9   neighborhood_overview             2786 non-null   object 
 10  notes                             2212 non-null   object 
 11  transit                           2884 non-null   object 
 12  thumbn

### Let's identify which features are necessary to determine how the cleaning fee impacts the monthly booking, and then create a new DataFrame. 

In [2]:
df = airbnb[['id', 'zipcode', 'latitude','longitude', 'property_type',
             'price', 'cleaning_fee', 'availability_30',
             'review_scores_rating', 'review_scores_cleanliness']]
df.head()

Unnamed: 0,id,zipcode,latitude,longitude,property_type,price,cleaning_fee,availability_30,review_scores_rating,review_scores_cleanliness
0,241032,98119,47.636289,-122.371025,Apartment,$85.00,,14,95.0,10.0
1,953595,98119,47.639123,-122.365666,Apartment,$150.00,$40.00,13,96.0,10.0
2,3308979,98119,47.629724,-122.369483,House,$975.00,$300.00,1,97.0,10.0
3,7421966,98119,47.638473,-122.369279,Apartment,$100.00,,0,,
4,278830,98119,47.632918,-122.372471,House,$450.00,$125.00,30,92.0,9.0


## Let's start cleaning the new DataFrame. 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         3818 non-null   int64  
 1   zipcode                    3811 non-null   object 
 2   latitude                   3818 non-null   float64
 3   longitude                  3818 non-null   float64
 4   property_type              3817 non-null   object 
 5   price                      3818 non-null   object 
 6   cleaning_fee               2788 non-null   object 
 7   availability_30            3818 non-null   int64  
 8   review_scores_rating       3171 non-null   float64
 9   review_scores_cleanliness  3165 non-null   float64
dtypes: float64(4), int64(2), object(4)
memory usage: 298.4+ KB


The datasets seem to have missing values in the Cleaning fee because it is 0; the host doesn't price for cleaning. Let's start cleaning this feature by deleting the dollar sign and putting it as a float type for feature analysis. 

In [6]:
df = df.fillna(0)
df['cleaning_fee'] = df['cleaning_fee'].str.replace('$', '').fillna(0).astype(float)
df['cleaning_fee'].unique()

array([  0.,  40., 300., 125.,  25.,  15., 150.,  95.,  85.,  89.,  35.,
       250., 200.,  65., 100.,  80.,  99.,  50.,  20.,  55.,  75.,  30.,
        60., 120.,  78.,  12.,  45.,  10., 264., 180.,  90.,   7., 131.,
         8.,   5., 185., 199., 175., 110., 155., 111.,  72., 105., 160.,
        13., 275.,  28.,  70., 209.,  82., 195., 145.,  22., 225., 169.,
       119.,  29., 140.,  61.,  49., 108.,   6.,  26.,  83.,  18.,  19.,
       117., 112.,  58.,  16., 170.,  64., 113.,  79., 130.,  96., 149.,
       164., 159.,  32., 184., 109., 107., 274., 143.,  88., 229.,  38.,
        69., 135.,  59., 101.,  67., 240., 137., 134.,  21., 189.,   9.,
        17., 106.,  24., 165.,  39.,  68.,  27.,  87.,  42.,  71., 194.,
       129., 210., 178.,  76.,  97., 179.,  52., 142., 230.])

In [7]:
df.head()

Unnamed: 0,id,zipcode,latitude,longitude,property_type,price,cleaning_fee,availability_30,review_scores_rating,review_scores_cleanliness
0,241032,98119,47.636289,-122.371025,Apartment,$85.00,0.0,14,95.0,10.0
1,953595,98119,47.639123,-122.365666,Apartment,$150.00,40.0,13,96.0,10.0
2,3308979,98119,47.629724,-122.369483,House,$975.00,300.0,1,97.0,10.0
3,7421966,98119,47.638473,-122.369279,Apartment,$100.00,0.0,0,0.0,0.0
4,278830,98119,47.632918,-122.372471,House,$450.00,125.0,30,92.0,9.0


## Let's now clean the price feature. 

In [8]:
df['price'] = df['price'].str.replace('$', '').fillna(0)
df['price'] = df['price'].str.replace(',', '').astype(float)
df.head()

Unnamed: 0,id,zipcode,latitude,longitude,property_type,price,cleaning_fee,availability_30,review_scores_rating,review_scores_cleanliness
0,241032,98119,47.636289,-122.371025,Apartment,85.0,0.0,14,95.0,10.0
1,953595,98119,47.639123,-122.365666,Apartment,150.0,40.0,13,96.0,10.0
2,3308979,98119,47.629724,-122.369483,House,975.0,300.0,1,97.0,10.0
3,7421966,98119,47.638473,-122.369279,Apartment,100.0,0.0,0,0.0,0.0
4,278830,98119,47.632918,-122.372471,House,450.0,125.0,30,92.0,9.0


## Let's now clean the zip code feature. 

In [10]:
df['zipcode'].unique()

array(['98119', '98109', '98107', '98117', 0, '98103', '98105', '98115',
       '98101', '98122', '98112', '98144', '99\n98122', '98121', '98102',
       '98199', '98104', '98134', '98136', '98126', '98146', '98116',
       '98177', '98118', '98108', '98133', '98106', '98178', '98125'],
      dtype=object)

In [11]:
df['zipcode'] = df['zipcode'].str.replace('99\n98122', '98122').fillna(0)
df['zipcode'] = df['zipcode'].astype(int)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         3818 non-null   int64  
 1   zipcode                    3818 non-null   int64  
 2   latitude                   3818 non-null   float64
 3   longitude                  3818 non-null   float64
 4   property_type              3818 non-null   object 
 5   price                      3818 non-null   float64
 6   cleaning_fee               3818 non-null   float64
 7   availability_30            3818 non-null   int64  
 8   review_scores_rating       3818 non-null   float64
 9   review_scores_cleanliness  3818 non-null   float64
dtypes: float64(6), int64(3), object(1)
memory usage: 298.4+ KB


## Summary

As we can see, we got all features with the same lenght of data. We got the features that we need with the correspnding data type to analyze how the cleaning fee impact the booking frequency.