# Numerical Data

### Introduction

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mhan1/Data-Science/master/listings_summary.csv')

Some of these columns include data that we do not know how to handle - mainly textual data and images.  A lot of this data is simply not in the correct format.  Let's take a look.

In [2]:
pd.set_option('display.max_rows',100)

In [3]:
def find_object_features(df):
    return list(df.dtypes[df.dtypes == 'object'].index)

In [4]:
find_object_features(df)

['listing_url',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'picture_url',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'is_location_exact',
 'property_type',
 'room_type',
 'bed_type',
 'amenities',
 'price',
 'weekly_price',
 'monthly_price',
 'security_deposit',
 'cleaning_fee',
 'extra_people',
 'calendar_updated',
 'has_availability',
 'calendar_last_scraped',
 'first_review',
 'last_review',
 'requires_license',
 'license',
 'instant_

In [5]:
len(find_object_features(df))

62

### Feature engineering

Let's try to capture as much of this object data as possible.  In this lesson, we'll start with date values.

In [6]:
def find_object_feature_values(df):
    object_features = find_object_features(df)
    return df[object_features][:2].values

In [7]:
find_object_feature_values(df)

array([['https://www.airbnb.com/rooms/2015', '2018-11-07',
        'Berlin-Mitte Value! Quiet courtyard/very central',
        'Great location!  30 of 75 sq meters. This wood floored/high ceiling typical Berlin "Altbau" section of an apartment consists of 1 simple large room, a small kitchen and a bathroom + shower. The apartment is in Mitte, close to Prenzlauer Berg/Mauerpark. Perfect for short visits, singles or couples. Your section is closed from the rest of the bigger flat wich is not noticeable. You will not be sharing your space.',
        'A+++ location! This „Einliegerwohnung“ is an extention of a larger apartment with a separate entrance, bathroom and kitchen. The door to the rest of the apartment is soundproof, hidden, locked and barely noticable (behind mirror in pictures). Your 30 sq meters are facing a quiet courtyard. This wood floored/high ceiling typical Berlin "Altbau" apartment consists of 1 large room with a large double bed, optionally with an extra matress for a 3

In [8]:
#     
def contains_numbers(column):
    # matches price or percentage     
    regex_string = (r'^(?!.*www|.*-|.*\/|.*[A-Za-z]|.* ).*\d.*')
#     regex_string = (r'\$\d+.*|\d+.*\%$|^\d+.*$')
    return column.str.contains(regex_string).all()

We needed to write our regex in such a way that it finds columns with digits but skips over date columns.

In [9]:
contains_numbers(df.last_scraped)

False

In [10]:
contains_numbers(df.listing_url)


False

In [11]:
contains_numbers(df.price)

True

In [12]:
contains_numbers(df.zipcode)
# this should be True

False

In [13]:
df['zipcode'].head(2)

0    10119
1    10437
Name: zipcode, dtype: object

In [14]:
def find_numeric_features(df):
    series_contains_number = df.apply(contains_numbers)
    return series_contains_number.index[series_contains_number.values]

In [15]:
numeric_features = find_numeric_features(df)

In [16]:
numeric_features

Index(['id', 'scrape_id', 'thumbnail_url', 'medium_url', 'xl_picture_url',
       'host_id', 'host_response_rate', 'host_acceptance_rate',
       'host_listings_count', 'host_total_listings_count', 'latitude',
       'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds',
       'square_feet', 'price', 'weekly_price', 'monthly_price',
       'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people',
       'minimum_nights', 'maximum_nights', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'jurisdiction_names',
       'calculated_host_listings_count', 'reviews_per_month'],
      dtype='object')

In [17]:
isinstance(df['zipcode'], float)

False

In [18]:
df['zipcode'].head(2)

0    10119
1    10437
Name: zipcode, dtype: object

In [19]:
df.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'access', 'interaction', 'house_rules',
       'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url',
       'host_id', 'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms',

While the `find_numeric_features` method captures the numeric features, we want to narrow down the features that we should fix.

In [20]:
def numeric_to_fix(df):
    numeric_features = find_numeric_features(df)
    return df[numeric_features].select_dtypes(exclude=['int64', 'float64'])[0:9]

In [21]:
numeric_to_fix(df)

Unnamed: 0,host_response_rate,price,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,96%,$60.00,,,$200.00,$30.00,$28.00
1,,$17.00,,,$0.00,$0.00,$0.00
2,100%,$90.00,$520.00,"$1,900.00",$200.00,$50.00,$20.00
3,,$26.00,$175.00,$599.00,$250.00,$30.00,$18.00
4,100%,$42.00,,,$0.00,$0.00,$24.00
5,100%,$180.00,$650.00,,$400.00,$80.00,$10.00
6,100%,$70.00,$420.00,$820.00,$500.00,$0.00,$0.00
7,,$120.00,,,,,$13.00
8,100%,$90.00,$520.00,"$1,440.00",$500.00,$50.00,$20.00


In [22]:
numeric_to_fix(df).columns

Index(['host_response_rate', 'price', 'weekly_price', 'monthly_price',
       'security_deposit', 'cleaning_fee', 'extra_people'],
      dtype='object')

In [23]:
len(numeric_to_fix(df).columns)

7

### Modifying Values

In [24]:
price_features = ['price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee', 'extra_people']
def price_to_float(price):
    if type(price) == str and price[0] == '$':
        return float(price[1:].replace(',',''))

def prices_to_floats(df, price_features):
    prices_df = pd.DataFrame({})
    for feature in price_features:
        prices_df[feature] = df[feature].map(price_to_float)
    return prices_df

In [25]:
prices_df = prices_to_floats(df, price_features)

In [26]:
prices_to_floats(df, price_features)

Unnamed: 0,price,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,60.0,,,200.0,30.0,28.0
1,17.0,,,0.0,0.0,0.0
2,90.0,520.0,1900.0,200.0,50.0,20.0
3,26.0,175.0,599.0,250.0,30.0,18.0
4,42.0,,,0.0,0.0,24.0
5,180.0,650.0,,400.0,80.0,10.0
6,70.0,420.0,820.0,500.0,0.0,0.0
7,120.0,,,,,13.0
8,90.0,520.0,1440.0,500.0,50.0,20.0
9,45.0,281.0,955.0,0.0,18.0,26.0


In [27]:
prices_df[0:1]

Unnamed: 0,price,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,60.0,,,200.0,30.0,28.0


In [28]:
df.head(1)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,2015,https://www.airbnb.com/rooms/2015,20181107122246,2018-11-07,Berlin-Mitte Value! Quiet courtyard/very central,Great location! 30 of 75 sq meters. This wood...,A+++ location! This „Einliegerwohnung“ is an e...,Great location! 30 of 75 sq meters. This wood...,none,It is located in the former East Berlin area o...,...,t,,,f,f,strict_14_with_grace_period,f,f,4,3.76


In [29]:
def percentage_to_num(percentage):
    if type(percentage) == str:
        return float(percentage[:-1])

In [30]:
df['host_response_rate'].head()

0     96%
1     NaN
2    100%
3     NaN
4    100%
Name: host_response_rate, dtype: object

In [31]:
percentage_to_num(df['host_response_rate'])

In [32]:
type(df['host_response_rate'])

pandas.core.series.Series

In [33]:
(df['host_response_rate']).dtypes

dtype('O')

In [34]:
df['host_response_rate'].head()

0     96%
1     NaN
2    100%
3     NaN
4    100%
Name: host_response_rate, dtype: object

In [35]:
def convert_to_num(percentage):
    if (percentage).dtypes == object:
        return float(percentage[:-1])

In [36]:
convert_to_num(df['host_response_rate'])

TypeError: cannot convert the series to <class 'float'>

- percentage_to_num function did not convert the host_response_rate column into number as shown above, because the series does not convert into class 'float', according to the error message. Hence, I tried a new function below:

In [37]:
percentage_feature = ['host_response_rate']
def percentage_to_numb(percentage):
    if type(percentage) == str:
        return float(percentage[:-1])

def percentage_to_number(df, percentage_feature):
    percent_df = pd.DataFrame({})
    for p_feature in percentage_feature:
        percent_df[p_feature] = df[p_feature].map(percentage_to_numb)
        return percent_df

In [38]:
percentage_to_number(df, percentage_feature).head()

Unnamed: 0,host_response_rate
0,96.0
1,
2,100.0
3,
4,100.0


Once we have modified our columns, we can merge the new columns into our original dataframe.

In [39]:
def merge_dfs(original_df, new_dfs):
    if not isinstance(new_dfs, list):
        new_dfs = [new_dfs]
    copied_original = original_df.copy()
    for new_df in new_dfs:
        copied_original[new_df.columns] = new_df
    return copied_original

In [40]:
prices_df.head(1)

Unnamed: 0,price,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,60.0,,,200.0,30.0,28.0


In [41]:
[prices_df]

[       price  weekly_price  monthly_price  security_deposit  cleaning_fee  \
 0       60.0           NaN            NaN             200.0          30.0   
 1       17.0           NaN            NaN               0.0           0.0   
 2       90.0         520.0         1900.0             200.0          50.0   
 3       26.0         175.0          599.0             250.0          30.0   
 4       42.0           NaN            NaN               0.0           0.0   
 5      180.0         650.0            NaN             400.0          80.0   
 6       70.0         420.0          820.0             500.0           0.0   
 7      120.0           NaN            NaN               NaN           NaN   
 8       90.0         520.0         1440.0             500.0          50.0   
 9       45.0         281.0          955.0               0.0          18.0   
 10      49.0         290.0          990.0               0.0          50.0   
 11     129.0         920.0         3200.0             500.0    

In [42]:
merge_dfs(df, prices_df)['weekly_price'].head(7)               

0      NaN
1      NaN
2    520.0
3    175.0
4      NaN
5    650.0
6    420.0
Name: weekly_price, dtype: float64

In [43]:
df['weekly_price'].head(7)

0        NaN
1        NaN
2    $520.00
3    $175.00
4        NaN
5    $650.00
6    $420.00
Name: weekly_price, dtype: object

### Resources

[Fixing messy col names](https://medium.com/@chaimgluck1/working-with-pandas-fixing-messy-column-names-42a54a6659cd)