# Numerical Data

### Introduction

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mhan1/Data-Science/master/listings_summary.csv')

Some of these columns include data that we do not know how to handle - mainly textual data and images.  A lot of this data is simply not in the correct format.  Let's take a look.

In [2]:
pd.set_option('display.max_rows',100)

In [3]:
def find_object_features(df):
    return list(df.dtypes[df.dtypes == 'object'].index)

In [4]:
len(find_object_features(df))

62

### Feature engineering

Let's try to capture as much of this object data as possible.  In this lesson, we'll start with date values.

In [5]:
def find_object_feature_values(df):
    object_features = find_object_features(df)
    return df[object_features][:2].values

In [6]:
#     
def contains_numbers(column):
    # matches price or percentage     
    regex_string = (r'^(?!.*www|.*-|.*\/|.*[A-Za-z]|.* ).*\d.*')
#     regex_string = (r'\$\d+.*|\d+.*\%$|^\d+.*$')
    return column.str.contains(regex_string).all()

We needed to write our regex in such a way that it finds columns with digits but skips over date columns.

In [7]:
contains_numbers(df.last_scraped)

False

In [8]:
contains_numbers(df.listing_url)


False

In [9]:
contains_numbers(df.price)

True

In [10]:
contains_numbers(df.zipcode)
# this should be True

False

In [11]:
def find_numeric_features(df):
    series_contains_number = df.apply(contains_numbers)
    return series_contains_number.index[series_contains_number.values]

In [12]:
numeric_features = find_numeric_features(df)

In [13]:
numeric_features

Index(['id', 'scrape_id', 'thumbnail_url', 'medium_url', 'xl_picture_url',
       'host_id', 'host_response_rate', 'host_acceptance_rate',
       'host_listings_count', 'host_total_listings_count', 'latitude',
       'longitude', 'accommodates', 'bathrooms', 'bedrooms', 'beds',
       'square_feet', 'price', 'weekly_price', 'monthly_price',
       'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people',
       'minimum_nights', 'maximum_nights', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'jurisdiction_names',
       'calculated_host_listings_count', 'reviews_per_month'],
      dtype='object')

While the `find_numeric_features` method captures the numeric features, we want to narrow down the features that we should fix.

In [14]:
def numeric_to_fix(df):
    numeric_features = find_numeric_features(df)
    return df[numeric_features].select_dtypes(exclude=['int64', 'float64'])[0:2]

In [15]:
numeric_to_fix(df)

Unnamed: 0,host_response_rate,price,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,96%,$60.00,,,$200.00,$30.00,$28.00
1,,$17.00,,,$0.00,$0.00,$0.00


### Modifying Values

In [16]:
price_features = ['price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee', 'extra_people']
def price_to_float(price):
    if type(price) == str and price[0] == '$':
        return float(price[1:].replace(',',''))

def prices_to_floats(df, price_features):
    prices_df = pd.DataFrame({})
    for feature in price_features:
        prices_df[feature] = df[feature].map(price_to_float)
    return prices_df

In [17]:
prices_df = prices_to_floats(df, price_features)

In [18]:
prices_df[0:1]

Unnamed: 0,price,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,60.0,,,200.0,30.0,28.0


In [19]:
def percentage_to_num(percentage):
    if type(percentage) == str:
        return float(percentage[:-1])

Once we have modified our columns, we can merge the new columns into our original dataframe.

In [20]:
def merge_dfs(original_df, new_dfs):
    if not isinstance(new_dfs, list):
        new_dfs = [new_dfs]
    copied_original = original_df.copy()
    for new_df in new_dfs:
        copied_original[new_df.columns] = new_df
    return copied_original

### Resources

[Fixing messy col names](https://medium.com/@chaimgluck1/working-with-pandas-fixing-messy-column-names-42a54a6659cd)