# This notebook is for testing and integration of the preprocessing and missing data functions, of the previous notebook

Table of contents <a id='top'>

1. [Prices](#prices)
2. [Boolean Values](#booleans)
3. [Dates](#dates)
4. [Percent](#percent)
5. [Missing Data](#missing)
6. [Integration](#integration)

In [4]:
import numpy as np
import pandas as pd
import lightgbm as lgb
import os
from jupyterthemes import jtplot
from datetime import datetime
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline
jtplot.style(theme='solarizedd')
plt.rcParams['figure.figsize'] = (20.0, 10.0)

%load_ext autoreload
%autoreload 2

ROOT_DIR = '..'
DATA_DIR = os.path.join(ROOT_DIR, 'data')
DATA_RAW = os.path.join(DATA_DIR, 'raw')
DATA_INTERIM = os.path.join(DATA_DIR, 'interim')
DATA_EXTERNAL = os.path.join(DATA_DIR, 'external')

SRC_DIR = os.path.join(ROOT_DIR, 'src')

SEATTLE_CALENDAR = os.path.join(DATA_RAW, 'seattle', 'calendar.csv')
SEATTLE_LISTINGS = os.path.join(DATA_RAW, 'seattle', 'listings.csv')
SEATTLE_REVIEWS = os.path.join(DATA_RAW, 'seattle', 'reviews.csv')

SEATTLE_LISTINGS_COLS = os.path.join(
    DATA_INTERIM, 'seattle', 'listings_cols_df.pkl')

import sys
sys.path.append(SRC_DIR)
sys.path.append(os.path.join(SRC_DIR, 'data'))

import preprocessing as pp
import missing_data as md

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
calendar = pd.read_csv(SEATTLE_CALENDAR)
listings = pd.read_csv(SEATTLE_LISTINGS)
reviews = pd.read_csv(SEATTLE_REVIEWS)

listings_cols_df = pd.read_pickle(SEATTLE_LISTINGS_COLS)

## 1. Prices <a id='prices'></a>
[Top](#top)

In [6]:
calendar, listings, reviews = pp.transform_prices(calendar, listings, reviews)

In [7]:
calendar.head()

Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,t,85.0
1,241032,2016-01-05,t,85.0
2,241032,2016-01-06,f,
3,241032,2016-01-07,f,
4,241032,2016-01-08,f,


In [8]:
pp.find_str(listings.head()).sum().sum()

0.0

In [9]:
price_cols = pp.get_column_by_kind(listings_cols_df, 'price_cols')
listings[price_cols].head()

Unnamed: 0,price,weekly_price,monthly_price,security_deposit,cleaning_fee,extra_people
0,85.0,,,,,5.0
1,150.0,1000.0,3000.0,100.0,40.0,0.0
2,975.0,,,1000.0,300.0,25.0
3,100.0,650.0,2300.0,,,0.0
4,450.0,,,700.0,125.0,15.0


OK. Looks good.

## 2. Boolean Values <a id='booleans'>
[Top](#top)

In [10]:
calendar, listings, reviews = pp.transform_booleans(calendar, listings, reviews)

In [11]:
calendar.head()

Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,True,85.0
1,241032,2016-01-05,True,85.0
2,241032,2016-01-06,False,
3,241032,2016-01-07,False,
4,241032,2016-01-08,False,


In [12]:
tf_cols = pp.get_column_by_kind(listings_cols_df, 'tf_cols')
listings[tf_cols].head()

Unnamed: 0,host_is_superhost,host_has_profile_pic,host_identity_verified,is_location_exact,has_availability,requires_license,instant_bookable,require_guest_profile_picture,require_guest_phone_verification
0,False,True,True,True,True,False,False,False,False
1,True,True,True,True,True,False,False,True,True
2,False,True,True,True,True,False,False,False,False
3,False,True,True,True,True,False,False,False,False
4,False,True,True,True,True,False,False,False,False


Looks good to me.

## 3. Dates <a id='dates'>
[Top](#top)

In [13]:
calendar, listings, reviews = pp.transform_dates(calendar, listings, reviews)

In [14]:
calendar.head()

Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,True,85.0
1,241032,2016-01-05,True,85.0
2,241032,2016-01-06,False,
3,241032,2016-01-07,False,
4,241032,2016-01-08,False,


In [15]:
type(calendar.date[0])

pandas._libs.tslibs.timestamps.Timestamp

In [16]:
date_cols = pp.get_column_by_kind(listings_cols_df, 'date_cols')
listings[date_cols].head()

Unnamed: 0,last_scraped,host_since,calendar_last_scraped,first_review,last_review
0,2016-01-04,2011-08-11,2016-01-04,2011-11-01,2016-01-02
1,2016-01-04,2013-02-21,2016-01-04,2013-08-19,2015-12-29
2,2016-01-04,2014-06-12,2016-01-04,2014-07-30,2015-09-03
3,2016-01-04,2013-11-06,2016-01-04,NaT,NaT
4,2016-01-04,2011-11-29,2016-01-04,2012-07-10,2015-10-24


In [17]:
listings[date_cols].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 5 columns):
last_scraped             3818 non-null datetime64[ns]
host_since               3816 non-null datetime64[ns]
calendar_last_scraped    3818 non-null datetime64[ns]
first_review             3191 non-null datetime64[ns]
last_review              3191 non-null datetime64[ns]
dtypes: datetime64[ns](5)
memory usage: 149.2 KB


In [18]:
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...


In [19]:
type(reviews.date[0])

pandas._libs.tslibs.timestamps.Timestamp

Seems to be OK.

## 4. Percent <a id='percent'>
[Top](#top)

In [20]:
calendar, listings, reviews = pp.transform_percent(calendar, listings, reviews)

In [21]:
percent_cols = pp.get_column_by_kind(listings_cols_df, 'percent_cols')
listings[percent_cols].head()

Unnamed: 0,host_response_rate,host_acceptance_rate
0,96.0,100.0
1,98.0,100.0
2,67.0,100.0
3,,
4,100.0,


Looks good

## 5. Missing data <a id='missing'>
[Top](#top)

In [22]:
calendar, listings, reviews = md.fill_missing(calendar, listings, reviews)

In [23]:
print(calendar.isnull().sum().sum())
calendar.head()

0


Unnamed: 0,listing_id,date,available,price
0,241032,2016-01-04,True,85.0
1,241032,2016-01-05,True,85.0
2,241032,2016-01-06,False,85.0
3,241032,2016-01-07,False,85.0
4,241032,2016-01-08,False,85.0


In [24]:
print(reviews.isnull().sum().sum())
reviews.head()

0


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...


In [25]:
print('There is {} missing data!'.format(listings.isnull().sum().sum()))
listings.head(2)

There is 0 missing data!


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,calculated_host_listings_count,reviews_per_month,review_scores_rating_missing,review_scores_accuracy_missing,review_scores_cleanliness_missing,review_scores_checkin_missing,review_scores_communication_missing,review_scores_location_missing,review_scores_value_missing,reviews_per_month_missing
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,2,4.07,0,0,0,0,0,0,0,0
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,6,1.48,0,0,0,0,0,0,0,0


## 6. Integration <a id='integration'>
[Top](#top)

In [26]:
# All together now...

calendar, listings, reviews = pp.load_data()
calendar, listings, reviews = pp.transform_all(calendar,
                                               listings,
                                               reviews,
                                               save_results=True)
calendar, listings, reviews = md.fill_missing(calendar,
                                              listings,
                                              reviews,
                                              save_results=True)

In [27]:
def show_data(data):
    """ Shows shape, missing data, and head of a pandas DataFrame. """
    print('The data has shape: {}\n'.format(data.shape))
    print('There is {} missing data!\n'.format(data.isnull().sum().sum()))
    print(data.head())

In [28]:
show_data(calendar)

The data has shape: (1393570, 4)

There is 0 missing data!

   listing_id       date  available  price
0      241032 2016-01-04       True   85.0
1      241032 2016-01-05       True   85.0
2      241032 2016-01-06      False   85.0
3      241032 2016-01-07      False   85.0
4      241032 2016-01-08      False   85.0


In [29]:
show_data(listings)

The data has shape: (3816, 98)

There is 0 missing data!

        id                           listing_url       scrape_id last_scraped  \
0   241032   https://www.airbnb.com/rooms/241032  20160104002432   2016-01-04   
1   953595   https://www.airbnb.com/rooms/953595  20160104002432   2016-01-04   
2  3308979  https://www.airbnb.com/rooms/3308979  20160104002432   2016-01-04   
3  7421966  https://www.airbnb.com/rooms/7421966  20160104002432   2016-01-04   
4   278830   https://www.airbnb.com/rooms/278830  20160104002432   2016-01-04   

                                  name  \
0         Stylish Queen Anne Apartment   
1   Bright & Airy Queen Anne Apartment   
2  New Modern House-Amazing water view   
3                   Queen Anne Chateau   
4       Charming craftsman 3 bdm house   

                                             summary  \
0                                                      
1  Chemically sensitive? We've removed the irrita...   
2  New modern house built in 2013.

In [30]:
show_data(reviews)

The data has shape: (84831, 6)

There is 0 missing data!

   listing_id        id       date  reviewer_id reviewer_name  \
0     7202016  38917982 2015-07-19     28943674        Bianca   
1     7202016  39087409 2015-07-20     32440555         Frank   
2     7202016  39820030 2015-07-26     37722850           Ian   
3     7202016  40813543 2015-08-02     33671805        George   
4     7202016  41986501 2015-08-10     34959538          Ming   

                                            comments  
0  Cute and cozy place. Perfect location to every...  
1  Kelly has a great room in a very central locat...  
2  Very spacious apartment, and in a great neighb...  
3  Close to Seattle Center and all it has to offe...  
4  Kelly was a great host and very accommodating ...  
