# Splitting the Data

Now it's time to select a training and test set.  Remember that we want a test set that is of our most recent data so that we do not use train a model that is good at looking backwards, but not so good at projecting forwards.  

### Loading our Data

We can start by loading our data from our csv.

In [11]:
import pandas as pd
listings_df = pd.read_csv('./price_listings_ten_k.csv')

### Choosing a Split

Now the next thing for us to do is choose a split for our data.  The main thing here, is to make sure that we are setting up our model for the same situation it will have when we deploy the model.  Our model will be tasked with taking past data, and predicting future results, so we should split the data by datetime if appropriate.

Let's look through our dataset to see if there are any datetime columns that indicate the date of a listing.  Then we can split the listings by date.

> If we select for datetime columns, we see that we currently don't have any.

In [4]:
listings_df.select_dtypes(include = 'datetime').shape

(22547, 0)

But we can use the `contains_date` function below to search for strings that might be dates.

In [5]:
def contains_date(column):
#     remove nas first, potentially use all
    regex_string = (r'^\d{1,2}-\d{1,2}-\d{4}$|^\d{4}-\d{1,2}-\d{1,2}$' + 
'|^\d{1,2}\/\d{1,2}\/\d{4}$|^\d{4}\/\d{1,2}\/\d{1,2}$')
    return column.str.contains(regex_string).any()

In [6]:
contains_date = listings_df.apply(lambda col: contains_date(col))


In [7]:
listings_df[contains_date[contains_date == True].index][:3]

Unnamed: 0,last_scraped,host_since,latitude,is_location_exact,property_type,calendar_last_scraped,first_review,last_review
0,2018-11-07,2008-08-18,52.5345,f,Guesthouse,2018-11-07,2016-04-11,2018-10-28
1,2018-11-07,2008-09-16,52.5485,t,Apartment,2018-11-07,2018-07-04,2018-10-01
2,2018-11-07,2008-10-19,52.535,t,Apartment,2018-11-07,2009-06-20,2017-03-20


So five of the columns have dates.  But are there any columns that tell us of a later listing price.  Do you see any?

Perhaps the best candidate is last scraped.  If the data was scraped on a later date, then it was likely listed a litte before then.  Let's take a look at that column.

In [8]:
listings_df['last_scraped'].value_counts()

2018-11-07    22541
2018-11-09        3
t                 2
f                 1
Name: last_scraped, dtype: int64

Ok, so it looks like these listing prices were all scraped on the same date.  Perhaps the other column to look at is the last review - this may tell us if there are any out of date listings.

In [9]:
listings_df['last_review'].value_counts().sort_index()[:10]

2010-09-16    1
2011-01-26    1
2011-11-14    1
2012-07-08    1
2012-07-17    1
2012-07-27    1
2012-12-17    1
2013-06-14    1
2013-08-13    1
2013-09-01    1
Name: last_review, dtype: int64

Here it does appear that there are some stale listings.  We may need to do more research into this.  But it seems like if a listing did not receive a review in five years, that the listing price may not have been updated in a long time.  

This would be a good candidate to split our data by, but if we look at the number of null values, we se eit's quite a lot.

In [10]:
listings_df['last_review'].isna().sum()

3912

In [11]:
listings_df.shape

(22547, 94)

So roughly 18 percent of the data.   We probably do not want to bias our training, validation, and test set with all of the na values just in one of those datasets.  So let's leave it alone for now.  Instead, we'll do a random split of the training and test data.

In [12]:
X = listings_df

In [13]:
X.shape

(22547, 94)

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X_train, X_test = train_test_split(X, random_state = 1, test_size = .2)

In [16]:
X_test.shape

(4510, 94)

In [17]:
X_train.shape

(18037, 94)

In [22]:
# X_test.to_csv('x_test.csv')

In [21]:
# X_train.reset_index().to_feather('X_train_list.feather')

### Summary

In this lesson, we explored how best to split of our data.  Because our task is train a model by giving it the task it will encounter when deployed, we want it to train our model to use past data to predict future outcomes.  To this end, we saw if we could sort listings by the day of the of the listing.  We saw that `last_scraped` mainly had the same date.  And we saw that `last_review`, while perhaps indicative of older listings, had many null values.  So we resorted to a random split of the data.

### Resources

[Check for multiple substrings](/Users/jeff/Documents/jigsaw/curriculum/1-section-content/mod-2/2-regression/brian-yeshiva/6-feature-lib)