# Splitting the Data

Now it's time to select a training and test set.  Remember that we want a test set that is of our most recent data so that we do not use train a model that is good at looking backwards, but not so good at projecting forwards.  

### Loading our Data

We can start by loading our data from our csv.

In [3]:
import pandas as pd
listing_url = "./listings_summary.csv.zip"

listings_df = pd.read_csv(listing_url, index_col = 0)

### Choosing a Split

In [4]:
listings_df.select_dtypes(include = 'datetime').shape

(22552, 0)

But we can use the `contains_date` function below to search for strings that might be dates.

In [6]:
def contains_date(column):
#     remove nas first, potentially use all
    regex_string = (r'^\d{1,2}-\d{1,2}-\d{4}$|^\d{4}-\d{1,2}-\d{1,2}$' + 
'|^\d{1,2}\/\d{1,2}\/\d{4}$|^\d{4}\/\d{1,2}\/\d{1,2}$')
    return column.str.contains(regex_string).any()

In [8]:
contains_date_ser = listings_df.apply(lambda col: contains_date(col))

In [1]:
# contains_date_ser.values

In [10]:
potential_date_df = listings_df.iloc[:, contains_date_ser.values]

In [12]:
potential_date_df.dtypes

last_scraped             object
host_since               object
calendar_last_scraped    object
first_review             object
last_review              object
dtype: object

In [16]:
date_df = potential_date_df.astype('datetime64')

In [18]:
date_df.dtypes

last_scraped             datetime64[ns]
host_since               datetime64[ns]
calendar_last_scraped    datetime64[ns]
first_review             datetime64[ns]
last_review              datetime64[ns]
dtype: object

In [22]:
updated_date_df = potential_date_df.apply(lambda col: pd.to_datetime(col))

In [27]:
updated_date_df[:3]

Unnamed: 0_level_0,last_scraped,host_since,calendar_last_scraped,first_review,last_review
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015,2018-11-07,2008-08-18,2018-11-07,2016-04-11,2018-10-28
2695,2018-11-07,2008-09-16,2018-11-07,2018-07-04,2018-10-01
3176,2018-11-07,2008-10-19,2018-11-07,2009-06-20,2017-03-20


In [26]:
updated_date_df.columns

Index(['last_scraped', 'host_since', 'calendar_last_scraped', 'first_review',
       'last_review'],
      dtype='object')

In [25]:
listings_df.loc[:, updated_date_df.columns] = updated_date_df

In [29]:
listings_df.select_dtypes('object').shape

(22552, 57)

In [33]:
listings_df.select_dtypes('datetime64')[:5]

Unnamed: 0_level_0,last_scraped,host_since,calendar_last_scraped,first_review,last_review
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015,2018-11-07,2008-08-18,2018-11-07,2016-04-11,2018-10-28
2695,2018-11-07,2008-09-16,2018-11-07,2018-07-04,2018-10-01
3176,2018-11-07,2008-10-19,2018-11-07,2009-06-20,2017-03-20
3309,2018-11-07,2008-11-07,2018-11-07,2013-08-12,2018-08-16
7071,2018-11-07,2009-05-16,2018-11-07,2009-08-18,2018-11-04


In [40]:
# listings_df['last_review'].value_counts().sort_index()

In [42]:
# listings_df['last_review']

* Explore candidates 
    * last_scraped
    * last_review

In [43]:
from sklearn.model_selection import train_test_split

# listings_df_train, listings_df_test = train_test_split(listings_df)

In [44]:
# listings_df_train.to_csv('./')

### Test train split

### Summary

### Resources

[Check for multiple substrings](/Users/jeff/Documents/jigsaw/curriculum/1-section-content/mod-2/2-regression/brian-yeshiva/6-feature-lib)