## Data Splitting in Python

This tutorial goes over how to split data into test and train data sets.

First, import the data set we'll use.

In [1]:
import pandas as pd

In [2]:
housing_dat = pd.read_csv("~/Desktop/Python/data_splitting/houses_to_rent.csv")

Preview the data set.

In [3]:
housing_dat

Unnamed: 0.1,Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,0,1,240,3,3,4,-,acept,furnished,R$0,"R$8,000","R$1,000",R$121,"R$9,121"
1,1,0,64,2,1,1,10,acept,not furnished,R$540,R$820,R$122,R$11,"R$1,493"
2,2,1,443,5,5,4,3,acept,furnished,"R$4,172","R$7,000","R$1,417",R$89,"R$12,680"
3,3,1,73,2,2,1,12,acept,not furnished,R$700,"R$1,250",R$150,R$16,"R$2,116"
4,4,1,19,1,1,0,-,not acept,not furnished,R$0,"R$1,200",R$41,R$16,"R$1,257"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6075,6075,1,50,2,1,1,2,acept,not furnished,R$420,"R$1,150",R$0,R$15,"R$1,585"
6076,6076,1,84,2,2,1,16,not acept,furnished,R$768,"R$2,900",R$63,R$37,"R$3,768"
6077,6077,0,48,1,1,0,13,acept,not furnished,R$250,R$950,R$42,R$13,"R$1,255"
6078,6078,1,160,3,2,2,-,not acept,not furnished,R$0,"R$3,500",R$250,R$53,"R$3,803"


Create X dataframe with predictor variables and y dataframe with outcome vairable.

In [4]:
X = housing_dat.iloc[:,1:13]

In [5]:
X

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance
0,1,240,3,3,4,-,acept,furnished,R$0,"R$8,000","R$1,000",R$121
1,0,64,2,1,1,10,acept,not furnished,R$540,R$820,R$122,R$11
2,1,443,5,5,4,3,acept,furnished,"R$4,172","R$7,000","R$1,417",R$89
3,1,73,2,2,1,12,acept,not furnished,R$700,"R$1,250",R$150,R$16
4,1,19,1,1,0,-,not acept,not furnished,R$0,"R$1,200",R$41,R$16
...,...,...,...,...,...,...,...,...,...,...,...,...
6075,1,50,2,1,1,2,acept,not furnished,R$420,"R$1,150",R$0,R$15
6076,1,84,2,2,1,16,not acept,furnished,R$768,"R$2,900",R$63,R$37
6077,0,48,1,1,0,13,acept,not furnished,R$250,R$950,R$42,R$13
6078,1,160,3,2,2,-,not acept,not furnished,R$0,"R$3,500",R$250,R$53


In [6]:
y = housing_dat.loc[:,"total"]

In [7]:
y

0        R$9,121
1        R$1,493
2       R$12,680
3        R$2,116
4        R$1,257
          ...   
6075     R$1,585
6076     R$3,768
6077     R$1,255
6078     R$3,803
6079     R$2,414
Name: total, Length: 6080, dtype: object

Use scikit-learn to split the data.

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
train_test_split(X,y,test_size=.2)

[      city  area  rooms  bathroom  parking spaces floor     animal  \
 908      1   240      3         1               2     8      acept   
 3393     1   300      4         4               4    12      acept   
 3658     0    58      2         1               1     2      acept   
 2166     1   380      5         7               4    21      acept   
 2217     1   192      3         3               3    16      acept   
 ...    ...   ...    ...       ...             ...   ...        ...   
 1049     1    38      1         1               0    16  not acept   
 492      1    48      2         1               0     -      acept   
 979      0    57      1         1               1     2      acept   
 3610     1   260      2         3               4     -      acept   
 360      1   400      3         3               2     -  not acept   
 
           furniture      hoa rent amount property tax fire insurance  
 908   not furnished  R$2,088     R$4,000      R$1,449           R$51  
 3

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2)

Add optional parameter to set a seed.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2, random_state = 1337)