# **Data Cleaning & Preparation**

### Splitting Data
### **Target Pertemuan**

<hr>

**Tujuan Instruksional Umum:** Peserta mampu mempersiapkan data untuk pembuatan model machine learning.

**Target Pertemuan:** Peserta mampu melakukan splitting data.

<hr>

## **Splitting Dataset**

When you’re working on a model and want to train it, you obviously have a dataset. But after training, we have to test the model on some test dataset. For this, you’ll a dataset which is different from the training set you used earlier. But it might not always be possible to have so much data during the development phase.

In such cases, the obviously solution is to split the dataset you have into two sets, one for training and the other for testing; and you do this before you start training your model.

But the question is, how do you split the data? You can’t possibly manually split the dataset into two. And you also have to make sure you split the data in a random manner. To help us with this task, the SciKit library provides a tool, called the Model Selection library. There’s a class in the library which is, aptly, named ‘train_test_split.’ Using this we can easily split the dataset into the training and the testing datasets in various proportions.

<img src = "a_img.png" style="width:600px;height:200px"/>

### **What is Overfitting/Underfitting a Model?**

As mentioned, in statistics and machine learning we usually split our data into two subsets: training data and testing data (and sometimes to three: train, validate and test), and fit our model on the train data, in order to make predictions on the test data. When we do that, one of two thing might happen: we overfit our model or we underfit our model. We don’t want any of these things to happen, because they affect the predictability of our model — we might be using a model that has lower accuracy and/or is ungeneralized (meaning you can’t generalize your predictions on other data). Let’s see what under and overfitting actually mean:
#### - **Overfitting**

Overfitting means that model we trained has trained “too well” and is now, well, fit too closely to the training dataset. This usually happens when the model is too complex (i.e. too many features/variables compared to the number of observations). This model will be very accurate on the training data but will probably be very not accurate on untrained or new data. It is because this model is not generalized (or not AS generalized), meaning you can generalize the results and can’t make any inferences on other data, which is, ultimately, what you are trying to do. Basically, when this happens, the model learns or describes the “noise” in the training data instead of the actual relationships between variables in the data. This noise, obviously, isn’t part in of any new dataset, and cannot be applied to it.
#### - **Underfitting**

In contrast to overfitting, when a model is underfitted, it means that the model does not fit the training data and therefore misses the trends in the data. It also means the model cannot be generalized to new data. As you probably guessed (or figured out!), this is usually the result of a very simple model (not enough predictors/independent variables). It could also happen when, for example, we fit a linear model (like linear regression) to data that is not linear. It almost goes without saying that this model will have poor predictive ability (on training data and can’t be generalized to other data).

<img src = "b_img.png" style="width:500px;height:200px"/>

## **Train/Test Split**
### **Parameter**

- **test_size** — This parameter decides the size of the data that has to be split as the test dataset. This is given as a fraction. For example, if you pass 0.5 as the value, the dataset will be split 50% as the test dataset. If you’re specifying this parameter, you can ignore the next parameter.
- **train_size** — You have to specify this parameter only if you’re not specifying the test_size. This is the same as test_size, but instead you tell the class what percent of the dataset you want to split as the training set.
- **random_state** — Here you pass an integer, which will act as the seed for the random number generator during the split. Or, you can also pass an instance of the RandomState class, which will become the number generator. If you don’t pass anything, the RandomState instance used by np.random will be used instead.

##### **Note**: 
Train/test split does have its dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in.

In [4]:
# splitting data dg sklearn: train_test_split
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd

In [5]:
data = {
    'massa': np.arange(1, 101),
    'harga':np.arange(1, 101)
}

df = pd.DataFrame(data)
df

Unnamed: 0,massa,harga
0,1,1
1,2,2
2,3,3
3,4,4
4,5,5
...,...,...
95,96,96
96,97,97
97,98,98
98,99,99


In [19]:
xTrain, xTest, yTrain, yTest = train_test_split(
    df[['massa']], df['harga'],
    train_size = .85, #berarti data untuk train sebanyak 85%
#     test_size = .15 #berarti data untuk test sebanyak 15%
)

## **Take Home Exercise**

- Data yang digunakan adalah cancer_data.csv. Keterangan dataset bisa dibaca di *Breast Cancer Wisconsin (Diagnostic) Data Set, https://www.kaggle.com/uciml/breast-cancer-wisconsin-data*
- Jalankan Feature Selection dan beri kesimpulan feature apa saja yang penting bagi target (ganas/jinaknya cancer)!
- Jalankan splitting data dengan komposisi 85% data train dan 15% data test!

#### **Reference**:
* Adi Bronshtein, "Train/Test Split and Cross Validation in Python", https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
* Sunny Srinidhi, "How to split your dataset to train and test datasets using SciKit Learn", https://medium.com/@contactsunny/how-to-split-your-dataset-to-train-and-test-datasets-using-scikit-learn-e7cf6eb5e0d
