## The Importance of the Train-Validation-Test Split
```In this exercise you will experience with an importent and often neglected issue in the data scienstist's work: the train-validation-test split. We will work with a specific dataset and we will see that the naïve method for splitting does not yield good results. We will examine different splitting methods that take into consideration the structure of the dataset, and consequently yield better results.```

```~ Ittai Haran```

In [0]:
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

```First, load the dataset, given to you as data.csv, using pandas. (Note that you are also given a separate dataset - the file data_test.csv - that we will use later for testing)```

```Take a moment to explore the dataset. Notice that the dataset is made out of pairs of objects: each row has two ids, one for each object in the pair (the objects are labeled 'left' and 'right'). The features related to each object are also included in each row. Finally, in each row we have a target that we would like to predict.```

```How many different objects are there?```

```We would like to describe the objects and the data using an appropriate data structure. Which data structure can best describe this dataset?```


```You will now use a library called networkx. This library is used for manipulating graphs in python. Use it to create a graph from the pandas dataset you just loaded.```

```How many connected components does the graph have? Draw a histogram of their sizes (use networkx).```

``` Are there any edges that aren't between left objects and right objects? That kind of graph is called a bipartite graph.```

```In order to create a baseline model we will try to make predictions using only one object from each row.```

```Create a dataset containig only the features of the left objects. Drop duplicates, so that every object will appear only once.```

```We will now use the regular, naïve, splitting method. Split your data randomly with ratio 0.7-0.3 to train and validation segments. Train a simple model (a random forest for example) to predict the target. Make sure your model isn't overfitted, and try to get the best score you can (on the validation segment). Compare your results to a simple baseline - the mean of the target computed on the train segment.```

```Use the model you got to compute loss on the separate test dataset given to you (data_test.csv). Did you get similar results for the validation and test segments?```

```Repeat that process, only this time use all features, and not just those of the left objects. Accordingly, you don't have to drop any duplicates. Again, use the naïve splitting method.```

```Did you get a good score on both train and validation? why (or why not)?```

```Do you think the score you got on the validation corresponds to the "real" error? Compute the loss on the separated test segment to get the "real" error. Is there any gap between validation-error and test-error? Why?```

```We will now use a different splitting method. In this method, every connected component is contained either in the train segment or in the test segment. To do so, implement the following algorithm (use networkx):```

```while length(train_segment) < 0.7 * length(data):```
```
    randomly choose a row r (an edge in the graph) from the data, that is not in train_segment
    add the connected component containing r to the train_segment.```
    
```validation_segment = data - train_segment```

*```Useful networkx functions: set_edge_attributes, connected_component_subgraphs```*

```Train a model using your new train segment.```

```What is the best score you can get on your validation segment? Compare it to the test segment. ```

```What is the problem with this new splitting method that we used? Hint: How many connected components are there in the train segment, and how many are in the validation segment? Compare the distribution of the value of the target, in the train segment and in the validation segment (you can create a histogram). Do they look the same? ```

```Can we safely use this validation error to estimate the error on the test segment?```

```We will now try a third method for splitting the dataset. This time make sure you have around 70% of the connected components in your train segment. That is, implement the following algorithm:```

```while length(train_segment) < 0.7 * length(data):```
```
    randomly choose a connected component c from the graph
    add c to the train_segment
```

```validation_segment = data - train_segment```

```What part of the connected components do you have in your train segment this time? Look again at the distribution of the target in the two segments (train and validation).```

```Train a model using the new train segment.```

```What is the best score you can get on your validation and test segments? Did you get a better score? Can you now use the validation segment to estimate the real error? ```

```Bonus: the data for this exercise was uniquely generated, using MNIST (what? how???). Can you generate a similar dataset? What parameters control this problem? Explain how it can be done.```