# Part One of the Course Project

In this project you will split a sample `X` into smaller subsamples, which we will call training sample, validation sample, and testing sample. We will later learn why these subsamples are needed and how a machine learning model can be trained on a training sample, validated or tested on the other samples. 

Here a sample `X` can be any *iterable*, or a sequence of elements, which includes a range, a list, a tuple, a string, a numpy array or any object allowing iteration over its elements, which can themselves be iterables or even more complex data structures.

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete this part of the course project. 

In [1]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import pandas as pd, numpy as np, nltk
from numpy.testing import assert_equal as eq
from sklearn.model_selection import train_test_split
pd.set_option('max_colwidth', 100, 'display.max_rows', 10)
import unittest
from colorunittest import run_unittest

_ = nltk.download(info_or_id=['names'], quiet=True)
LsM = nltk.corpus.names.words(fileids='male.txt')
LsF = nltk.corpus.names.words(fileids='female.txt')
LsF = [n for n in LsF if n not in LsM] # for simplicity: remove ~360 female names found among the male names
print(f'{len(LsM)} male names:  ', LsM[:10])
print(f'{len(LsF)} female names:', LsF[:10])

2943 male names:   ['Aamir', 'Aaron', 'Abbey', 'Abbie', 'Abbot', 'Abbott', 'Abby', 'Abdel', 'Abdul', 'Abdulkarim']
4636 female names: ['Abagael', 'Abagail', 'Abbe', 'Abbi', 'Abigael', 'Abigail', 'Abigale', 'Abra', 'Acacia', 'Ada']


# Task 1

Your first task is to create a simple wrapper function around `train_test_split()` to better understand the existing powerful and convenient tools for splitting samples. Typically, the samples are randomly shuffled, but we will avoid this to ensure your code results can be easily modified. Complete the function `tts_60_40()` after carefully reading its [docstring](https://www.python.org/dev/peps/pep-0257/) and evaluating the examples in the following test code cell, which tests your function's output. 

Note: Docstrings are descriptive string literals appearing in the function after its definition. Often, docstrings are multi-line strings quoted with ''' on each side as in the example below. All red text is a docstring. Typically, it describes what the function does, what it takes as arguments and what it outputs.

In [2]:
def tts_60_40(X:'iterable'=range(10)) -> [[],[]]:
    ''' Splits a list X into 60% train sample and 40% validation sample without shuffling.
        Use SKLearn's train_test_split() function with appropriate parameters in your implementation.
    Inputs:       X: iterable of observations to partition
    Returns: tX, vX: train and validation sets '''
    # YOUR CODE HERE
    raise NotImplementedError()
    return tX, vX

In [3]:
def tts_60_40(X:'iterable'=range(10)) -> [[],[]]:
    tX, vX = train_test_split(X, test_size=0.4, shuffle=False)
    #raise NotImplementedError()
    return tX, vX

In [4]:

tts_60_40()

([0, 1, 2, 3, 4, 5], [6, 7, 8, 9])

Use the following test cases as examples to debug your function.

In [5]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_tts_60_40(unittest.TestCase):
    def test_00(self): eq(tts_60_40(),                             ([0, 1, 2, 3, 4, 5], [6, 7, 8, 9]))
    def test_01(self): eq([len(x) for x in tts_60_40(range(100))], [60, 40])
    def test_02(self): eq([len(x) for x in tts_60_40(LsM)],        [1765, 1178])
    def test_03(self): eq([len(x) for x in tts_60_40(LsF)],        [2781, 1855])
    def test_04(self): eq(tts_60_40(LsF)[1][:5],                   ['Libbey', 'Libbi', 'Libbie', 'Libby', 'Licha'])

Ran 5 tests in 0.005s

[1m[34mOK[0m
test_00 (__main__.test_tts_60_40) ... [1m[34mok[0m
test_01 (__main__.test_tts_60_40) ... [1m[34mok[0m
test_02 (__main__.test_tts_60_40) ... [1m[34mok[0m
test_03 (__main__.test_tts_60_40) ... [1m[34mok[0m
test_04 (__main__.test_tts_60_40) ... [1m[34mok[0m

----------------------------------------------------------------------



# Task 2

Next, complete a relatively simple `Split2()` function below.

In [6]:
def Split2(X:'iterable'=range(10), j=2) -> ([],[]):
    ''' Splits a list X into training and validation lists, tX and vX, as following:
        Elements with indices divisible by j are added to vX. Others go to tX.
        If j is not in range of integers from 1 to len(X), reset it to 2 (for even split)
        Hint: You can use modulo operation, %, to check divisibility. See documentation.
    Inputs: 
        X: iterable containing observations to partition
        j: positive integer index so that X is split at each jᵗʰ element into training and validation lists
    Returns:
        tX, vX: two lists containing all elements of X with vX containing each jᵗʰ element of X 
                and tX containing the rest of elements (without changing their original order)
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    return tX, vX

In [7]:
def Split2(X:'iterable'=range(10), j=2) -> ([],[]):
    if j < 1 or j > len(X):
        j = 2
    
    tX = []  # Training list
    vX = []  # Validation list

    for index in range(len(X)):
        if index % j == 0:
            vX.append(X[index])  # Add to validation set
        else:
            tX.append(X[index])  # Add to training set
    
    return tX, vX


In [8]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_Split2(unittest.TestCase):
    def test_00(self): eq(Split2(),                 ([1, 3, 5, 7, 9], [0, 2, 4, 6, 8]))
    def test_01(self): eq(Split2([]),               ([],[]))
    def test_02(self): eq(Split2([1]),              ([],[1]))
    def test_03(self): eq(Split2(j=0.1),            ([1, 3, 5, 7, 9], [0, 2, 4, 6, 8]))
    def test_04(self): eq(Split2(j=-1.7),           ([1, 3, 5, 7, 9], [0, 2, 4, 6, 8]))
    def test_05(self): eq(Split2(np.arange(12), 3), ([1, 2, 4, 5, 7, 8, 10, 11], [0, 3, 6, 9]))
    def test_06(self): eq(Split2(range(-10,10), 4), ([-9, -8, -7, -5, -4, -3, -1, 0, 1, 3, 4, 5, 7, 8, 9], [-10, -6, -2, 2, 6]))
    def test_07(self): eq([len(x) for x in Split2(LsM, 4)], [2207, 736])
    def test_08(self): eq(Split2(LsM, 4)[0][:5], ['Aaron', 'Abbey', 'Abbie', 'Abbott', 'Abby'])

Ran 9 tests in 0.005s

[1m[34mOK[0m
test_00 (__main__.test_Split2) ... [1m[34mok[0m
test_01 (__main__.test_Split2) ... [1m[34mok[0m
test_02 (__main__.test_Split2) ... [1m[34mok[0m
test_03 (__main__.test_Split2) ... [1m[34mok[0m
test_04 (__main__.test_Split2) ... [1m[34mok[0m
test_05 (__main__.test_Split2) ... [1m[34mok[0m
test_06 (__main__.test_Split2) ... [1m[34mok[0m
test_07 (__main__.test_Split2) ... [1m[34mok[0m
test_08 (__main__.test_Split2) ... [1m[34mok[0m

----------------------------------------------------------------------



# Task 3

Next, complete a `Split3()` function below. Although it seems more complicated than `Split2()`, it is actually easier, if you apply `Split2()` twice. 

In [9]:
def Split3(X:'iterable'=range(10), i=2, j=3) -> ([],[],[]):
    '''Splits a list X into training, validation, and testing lists, tX, vX, sX, as following:
        All jᵗʰ elements are collected in sX and the remaining elements go to tvX (train and validation).
        Then we split tvX so that all iᵗʰ elements go to vX and the rest go to tX.
        If i or j are not in range of integers from 1 to len(X), reset them to 2 (for even split)
        Hint: consider applying Split2() twice.
    Inputs: 
        X: iterable containing observations to partition
        j: positive integer index so that X is split at each jᵗʰ element into training and testing lists
    Returns:
        tX, vX, sX: three lists containing all elements of X with vX containing each iᵗʰ element of X,  
                sX containing each jᵗʰ element of X, and tX containing the rest of elements 
                (without changing their original order)
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    return tX, vX, sX

In [10]:
def Split3(X:'iterable'=range(10), i=2, j=3) -> ([],[],[]):
    # Ensure i and j are within the valid range
    if i < 1 or i > len(X):
        i = 2
    if j < 1 or j > len(X):
        j = 2

    # Step 1: Split into testing set (sX) and temporary set (tvX)
    sX = []  # Testing list
    tvX = []  # Temporary list for training and validation

    for index in range(len(X)):
        if (index + 1) % j == 0:  # Adjust for 1-based indexing
            sX.append(X[index])   # Add to testing set
        else:
            tvX.append(X[index])   # Add to temporary set

    # Step 2: First use Split2 to get training (tX) and validation (vX) from tvX
    tX, vX = Split2(tvX, i)

    # Step 3: Apply Split2 again to the training set to create a new validation set
    # This will take the training set (tX) and split it again if needed
    tX, vX = Split2(tX + vX, i)
    return tX, vX, sX

In [11]:
def Split3(X:'iterable'=range(10), i=2, j=3) -> ([],[],[]):
    # Convert X to a list to allow indexing
    X = list(X)
    
    # Validate indices
    if not (1 <= i <= len(X)):
        i = 2
    if not (1 <= j <= len(X)):
        j = 2
    
    # Step 1: Split into sX and tvX
    tX = [X[k] for k in range(j - 1, len(X), j)]  # Every j-th element
    tvX = [X[k] for k in range(len(X)) if (k + 1) % j != 0]  # Remaining elements
    
    # Step 2: Split tvX into tX and vX
    vX = [tvX[k] for k in range(i - 1, len(tvX), i)]  # Every i-th element
    sX = [tvX[k] for k in range(len(tvX)) if (k + 1) % i != 0]  # Remaining elements
    
    return tX, vX, sX

In [12]:
def Split3(X:'iterable'=range(10), i=2, j=3) -> ([],[],[]):
    X = list(X)
    
    # Validate indices
    if not (1 <= i <= len(X)):
        i = 2
    if not (1 <= j <= len(X)):
        j = 2
        
    # Step 1: Split into sX and tvX
    sX = [X[k] for k in range(j - 1, len(X), j)]  # Every j-th element
    tvX = [X[k] for k in range(len(X)) if (k + 1) % j != 0]  # Remaining elements
    
    # Step 2: Use Split2 to split tvX into tX and vX
    vX, tX = Split2(tvX, i)  # Split tvX using i for the split

    return tX, vX, sX

In [13]:
    '''Splits a list X into training, validation, and testing lists, tX, vX, sX, as following:
        All jᵗʰ elements are collected in sX and the remaining elements go to tvX (train and validation).
        Then we split tvX so that all iᵗʰ elements go to vX and the rest go to tX.
        If i or j are not in range of integers from 1 to len(X), reset them to 2 (for even split)
        Hint: consider applying Split2() twice.
    Inputs: 
        X: iterable containing observations to partition
        j: positive integer index so that X is split at each jᵗʰ element into training and testing lists
    Returns:
        tX, vX, sX: three lists containing all elements of X with vX containing each iᵗʰ element of X,  
                sX containing each jᵗʰ element of X, and tX containing the rest of elements 
                (without changing their original order)
    '''

'Splits a list X into training, validation, and testing lists, tX, vX, sX, as following:\n    All jᵗʰ elements are collected in sX and the remaining elements go to tvX (train and validation).\n    Then we split tvX so that all iᵗʰ elements go to vX and the rest go to tX.\n    If i or j are not in range of integers from 1 to len(X), reset them to 2 (for even split)\n    Hint: consider applying Split2() twice.\nInputs: \n    X: iterable containing observations to partition\n    j: positive integer index so that X is split at each jᵗʰ element into training and testing lists\nReturns:\n    tX, vX, sX: three lists containing all elements of X with vX containing each iᵗʰ element of X,  \n            sX containing each jᵗʰ element of X, and tX containing the rest of elements \n            (without changing their original order)\n'

In [14]:
def Split3(X:'iterable'=range(10), i=2, j=3) -> ([],[],[]):

    if j < 1 or j > len(X):
        j = 2
    
    if i < 1 or i > len(X):
        i = 2
    
    sX = []  # Testing set
    tvX = []  # Temporary set for training and validation
    vX = []  # Validation set
    tX = []  # Training set

    # First loop to fill sX and tvX
    for index in range(len(X)):
        if index % j == 0:
            sX.append(X[index])  # Every j-th element goes to sX
        else:
            tvX.append(X[index])  # The rest go to tvX

    # Second loop to fill vX and tX from tvX
    for index in range(len(tvX)):
        if index % i == 0:
            vX.append(tvX[index])  # Every i-th element from tvX goes to vX
        else:
            tX.append(tvX[index])  # The rest go to tX

    return tX, vX, sX

In [15]:
Split3()

([2, 5, 8], [1, 4, 7], [0, 3, 6, 9])

In [16]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_Split3(unittest.TestCase):
    def test_00(self): eq(Split3(), ([2, 5, 8], [1, 4, 7], [0, 3, 6, 9]))
    def test_01(self): eq(Split3(i=3, j=3), ([2, 4, 7, 8], [1, 5], [0, 3, 6, 9]))
    def test_02(self): eq([len(x) for x in Split3(range(1,100), i=2, j=3)], [33, 33, 33])
    def test_03(self): eq([sum(x) for x in Split3(range(-100,100), i=2, j=4)], [50, -50, -100])
    def test_04(self): eq([len(x) for x in Split3(LsM, 4)], [1471, 491, 981])

Ran 5 tests in 0.003s

[1m[34mOK[0m
test_00 (__main__.test_Split3) ... [1m[34mok[0m
test_01 (__main__.test_Split3) ... [1m[34mok[0m
test_02 (__main__.test_Split3) ... [1m[34mok[0m
test_03 (__main__.test_Split3) ... [1m[34mok[0m
test_04 (__main__.test_Split3) ... [1m[34mok[0m

----------------------------------------------------------------------



# Task 4

Next, complete the `KSplit()` function below. It is similar to SKLearn's [`KFold()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) function, which splits a sample into approximately even-sized `K` subsamples. You will learn more about K Folds in later modules. 

`KSplit()` splits `X` slightly differently. Instead of splitting `X` into blocks of contiguous observations, it cyclically redistributes elements among `K` lists. So, if `X=[0,1,2,3]` and `K=2`, then 2 lists are created, say, `L0` and `L1`. In the loop over elements of `X`, the 0th element is appended to `L0`, then the 1st element is appended to `L1`, then recycling of lists `L0` and `L1` begins. The 2nd element is appended to `L0` again and the 3rd element is appended to `L1` again. The trick is to create a list of `K` sublists first and then to use the [modulo operator](https://docs.python.org/3/reference/expressions.html#binary-arithmetic-operations), `%`, to decide on which of the sublists receives the next element of `X`. There are other ways to implement this requirement as well.

In [17]:
def KSplit(X:'iterable'=range(10), K=2):
    '''Splits X across K lists. Use modulo operator to assign X[i] to i%Kᵗʰ list
    If K is not in range of integers from 1 to len(X), reset it to 2 (for even split).
    Implement failsafe: If K is not in [1,...,len(X)-1], set K to 2. For example, 
        if lex(X) = 6, K should be an integer between 1 and 5, or else set K=2.
    Inputs: 
        X: iterable containing observations to partition
        K: positive integer specifying the number of resulting lists
    Returns:
        LSamples: list of samples
    '''
    # Hints:
    # Create a list of K empty lists named LSamples. Ex: if K = 3, LSamples = [[],[],[]].
    # There are many possible ways to populate the individual lists in LSamples.
    # The end result, for example, where X was a list of numbers = [0,1,2,3,4,5]
    # and the value for K was 3, LSamples would equal [[0,3],[1,4],[2,5]]. 
    # In this example, LSamples[0] = [0,3], LSamples[1] = [1,4], and LSamples[2] = [2,5].
    # Note that the first value is always the index position of the sublist index value.
    # In other words, LSamples[0][0] = 0, LSamples[1][0] = 1, and LSamples[2][0] = 2.
    # If X was a list of letters = ['a','b','c','d','e','f'] and K = 3 then
    # LSamples would equal [['a','d'],['b','e'],['c','f']], and the value for 
    # LSamples[0][0] would be 'a', or the 0th index position of the 0th list. 
    # Using modulus and appending by some counting loop logic is one way, but not the only 
    # way to do this.
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return LSamples

In [18]:
def KSplit(X:'iterable'=range(10), K=2):

    
    # YOUR CODE HERE
    raise NotImplementedError()
    return LSamples

In [36]:
def KSplit(X:'iterable'=range(10), K=2): 
    # Convert X to a list
    X = list(X)
    
    # Validate K
    if K < 1 or K > len(X) -1:
        K = 2

    K = K if isinstance(K, int) else 2
    
    # Create K empty lists
    LSamples = [[] for _ in range(K)]
    
    # Distribute elements
    for index in range(len(X)):
        LSamples[index % K].append(X[index])  # Append to the appropriate sublist
    
    return LSamples

In [37]:
KSplit(range(3), 3) #[[0, 2], [1]]) 

[[0, 2], [1]]

In [35]:
test = range(3)
for i in test:
    print(i)

0
1
2


In [38]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_KSplit(unittest.TestCase):
    def test_00(self): eq(KSplit([], 2),         [[], []])
    def test_01(self): eq(KSplit([1], 2),        [[1], []])
    def test_02(self): eq(KSplit(range(3), -1),  [[0, 2], [1]])
    def test_03(self): eq(KSplit(range(3), 1),   [[0, 1, 2]])
    def test_04(self): eq(KSplit(range(3), 1.5), [[0, 2], [1]])
    def test_05(self): eq(KSplit(range(3), 2),   [[0, 2], [1]])
    def test_06(self): eq(KSplit(range(3), 3),   [[0, 2], [1]]) # Failsafe kicks in with a default of K=2
    def test_07(self): eq(KSplit(range(4), 3),   [[0, 3], [1], [2]])
    def test_07(self): eq(KSplit(range(3), 4),   [[0, 2], [1]]) # Failsafe kicks in with a default of K=2
    def test_08(self): eq([len(x) for x in KSplit(range(1000), 6)], [167, 167, 167, 167, 166, 166])
    def test_09(self): eq([len(x) for x in KSplit(LsM, 6)],         [491, 491, 491, 490, 490, 490])
    def test_10(self): eq(KSplit(LsM, 7)[0][-3:],['Zary', 'Zed', 'Zollie'])

Ran 11 tests in 0.004s

[1m[34mOK[0m
test_00 (__main__.test_KSplit) ... [1m[34mok[0m
test_01 (__main__.test_KSplit) ... [1m[34mok[0m
test_02 (__main__.test_KSplit) ... [1m[34mok[0m
test_03 (__main__.test_KSplit) ... [1m[34mok[0m
test_04 (__main__.test_KSplit) ... [1m[34mok[0m
test_05 (__main__.test_KSplit) ... [1m[34mok[0m
test_06 (__main__.test_KSplit) ... [1m[34mok[0m
test_07 (__main__.test_KSplit) ... [1m[34mok[0m
test_08 (__main__.test_KSplit) ... [1m[34mok[0m
test_09 (__main__.test_KSplit) ... [1m[34mok[0m
test_10 (__main__.test_KSplit) ... [1m[34mok[0m

----------------------------------------------------------------------

