# Instructions

1. Add your name and HW Group Number below.
2. Complete each question. Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE".
3. Where applicable, run the test cases *below* each question to check your work. **Note**: In addition to the test cases you can see, the instructor may run additional test cases, including using *other datasets* to validate you code.
4. Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). You can also use the **Validate** button to run all test cases.
5. Turn in your homework by going to the main screen in JupyterHub, clicking the Assignments menu, and submitting. **Also** make sure to turn in your homework on Moodle (so we have a backup copy).



In [1]:
"""
Name: Sagnik
HW Group Number: 32 
"""

## Homework 1: Problem 2: Discretization

### Import all necessary libraries

In [2]:
# you should be familiar with numpy from HW0
import numpy as np
import pandas as pd

## 2.1 Equal-width discretization

In the following exercises, you will be writing code to discretize a defined numpy vector into equal-width bins (no other libraries are allowed). The $n$ bins should all be of size $(max - min) / n$. If a value falls directly on a break, you should put it in the higher bin.

In [3]:
def equal_width(v, n):
    """
    Input: v: the vector to be discretized (np.array).
           n: is the number of equal-width bins (intervals) for discretization.
    Output: an np.array of the same length as v, where each item indicates the
            bin number (int value) of the corresponding item in v, starting at 0 for the
            first bin.
    Allowed Libraries: None (other than np).
    """
    #v = np.array([1, 6, 13, 40, 56, 7, 23, 43])
    assert n > 0
    #n = 1
    max_val = np.amax(v)
    min_val = np.amin(v)
    bin_size = (max_val - min_val) / n
    
    result = []
    for i in v:
        bin_num = int( i // bin_size )
        if bin_num > n - 1:
            bin_num = n - 1
        result.append( bin_num )
    return result
    #print(result)
    

In [4]:
# Test your function!
v = np.array([1, 6, 13, 40, 56, 7, 23, 43])
# Bin size should be 11 = (56-1)/5, so we get [1-12), [12, 23), etc.
count = equal_width(v, 5)
count

[0, 0, 1, 3, 4, 0, 2, 3]

In [5]:
v = np.array([1, 6, 13, 40, 56, 7, 23, 43])
# Note, the "or" here is to support the original, incorrect test cases
assert np.array_equal(equal_width(v, 5), np.array([0, 0, 1, 3, 4, 0, 2, 3])) or \
       np.array_equal(equal_width(v, 5), np.array([0, 0, 1, 3, 5, 0, 2, 3]))
assert np.array_equal(equal_width(v, 3), np.array([0, 0, 0, 2, 2, 0, 1, 2])) or \
       np.array_equal(equal_width(v, 3), np.array([0, 0, 0, 2, 3, 0, 1, 2]))
assert np.array_equal(equal_width(v, 1), np.array([0, 0, 0, 0, 0, 0, 0, 0])) or \
       np.array_equal(equal_width(v, 1), np.array([0, 0, 0, 0, 1, 0, 0, 0]))

## 2.2 Equal-depth discretization

In the following exercises, you will be writing code to discretize a defined numpy vector into equal-depth bins (each bin has the same number of elements). This time, you will use panda's cut function to do it.

In [6]:
# Is the cut function using equal-width or equal-frequency?
pd.cut([1, 1, 1, 2, 3, 4, 100], 2, labels=["low", "high"])

['low', 'low', 'low', 'low', 'low', 'low', 'high']
Categories (2, object): ['low' < 'high']

In [7]:
# What about the qcut function?
pd.qcut([1, 1, 1, 2, 3, 4, 100], 2, labels=["low", "high"])

['low', 'low', 'low', 'low', 'high', 'high', 'high']
Categories (2, object): ['low' < 'high']

In [8]:
# You can use these functions without "labels" as well:
pd.qcut([1, 1, 1, 2, 3, 4, 100], 2, labels=False)

array([0, 0, 0, 0, 1, 1, 1])

In [9]:
def equal_depth(v, n):
    """
    Input: v: the vector to be discretized (np.array).
           n: is the number of equal-frequency bins for discretization.
    Output: an array of the same length as v, where each item indicates the
            bin number (int value) of the corresponding item in v, starting at 0 for the
            first bin.
    Allowed Libraries: pd.cut(), pd.qcut() functions
    """
    
    return pd.qcut(v, n, labels=False)

In [10]:
# Test your function!
v = np.array([1,6,13,40,56,7,23, 43])
bin_index = equal_depth(v, 4)
bin_index

array([0, 0, 1, 2, 3, 1, 2, 3])

In [11]:
np.testing.assert_array_equal(equal_depth(np.array([1,6,13,40,56,7,23,43]),4),np.array([0, 0, 1, 2, 3, 1, 2, 3]))

**Remember**: Make sure to complete all problems (.ipynb files) in this assignment. When you finish, double-check the submission instructions at the top of this file, and submit on JupyterHub and Moodle.