# Homework 1: Python for basic data analysis


Name: Thompson, Heath

Department: Computer Science




This homework aims to help you practice basic Python programing skills using the breast cancer wisconsin dataset. 

![breast image](breastimg.png)

| *Fig. 1. Cell nuclei in a breast histopathology image* | 
|---|
|Fine Needle Aspiration (FNA) biopsy: https://www.cancer.org/cancer/breast-cancer/screening-tests-and-early-detection/breast-biopsy/fine-needle-aspiration-biopsy-of-the-breast.html|
|H&E stain: https://en.wikipedia.org/wiki/H%26E_stain|


Tasks:

    [Task 1](#section1)

    [Task 2](#section2)

    [Task 3](#section3)

    [Task 4](#section4)

    [Task 5](#section5)

## Dataset

    - Number of data samples: 569
    
    - Each data sample has 30 numeric features/attributes. The first 10 features were directly calculated using mean feautues of all nuclei in an image
    
    - Class labels
        : 212 Malignant (0)
        : 357 Benign (1)
        
    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

In [1]:
import sklearn.datasets as ds
import numpy as np

In [2]:
breast_ds = ds.load_breast_cancer()
print('Data fields in breast_ds: \n', dir(breast_ds))

print('\n Dataset description:\n', breast_ds['DESCR'])

Data fields in breast_ds: 
 ['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']

 Dataset description:
 .. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst

In [3]:
# we are going to use the first 10 features in this assignment.
ftrs = breast_ds['data'][:, :10].tolist() # np array to list
tgt = (breast_ds['target']).tolist()

print('size of ftrs: ', len(ftrs), len(ftrs[0]))
print('size of tgt: ', len(tgt))

size of ftrs:  569 10
size of tgt:  569


## Task 1: Count and print out the number of malignant samples. 10 points <a id = "section1"/>

In [4]:
count = 0
for i in range(0, len(tgt)):
    if tgt[i] == 0:
        count = count+1

print("Number of Malignant Samples:", count)

Number of Malignant Samples: 212


## Task 2: data search. 20 points.  <a id = "section2"/>

Let the user input a sample idx (1 to 569), and your code will output the data features and the corresponding class label

Extra 5 points for dealing with abnormal input.


In [6]:
# tip: use the input() function and while loop.
bool = True
valid = []
for i in range(1,570):
    valid.append(str(i))

while bool:   
    idx = input("Enter a sample idx from 1 to 569 or type \"quit\" to exit: ")
    if idx == 'quit':
        bool = False
        break

    elif idx in valid:
        idx = int(idx)
        for i in range(0,10):
            print(breast_ds['feature_names'][i], ':', ftrs[idx-1][i])
        if tgt[idx-1] == 0:
            print('class label:', breast_ds['target_names'][0])
        else:
            print('class label:', breast_ds['target_names'][1])
    else:
        print("Error: Input must be an integer between 1 and 569")
        
  
    


Enter a sample idx from 1 to 569 or type "quit" to exit: 570
Error: Input must be an integer between 1 and 569
Enter a sample idx from 1 to 569 or type "quit" to exit: 569
mean radius : 7.76
mean texture : 24.54
mean perimeter : 47.92
mean area : 181.0
mean smoothness : 0.05263
mean compactness : 0.04362
mean concavity : 0.0
mean concave points : 0.0
mean symmetry : 0.1587
mean fractal dimension : 0.05884
class label: benign
Enter a sample idx from 1 to 569 or type "quit" to exit: heat
Error: Input must be an integer between 1 and 569
Enter a sample idx from 1 to 569 or type "quit" to exit: 45
mean radius : 13.17
mean texture : 21.81
mean perimeter : 85.42
mean area : 531.5
mean smoothness : 0.09714
mean compactness : 0.1047
mean concavity : 0.08259
mean concave points : 0.05252
mean symmetry : 0.1746
mean fractal dimension : 0.06177
class label: malignant
Enter a sample idx from 1 to 569 or type "quit" to exit: 0
Error: Input must be an integer between 1 and 569
Enter a sample idx fro

## Task 3. 30 points  <a id = "section3"/>

Task 3.1: Calculate and print out the mean, min and max values of the feature 'concave points (7)' for all benign samples.
Tip: use the for loop


In [None]:
sumb = 0
minb = 1
maxb = 0
countb = 0
for i in range(0, len(ftrs)):
    if tgt[i] == 1:
        elemb = ftrs[i][7]
        countb = countb + 1
        sumb = sumb + elemb
        if elemb < minb:
            minb = elemb
        if elemb > maxb:
            maxb = elemb
print("Mean: ", sumb/countb)
print("Min: ", minb)
print("Max: ", maxb)
    


Task 3.2: Calculate and print out the mean, min and max values of the feature 'concave points' for all malignant samples.


In [None]:
summ = 0
minm = 1
maxm = 0
countm = 0
for i in range(0, len(ftrs)):
    if tgt[i] == 0:
        elemm = ftrs[i][7]
        countm = countm + 1
        summ= summ + elemm
        if elemm < minm:
            minm = elemm
        if elemm > maxm:
            maxm = elemm
print("Mean: ", summ/countm)
print("Min: ", minm)
print("Max: ", maxm)

## Task 4: count the number of benign samples that have 'concave points' values less than 0.17. 20 points  <a id = "section4"/>



In [None]:
count = 0
for i in range(0, len(ftrs)):
    if tgt[i] == 1:
        elem = ftrs[i][7]
        if elem < 0.17:
            count = count + 1
print("Number of Benign Samples that have concave point values less than 0.17: ", count)
        

## Task 5. 20 points <a id = "section5"/>

Define a function that calculates the Euclidean distance between any two given data samples
 

In [28]:
import math

def eucdist(vec1, vec2):
    """Description: Finds the Euclidean distance between two given arrays
       Usage: eucdist(arr1, arr2) where arr1 an arr2 are python lists or tuples"""
    iteration = 0
    for i in range(0, 10):
        iteration = iteration + (vec1[i] - vec2[i])**2
    print("Euclidean Distance: ", math.sqrt(iteration))
       
    return

#example usage below:
#eucdist(ftrs[0], ftrs[1])