# Module 4 Guidance

This notebook is a template for module 4b and 4c, which will be tested in Google Colab, your code needs to run there.
The structure has been provided to improve consistency and make it easier for markers to understand your code but still give students the flexibility to be creative.  You need to populate the required functions to solve this problem.  All dependencies should be documented in the next cell.

You can:
    add further cells or text blocks to extend or further explain your solution
    add further functions

Dont:
    rename functions
   

In [128]:
# Fixed dependencies - do not remove or change.
import pytest
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from google.colab import drive
drive.mount('/content/gdrive/')
# Import your dependencies




Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [129]:
# Import data
def import_local_data(file_path):
    """This function needs to import the data file into collab and return a pandas dataframe
    """
    raw_df = pd.read_csv(file_path)
    return raw_df

In [130]:
local_file_path = "/content/gdrive/MyDrive/Colab Notebooks/breast-cancer.csv"

In [131]:
# Dont change
raw_data = import_local_data(local_file_path)

### Conduct exploratory data analysis and explain your key findings - Examine the data, explain its key features and what they look like.  Highlight any fields that are anomalous.

In [132]:
raw_data.head()

Unnamed: 0,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,Class
0,40-49,premeno,15-19,0-2,yes,3,right,left_up,no,recurrence-events
1,50-59,ge40,15-19,0-2,no,1,right,central,no,no-recurrence-events
2,50-59,ge40,35-39,0-2,no,2,left,left_low,no,recurrence-events
3,40-49,premeno,35-39,0-2,yes,3,right,left_low,yes,no-recurrence-events
4,40-49,premeno,30-34,03-May,yes,2,left,right_up,no,recurrence-events


In [133]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          286 non-null    object
 1   menopause    286 non-null    object
 2   tumor-size   286 non-null    object
 3   inv-nodes    286 non-null    object
 4   node-caps    286 non-null    object
 5   deg-malig    286 non-null    int64 
 6   breast       286 non-null    object
 7   breast-quad  286 non-null    object
 8   irradiat     286 non-null    object
 9   Class        286 non-null    object
dtypes: int64(1), object(9)
memory usage: 22.5+ KB


In [134]:
for column in raw_data:
    unique_val = raw_data[column].unique()
    print(unique_val)

['40-49' '50-59' '60-69' '30-39' '70-79' '20-29']
['premeno' 'ge40' 'lt40']
['15-19' '35-39' '30-34' '25-29' '40-44' 'Oct-14' '0-4' '20-24' '45-49'
 '50-54' '05-Sep']
['0-2' '03-May' '15-17' '06-Aug' '09-Nov' '24-26' 'Dec-14']
['yes' 'no' '?']
[3 1 2]
['right' 'left']
['left_up' 'central' 'left_low' 'right_up' 'right_low' '?']
['no' 'yes']
['recurrence-events' 'no-recurrence-events']


In [135]:
question_mark_counts = (raw_data == '?').sum()
print(question_mark_counts)

age            0
menopause      0
tumor-size     0
inv-nodes      0
node-caps      8
deg-malig      0
breast         0
breast-quad    1
irradiat       0
Class          0
dtype: int64


age - sequence
menopause - categorical
tumor_size - sequence, inconsistent formatting, need to clean dates into number ranges
inv_nodes - sequence, inconsistent formatting, need to clean dates into number ranges
node_caps - binary, missing/ambiguous data, remove ? rows as affecting small number
deg_malig - categorical
breast - binary
breast_quad - categorical, missing/ambiguous data, remove ? rows as affecting small number
irradiat - binary
class - categorical, convert values to 1,0

In [136]:
# cleaning
def clean_tumor(value):
        if value == 'Oct-14':
            return '10-14'
        elif value == '05-Sep':
            return '5-9'
        else:
            return value
raw_data['tumor-size'] = raw_data['tumor-size'].apply(clean_tumor)

def clean_inv_nodes(value):
        if value == '03-May':
            return '3-5'
        elif value == '06-Aug':
            return '6-8'
        elif value == '09-Nov':
            return '9-11'
        elif value == 'Dec-14':
            return '12-14'
        else:
            return value
raw_data['inv-nodes'] = raw_data['inv-nodes'].apply(clean_inv_nodes)

def clean_node_caps(data):
    data = data[~data.isin(["?"]).any(axis=1)]
    return data
raw_data = clean_node_caps(raw_data)

In [137]:
for column in raw_data:
    unique_val = raw_data[column].unique()
    print(unique_val)

['40-49' '50-59' '60-69' '30-39' '70-79' '20-29']
['premeno' 'ge40' 'lt40']
['15-19' '35-39' '30-34' '25-29' '40-44' '10-14' '0-4' '20-24' '45-49'
 '50-54' '5-9']
['0-2' '3-5' '15-17' '6-8' '9-11' '24-26' '12-14']
['yes' 'no']
[3 1 2]
['right' 'left']
['left_up' 'central' 'left_low' 'right_up' 'right_low']
['no' 'yes']
['recurrence-events' 'no-recurrence-events']


In [138]:
for column in raw_data:
    unique_val = raw_data[column].unique()
    print(unique_val)

['40-49' '50-59' '60-69' '30-39' '70-79' '20-29']
['premeno' 'ge40' 'lt40']
['15-19' '35-39' '30-34' '25-29' '40-44' '10-14' '0-4' '20-24' '45-49'
 '50-54' '5-9']
['0-2' '3-5' '15-17' '6-8' '9-11' '24-26' '12-14']
['yes' 'no']
[3 1 2]
['right' 'left']
['left_up' 'central' 'left_low' 'right_up' 'right_low']
['no' 'yes']
['recurrence-events' 'no-recurrence-events']


In [139]:
# Explain your key findings

Create any data pre-processing that you will conduct on seen and unseen data.  Regardless of the model you use, this dataframe must contain only numeric features and have a strategy for any expected missing values. Any objects can that are needed to handle the test data that are dependent on the training data can be stored in the model class.  You are recommended to use sklearn Pipelines or similar functionality to ensure reproducibility.

In [140]:
# Split your data so that you can test the effectiveness of your model
X = raw_data.drop('Class', axis=1)
y = raw_data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [141]:
# Populate preprocess_training_data and preprocess_test_data to preprocess data.
# You must process test and train separately so your model does not accidently gain information that a model wouldnt have in reality and therefore get better predictions

In [144]:
# Binary mapping
binary_mapping = {'yes': 0, 'no': 1}
raw_data['node-caps'] = raw_data['node-caps'].map(binary_mapping)

"""
# Function to extract the lower bound of the age range
def extract_lower_bound(age_range):
    return int(age_range.split('-')[0])

# Sort the age ranges by their lower bound
df['lower_bound'] = df['age_range'].apply(extract_lower_bound)
df = df.sort_values(by='lower_bound').reset_index(drop=True)

# Create ordinal mapping
df['age_range'] = range(1, len(df) + 1)  # Replace the original column with ordinal values

# Drop the helper column if not needed
df = df.drop(columns=['lower_bound'])

print(df.head())
"""

"\n# Function to extract the lower bound of the age range\ndef extract_lower_bound(age_range):\n    return int(age_range.split('-')[0])\n\n# Sort the age ranges by their lower bound\ndf['lower_bound'] = df['age_range'].apply(extract_lower_bound)\ndf = df.sort_values(by='lower_bound').reset_index(drop=True)\n\n# Create ordinal mapping\ndf['age_range'] = range(1, len(df) + 1)  # Replace the original column with ordinal values\n\n# Drop the helper column if not needed\ndf = df.drop(columns=['lower_bound'])\n\nprint(df.head())\n"

In [145]:
raw_data.head()

Unnamed: 0,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat,Class
0,40-49,premeno,15-19,0-2,0,3,right,left_up,no,recurrence-events
1,50-59,ge40,15-19,0-2,1,1,right,central,no,no-recurrence-events
2,50-59,ge40,35-39,0-2,1,2,left,left_low,no,recurrence-events
3,40-49,premeno,35-39,0-2,0,3,right,left_low,yes,no-recurrence-events
4,40-49,premeno,30-34,3-5,0,2,left,right_up,no,recurrence-events


In [None]:
class Module4_Model:

    def __init__(self):
        self.model = None

    def preprocess_training_data(self, training_df):
        """
        This function should process the training data and store any features required in the class
        """
        return processed_df

    def preprocess_test_data(self, test_df):

        return processed_df



In [None]:
raw_data.head()

In [None]:
# Dont change
my_model = Module4_Model()

In [None]:
# Dont change
x_train_processed = my_model.preprocess_training_data(x_train)

In [None]:
# Create a model

In [None]:
# Dont change
x_test_processed = my_model.preprocess_test_data(x_test)

In [None]:
# Train your model

In [None]:
# use your model to make a prediction on unseen data

In [None]:
# Asssess the accuracy of your model and explain your key findings

### Unit tests:

#### Checking training and test data for null values. This will work for both pd dataframes and np arrays, and ensures no null values exist.

In [None]:
def test_no_nulls(data):
    """ Assert no null values within pd dataframe or np array """

    # if data is numpy array, handle accordingly
    if isinstance(data, (np.ndarray)):
        assert not np.isnan(np.min(data))

    # if not np array, assume data is pandas dataframe
    else:
        assert data.isna().sum().sum() == 0

In [None]:
# run null data unit test on both training and test data
test_no_nulls(x_train_processed)
test_no_nulls(x_test_processed)