In [3]:
## Data Wrangling in Pandas

#[Pandas docs](https://pandas.pydata.org/)

#[Breast Cancer Wisconsin (Original) Data Set](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original))

## Data Exploration

Let us begin by reading in our dataset (csv file) into pandas and displaying the column names along with their data types. Also take a moment to view the entire dataset.

In the data we have the following columns as described by the source — Patient ID: id number, Clump Thickness: 1–10, Uniformity of Cell Size: 1–10, Uniformity of Cell Shape: 1–10, Marginal Adhesion: 1–10, Single Epithelial Cell Size: 1–10, Bare Nuclei: 1–10, Bland Chromatin: 1–10, Normal Nucleoli: 1–10, Mitoses: 1–10, Class: malignant or benign, Doctor name: 4 different doctors.

Based on this, we can assume that patient_id is a unique identifier, class is going to tell us whether the tumor is malignant (cancerous) or benign (not cancerous). The remaining columns are numeric medical descriptions of the tumor, except for the doctor_name which is a categorical feature.

_Things to keep in mind — If our goal is to predict wether a tumor is cancerous or not based on the remaining features, we will have to one hot encode the categorical data and clean up the numerical data._

From our first output we see that bare_nuclei was read as an object data type although the description is numeric. Therefore we will need to change this.

To verify that our data matches up with the source we can use the describe option in pandas:

```df.describe()```

This neatly summarizes some statistical data for all numerical columns. It seems that all. For categorical data we can hand this by grouping together values:

```df.groupby(by =[‘class’, ‘doctor_name’]).size()```

In [4]:
import pandas as pd
pd.options.display.max_columns =None
pd.options.display.max_rows =40

filename = 'data/breast_cancer_data.csv'

df = pd.read_csv(filename)

In [5]:
#start of by actually looking at your data set
df

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
0,1000025,5.0,1.0,1,1,2,1,3.0,1.0,1,benign,Dr. Doe
1,1002945,5.0,4.0,4,5,7,10,3.0,2.0,1,benign,Dr. Smith
2,1015425,3.0,1.0,1,1,2,2,3.0,1.0,1,benign,Dr. Lee
3,1016277,6.0,8.0,8,1,3,4,3.0,7.0,1,benign,Dr. Smith
4,1017023,4.0,1.0,1,3,2,1,3.0,1.0,1,benign,Dr. Wong
...,...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3.0,1.0,1,1,3,2,1.0,1.0,1,benign,Dr. Lee
695,841769,2.0,1.0,1,1,2,1,1.0,1.0,1,benign,Dr. Smith
696,888820,5.0,10.0,10,3,7,3,8.0,10.0,2,malignant,Dr. Lee
697,897471,4.0,8.0,6,4,3,4,10.0,6.0,1,malignant,Dr. Lee


In [6]:
# What is the size of our dataset?
df.shape

(699, 12)

In [7]:
# Over here we see the columns names and their data types
df.dtypes

patient_id                 int64
clump_thickness          float64
cell_size_uniformity     float64
cell_shape_uniformity      int64
marginal_adhesion          int64
single_ep_cell_size        int64
bare_nuclei               object
bland_chromatin          float64
normal_nucleoli          float64
mitoses                    int64
class                     object
doctor_name               object
dtype: object

In [8]:
#Its good to inspect your unique key identifier
df.nunique()

patient_id               645
clump_thickness           10
cell_size_uniformity      10
cell_shape_uniformity     10
marginal_adhesion         10
single_ep_cell_size       10
bare_nuclei               11
bland_chromatin           10
normal_nucleoli           10
mitoses                    9
class                      2
doctor_name                4
dtype: int64

In [9]:
# Here we list all columns
df.columns

Index(['patient_id', 'clump_thickness', 'cell_size_uniformity',
       'cell_shape_uniformity', 'marginal_adhesion', 'single_ep_cell_size',
       'bare_nuclei', 'bland_chromatin', 'normal_nucleoli', 'mitoses', 'class',
       'doctor_name'],
      dtype='object')

In [10]:
# This provides some statistics on the numerical data
df.describe()

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bland_chromatin,normal_nucleoli,mitoses
count,699.0,698.0,698.0,699.0,699.0,699.0,695.0,698.0,699.0
mean,1071704.0,4.416905,3.137536,3.207439,2.793991,3.216023,3.447482,2.868195,1.589413
std,617095.7,2.817673,3.052575,2.971913,2.843163,2.2143,2.441191,3.055647,1.715078
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0
75%,1238298.0,6.0,5.0,5.0,3.5,4.0,5.0,4.0,1.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0


In [11]:
# This aggreates the data by its column names, then we pass the aggregation function (size = count)
df.groupby(by =['class', 'doctor_name']).size()

class      doctor_name
benign     Dr. Doe        127
           Dr. Lee        121
           Dr. Smith      102
           Dr. Wong       108
malignant  Dr. Doe         58
           Dr. Lee         60
           Dr. Smith       74
           Dr. Wong        49
dtype: int64

## Data Preproccessing

*Dealing with missing values*

With every dataset it is vital to evaluate the missing values. How many are there? Is it an error? Are there too many missing values? Does a missing value have a meaning relative to its context?

We can sum up the total missing values using the following:

```df.isna().sum()```

Now that we have identified our missing values, we have a few options. We can fill them in with a certain value (zero, mean/max/median by column, string) or drop them by row. Since there are few missing values, we can drop the rows to avoid skewing the data in further analysis.

```df = df.dropna(axis = 0, how = 'any')```

This allows us to drop rows with any missing values in them.

*Inspecting duplicates*

To view repeating rows we can start off by looking at the number of unique values in each column.

```df.nunique()```

We see here that although there are 699 rows, there are only 645 unique patient_id’s. This could mean that some patient appear more than once in the dataset. To isolate these patients and view their data, we use the following:

```df[df.duplicated(subset = 'patient_id', keep =False)].sort_values('patient_id')```

This line displays all the duplicated patient_id’s in order. The number of times a patient shows up in the dataset can also be viewed.

```repeat_patients = df.groupby(by = 'patient_id').size().sort_values(ascending =False)```

This shows that one patient shows up in the data 6 times.

*Filtering data*

If we want to remove patients that show up more that 2 times in the data set.

```filtered_patients = repeat_patients[repeat_patients > 2].to_frame().reset_index()```

```filtered_df = df[~df.patient_id.isin(filtered_patients.patient_id)]```

If we did not have the tilde (“~”) we would get all individuals that repeat more than twice. By adding a tilde the pandas boolean series is reversed and thus the resulting data frame is of those that do NOT repeat more than twice.

In [12]:
#Dealing with missing values? How many np.nan per column?

df.isna().sum() 

patient_id               0
clump_thickness          1
cell_size_uniformity     1
cell_shape_uniformity    0
marginal_adhesion        0
single_ep_cell_size      0
bare_nuclei              2
bland_chromatin          4
normal_nucleoli          1
mitoses                  0
class                    0
doctor_name              0
dtype: int64

In [13]:
# # fill with zero
# df = df.fillna(0) 

In [14]:
df = df.dropna(axis = 1, how = 'all')  #drop rows with any column having np.nan values

#Rename columns
df.rename(index =str, columns = {'patient_id':'patient_id'})

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
0,1000025,5.0,1.0,1,1,2,1,3.0,1.0,1,benign,Dr. Doe
1,1002945,5.0,4.0,4,5,7,10,3.0,2.0,1,benign,Dr. Smith
2,1015425,3.0,1.0,1,1,2,2,3.0,1.0,1,benign,Dr. Lee
3,1016277,6.0,8.0,8,1,3,4,3.0,7.0,1,benign,Dr. Smith
4,1017023,4.0,1.0,1,3,2,1,3.0,1.0,1,benign,Dr. Wong
...,...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3.0,1.0,1,1,3,2,1.0,1.0,1,benign,Dr. Lee
695,841769,2.0,1.0,1,1,2,1,1.0,1.0,1,benign,Dr. Smith
696,888820,5.0,10.0,10,3,7,3,8.0,10.0,2,malignant,Dr. Lee
697,897471,4.0,8.0,6,4,3,4,10.0,6.0,1,malignant,Dr. Lee


In [15]:
# Its good to inspect unique key identifiers
df.nunique()

patient_id               645
clump_thickness           10
cell_size_uniformity      10
cell_shape_uniformity     10
marginal_adhesion         10
single_ep_cell_size       10
bare_nuclei               11
bland_chromatin           10
normal_nucleoli           10
mitoses                    9
class                      2
doctor_name                4
dtype: int64

In [16]:
# This shows rows that show up more than once and have the exact same column values. 
df[df.duplicated(keep = 'last')]

# # This shows all instances where pantient_id shows up more than once, but may have varying column values
# df[df.duplicated(subset = 'patient_id', keep =False)].sort_values('patient_id')

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
168,1198641,3.0,1.0,1,1,2,1,3.0,1.0,1,benign,Dr. Lee


In [17]:
#Now that I have seen that there are some duplicates, I am going to go ahead and remove any duplicate rows
#, same things that occours twice

df = df.drop_duplicates(subset = None, keep ='first')

In [18]:
repeat_patients = df.groupby(by = 'patient_id').size().sort_values(ascending =False)
repeat_patients

patient_id
1182404     6
1276091     5
769612      2
1339781     2
385103      2
           ..
1079304     1
1080185     1
1080233     1
1081791     1
13454352    1
Length: 645, dtype: int64

In [19]:
# How to reverse conditionality?
print(1==1)
print(~1==1)

True
False


In [20]:
filtered_patients = repeat_patients[repeat_patients > 2].to_frame().reset_index()
filtered_df = df[~df.patient_id.isin(filtered_patients.patient_id)]
filtered_df

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
0,1000025,5.0,1.0,1,1,2,1,3.0,1.0,1,benign,Dr. Doe
1,1002945,5.0,4.0,4,5,7,10,3.0,2.0,1,benign,Dr. Smith
2,1015425,3.0,1.0,1,1,2,2,3.0,1.0,1,benign,Dr. Lee
3,1016277,6.0,8.0,8,1,3,4,3.0,7.0,1,benign,Dr. Smith
4,1017023,4.0,1.0,1,3,2,1,3.0,1.0,1,benign,Dr. Wong
...,...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3.0,1.0,1,1,3,2,1.0,1.0,1,benign,Dr. Lee
695,841769,2.0,1.0,1,1,2,1,1.0,1.0,1,benign,Dr. Smith
696,888820,5.0,10.0,10,3,7,3,8.0,10.0,2,malignant,Dr. Lee
697,897471,4.0,8.0,6,4,3,4,10.0,6.0,1,malignant,Dr. Lee


In [21]:
# This is all the repeating patients details

df[df.patient_id.isin(filtered_patients.patient_id)]

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name
136,1182404,4.0,1.0,1,1,2,1,2.0,1.0,1,benign,Dr. Lee
241,1276091,3.0,1.0,1,3,1,1,3.0,1.0,1,benign,Dr. Wong
256,1182404,3.0,1.0,1,1,2,1,1.0,1.0,1,benign,Dr. Wong
257,1182404,3.0,1.0,1,1,2,1,2.0,1.0,1,benign,Dr. Doe
265,1182404,5.0,1.0,4,1,2,1,3.0,2.0,1,benign,Dr. Lee
429,1276091,2.0,1.0,1,1,2,1,2.0,1.0,1,benign,Dr. Doe
430,1276091,1.0,3.0,1,1,2,1,2.0,2.0,1,benign,Dr. Wong
431,1276091,5.0,1.0,1,3,4,1,3.0,2.0,1,benign,Dr. Wong
448,1182404,1.0,1.0,1,1,1,1,1.0,1.0,1,benign,Dr. Lee
462,1276091,6.0,1.0,1,3,2,1,1.0,1.0,1,benign,Dr. Lee


In [22]:
# How to view the data by aggeregting on more than one column

df.groupby('class').agg({'cell_size_uniformity': ['min', 'max'], 'normal_nucleoli': 'mean', 'class': 'count'})

Unnamed: 0_level_0,cell_size_uniformity,cell_size_uniformity,normal_nucleoli,class
Unnamed: 0_level_1,min,max,mean,count
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
benign,1.0,9.0,1.289474,457
malignant,1.0,10.0,5.863071,241


### One Hot Encoding Catergorical Data

*Reshaping data*

The dataset has elements of categorical data in the “doctor_name” column. To feed this data into a machine learning pipeline, we will need to convert it into a one hot encoded column. This can be done with a sci-kit learn package, however we will do it in pandas to demonstrate the pivoting and merging functionality. Start off by creating a new dataframe with the categorical data.

```categorical_df = df[['patient_id','doctor_name']]```
```categorical_df['doctor_count'] = 1```

We add a column an extra column to identify which doctor a patient deals with. Pivot this table so that we only have numerical values in the cells and the columns become the doctors’ name. Then fill in the empty cells with 0.

```doctors_one_hot_encoded = pd.pivot_table( categorical_df, index = categorical_df.index, columns = ['doctor_name'], values = ['doctor_count'] )```
```doctors_one_hot_encoded = doctors_one_hot_encoded.fillna(0)```

Then drop the multiIndex columns:

```doctors_one_hot_encoded.columns = doctors_one_hot_encoded.columns.droplevel()```

We can now join this back to our main table. Typically a left join in pandas looks like this:

```leftJoin_df = pd.merge(df1, df2, on ='col_name', how='left')```

However we are joining on the index so we pass the “left_index” and “right_index” option to specify that the join key is the index of both tables

```combined_df = pd.merge(df, one_hot_encoded, left_index = True,right_index =True, how =’left’)```

We can drop the column that we no longer need by the following

```combined_df = combined_df.drop(columns=['doctor_name']```

In [23]:
categorical_df = df[['patient_id', 'doctor_name']]

In [24]:
# This specifies all rows (':') and column name 'doctor_count'
categorical_df.loc[:,'doctor_count'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  categorical_df.loc[:,'doctor_count'] = 1


In [25]:
categorical_df

Unnamed: 0,patient_id,doctor_name,doctor_count
0,1000025,Dr. Doe,1
1,1002945,Dr. Smith,1
2,1015425,Dr. Lee,1
3,1016277,Dr. Smith,1
4,1017023,Dr. Wong,1
...,...,...,...
694,776715,Dr. Lee,1
695,841769,Dr. Smith,1
696,888820,Dr. Lee,1
697,897471,Dr. Lee,1


In [26]:
doctors_one_hot_encoded  = pd.pivot_table(categorical_df
                                  ,index = categorical_df.index, 
                                  columns = ['doctor_name'], values = ['doctor_count'])

In [27]:
doctors_one_hot_encoded = doctors_one_hot_encoded.fillna(0)
doctors_one_hot_encoded

Unnamed: 0_level_0,doctor_count,doctor_count,doctor_count,doctor_count
doctor_name,Dr. Doe,Dr. Lee,Dr. Smith,Dr. Wong
0,1.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
694,0.0,1.0,0.0,0.0
695,0.0,0.0,1.0,0.0
696,0.0,1.0,0.0,0.0
697,0.0,1.0,0.0,0.0


In [28]:
doctors_one_hot_encoded.columns = doctors_one_hot_encoded.columns.droplevel()
doctors_one_hot_encoded

doctor_name,Dr. Doe,Dr. Lee,Dr. Smith,Dr. Wong
0,1.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0
...,...,...,...,...
694,0.0,1.0,0.0,0.0
695,0.0,0.0,1.0,0.0
696,0.0,1.0,0.0,0.0
697,0.0,1.0,0.0,0.0


In [29]:
combined_df = pd.merge(df, doctors_one_hot_encoded, left_index = True,right_index =True, how ='left')
combined_df

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name,Dr. Doe,Dr. Lee,Dr. Smith,Dr. Wong
0,1000025,5.0,1.0,1,1,2,1,3.0,1.0,1,benign,Dr. Doe,1.0,0.0,0.0,0.0
1,1002945,5.0,4.0,4,5,7,10,3.0,2.0,1,benign,Dr. Smith,0.0,0.0,1.0,0.0
2,1015425,3.0,1.0,1,1,2,2,3.0,1.0,1,benign,Dr. Lee,0.0,1.0,0.0,0.0
3,1016277,6.0,8.0,8,1,3,4,3.0,7.0,1,benign,Dr. Smith,0.0,0.0,1.0,0.0
4,1017023,4.0,1.0,1,3,2,1,3.0,1.0,1,benign,Dr. Wong,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3.0,1.0,1,1,3,2,1.0,1.0,1,benign,Dr. Lee,0.0,1.0,0.0,0.0
695,841769,2.0,1.0,1,1,2,1,1.0,1.0,1,benign,Dr. Smith,0.0,0.0,1.0,0.0
696,888820,5.0,10.0,10,3,7,3,8.0,10.0,2,malignant,Dr. Lee,0.0,1.0,0.0,0.0
697,897471,4.0,8.0,6,4,3,4,10.0,6.0,1,malignant,Dr. Lee,0.0,1.0,0.0,0.0


## Making new columns and conducting elementise operations

*Row-wise Operations*

Another key component in data wrangling is having the ability to conduct row-wise or column wise operations. Examples of this are; rename elements within a column based on its value and create a new column that yields a specific value based on multiple attributes within the row.

For this example lets create a new column that categorizes a patients cell as normal or abnormal based on its attributes. We first define our function and the operation that it will be doing.

```
def celltypelabel(x):
    if ((x['cell_size_uniformity'] > 5) &      (x['cell_shape_uniformity'] > 5)):

        return('normal')
    else:
        return('abnormal')
```

Then we use the pandas apply function to run the celltypelabel(x) function on the dataframe.

```combined_df['cell_type_label'] = combined_df.apply(lambda x: celltypelabel(x), axis=1)```

In [30]:
#Randomly sampling 10 rows
combined_df.sample(n=10)

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name,Dr. Doe,Dr. Lee,Dr. Smith,Dr. Wong
494,1155967,5.0,1.0,2,10,4,5,2.0,1.0,1,benign,Dr. Doe,1.0,0.0,0.0,0.0
453,1230994,4.0,5.0,5,8,6,10,10.0,7.0,1,malignant,Dr. Lee,0.0,1.0,0.0,0.0
359,873549,10.0,3.0,5,4,3,7,,5.0,3,malignant,Dr. Doe,1.0,0.0,0.0,0.0
649,1318671,3.0,1.0,1,1,2,1,2.0,1.0,1,benign,Dr. Doe,1.0,0.0,0.0,0.0
482,1318169,9.0,10.0,10,10,10,5,10.0,10.0,10,malignant,Dr. Smith,0.0,0.0,1.0,0.0
572,183936,3.0,1.0,1,1,2,1,2.0,1.0,1,benign,Dr. Wong,0.0,0.0,0.0,1.0
95,1164066,1.0,1.0,1,1,2,1,3.0,1.0,1,benign,Dr. Lee,0.0,1.0,0.0,0.0
251,191250,10.0,4.0,4,10,2,10,5.0,3.0,3,malignant,Dr. Doe,1.0,0.0,0.0,0.0
493,1142706,5.0,10.0,10,10,6,10,6.0,5.0,2,malignant,Dr. Lee,0.0,1.0,0.0,0.0
286,529329,10.0,10.0,10,10,10,10,4.0,10.0,10,malignant,Dr. Doe,1.0,0.0,0.0,0.0


In [31]:
combined_df.drop(columns=['doctor_name'])

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,Dr. Doe,Dr. Lee,Dr. Smith,Dr. Wong
0,1000025,5.0,1.0,1,1,2,1,3.0,1.0,1,benign,1.0,0.0,0.0,0.0
1,1002945,5.0,4.0,4,5,7,10,3.0,2.0,1,benign,0.0,0.0,1.0,0.0
2,1015425,3.0,1.0,1,1,2,2,3.0,1.0,1,benign,0.0,1.0,0.0,0.0
3,1016277,6.0,8.0,8,1,3,4,3.0,7.0,1,benign,0.0,0.0,1.0,0.0
4,1017023,4.0,1.0,1,3,2,1,3.0,1.0,1,benign,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3.0,1.0,1,1,3,2,1.0,1.0,1,benign,0.0,1.0,0.0,0.0
695,841769,2.0,1.0,1,1,2,1,1.0,1.0,1,benign,0.0,0.0,1.0,0.0
696,888820,5.0,10.0,10,3,7,3,8.0,10.0,2,malignant,0.0,1.0,0.0,0.0
697,897471,4.0,8.0,6,4,3,4,10.0,6.0,1,malignant,0.0,1.0,0.0,0.0


In [32]:
#Making a new column based on a nuemrical calcualtion of other columns in the df
df['new_col_name'] = df.clump_thickness*df.cell_size_uniformity


In [33]:
# How to convert benign & malingant to 0 and 1

class_to_numerical_dictionary = {'benign':0, 'malignant':1}

combined_df['class'] = combined_df['class'].map(class_to_numerical_dictionary)

combined_df


Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name,Dr. Doe,Dr. Lee,Dr. Smith,Dr. Wong
0,1000025,5.0,1.0,1,1,2,1,3.0,1.0,1,0,Dr. Doe,1.0,0.0,0.0,0.0
1,1002945,5.0,4.0,4,5,7,10,3.0,2.0,1,0,Dr. Smith,0.0,0.0,1.0,0.0
2,1015425,3.0,1.0,1,1,2,2,3.0,1.0,1,0,Dr. Lee,0.0,1.0,0.0,0.0
3,1016277,6.0,8.0,8,1,3,4,3.0,7.0,1,0,Dr. Smith,0.0,0.0,1.0,0.0
4,1017023,4.0,1.0,1,3,2,1,3.0,1.0,1,0,Dr. Wong,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3.0,1.0,1,1,3,2,1.0,1.0,1,0,Dr. Lee,0.0,1.0,0.0,0.0
695,841769,2.0,1.0,1,1,2,1,1.0,1.0,1,0,Dr. Smith,0.0,0.0,1.0,0.0
696,888820,5.0,10.0,10,3,7,3,8.0,10.0,2,1,Dr. Lee,0.0,1.0,0.0,0.0
697,897471,4.0,8.0,6,4,3,4,10.0,6.0,1,1,Dr. Lee,0.0,1.0,0.0,0.0


In [34]:
# Feature building: 

def celltypelabel(x):
    if ((x['cell_size_uniformity'] > 5) & (x['cell_shape_uniformity'] > 5)):
        return('normal')
    else:
        return('abnormal')


combined_df['cell_type_label'] = combined_df.apply(lambda x: celltypelabel(x), axis=1)

        

In [35]:
combined_df[['patient_id', 'cell_type_label']]

Unnamed: 0,patient_id,cell_type_label
0,1000025,abnormal
1,1002945,abnormal
2,1015425,abnormal
3,1016277,normal
4,1017023,abnormal
...,...,...
694,776715,abnormal
695,841769,abnormal
696,888820,normal
697,897471,normal


In [36]:
combined_df[~(combined_df.cell_size_uniformity >5) & (combined_df.cell_shape_uniformity >5)]

Unnamed: 0,patient_id,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_ep_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class,doctor_name,Dr. Doe,Dr. Lee,Dr. Smith,Dr. Wong,cell_type_label
15,1047630,7.0,4.0,6,4,6,1,4.0,3.0,1,1,Dr. Lee,0.0,1.0,0.0,0.0,abnormal
50,1108370,9.0,5.0,8,1,2,3,2.0,1.0,5,1,Dr. Smith,0.0,0.0,1.0,0.0,abnormal
52,1110102,10.0,3.0,6,2,3,5,4.0,10.0,2,1,Dr. Doe,1.0,0.0,0.0,0.0,abnormal
68,1120559,8.0,3.0,8,3,4,9,8.0,9.0,8,1,Dr. Wong,0.0,0.0,0.0,1.0,abnormal
84,1147699,3.0,5.0,7,8,8,9,7.0,10.0,7,1,Dr. Lee,0.0,1.0,0.0,0.0,abnormal
86,1148278,3.0,3.0,6,4,5,8,4.0,4.0,1,1,Dr. Smith,0.0,0.0,1.0,0.0,abnormal
99,1166630,7.0,5.0,6,10,5,10,7.0,9.0,4,1,Dr. Lee,0.0,1.0,0.0,0.0,abnormal
124,1175937,5.0,4.0,6,7,9,7,8.0,10.0,1,1,Dr. Smith,0.0,0.0,1.0,0.0,abnormal
186,1206695,1.0,5.0,8,6,5,8,7.0,10.0,1,1,Dr. Doe,1.0,0.0,0.0,0.0,abnormal
187,1206841,10.0,5.0,6,10,6,10,7.0,7.0,10,1,Dr. Wong,0.0,0.0,0.0,1.0,abnormal
