# DSI Instructor Challenge

* Hank Butler

* Data Scientist - Grabit Logistics

* 2/10/2021

---

## Part 1. Modeling



In [3]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
# Import breast-cancer data set.
df = pd.read_csv('breast-cancer.csv', header = None)

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [4]:
# Read .txt file with column names
with open('field_names.txt') as f:
    col_names = f.read().split('\n')
    
col_names

['ID',
 'diagnosis',
 'radius_mean',
 'radius_sd_error',
 'radius_worst',
 'texture_mean',
 'texture_sd_error',
 'texture_worst',
 'perimeter_mean',
 'perimeter_sd_error',
 'perimeter_worst',
 'area_mean',
 'area_sd_error',
 'area_worst',
 'smoothness_mean',
 'smoothness_sd_error',
 'smoothness_worst',
 'compactness_mean',
 'compactness_sd_error',
 'compactness_worst',
 'concavity_mean',
 'concavity_sd_error',
 'concavity_worst',
 'concave_points_mean',
 'concave_points_sd_error',
 'concave_points_worst',
 'symmetry_mean',
 'symmetry_sd_error',
 'symmetry_worst',
 'fractal_dimension_mean',
 'fractal_dimension_sd_error',
 'fractal_dimension_worst']

In [6]:
# Combine breast-cancer data with column names

df.columns = col_names

df.head()

Unnamed: 0,ID,diagnosis,radius_mean,radius_sd_error,radius_worst,texture_mean,texture_sd_error,texture_worst,perimeter_mean,perimeter_sd_error,...,concavity_worst,concave_points_mean,concave_points_sd_error,concave_points_worst,symmetry_mean,symmetry_sd_error,symmetry_worst,fractal_dimension_mean,fractal_dimension_sd_error,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [7]:
# Initial Look at DF
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          569 non-null    int64  
 1   diagnosis                   569 non-null    object 
 2   radius_mean                 569 non-null    float64
 3   radius_sd_error             569 non-null    float64
 4   radius_worst                569 non-null    float64
 5   texture_mean                569 non-null    float64
 6   texture_sd_error            569 non-null    float64
 7   texture_worst               569 non-null    float64
 8   perimeter_mean              569 non-null    float64
 9   perimeter_sd_error          569 non-null    float64
 10  perimeter_worst             569 non-null    float64
 11  area_mean                   569 non-null    float64
 12  area_sd_error               569 non-null    float64
 13  area_worst                  569 non

ID is an integer.

diagnosis is an object.

The rest are floats.

In [12]:
# Quick look at 'diagnosis'

df['diagnosis'].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

Only two values for diagnosis. B is 'benign', M is 'malignant'.

From the task we know that we are trying to predict whether a cell is benign or malignant. Thus, we can use diagnosis as our target variable.

This is a classification problem with two classes. 

Further EDA will be conducted until data pre-processing and then modeling.

In [8]:
df.describe()

Unnamed: 0,ID,radius_mean,radius_sd_error,radius_worst,texture_mean,texture_sd_error,texture_worst,perimeter_mean,perimeter_sd_error,perimeter_worst,...,concavity_worst,concave_points_mean,concave_points_sd_error,concave_points_worst,symmetry_mean,symmetry_sd_error,symmetry_worst,fractal_dimension_mean,fractal_dimension_sd_error,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [10]:
# check for any null values

df.isnull().sum()

ID                            0
diagnosis                     0
radius_mean                   0
radius_sd_error               0
radius_worst                  0
texture_mean                  0
texture_sd_error              0
texture_worst                 0
perimeter_mean                0
perimeter_sd_error            0
perimeter_worst               0
area_mean                     0
area_sd_error                 0
area_worst                    0
smoothness_mean               0
smoothness_sd_error           0
smoothness_worst              0
compactness_mean              0
compactness_sd_error          0
compactness_worst             0
concavity_mean                0
concavity_sd_error            0
concavity_worst               0
concave_points_mean           0
concave_points_sd_error       0
concave_points_worst          0
symmetry_mean                 0
symmetry_sd_error             0
symmetry_worst                0
fractal_dimension_mean        0
fractal_dimension_sd_error    0
fractal_

There are no null values. Considering all the columns, besides the ID and diagnosis columns, are floats, if there were any null values, median imputation would have been used to replace the null values. Medians are typically preferred to means since means can be affected by outliers.

---

### Mean / Median Smoothness and Compactness for Benign and Malignant Tumors

A tumor being 'benign' indicates that the cancer is not harmful, where 'malignant' indicates virulent/infectious. In simpler terms, benign means inactive, malignant means active. The diagnosis column can be transformed to a dummy variable where 1 means malignant and 0 means benign. 

In [None]:
df['diagnosis'] = df['diagnosis'].apply(lambda x: 1 if x == 'M' else 0)

df['diagnosis'].value_counts()

---

## Part 2. Feedback

In [1]:
part2_csv = 'https://gist.githubusercontent.com/jeff-boykin/9e1a450ef152604e6830ce70f4fc1be8/raw/4d42aebc2c2d3f7528a7769248720918e14f2e03/part-2-data.train.csv'

In [5]:
data = pd.read_csv(part2_csv)

data.head()

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500,cv-library.co.uk
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk


In [6]:
data.shape

(10000, 12)

___
### Student Sample 1.

Below is the full script for Student 1. 

If there is feedback for a chunk of code it will be in a separate sell with the feedback in a markdown cell below it.

In [None]:
import pandas as pd
import numpy as np
from sklearn import LinearRegression
from sklearn.cross_validation import cross_val_score

# Load data
d = pd.read_csv('../data/train.csv')


# Setup data for prediction
x1 = data.SalaryNormalized
x2 = pd.get_dummies(data.ContractType)

# Setup model
model = LinearRegression()

# Evaluate model
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
scores = cross_val_score(model, x2, x1, cv=1, scoring='mean_absolute_error')
print(scores.mean())

In [None]:
import pandas as pd
import numpy as np
from sklearn import LinearRegression
from sklearn.cross_validation import cross_val_score

You may want to hold off on importing specific moduels from sklearn until you're about to use them. This will help you keep track of what model you're using and not use up a bunch of memory.

Also, cross_validation isn't a class in sklearn. Sklearn has great documentation so make sure you're importing from the correct places. Don't worry about remembering all the different modules. Sklearn is a large library, so it's okay if you forget which module something is located, but be sure you know how to use the documentation. This will help you down the road.
https://scikit-learn.org/stable/index.html

In [None]:
d = pd.read_csv('../data/train.csv')

Pick a better name for your dataframe than simply 'd'. This doesn't really tell you or whoever looks at your code what the data is reflective of and it could be mistaken as a dictionary in other parts of your code.

df is commonly used as a convention. You use 'data' later in your file which is also fine. Just be consistent and try not to use names that could be confusing or vague.

Also, it's typically a good idea to call the .head() method after importing data to see if the import went well and to get a look at the data. .info(), .shape, and .describe() are also good to call right after importing.

In [None]:
# Setup data for prediction
x1 = data.SalaryNormalized
x2 = pd.get_dummies(data.ContractType)

Let's get some better names for your variables!

You can do data.VariableName but typically you'll see:
data['VariableName'] which is more conventional.

The typical convention is X for predictor variables and y for the target variable. 

This is derived from the notation y = f(X) which should help explain the convention. We're trying to predict y based on a function of X!

If we're assuming that you're trying to predict salary here's what your variables should look like:

y = data['SalaryNormalized']

and

X = data['ContractType']

Additionally, there are 12 features in the dataset but only two included in your data preparation. You can start with only a few predictor variables and iteratively add more each time, but it may be easier to start with all features and then drop features with less predictive power after testing the model. Also, only one predictor variable will probably give you a poor model.

You should do some imputation for values of data['ContractType']. From a quick look at the dataset, it has missing values and you should consider how you are going to take care of those missing values before converting it to a dummy variable. It only has two different values (full_time or part_time) so it may even be easier to convert it to a binary variable since .get_dummies() adds extra columns.

In [None]:
# Setup model
model = LinearRegression()

Consider a more specific name for your model that indicates what kind of model you are using. This will help you later on if you're using different types of models and need to keep track of which models you have used.

It can be something simple like lin_reg = LinearRegression().

In [None]:
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split

This is a better place to import the cross_val_score. Just make sure you're importing it from the right place! Also, cross_val_score only needs to be imported once!

train_test_split is in sklearn.model_selection. You also didn't specify a train_test_split for your model. Typically you only want to import a class or module if you're going to use it.

In [None]:
scores = cross_val_score(model, x2, x1, cv=1, scoring='mean_absolute_error')
print(scores.mean())

Firstly, this ties back to why you need to have better names for your variables. It's not clear which one is the target and it could be in the wrong spot in the call of cross_val_score.

cv = 1. You might want to use a larger number on this. In most common cross-validation approaches you use part of the training set for testing. You do this several times so that each data point appears once in the test set. By setting this equal to 1, you're essentially using the entire dataset without splitting the dataset and sampling different subsets. This won't help train the model. You should set cv to at least 2, but I have seen 3 or 5 usually used. 
Here's a link to read up more on cross-validation if you're struggling to grasp it.

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

Calling scores.mean() will give you the mean score of all iterations. Consider calling scores by itself to see the scores across the different cv folds in addition to calling just the mean.

'Mean_absolute_error' is a good metric for seeing how a regression model performs so that should be okay!

Here's how you can use train_test_split to set up and fit the data to your model.

In [None]:
# Suggested code for fitting and training the model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# creating X and y
X = df['predictor_variables']
y = df['target']

# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

# Initialize the model
lin_reg = LinearRegression()

lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_test)

# See how model performs
from sklearn.metrics import mean_squared_error, r2_score
print("Training Score: {:.3f}".format(lin_reg.score(X_train, y_train)))
print("Test Score: {:.3f}".format(lin_reg.score(X_test, y_test)))
print("Coefficients: {:.3f}".format(lin_reg.coef_))
print("MSE: {:.3f}".format(mean_squared_error(y_test, y_pred)))
print("R2 Score: {:.3f}".format(r2_score(y_test, y_pred)))

---

### Student Sample 2.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score

# Load data
data = pd.read_csv('../data/train.csv')


# Setup data for prediction
y = data.SalaryNormalized
X = pd.get_dummies(data.ContractType)

# Setup model
model = LinearRegression()

# Evaluate model
scores = cross_val_score(model, X, y, cv=5, scoring='mean_absolute_error')
print(scores.mean())

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_score

I believe .cross_validation isn't a module in sklearn. It should be in sklearn.model_selection. 

Don't feel bad about this stuff. It's very common to forget all the different modules in sklearn. The documentation for sklearn is very good and helpful!

https://scikit-learn.org/stable/index.html

In [None]:
data = pd.read_csv('../data/train.csv')

After importing data, consider calling .head() on your data frame to get a quick look. This will help you ensure the data was imported correctly. 
.shape, .info(), and .describe() can also provide some good insight into what the data looks like.

In [None]:
y = data.SalaryNormalized
X = pd.get_dummies(data.ContractType)

data.SalaryNormalized is perfectly valid but it's usually preferred to use
data['SalaryNormalize']. It makes it clearer that you're using a variable and not calling an attribute or method on your data frame.

There appears to be more features in the data set than you've included in X. You may want to consider adding in all features initially then removing features that have little to no predictive power. By only including one variable your model may not have strong predictive power.

You should do some imputation for values of data['ContractType']. From a quick look at the dataset, it has missing values and you should consider how you are going to take care of those missing values before converting it to a dummy variable. It only has two different values (full_time or part_time) so it may even be easier to convert it to a binary variable since .get_dummies() adds extra columns.

In [None]:
model = LinearRegression()

This is more of a personal preference, but you may want to be more specific with the name for your model in case you try multiple models or variations of the same model. 

I typically use an abbreviation of the model I'm using such as:

lin_reg = LinearRegression()

In [None]:
scores = cross_val_score(model, X, y, cv=5, scoring='mean_absolute_error')
print(scores.mean())

Setting CV = 5 is a good job. This will make sure the model sees all data points in the training and test sets. Mean Absolute Error is a good metric for evaluating a regression as well.

By printing the scores.mean() you should get a good idea how the model is performing. It may be helpful to print(scores) to see how it model performed across different folds.

If it's performing poorly you may want to consider adding more features, using a different type of model, and/or doing a train_test_split instead of just CV. 

You could even do a GridSearchCV, but it will take a long time to run. 

Good work, hopefully you have a good model!.