Ever since I started off on my own as a freelance patent illustrator I've wanted to find a better way to provide my clients with accurate estimates for projects. A typical project starts off with the client sending me disclosure about a new patent idea. This disclosure can be in a variety of different formats: photos, prototypes, rough sketches, detailed sketches, written descriptions, and full-blown 3D models. And depending on the type of patent case it is, a project can consist of one single page, or it can consist of over 100 pages. Some pages will consist of only simple diagrams that don't take long at all compared, for example, to a cross section of an engine drawn out over multiple pages. In some instances, depending on the detail level of the disclosure, I will actually build a 3D model of the invention in order to have all the detail required.

So as you can imagine, a lot goes into estimating a project. It starts out with me getting an understanding of the invention and what we're trying to communicate with the drawings. So when a new project comes in, work that I'm doing on other projects has to be halted so that I can send out an estimate on this new project. This is where Data Science and specifically Machine Learning comes in. 

After discovering the world of Data Science I started to think about my business differently. I started collecting data on anything that I could think of that might be of value. I ended up finding a few good things that I could record about my various projects: the type of case (it's always either a **design** or **utility** patent), the **number of pages** in the final package, and whether or not I had to **build a 3D model**. I did this, collecting the data, just to be able to analyze things when I became curious about a particular aspect of my business. 

Then I got into machine learning and realized that I could potentially leverage this data to not only look at the past, but to predict the future of my projects. I got excited about the idea of building a model that would allow me to provide clients with accurate estimates that would take a fraction of the time. So that's what this first post is about. I thought it might be interesting to document my journey with this project as I collect more data and improve the model. As more data comes in, I will update with a new post. Thanks for checking it out.

# Loading the Libraries for EDA

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import pandas as pd
import numpy as np
import seaborn as sns
from plotnine import *
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Importing the Data

In [None]:
df = pd.read_csv('../input/projects.csv')

In [None]:
df.shape

# Exploratory Data Analysis

## Variable Identification

In [None]:
df.head()

Here we can see the variables of the dataset. Our dependent variable will be hours as this is what we're trying to predict. Let's take a look at the datatypes for each variable.

In [None]:
df.info()

We have 3 categorical independent variables, and 1 continuous independent variable.

## Univariate Analysis

#### Case Type Variable

In [None]:
df['case_type'].value_counts()

In [None]:
(df['case_type'].value_counts() / len(df)).plot.bar()

Here we can see that utility cases make up about 70% of this dataset.

#### Number of Pages

In [None]:
df['number_pages'].value_counts().sort_index().plot.line()

Here we can see that the majority of the projects in this dataset have under 10 pages.

#### 3D Modeling

In [None]:
df['3d_modeling'].value_counts()

In [None]:
(df['3d_modeling'].value_counts() / len(df)).plot.bar()

Here we see that for this dataset only about 20% percent of the projects require me to create a 3d model.

#### Target Variable (hours)

In [None]:
df['hours'].value_counts().sort_index().plot.line()

Here we can see that most of the projects in this dataset take around 7 hours to complete. It is pretty rare to have a case that goes over 7 hours.

## Bi-variate Analysis

#### Case Type vs Number of Pages (Categorical vs. Categorical)

In [None]:
type_pages = pd.crosstab(index = df['number_pages'],
                        columns = df['case_type'])
type_pages

In [None]:
type_pages.plot(kind = 'bar',
               figsize = (8,8),
               stacked=True)

Here we can see that for both utility and design cases the majority seem to be in the 5 - 8 page range. Also, utility projects appear to be more spread out, whereas design projects tent to mainly fall within that 5 - 8 range. This makes sense to me as someone with domain knowledge about this kind of work.

For example, I know that for design projects the end goal is to have a package that describes the design of a new invention from every angle. Typically there will be an orthographic (isometric) perspective drawing, and then a front view elevation drawing, rear, left and right side, and top and bottom. Each view tends to get it's own page -- with occasional exceptions -- and so I'll usually end up with 7 pages. 

Utility projects on the other had can really be any length of pages. I was actually interested to see the distribution of the number of pages for utility cases as without this visual I would not have been able to guess it's shape.

#### Case Type vs 3D Modeling (Categorical vs. Categorical)

In [None]:
type_modeling = pd.crosstab(index = df['3d_modeling'],
                        columns = df['case_type'])
type_modeling

In [None]:
type_modeling.plot(kind = 'bar',
               figsize = (8,8),
               stacked=True)

Here we can see that 3D modeling is more often used for design projects. Again, because of my domain knowledge this makes complete sense to me. As I mentioned above design projects are all about seeing a new invention from all angles and sometimes the disclosure I receive from the client is lacking in it's descriptiveness. This is where modeling things in 3D first will allow me to work much faster, and more accurately, than if I tried to wing it.

As utility matters really only have to work on a 2D plane, it's pretty rare that I need to incorporate any 3D modeling in those projects. It is interesting to see just how much (proportionally within this sample) I'm actually using this option.

#### Case Type vs Hours (Categorical vs. Continuous)

In [None]:
df.boxplot(column="hours",
           by= "case_type",
           figsize= (8,8))

Here we get a sense of the time required to finish each type of project. We also see that there are some outliers, but given my knowledge of those matters I know that they were simply ramdom projects that turned out to be quite lengthly. Because I want to be able to estimate these larger projects, it makes sense to keep them in the set. Coming up with a way to accurately and quickly estimate projects is actually even more important when it comes to large cases as it can take quite a while to go through each page and try to guess how long it will take to complete.

#### Number of Pages vs 3D Modeling (Cat vs Cat))

In [None]:
pages_modeling = pd.crosstab(index = df['number_pages'],
                        columns = df['3d_modeling'])
pages_modeling

In [None]:
pages_modeling.plot(kind = 'bar',
               figsize = (8,8),
               stacked=True)

Everything here is generally skewed to the right, but we see a high number of projects requiring 3D modeling to have around 5 to 7 pages. This can be explained by what I mentioned about about design matters being the main type of project to require 3D modeling. And we know that design projects tend to fall in the page count range.

#### Number of Pages vs Hours ( Cat vs Cont)

In [None]:
df.boxplot(column="hours",
           by= "number_pages",
           figsize= (8,8))

There seems to be an obvious positive relationship between the **number_pages** and **hours** variables, but there is quite a bit of variance. This illustrates the fact that it isn't necessarily the length of a project, in terms of the number of pages, that dictates the amount of time required to finish because some drawings can be much more complex than others.

#### 3D Modeling vs Hours ( Cat vs Cont)

In [None]:
df.boxplot(column="hours",
           by= "3d_modeling",
           figsize= (8,8))

This looks like another effect of the utility versus design relationship I've been discussing. Design projects tend to be more predicable in terms of hours to complete.

## Feature Engineering

Because there is so much variance in the length of time needed to complete a project, I'd like to try some feature engineering. Hopefully this will not only help our model's accuracy, it will also make it easier on me when I'm entering information about a new project that needs estimating. 

Given the fact that the drawing complexity will vary from project to project, it will take longer to complete pages for more complex projects. I'd like to create a new variable called **difficulty** that will assign levels of difficulty based on how long it takes to finish a page for each project. I'll start by creating a temporary vector to hold the values of a project's **hours** divided by it's **number_pages**.

In [None]:
df.columns

In [None]:
# Create a new variable that records 'hours' / 'number_pages'
df['hour_page'] = df['hours'] / df['number_pages']

In [None]:
df.hour_page.describe()

Here we can take a look at the spead again. I'd like to divide this into four equal groups that will represent the four difficulty levels.

In [None]:
# Create variables to store location of bin boundaries
hp_min = df.hour_page.min()
hp_max = df.hour_page.max()
hp_range = hp_max - hp_min
hp_bin = hp_range / 4


# Create variables to store location of difficulty bins
level_one = hp_min + hp_bin
level_two = level_one + hp_bin
level_three = level_two + hp_bin

In [None]:
# Create a function that will assign a difficulty to each project
# based on 'hour_page'

def get_difficulty(row):
    difficulty = 0
    if row.hour_page < level_one:
        difficulty = 1
    elif (row.hour_page >= level_one) & (row.hour_page < level_two):
        difficulty = 2
    elif (row.hour_page >= level_two) & (row.hour_page < level_three):
        difficulty = 3
    elif (row.hour_page >= level_three):
        difficulty = 4
    else:
        return difficulty
    
    return difficulty

In [None]:
df['difficulty'] = df.apply(get_difficulty, axis=1)

In [None]:
df.difficulty.value_counts()

In [None]:
df.head()

Having the four options for difficulty will make it easier for me to take a better guess when estimating a new project. Now instead of trying to guess and exact amount of time-per-page, I simply have to narrow my guess to four options. After creating this variable I think I'm ready to get into the modeling stage.

# Modeling

Since I'm working with such a small amount of data, I know it's important to get the right model. I'm referencing scikit learn's cheat sheet here http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html to find a good starting point. Based on my data, I'm deciding to go with an Support Vector Regression.

In [None]:
# First I'll import some libraries I know I'll need 
from xgboost import XGBClassifier
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error

In [None]:
# I'll make a copy of the dataset so I can refer back to it
df_train = df.copy()

Let's remove some variables that we won't need. **ProjectID** is simply an identification so it won't be helping the model. **hour_page** can be removed because we already have this information accounted for in the fact that we have both the **number_pages** and the **hours** variables.

In [None]:
# Delete unneeded variables
del df_train['project_id']
del df_train['hour_page']

print('Data shape:', df_train.shape)

### Data Preprocessing

We have two categorical variables that need to be encoded. I'll use LabelEncoder to take care of this.

In [None]:
# Use label encoder on the 'case_type' and '3d_modeling' variables
labelencoder_X = LabelEncoder()

df_train['case_type'] = labelencoder_X.fit_transform(df_train['case_type'])
df_train['3d_modeling'] = labelencoder_X.fit_transform(df_train['3d_modeling'])
# utility = 1, yes = 1

In [None]:
df_train.head()

Here we can confirm that the label encoding correctly converted the categorical variables. Here a **case_type** of 1 = **utility** (0 = design), and a 1 for **3d_modeling** = **yes**. Next we'll separate our *Independent* variables and our *dependent* variable.

In [None]:
# Create X and y arrays for the dataset
X = df_train[['case_type', 'number_pages', '3d_modeling', 'difficulty']].copy()
y = df_train['hours'].values

In [None]:
y.shape

We'll have to reshape the *y* vector in order for our model to work correctly.

In [None]:
y = y.reshape(-1, 1)
y.shape

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Now we have the dataset split into training and test sets.

In [None]:
print('Training data shape: {}'. format(len(X_train)))
print('Test data shape: {}'. format(len(X_test)))

### Fitting the Model

In [None]:
# PIPELINE
my_pipeline = make_pipeline(StandardScaler(), SVR(kernel = 'linear'))

my_pipeline.fit(X_train, y_train)
y_pred = my_pipeline.predict(X_test)

In [None]:
svr_pipeline_score = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error for SVR: {}'.format(svr_pipeline_score))

Let's see if we can improve on the model by using **Grid Search** to find the best hyperparameters to use.

### Grid Search

In [None]:
# Feature Scaling
sc_X = StandardScaler()
sc_y = StandardScaler()
X_train = sc_X.fit_transform(X_train)
y_train = sc_y.fit_transform(y_train)


regressor = SVR(kernel = 'linear')



# Applying Grid Search to find the best model and the best parameters
parameters = [{'C': [1, 5, 10, 15, 20], 'kernel': ['linear']},
              {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = regressor,
                           param_grid = parameters,
                           scoring = 'neg_mean_absolute_error',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)

best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

print('Best Accuracy: {}'.format(best_accuracy))
print('Best Parameters: {}'.format(best_parameters))

### Applying New Hyperparameters

In [None]:
# Create X and y arrays for the dataset
X = df_train[['case_type', 'number_pages', '3d_modeling', 'difficulty']].copy()
y = df_train['hours'].values
y = y.reshape(-1, 1)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


# PIPELINE
my_pipeline = make_pipeline(StandardScaler(), SVR(kernel = 'linear', C = 10))

my_pipeline.fit(X_train, y_train)
y_pred = my_pipeline.predict(X_test)

In [None]:
# Print out the accuracy score
svr_updated_pipeline_score = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error for SVR: {}'.format(svr_pipeline_score))

Selecting these new hyperparameters has improved our model by reducing the **mean absolute error** from around **0.97** to **0.68**. Let's see if we can improve the model even more by using the popular **XGBoost** method.

### XGBoost

In [None]:
# Fitting XGBoost to the Training set
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# get predicted 'hours' on validation data
xg_score = mean_absolute_error(y_test, y_pred)
print('Mean Absolute Error for XGBoost: {}'.format(xg_score))

The **XGBoost** was not as effective as our updated **SVR** model.

# Conclusion

That was a lot of fun! It is extremely satisfying to use some data that I've collected about my business to actually improve efficiency. Based on the scores of the model, I feel confident enough to begin using this to assit me on future project estimations. However, it is important to note that I'm dealing with quite a small dataset here. Even though I've tested the model with some hypothecical new project parameters and haven't really gotten any estimates I don't agree to be helpful, I will always take the model's prediction with a grain of salt. I will continue to tweak and improve my model as I collect more data. Hopefully after I've collected enough data, I can begin to use **k-Fold Cross Validation** to get a better sense of my model's performance.

#### Next Steps

At the time of this writing I'm using the model to estimate by a simple python script that prompts me to enter the four categories of information about a new project. Going forward I'd like to learn about using something like Flash to create an application based on the model.

Thanks for taking the time to check out this kernel and dataset. Please feel free to leave a comment if you have any feedback about anything I could do differently or of ways to improve. Cheers!