# Modeling

The [RealWaste](https://archive.ics.uci.edu/dataset/908/realwaste) dataset already includes a "Miscellanous" class but for the sake of demonstration we will also group textiles into this other class to demonstrate how to configure Class Aggregation with the datarobot api. 

In [2]:
import datarobot as dr 

# Create your client
dr.Client(config_path="/Users/luke.shulman/.config/datarobot/drconfig.yml")



<datarobot.rest.RESTClientObject at 0x106b1b040>

In [3]:


project = dr.Project.create("data/real_waste_data_sampled.csv", project_name="Multiclass Example", max_wait=1000)


We create the ClassMapping helper class and then pass that `set_target`. This should preserve at least 6 of the 9 classes while creating a final class "DR_OTHER". 

In [4]:
class_mapping = dr.helpers.ClassMappingAggregationSettings(max_unaggregated_class_values=6, excluded_from_aggregation=['Food Organics', 'Glass','Metal','Paper', 'Plastic'], aggregation_class_name="DR_OTHER")

project.set_target(
    target="main_class",
    target_type="Multiclass",
    class_mapping_aggregation_settings=class_mapping,
    mode="quick",
)



Project(Multiclass Example)

In [5]:
# Optional Will block Execution until auto pilot is complete
project.wait_for_autopilot()

In progress: 3, queued: 0 (waited: 0s)
In progress: 3, queued: 0 (waited: 1s)
In progress: 3, queued: 0 (waited: 2s)
In progress: 3, queued: 0 (waited: 3s)
In progress: 3, queued: 0 (waited: 4s)
In progress: 3, queued: 0 (waited: 7s)
In progress: 3, queued: 0 (waited: 11s)
In progress: 2, queued: 0 (waited: 18s)
In progress: 2, queued: 0 (waited: 31s)
In progress: 1, queued: 0 (waited: 52s)
In progress: 8, queued: 4 (waited: 72s)
In progress: 8, queued: 4 (waited: 93s)
In progress: 8, queued: 1 (waited: 113s)
In progress: 2, queued: 0 (waited: 134s)
In progress: 0, queued: 0 (waited: 154s)
In progress: 0, queued: 0 (waited: 175s)
In progress: 0, queued: 0 (waited: 196s)
In progress: 0, queued: 0 (waited: 217s)
In progress: 1, queued: 0 (waited: 237s)
In progress: 1, queued: 0 (waited: 258s)
In progress: 1, queued: 0 (waited: 278s)
In progress: 1, queued: 0 (waited: 299s)
In progress: 1, queued: 0 (waited: 320s)
In progress: 0, queued: 0 (waited: 340s)
In progress: 0, queued: 0 (waited:

## Evaluation of the Models

With the models built, we can examine how the model is performing. 


In [9]:
best_model = project.get_models()[0]


con_chart = best_model.get_confusion_chart(source=dr.enums.CHART_DATA_SOURCE.VALIDATION,  fallback_to_parent_insights=True)

In [10]:
import altair as alt
import pandas as pd 

dtc = pd.DataFrame([{'class_name': c['class_name'], 'actual_count':c['actual_count'], 'predicted_count':c['predicted_count']} for c in  con_chart.class_metrics])
dtc = dtc.melt(id_vars=['class_name'], value_vars=['actual_count', 'predicted_count'], var_name="Test")
dtc


alt.Chart(dtc, title="Actual vs. Predicted").mark_bar().encode(
    alt.Column('class_name:N'),

    alt.Color('Test:N'), 
    alt.X('Test:N', axis=alt.Axis(labels=False, ticks=False), title=None),

    alt.Y('value:Q', title='Number of Instances')

)



Let's examine the class probabilities to understand how we might apply thresholding. 


In [11]:
try:
    prediction_job = best_model.request_training_predictions(data_subset='validation')
    train_preds = prediction_job.get_result_when_complete() 
except dr.errors.ClientError:
    all_training_predictions = dr.TrainingPredictions.list(project.id)
    train_preds = [tp for tp in all_training_predictions if tp.model_id == best_model.id][0]


df = train_preds.get_all_as_dataframe()

In [12]:
(df * 100).describe()

Unnamed: 0,row_id,class_DR_OTHER,class_Food Organics,class_Glass,class_Metal,class_Paper,class_Plastic
count,80.0,80.0,80.0,80.0,80.0,80.0,80.0
mean,25658.75,36.354732,11.570927,13.468319,12.543796,17.039046,9.023179
std,14451.040817,40.01291,28.782546,27.668728,24.714265,29.796766,19.128345
min,2100.0,0.010644,0.002624,0.00266,0.00031,0.005502,0.001873
25%,12100.0,2.385992,0.156653,0.141372,0.107705,0.380689,0.216106
50%,25800.0,13.737556,0.618354,1.321964,0.828448,2.173581,1.122535
75%,40075.0,77.595617,3.645029,6.935243,9.934916,13.554327,4.68512
max,48700.0,99.938464,99.845302,98.895675,98.957574,99.932086,84.018409


For illustration, let's set a threshold of 75% 

In [15]:
THRESHOLD = 0.75
class_columns = [col for col in df.columns if col.startswith('class_')]


df['high_confidence'] = (df[class_columns] > THRESHOLD).any(axis=1).astype(int)

# Display the result (note this is just for the validation 
df[['prediction', 'high_confidence']].groupby('prediction').sum()

Unnamed: 0_level_0,high_confidence
prediction,Unnamed: 1_level_1
DR_OTHER,22
Food Organics,8
Glass,7
Metal,5
Paper,9
Plastic,2


### Class Names

This cell will output the class names which are needed for the next step. 



In [14]:
import json 
class_names = df.prediction.value_counts().index.tolist()
print(json.dumps(class_names, indent=1))

[
 "DR_OTHER",
 "Paper",
 "Glass",
 "Food Organics",
 "Metal",
 "Plastic"
]
