### Calculating Errors

Here are two datasets that represent two of the examples you have seen in this lesson.  

One dataset is based on the parachute example, and the second is based on the judicial example.  Neither of these datasets is based on real people.

Use the exercises below to assist in answering the quiz questions at the bottom of this page.

In [19]:
import numpy as np
import pandas as pd

jud_data = pd.read_csv('judicial_dataset_predictions.csv')
par_data = pd.read_csv('parachute_dataset.csv')

In [20]:
jud_data.head()

Unnamed: 0,defendant_id,actual,predicted
0,22574,innocent,innocent
1,35637,innocent,innocent
2,39919,innocent,innocent
3,29610,guilty,guilty
4,38273,innocent,innocent


`1.` Above, you can see the actual and predicted columns for each of the datasets.  Using the **jud_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 1 below.  

**Hint for quiz:** an error is any time the prediction doesn't match an actual value.  Additionally, there are Type I and Type II errors to think about.  We also know we can minimize one type of error by maximizing the other type of error.  If we predict all individuals as innocent, how many of the guilty are incorrectly labeled?  Similarly, if we predict all individuals as guilty, how many of the innocent are incorrectly labeled?

In [21]:
# type 1 error: [actual *innocent*] [predicted *guilty*]
# type 2 error: [actual *guilty*] [predicted *innocent*]

type1 = jud_data.query('actual == "innocent" & predicted == "guilty"').actual.count()
type2 = jud_data.query('actual == "guilty" & predicted == "innocent"').actual.count()
correct = jud_data.query('actual == "guilty" & predicted == "guilty" or \
            actual == "innocent" & predicted == "innocent"').actual.count()
innocent = jud_data.query('actual == "innocent"').actual.count()
guilty = jud_data.query('actual == "guilty"').actual.count()


In [23]:
# proportion of error rates

type1_rate = type1/jud_data.shape[0]
type2_rate = type2/jud_data.shape[0]
innocent_rate = innocent/jud_data.shape[0]
guilty_rate = guilty/jud_data.shape[0]

print('Total Percentage of Errors: ', (type1_rate + type2_rate)*100 )
print('Percentage of Type I Error: ', type1_rate * 100 )
print('Percentage of Type II Error: ', type2_rate * 100 )
print('\nIf everyone was predicted to be guilty') 
print('\tthe percentage of Type I Errors made : ', innocent_rate*100,'\n')
print('\nIf everyone was predicted to be guilty') 
print('\tthe proportion of Type II Errors made : ', 0 )

Total Percentage of Errors:  4.21529589455
Percentage of Type I Error:  0.151036660717
Percentage of Type II Error:  4.06425923383

If everyone was predicted to be guilty
	the percentage of Type I Errors made :  45.1599615543 


If everyone was predicted to be guilty
	the proportion of Type II Errors made :  0


* Notice that all the innocent individuals would be Type I Errors if everyone was predicted as guilty. What proportion of the dataset is innocent?

* Notice that Type II Errors are individuals where guilty individuals are predicted as innocent. If everyone is predicted as guilty, what proportion of the time would we commit Type II Errors then?

`2.` Using the **par_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 2 below.

These should be very similar operations to those you performed in the previous question.

In [3]:
par_data.head()

Unnamed: 0,parachute_id,actual,predicted
0,3956,opens,opens
1,2147,opens,opens
2,2024,opens,opens
3,8325,opens,opens
4,6598,opens,opens


In [17]:
# type 1 error: [actual: *fails* predicted: *opens*]
# type 2 error: [actual: *opens* predicted: *fails*]

type1 = par_data.query('actual == "fails" & predicted == "opens"').actual.count()
type2 = par_data.query('actual == "opens" & predicted == "fails"').actual.count()
correct = par_data.query('actual == "fails" & predicted == "fails" or \
            actual == "opens" & predicted == "opens"').actual.count()
opens = par_data.query('actual == "opens"').actual.count()
fails = jud_data.query('actual == "fails"').actual.count()

In [25]:
# proportion of error rates

type1_rate = type1/par_data.shape[0]
type2_rate = type2/par_data.shape[0]
opens_rate = opens/par_data.shape[0]
fails_rate = fails/par_data.shape[0]

print('Total Proportion of Errors: ', (type1_rate + type2_rate) )
print('Proportion of Type I Error: ', type1_rate )
print('Proportion of Type II Error: ', type2_rate )
print('\nIf every parachute was predicted to not open,'+
      '\n\tthe proportion of Type I Errors made', 0,'\n')
print('If every parachute was predicted to not open,'+
       '\n\tthe proportion of Type II Errors made', opens_rate)

Total Proportion of Errors:  0.0526676960027
Proportion of Type I Error:  0.00188711614342
Proportion of Type II Error:  0.0507805798593

If every parachute was predicted to not open,
	the proportion of Type I Errors made 0 

If every parachute was predicted to not open,
	the proportion of Type II Errors made 0.991765311374


* If we predict all of the parachutes to fail, we would never commit a Type I Error.

* If we predict all of the parachutes to fail, we would commit a Type II Error on every parachute that actually opens.