In [1]:
import pandas as pd

For this project you will need to classify some angry comments into their respective category of angry. The process that you'll need to follow is (roughly):
<ol>
<li> Use NLP techniques to process the training data. 
<li> Train model(s) to predict which class(es) each comment is in.
    <ul>
    <li> A comment can belong to any number of classes, including none. 
    </ul>
<li> Generate predictions for each of the comments in the test data. 
<li> Write your test data predicitions to a CSV file, which will be scored. 
</ol>

You can use any models and NLP libraries you'd like. Think aobut the problem, look back to see if there's anything that might help, give it a try, and see if that helps. We've regularly said we have a "toolkit" of things that we can use, we generally don't know which ones we'll need, but here you have a pretty simple goal - if it makes it more accurate, it helps. There's not one specific solution here, there are lots of things that you could do.

## Output Details, Submission Info, and Example Submission

For this project, please output your predictions in a CSV file. The structure of the CSV file should match the structure of the example below. 

The output should contain one row for each row of test data, complete with the columns for ID and each classification.

Into Moodle please submit:
<ul>
<li> Your notebook file(s). I'm not going to run them, just look. 
<li> Your sample submission CSV. This will be evaluated for accuracy against the real labels; only a subset of the predictions will be scored. 
</ul>

It is REALLY, REALLY, REALLY important the the structure of your output matches the specifications. The accuracies will be calculated by a script, and it is expecting a specific format. 

### Sample Evaluator

The file prediction_evaluator.ipynb contains an example scoring function, scoreChecker. This function takes a sumbission and an answer key, loops through, and evaluates the accuracy. You can use this to verify the format of your submission. I'm going to use the same function to evaluate the accuracy of your submission, against the answer key (unless I made some mistake in this counting 

### Evaluator of Comment Accuracy

In [2]:
def scoreChecker(df_sub, df_correct):
    check_full = pd.merge(df_sub, df_correct, how='inner', left_on = 'id', right_on = 'id')
    corr_cols = ["toxic_y", "severe_toxic_y", "obscene_y", "threat_y", "insult_y", "identity_hate_y"]
    sub_cols = ["toxic_x", "severe_toxic_x", "obscene_x", "threat_x", "insult_x", "identity_hate_x"]
    first_val = "toxic_y"
    correct = 0
    total = 0
    #print(check_full)
    for index, sub_row in check_full.iterrows():
        if sub_row[first_val] >= 0:
            total += len(corr_cols)
            for i in range(len(corr_cols)):
                if sub_row[sub_cols[i]] == sub_row[corr_cols[i]]:
                    correct += 1
    
    return correct, total, check_full

In [3]:
submission = pd.read_csv("out.csv")
key = pd.read_csv("sample_correct.csv")

In [4]:
correct, total, df_res = scoreChecker(submission, key)
print(correct, total, (correct*100/total))
df_res.head()

21 24 87.5


Unnamed: 0,id,toxic_x,severe_toxic_x,obscene_x,threat_x,insult_x,identity_hate_x,toxic_y,severe_toxic_y,obscene_y,threat_y,insult_y,identity_hate_y
0,dfasdf234,0,0,0,0,0,0,1,0,0,0,0,0
1,asdfgw43r52,0,0,1,1,0,1,0,0,0,1,0,1
2,asdgtawe4,0,0,1,0,1,1,0,0,1,1,1,1
3,wqtr215432,0,0,0,1,0,0,0,0,0,1,0,0


In [None]:
#Construct dummy data for a sample output. 
#You won't do this part first, you have real data - I'm faking it. 
#Your data should have the same structure, so the CSV output is the same
dummy_ids = ["dfasdf234", "asdfgw43r52", "asdgtawe4", "wqtr215432"]
dummy_toxic = [0,0,0,0]
dummy_severe = [0,0,0,0]
dummy_obscene = [0,1,1,0]
dummy_threat = [0,1,0,1]
dummy_insult = [0,0,1,0]
dummy_ident = [0,1,1,0]
columns = ["id", "toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
sample_out = pd.DataFrame( list(zip(dummy_ids, dummy_toxic, dummy_severe, dummy_obscene, dummy_threat, dummy_insult, dummy_ident)),
                    columns=columns)
sample_out.head()

In [None]:
#Write DF to CSV. Please keep the "out.csv" filename. Moodle will auto-preface it with an identifier when I download it. 
#This command should work with your dataframe of predictions. 
sample_out.to_csv('out.csv', index=False)