## Appendix 2: Some Frustrations

Briefly, there were parts of this competition which rubbed me the wrong way. 

Kaggle is an excellent concept, a wonderful group of folks, and a justly renowned educational platform. But it's not perfect. I really wish they had somewhat better explanations/tutorials/etc. for navigating their tools from the perspective of a non-computer-science-major. It's not at all obvious, for example, how to proceed when 'Cuda is out of memory', when 'notebook threw exception' (but no further details), or when 'internet must be disabled' (despite Kaggle not having the module otherwise available) is the feedback you get. There were multiple times where I just had to give up on model variations because despite them running cleanly in Jupyter, they wouldn't run in Kaggle; or despite them running in Kaggle, they wouldn't run in the submission. Based on the discussion boards, mods' contributions, and site updates, they seem to be aware, at least, that some of these things are adversely impacting a lot of users, but so far none of the answers provided worked for me (or either I was doing them wrongly -- a very real possibility, but if I, a motivated and well-educated user, can't figure them out, doesn't that just reinforce my point?).

In any event, my biggest frustration was with the dataset itself. I'll explore that briefly here, and please let me know what you think!

In [1]:
import pandas as pd
gt = pd.read_csv('train.csv')

In [2]:
gt.head(5)

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
0,0013cc385424,007ACE74B050,"Hi, i'm Isaac, i'm going to be writing about h...",Lead,Adequate
1,9704a709b505,007ACE74B050,"On my perspective, I think that the face is a ...",Position,Adequate
2,c22adee811b6,007ACE74B050,I think that the face is a natural landform be...,Claim,Adequate
3,a10d361e54e4,007ACE74B050,"If life was on Mars, we would know by now. The...",Evidence,Adequate
4,db3e453ec4e2,007ACE74B050,People thought that the face was formed by ali...,Counterclaim,Adequate


So, here we see the blocks of text, the discourse type, and the corrsponding score assigned. Nothing appears off. Given this dataset is a subset of another Kaggle dataset from a previous competition, it's presumably been checked and double-checked by hundreds of people. But looking more closely we find: 

In [3]:
gt['length'] = gt.discourse_text.apply(lambda x: len([i for i in x]))

In [4]:
gt['words'] = gt.discourse_text.apply(lambda x: len([i for i in x.split()]))

In [5]:
Characters = 'This sentence, including spaces and punctuation marks, is of length:'
char_count = len(Characters)
Alternate = 'The shorter example.'
cc2 = len(Alternate)
print(Characters, char_count)
print(f"'{Alternate}' (Excluding this annotation and the quotation marks, this sentence has length {cc2}) ")

This sentence, including spaces and punctuation marks, is of length: 68
'The shorter example.' (Excluding this annotation and the quotation marks, this sentence has length 20) 


Okay, so a given sentence of mildly low complexity is ~70 characters, while a complete sentence of near-maximal brevity -- while still being of complete and grammatically correct structure--is around 20. 

My instinct is that, given the rubric's criteria, it would be hard to be considered adequate using fewer characters than my example sentence above. Well..

In [6]:
gt[gt.length < 20]

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,length,words
11,cc921c5cfda4,00944C693682,stress.,Claim,Adequate,8,1
37,7a01d9cb379a,013B9AA6B9DB,well it is not.,Rebuttal,Ineffective,16,4
46,f1d3e589dbbe,0158970BC5D2,easy plagiarism,Claim,Effective,16,2
409,dfbcba259ead,09133053C474,It's fun,Claim,Adequate,9,2
452,1ab1030c639a,0A5B8761B187,Disagree,Position,Ineffective,9,1
...,...,...,...,...,...,...,...
36323,fc80beab495d,E018497ED277,low experience,Claim,Ineffective,15,2
36324,5db66047d029,E018497ED277,less communicate,Claim,Adequate,17,2
36549,a54c4c66b7cb,EDDFFD34DBD4,opinions.,Claim,Ineffective,10,1
36591,e751b1501bda,F52B9A0882BB,learn what to do.,Claim,Adequate,18,4


265 example-texts have a character count of 20 or fewer characters! As we can see above, these are not -- in many cases-- even complete *phrases*, much less sentences remotely resembling the examples given. 
Nevertheless, these were largely rated to be adequate! For example: 

In [7]:
print(gt['discourse_effectiveness'][11], '\n')

print(f"The entirety of observation eleven's discourse Text: '{gt['discourse_text'][11]}'  ")

Adequate 

The entirety of observation eleven's discourse Text: 'stress. '  


The entirety of this item's discourse_text is the word 'stress.' (Period in original). Let's recall what the rubric has to say about a 'claim' and how it is assessed: 

>Adequate: 
>- Description: The claim relates to the position but may simply repeat part of the position or state a claim without support.
    The claim is moderately valid and acceptable. 
   
> - Example: Position: "Every individual owes it to themselves to think seriously about important matters, no matter the difficulty".
    Claim: " It is important to think seriously about important matters although some people do not do this". 
    
Note that the claim is contextually defined: it should be interpreted in terms of its relationship to the preceding position statement. Below, we can use the essay ID t reconstruct the original essay in its entirety:

In [8]:
import simple_colors
exampleEssay = gt.loc[gt['essay_id']== '00944C693682']
for index, row in exampleEssay.iterrows():
    if row.discourse_text.startswith('With') or row.discourse_text.startswith('stress'):
        row.discourse_text = simple_colors.yellow(row.discourse_text, 'bold')
    print(f'(Discourse Type: {row.discourse_type})','\n', row.discourse_text,'\n', f'(Score= {row.discourse_effectiveness})', '\n')

(Discourse Type: Lead) 
 Limiting the usage of cars has personal and professional support all across the globe and yet it has yet to be embraced everywhere. Statistical proof show where it may help and real life examples of some of the effects of reducing, or getting rid of altogether, cars in one's daily life. While "recent studies suggest that Americans are buying fewer cars, driving less and getting fewer licenses as each year goes by" (Source 4), is that really enough or for the right reason? There are plenty of reasons to stop, or limit, the amount of cars being driven on the roads for every kind of person, from the hippie to the businessman, from the mom to the college student.  
 (Score= Effective) 

(Discourse Type: Position) 
 [1;33mWith so many things in this world that few people agree on, this is a nice change to see in regards the removal of so many cars. Why would they all agree, one might ask. Well, there are plenty of reasons. 
[0m 
 (Score= Effective) 

(Discourse Ty

So, here we can see that the text is complete in itself. The student wrote: 
'Why would they all agree, one might ask. Well, there are plenty of reasons. Stress.' 

Presumably, they intended something to the effect of a list or heading: 
- 'Why do people agree? Many reasons: stress, exhaustion, ...

Or:

- 'There are many reasons people agree. Reason one: Stress.
Or something like that. 

'Well, there are many reasons' is the actual 'claim'; 'stress' is basically a heading. But the former got identified as a 'position,' the single word 'stress' as the claim, and the subsequent discussion of car-less communities as the evidence. 

This -- in conjunction with the multifarious examples that follow, lead me to think that the supervisory inputs are contradictory and incompatible with the rubric used to guide this dataset's development. These kinds of errors mislead both the analyst and the learner. In order for any algorithm to meaningfully learn there must be genuine and consistent distinctions. When a supervised learner gets: 

- Height | Description
- 5'1"   | Tall
- 5'2"   | Tall
- 5'1    | Short
- 6'1    | Tall
- ...

there's no 'there' there. 

And indeed for human learner it's much the same: a person learning English would infer that the word 'height' must mean *some other feature*, but certainly not how tall people are. The instances being called 'tall' have no relationship with the instances' distance from head-to-ground. *If* some tangential or spurious link can be discovered, that may be even worse: it would mean the outputs from the machine indicate sharing our meaning in some sense, while in fact they have merely latched onto some completely unrelated factor invisble without the synchronous computation of a million partial derivatives.

In [9]:
gt[gt['length'] < 60]

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,length,words
11,cc921c5cfda4,00944C693682,stress.,Claim,Adequate,8,1
21,c20937683442,00BD97EA4041,"No because, why should a computer know how you...",Position,Adequate,57,10
33,ed3a833a2f49,013B9AA6B9DB,What is that thing on Mars?,Lead,Adequate,28,6
37,7a01d9cb379a,013B9AA6B9DB,well it is not.,Rebuttal,Ineffective,16,4
40,244544e584aa,013B9AA6B9DB,but in 2001 a newer image was taken,Rebuttal,Adequate,36,8
...,...,...,...,...,...,...,...
36748,1e92924f6555,FF9E0379CD98,dont have partner when are you take the classe...,Claim,Adequate,55,10
36751,f6a04e6a32cc,FF9E0379CD98,when you take the class online you dont have o...,Claim,Adequate,55,10
36756,e8241b9934b7,FF9E0379CD98,but is not bad idea take the class,Rebuttal,Adequate,35,8
36757,74c58fcc7ef8,FF9E0379CD98,you cant work or cant study after school with ...,Evidence,Adequate,55,10


Over 12% of the entire corpus consists of sentences, phrases, fragments, and discrete words that are shorter than my example. Of these, as you can see below, the majority were assessed to have adequately met the description (800  were even 'Effective'!).

Now, given this is a subset of a former Kaggle competition's dataset, I guessed that perhaps what happened was something was explained in the previous conversation that I was not aware of. But that page is still live [https://www.kaggle.com/competitions/feedback-prize-2021/overview/description] and it suggests very plainly that we also are looking for complete sentences (at least) to comprise the discourse element.

In [10]:
len(gt.loc[(gt['length']<60) & (gt['discourse_effectiveness']=='Adequate')])

2989

In [11]:
len(gt.loc[(gt['length']<60) & (gt['discourse_effectiveness']=='Effective')])

811

In [12]:
gt.loc[(gt['length']<60) & (gt['discourse_effectiveness']=='Effective') & (gt['discourse_type']=='Evidence')]

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,length,words
10149,6097449f6081,B6F03D0CEABE,Let them grow up.,Evidence,Effective,18,4
13465,8cfde4ea838e,F2B86853339B,They polute the air and they can be very dange...,Evidence,Effective,52,10
17456,fadb350e65c2,2B568E2031B1,It will make our earth cleaner,Evidence,Effective,31,6


I don't see how 'Let them grow up.' can have effectively met the criteria described above.

In any case, these aren't just oddballs.

Things don't look great when we shift from characters to words, either.

In [13]:
Words = 'This sentence, excluding punctuation marks, contains the following number of words:'
word_count = len([i for i in Characters.split()])
print(Words, word_count)

This sentence, excluding punctuation marks, contains the following number of words: 10


In [14]:
gt[gt['words'] < 10]

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,length,words
11,cc921c5cfda4,00944C693682,stress.,Claim,Adequate,8,1
33,ed3a833a2f49,013B9AA6B9DB,What is that thing on Mars?,Lead,Adequate,28,6
37,7a01d9cb379a,013B9AA6B9DB,well it is not.,Rebuttal,Ineffective,16,4
40,244544e584aa,013B9AA6B9DB,but in 2001 a newer image was taken,Rebuttal,Adequate,36,8
45,e713e8b20cf3,0158970BC5D2,summer projects should be teacher-designed,Position,Adequate,43,5
...,...,...,...,...,...,...,...
36733,3414073ceef9,FEF42864AE28,the lack of socialization. \n,Claim,Effective,28,4
36746,aa1e0cd8e3a9,FF9E0379CD98,dont have a teacher,Claim,Adequate,20,4
36747,0ae154c2131f,FF9E0379CD98,not organized,Claim,Adequate,14,2
36756,e8241b9934b7,FF9E0379CD98,but is not bad idea take the class,Rebuttal,Adequate,35,8


10% of all discourse_texts are of fewer than 10 words, most are 'Adequate' and -- at least of those visible-- consist of fragments of decontextualized words. 

In [15]:
#Example of an effective type Claim: 
gt.discourse_text[46]

'easy plagiarism '

As shown below, none of these are without a corresponding rating, since no ratings are absent from the dataset.

In [16]:
gt.discourse_effectiveness.value_counts()

Adequate       20977
Effective       9326
Ineffective     6462
Name: discourse_effectiveness, dtype: int64

In [17]:
print(len(gt))
print(gt.discourse_effectiveness.value_counts().sum())

36765
36765


Likewise, we can see below that there are duplicates of the discourse_text: 76 to be precise. Since that number is so small, perhaps these are just goofs.

In [18]:
duplicates = gt[gt.discourse_text.duplicated(keep=False)].sort_values(by="discourse_text")
duplicates.head(10)

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness,length,words
26691,7f9c3500259d,A602D45D22B2,"""That's a lava dome that takes the form of an ...",Evidence,Adequate,104,21
27350,d628a6adda3a,ADB68BCD2874,"""That's a lava dome that takes the form of an ...",Evidence,Adequate,104,21
25391,781452d9404c,942ECB176B3A,"At the most basic level, the electoral college...",Position,Adequate,68,12
28835,6fa171a95540,C2BAF4ADA2CA,"At the most basic level, the electoral college...",Claim,Adequate,68,12
28436,9e12ec699196,BB3A6C2D0B65,Big States,Claim,Adequate,11,2
20121,35bf70c4a673,4CA37D113612,Big States,Claim,Ineffective,11,2
3933,c5b2ecb3888e,44E2726DA1B3,I agree,Position,Adequate,8,2
11285,5e4022e93247,CB66B685DAF6,I agree,Position,Adequate,8,2
17087,99782ca26927,2714214F7D9E,I think students should be required to perform...,Position,Adequate,66,10
29590,33d6bbba823c,CE64FA08E4CF,I think students should be required to perform...,Position,Adequate,66,10


But that doesn't seem to be the case. Take a look at the 3rd/4th entry: it is exactly the same text, but it has been classfied as two different discourse-types. Item 5/6 shows that 'Big states' is both 'Adequate' *and* 'Ineffective'! The last four elements show that 'I agree' is of equal sophistication as 'I agree with the principal', which (to me) is hard to understand.  

Now, we could, I suppose, just remove all of these and continue by dismissing these several thousand cases and focusing on the more intact texts. 

In [19]:
#only effective texts with more than 60 characters.
longer_texts = gt.loc[(gt['length'] > 60) & (gt['discourse_effectiveness']=='Effective') & (gt['discourse_type']== 'Counterclaim')]

#Let's take a look at a couple of (non-random) examples to illustrate my point.
#for i in longer_texts.discourse_text[43]:
 #   print(i, '\n')
    
print(longer_texts.discourse_text.values[36], '\n')    
print(longer_texts.discourse_text.values[26])

Some can argue that teachers can pick just as fascinating topics as students.  

Some people may claim that giving students the ability to design their own project is not "giving" but "forcing". Students may need guidelines in order to work properly and receive a fair grade for their effort. Or worst case, they simply won't do the work. If a student needs a step-by-step process on how to recount their learning, then how will they expect to fend for themselves once they graduate? 


But...

According to the grading rubric, an Effective counterclaim is: 
>...reasonable and relevant. It represents a valid objection to the position.

While an Adequate one is: 

>...not quite a reasonable opposing opinion, or it is not closely relevant to the position. 

And this gets to the second point: who decides what is 'reasonable,' what is 'relevant,' and what is 'valid'?

To me, these two excerpts are just apples and oranges. The first one is obviously shorter, less informative, less thoughtful, rhetorically weaker, and barely constitutes a counterclaim (or at least a counter-*argument*) at all -- it's closer in my mind to a 'Claim.' The latter argument builds a case, legitimizes the point of view, and discusses the legitimate and plausible consequences that worry opposing points-of-view. 

Consider, another excerpt, and contrast it with the first example above: 

In [20]:
Adequate_Comparison = gt.loc[(gt['length'] > 60) & (gt['discourse_effectiveness']=='Adequate') & (gt['discourse_type']== 'Counterclaim')]

print(Adequate_Comparison.discourse_text.values[199])

Although these views are correct, there are some who disagree. Some people believe that students don't learn when creating their own projects to do. They also believe that students aren't capable of doing their own project without the assistance of a teacher. 


Again, it's hard for me to see why this text is less effective than the example above. It is longer, more descriptive, and more complex. It is a little clunky with the 'projects to do'-bit, but the above one is clunky as well ('can argue'). In terms of ambiguity, validity, relevance, informativeness, and so on, *I* would judge this example to be a better element than the first. 

Now compare *that* Adequate text with this one: 

In [21]:
print(Adequate_Comparison.discourse_text.values[52])

. annother thing is that there are about 335 horses, so if you are affraid of horses then this program isn't for you. 


This is incorrectly punctuated, uses incorrect capitalization, has multiple misspellings, introduces irrelevance ('335 horses'), is informal ('isn't for you') and -- most importantly-- *isn't a counterclaim*. It's a position, or maybe a claim, but unless the preceding sentence was describing some melange of things a critic might point out, this appears to be an opinion held by the author. How is this comparable to the item above?

I won't beat this dead horse further. My point, though, is that there appears to be a good bit of the messiness and impulsivity that human graders are prone to -- for example, being impressed/in a good mood/hungry/liking the student/fearing the parent/not wanting the headache... (or the opposite). When you're trying to meet your friends at the bar in time for the game, and you've got a stack of 30,000 essays to label and grade, you just default to 'Adequate', then when you see something that is even mildly impressive (or the name of a student you're fond of, etc., etc.), you throw-in an occasional 'Effective.' When you're tired or bored or spiteful and/or you come across a particularly awful text, you throw it in the ineffective pile.

What that results in is really 5 strata: 

- 1) Truly bad essays
- 2) Truly good essays
- 3) Truly Average essays
- 4) Truly Average essays marked good or bad willy-nilly
- 5) Truly good or bad essays marked average by default.

And the effect of those last two groups is that the waters become muddied -- The machine can pick between 1 and 2, at least pretty well, but since it's trained to recognize as 'Average' examples which are truly effective or ineffective, it just defaults basically *everything* to adequate. Only the most extreme cases of goood/bad are labeled as such.

Notice the indices in the code: I've not gone beyond triple-digits. Recall that there are over 36,000 texts in this document...

I suppose you could make the case that the ambiguity and inconsistency and so on are part-and-parcel to the nature of this type of data or this type of project, and that the goal isn't to make the model necessarily *good* at evaluating essays, only to make the model *comparable to* the average of judgments made by the graders. But as a former teacher, I can tell you that humans are quite prone to sloppiness, incoherence, arbitrariness, and inconsistency. Ignoring the 'no there there' problem (i.e., that there *is* no meaningful pattern here, and thus any model will be useless in terms of generalizability), imagine what it would look like to deploy such a model. Do we really want -- or consider it in any way to be progress -- to have a machine-- one that informs schools, publishing houses, standardized exam makers, and so on -- that 'correctly' marks both dangling clauses full of misspellings *and* reasonable (if stylistically modest) counterclaims as both being adequate compositions? 

Put differently, it seems to me that any model fit to this dataset would accomplish the exact opposite of what we'd want in such a product. That is, we'd presumably want to outsource to a computer work that humans would do sloppily, lazily, in a biased manner, in an inconsistent manner, and so forth. If my eyes have glazed over after essay 400 and my mind has turned to such a soup that I'm unable to distinguish or discriminate any better than what's shown in these excerpts, no computer should be told to try and capture exactly how I'm thinking in that context! When we've trained, in the past, models on text that is racist, sexist, or otherwise discriminatory, we've ended up with racist and sexist AI. If we train models on text that reflects human errors or inconsistencies or whatever, then we'll have a grader that's just as bad as the error-prone human it's meant to replace. Of course, a neural network *can* discover *any* function -- even a chaotic, disorderly one-- but that doesn't mean it's any *good*. Any deployable model needs to have consistent parameters for identifying good from bad in a generalizable way, and if it is fit to such an idiosyncratic, slapdash dataset, it will yield a correspondingly chaotic framework. Consistently outputting nonsense may be consistent, but it's still nonsense. 

It seems, further, that AI which is trained on anything other than the best of human judgement will be a merely faster and less accountable source of errors, not a solution to them. We source the substance of AI's 'thought' through human-created data. Only the very best of this data ought to be captured in the algorithms we generate, and only be deployed after we've carefully confirmed this to be the case. That's the only way to ensure consistency, accuracy, and unbiased evaluation. Having mined this dataset for the better part of two months, and having expectation after expectation disconfirmed in model-testing, I'm not sure I'm convinced this is that sort of dataset, and I'm skeptical that any model derived from it is useful in terms of deployment. I certainly wouldn't want to be graded by it. 

I think the intention of this project is very noble and worthwhile, but without better data, I doubt it will do much in terms of ameliorating the limitations of present writing-feedback tools. The performance of this task requires both normative and aesthetic judgments, and Thanksgiving dinner with your family is all that's required to prove that objectivity with respect to what constitutes 'soundness', 'validity', 'effectiveness', 'logical coherence', 'evidence', and so on is not definable. What's more, even when agreement can be achieved about what works conceptually, there is an almost infinite number of ways in which it may be shared. Hemingway and Shakespeare may both be literary geniuses, but there's hardly a single quantifiable thing in their work that indicates that shared capacity. It is not the instantiation on paper, but the affective state brought about in human minds that distinguishes -- and links-- their work. Effective natural language processing is as much about human minds -- intentions, feelings, inclinations, sarcasm, bias -- as it is about the actual words that are present. This requires sensitivity to not only the structural elements of analysis (linguistics, statistical methods) -- the "processing," if you will --  but also to the "natural"-- the way humans actually use it, the way one individual's thoughtful remark is another's inconsiderate snarkiness, or the way one person's common sense is another's blatant falsehood. When you pair this with the way that language is inherently ambiguous, contextually dependent, and inseparable from models of the world and one's experiences therein, it seems unlikely that the statistical sophistication will be sufficient on its own to achieve what this project sets out to.

I mentioned above that I thought web-scraping for essays to use would be a better approach to building what the competition seems to be looking for. I think this would ameliorate at least some of the dataset's issues. For one thing, when an essay is selected to be *featured*, it is because that dataset is likely indisputably an example of the score it seeks to model. That level of reliability -- graded once by the tired, overworked grader, then implicitly graded again by the author of the website's post-- would ensure that the fickleness and caprice of the human grader wasn't mixed in with the reliable examples. Additionally, the diversity of topic-prompts that would be represented in this way would ensure that topical noise didn't confuse the learner, forcing it to focus on structural elements of the essay. Lastly, a 'segment identifier' algorithm could be built alongside the scoring algorithm, allowing the learner to 'recognize' complete instances of essay portions, rather than having the dataset be peppered with single-word elements. 