## EDA on NPOV edits  of Wikipedia Articles

The NPOV edits are the collection of edits extracted from the NPOV
corpus, which consists of 7464 Wikipedia articles in the "NPOV
disputes" category together with their history of revisions. The edits
include changes involving strings of up to five words that occurred
between a pair of consecutive revisions.

The edits are released as tab-separated-value (TSV) files. Each line
contains an edit with the following information split across 10
tab-delimited columns: 

1. Title of the Wikipedia article the edit comes from. 
2. Revision number.
3. True if the revision text contains an NPOV tag (e.g., {{POV}}), and
false otherwise.
4. True if the edit comment contains the string "POV", and false
otherwise.
5. ID of the editor responsible for that revision.
6. Size of the revision. It can take three values: minor, major or
unknown (if unspecified).
7. String modified by the edit (i.e., before form).
8. String resulting from the edit (i.e., after form).
9. Original sentence of the before form (i.e., string in column 7).
10. Original sentence of the after form (i.e., string in column 8).

In [1]:
import pandas as pd

In [30]:
df = pd.read_csv('B:/Dataset/npov-edits/npov-edits/5gram-edits-dev.tsv', delimiter='\t', header=None, engine="c", error_bad_lines=False, warn_bad_lines=False)

In [31]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Light brown apple moth controversy,194566002,False,False,66.81.112.123,UNKNOWN,"contain its spread,","eradicate the moth,",Quarantine measures and aerial spraying were i...,Quarantine measures and aerial spraying of cit...
1,Light brown apple moth controversy,194568060,False,False,66.81.112.123,UNKNOWN,control,eradication,"Following the DNA confirmation of LBAM, offici...","Following the DNA confirmation of LBAM, offici..."
2,Light brown apple moth controversy,194568881,False,False,66.81.112.123,UNKNOWN,pheromone,pheromones,"In 2007 and 2008, an aerial eradication progra...","In 2007 and 2008, an aerial eradication progra..."
3,Light brown apple moth controversy,195808797,False,False,66.81.112.206,UNKNOWN,purposes to catch.,purposes.,These are the same traps used for population m...,These are the same traps used for population s...
4,Light brown apple moth controversy,195809017,False,False,66.81.112.206,UNKNOWN,that,than,"<ref name=""titleMoth-eaten plans"">{{cite web |...","<ref name=""titleMoth-eaten plans"">{{cite web |..."


In [32]:
df.shape

(146204, 10)

In [33]:
df.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [34]:
# true if revision text contains a NPOV tag and false otherwise
df[2].value_counts() 

False    107995
True      38209
Name: 2, dtype: int64

In [35]:
# true if edit comment contains the string "POV" and false otherwise
df[3].value_counts()

False    144707
True       1497
Name: 3, dtype: int64

In [37]:
# rev size
df[5].value_counts()

UNKNOWN    104406
MINOR       41798
Name: 5, dtype: int64

In [38]:
df[1].value_counts()

35324420     394
35677539     384
200235132    175
240082821    133
148385466    132
219887237    131
471407576    126
41055397     122
318329425    121
366002319    113
289451479    106
289455782    106
366504618    105
366496192    101
32872800      99
461544755     92
461536868     91
148197168     90
148196769     90
405595488     89
465268474     88
100409279     87
213818890     79
142402981     74
489803277     73
417809954     73
142401801     73
458255112     72
245630504     72
99793743      72
            ... 
233671053      1
365224308      1
416917897      1
365724024      1
362580345      1
438380877      1
454765017      1
44507404       1
8710689        1
201853501      1
128188988      1
172148367      1
429367580      1
231624994      1
320239911      1
388529949      1
499972643      1
429780447      1
480962869      1
387914207      1
394904068      1
229511678      1
398082548      1
263427724      1
297185770      1
414280005      1
191283684      1
98000458      