In [24]:
import pandas as pd
import nltk
import spacy

from tqdm import tqdm

tqdm.pandas()

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
parse = nlp("Don't blink if you don't think.")

In [5]:
[word.tag_ for word in parse]

['VB', 'RB', 'VB', 'IN', 'PRP', 'VBP', 'RB', 'VB', '.']

In [28]:
df = pd.read_excel("realec_errors1.xlsx", sheet_name='контексты', index_col='id')

In [29]:
df['Wrong answer'] = df['Sentence'].apply(lambda x: x[x.find('<b>')+3:x.rfind('</b>')])

In [30]:
df.head()

Unnamed: 0_level_0,Sentence,Right answer,Error type,Wrong answer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"Will Gardner, a founder of a law firm, inform...",demanded that,that_clause,demanded to
1,"Will Gardner, a founder of a law firm, inform...",she deal,Absence_comp_sent,deal
2,He <b>noticed</b> that it was a very sensitiv...,noted,Often_confused,noticed
3,He noted that it <b>is</b> a very sensitive i...,was,Tense_choice,is
4,He noted that it was a very sensitive issue f...,the defendant,Articles,a defendant


## Choice of tense:

In [6]:
tense_errors = df.loc[df['Error type'] == 'Tense_choice']

In [7]:
tense_errors.head()

Unnamed: 0_level_0,Sentence,Right answer,Error type,Wrong answer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,He noted that it <b>is</b> a very sensitive i...,was,Tense_choice,is
5,He noted that it was a very sensitive issue f...,was,Tense_choice,is
10,She explained that she <b>had</b> a Shabbat t...,had had,Tense_choice,had
12,She explained that she had had a Shabbat the ...,happened,Tense_choice,had happened
112,"The bar chart illustrates amount of children, ...",did,Tense_choice,do


In [8]:
tense_errors.shape

(9023, 4)

Оставим в выборке только те примеры, в которых в правильном варианте есть хотя бы один глагол:

In [11]:
def contains_pos(string, pos, parser):
    if string and type(string) == str:
        parse = parser(string)
        pos_string = [token.pos_ for token in parse]
        return pos in pos_string
    return False

In [12]:
tense_errors['Contains verb'] = tense_errors['Right answer'].apply(lambda x: contains_pos(x, 'VERB', nlp))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
tense_errors = tense_errors.loc[tense_errors['Contains verb'] == True]

In [14]:
tense_errors.shape

(8632, 5)

In [15]:
tense_errors.to_excel("Tense_choice1.xlsx")

## Form of tense:

In [16]:
form_tense_errors = df.loc[df["Error type"] == "Tense_form"]

In [17]:
form_tense_errors["Contains verb"] = form_tense_errors["Right answer"].apply(lambda x: contains_pos(x, 'VERB', nlp))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [18]:
form_tense_errors = form_tense_errors.loc[form_tense_errors['Contains verb'] == True]

In [19]:
form_tense_errors.shape

(1771, 5)

In [20]:
form_tense_errors.to_excel("Tense_form1.xlsx")

## Prepositions:

In [31]:
prep_errors = df.loc[df["Error type"] == "Prepositions"]

In [32]:
prep_errors.shape

(8725, 4)

In [34]:
prep_errors["Contains prep"] = prep_errors.progress_apply(lambda x: contains_pos(x['Right answer'], 'ADP', nlp) or
                                                 contains_pos(x['Wrong answer'], 'ADP', nlp),
                                                 axis=1)

100%|█████████████████████████████████████████████████████████████████████████████| 8725/8725 [00:44<00:00, 197.06it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [35]:
bad_preps = prep_errors.loc[prep_errors["Contains prep"] == False]
prep_errors = prep_errors.loc[prep_errors["Contains prep"] == True]

In [48]:
def pos_seq(string, parser):
    parse = parser(string)
    return [token.pos_ for token in parse]

In [57]:
pos_seq("if", nlp)

['SCONJ']

In [49]:
parse = nlp("You decided to cut the tree")

In [55]:
parse[1].morph.to_json()

['VerbForm_fin']

In [36]:
bad_preps.to_excel("Bad prepositions.xlsx")

In [25]:
prep_errors.shape

(8585, 5)

In [26]:
prep_errors

Unnamed: 0_level_0,Sentence,Right answer,Error type,Wrong answer,Contains prep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
34,Considerable social and financial resources a...,to,Prepositions,on,True
56,5) No sooner had Alex knocked <b>the door</b>...,at the door,Prepositions,the door,True
63,"For example, Russian company “Aeroflot” under...",per,Prepositions,in,True
65,There is the map of current flights <b>on</b>...,in,Prepositions,on,True
81,The given bar chart compares data <b>on</b> ac...,about,Prepositions,on,True
110,"Overall, there is a downward trend in numbers...",access to primary education,Prepositions,access primary education,True
120,This difference estimates 10 million in compa...,with,Prepositions,of,True
124,It is a fact that budget airlines companies s...,with,Prepositions,from,True
126,It is underiable that travelling activities b...,to,Prepositions,for,True
151,In Africa from 2000 <b>till</b> 2012 there wa...,to,Prepositions,till,True


In [27]:
prep_errors.to_excel("Prepositions1.xlsx")

## Prepositional noun:

Мы используем модель SpaCy en_core_web_sm, которая основывается на <a href="https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md">списке зависимостей ClearNLP</a>. В отличие от Universal Dependencies, где предлог не может иметь детей, CLEAR считает предлог вершиной предложной фразы.

Нужно, чтобы было существительное, у которого есть зависимое-предлог

In [28]:
def is_prepositional_noun(string, parser):
    if string and type(string)==str:
        parse = parser(string)
        for nphrase in parse.noun_chunks:
            children_pos = [child.pos_ for child in nphrase.root.children]
            if 'ADP' in children_pos:
                return True
    return False

In [29]:
is_prepositional_noun("This is my favourite way of travelling", nlp)

True

In [30]:
parse = nlp("This is my favourite way of travelling")

In [31]:
for token in parse:
    print(token.i, token.text, token.pos_, token.dep_, token.head.i)

0 This DET nsubj 1
1 is VERB ROOT 1
2 my DET poss 4
3 favourite ADJ amod 4
4 way NOUN attr 1
5 of ADP prep 4
6 travelling VERB pcomp 5


In [32]:
parse = nlp("This is my favourite way of travelling")

In [33]:
is_prepositional_noun("This is my favourite fruit", nlp)

False

In [34]:
prep_noun_errors = df.loc[df["Error type"] == "Prepositional_noun"]

In [35]:
prep_noun_errors.shape

(502, 4)

In [36]:
prep_noun_errors["Is prepositional noun"] = prep_noun_errors.apply(lambda x: is_prepositional_noun(x['Right answer'],nlp) or
                                                                  is_prepositional_noun(x['Wrong answer'],nlp),
                                                                  axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [37]:
prep_noun_errors = prep_noun_errors.loc[prep_noun_errors["Is prepositional noun"] == True]

In [38]:
prep_noun_errors.shape

(421, 5)

In [39]:
prep_noun_errors.to_excel("Prepositional_noun1.xlsx")

## Prepositional verb:

Нужно, чтобы был глагол, у которого есть зависимое, у которого в свою очередь есть зависимое-предлог

In [40]:
parse = nlp("I would like to go to Spain")

In [41]:
for token in parse:
    print(token.i, token.text, token.pos_, token.dep_, token.head.i)

0 I PRON nsubj 2
1 would VERB aux 2
2 like VERB ROOT 2
3 to PART aux 4
4 go VERB xcomp 2
5 to ADP prep 4
6 Spain PROPN pobj 5


In [42]:
def is_prepositional_verb(string, parser):
    if string and type(string) == str:
        parse = parser(string)
        for token in parse:
            if token.pos_ == 'VERB':
                if 'ADP' in [child.pos_ for child in token.children]:
                    return True
    return False

In [43]:
prep_verb_errors = df.loc[df["Error type"] == "Prepositional_verb"]

In [44]:
prep_verb_errors.shape

(795, 4)

In [45]:
prep_verb_errors["Is prepositional verb"] = prep_verb_errors.apply(lambda x: is_prepositional_verb(x['Right answer'], nlp) or 
                                                                  is_prepositional_verb(x['Wrong answer'], nlp),
                                                                  axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [46]:
prep_verb_errors = prep_verb_errors.loc[prep_verb_errors["Is prepositional verb"] == True]

In [47]:
prep_verb_errors.to_excel("Prepositional_verb1.xlsx")

## Standard word order:

Нужно чтобы леммы в правильном и неправильном варианте совпадали и отличались порядком и чтобы слов было больше 3

In [48]:
std_word_order_errors = df.loc[df["Error type"] == "Standard"]

In [49]:
std_word_order_errors.shape

(2847, 4)

In [50]:
def n_tokens(string, parser):
    if string and type(string) == str:
        return len(parser(string))
    return 0

In [51]:
std_word_order_errors["Correction length"] = std_word_order_errors["Right answer"].apply(lambda x: n_tokens(x, nlp))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [52]:
std_word_order_errors = std_word_order_errors.loc[std_word_order_errors["Correction length"]>=4]

In [53]:
std_word_order_errors.shape

(1876, 5)

In [54]:
std_word_order_errors.to_excel("Standard word order1.xlsx")

## Word order:

In [55]:
word_order_errors = df.loc[df["Error type"] == "Word_order"]

In [56]:
word_order_errors.shape

(741, 4)

In [57]:
word_order_errors["Correction length"] = word_order_errors["Right answer"].apply(lambda x: n_tokens(x, nlp))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [58]:
word_order_errors = word_order_errors.loc[word_order_errors["Correction length"] >= 4]

In [59]:
word_order_errors.shape

(413, 5)

In [60]:
word_order_errors.to_excel("Word_order1.xlsx")

## Lex item choice & Word choice

Оставим те примеры, в которых в исправлении одно слово и запишем POS-тэг для этого слова:

In [18]:
wrong_words = df[(df['Error type'] == 'Word_choice')|(df['Error type'] == 'lex_item_choice')]

In [6]:
wrong_words.head()

Unnamed: 0_level_0,Sentence,Right answer,Error type,Wrong answer
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7,Alicia and her colleague Kalinda <b>imagined<...,thought,lex_item_choice,imagined
9,Alicia and her colleague Kalinda thought that...,simple,lex_item_choice,usual
17,He <b>expected</b> Alicia to go away and not ...,urged,lex_item_choice,expected
22,Alprin offered to use the First Amendment in ...,refer,lex_item_choice,apply
25,In a court Alprin persuaded the judge to <b>u...,place the decision with,Word_choice,use a help of


In [7]:
def n_words(string):
    if string and type(string) == str:
        return len(nltk.word_tokenize(string))
    return 0

In [19]:
wrong_words['N words right'] = wrong_words['Right answer'].apply(n_words)
wrong_words['N words wrong'] = wrong_words['Wrong answer'].apply(n_words)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [21]:
wrong_words = wrong_words[(wrong_words['N words right'] == 1) & (wrong_words['N words wrong'] == 1)]

In [22]:
wrong_words.shape

(13965, 6)

In [23]:
def correct(x):
    start_id = x['Sentence'].find('<b>')
    end_id = x['Sentence'].rfind('</b>') + len('</b>')
    corrected = x['Sentence'][:start_id]+x['Right answer']+x['Sentence'][end_id:]
    return corrected

In [24]:
def get_err_word_id(x):
    tokenized = nltk.word_tokenize(x['Sentence'])
    for word_id, word in enumerate(tokenized[:-2]):
        if word == '<' and tokenized[word_id+1] == 'b' and tokenized[word_id+2] == '>':
            return word_id

In [25]:
def pos_tag_error(row):
    word_index = get_err_word_id(row)
    corrected = correct(row)
    erroneous = row['Sentence'].replace('<b>','').replace('</b>','')
    try:
        error_pos_tag = nltk.pos_tag(nltk.word_tokenize(erroneous))[word_index][1]
    except IndexError:
        print(row['Sentence'], nltk.word_tokenize(erroneous), word_index)
        error_pos_tag = None
    try:
        correction_pos_tag = nltk.pos_tag(nltk.word_tokenize(corrected))[word_index][1]
    except IndexError:
        print(row['Sentence'], nltk.word_tokenize(corrected), word_index)
        correction_pos_tag = None
    return error_pos_tag, correction_pos_tag

In [26]:
nltk.pos_tag(nltk.word_tokenize("I want a <b>bisquit</b>"))

[('I', 'PRP'),
 ('want', 'VBP'),
 ('a', 'DT'),
 ('<', 'NN'),
 ('b', 'NN'),
 ('>', 'NNP'),
 ('bisquit', 'NN'),
 ('<', 'NNP'),
 ('/b', 'NNP'),
 ('>', 'NN')]

In [27]:
# progress bar
from tqdm import tqdm, tqdm_notebook

# instantiate
tqdm.pandas()

In [28]:
wrong_words[['error POS', 'correction POS']] = wrong_words.progress_apply(pos_tag_error, axis=1,
                                                                                   result_type="expand")

100%|███████████████████████████████████████████████████████████████████████████| 13965/13965 [00:39<00:00, 350.18it/s]


In [29]:
wrong_words

Unnamed: 0_level_0,Sentence,Right answer,Error type,Wrong answer,N words right,N words wrong,error POS,correction POS
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7,Alicia and her colleague Kalinda <b>imagined<...,thought,lex_item_choice,imagined,1,1,VBD,VBD
9,Alicia and her colleague Kalinda thought that...,simple,lex_item_choice,usual,1,1,JJ,JJ
17,He <b>expected</b> Alicia to go away and not ...,urged,lex_item_choice,expected,1,1,VBD,VBD
22,Alprin offered to use the First Amendment in ...,refer,lex_item_choice,apply,1,1,VB,VB
28,But after a break in a trial the prosecutor <...,submitted,lex_item_choice,revealed,1,1,VBD,VBD
31,But after a break in a trial the prosecutor s...,negligent,lex_item_choice,indifferent,1,1,JJ,JJ
33,Considerable social and financial resources a...,allocated,lex_item_choice,sent,1,1,VBN,VBN
38,That's why there should be intercept car park...,underground,lex_item_choice,subway,1,1,VB,NN
41,An other thing we have to mention is that the...,number,lex_item_choice,large,1,1,JJ,NN
44,But the main thing is just to start doing som...,risen,lex_item_choice,come,1,1,VBN,VBN


In [30]:
wrong_words.to_excel("Wrong words.xlsx")