This is just  a simple exploratory approach to identify those words that are uniquely used by each author. I have not figured out if this is a reasonable and valid step prior to any sophisticated modeling approach; it's just my first attempt to get familiar with the data. So let's import the necessary libraries first:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from nltk.tokenize import RegexpTokenizer

Get the training set:

In [None]:
train_df = pd.read_csv("../input/train.csv")

Build subsets for each author:

In [None]:
eap = train_df[train_df["author"] == "EAP"]
hpl = train_df[train_df["author"] == "HPL"]
mws = train_df[train_df["author"] == "MWS"]

Combine all text snippets to one great string for each author separately:

In [None]:
all_eap = "".join([ text for text in eap["text"] ])
all_hpl = "".join([ text for text in hpl["text"] ])
all_mws = "".join([ text for text in mws["text"] ])

Define a simple NLTK tokenizer:

In [None]:
tokenizer = RegexpTokenizer(r'\w+')

Tokenize the complete texts for each author separately:

In [None]:
tokens_eap = tokenizer.tokenize(all_eap)
tokens_hpl = tokenizer.tokenize(all_hpl)
tokens_mws = tokenizer.tokenize(all_mws)

Build token sets for all possible author pairs:

In [None]:
set_eap_hpl =  set(tokens_eap + tokens_hpl)
set_eap_mws =  set(tokens_eap + tokens_mws)
set_hpl_mws =  set(tokens_hpl + tokens_mws)

Keep only those tokens for each author that are uniquely used by the respective author:

In [None]:
specific_eap = [ token for token in tokens_eap if token not in set_hpl_mws ]
specific_hpl = [ token for token in tokens_hpl if token not in set_eap_mws ]
specific_mws = [ token for token in tokens_mws if token not in set_eap_hpl ]

Create a data frame for each author that contains the uniquely used tokens; tokens, however, are not unique within the respective data frame, i.e. duplicates have not been removed yet.

In [None]:
df_eap = pd.DataFrame({"Token": specific_eap, "Author": "EAP"})
df_hpl = pd.DataFrame({"Token": specific_hpl, "Author": "HPL"})
df_mws = pd.DataFrame({"Token": specific_mws, "Author": "MWS"})

Count the specific tokens for each data frame and add a column for the counts. I could have done this in one step without subsetting for each author and instead filter for a fixed number of most frequent tokens. However, I wanted to have the same size for each subset, indepentent from any general frequency count; so I decided to count tokens for each author individually and then keep the same number of top rows (see below). More elegant solutions are of course very welcome.

In [None]:
df_eap["Counts"] = df_eap.groupby(["Token"]).transform("count")
df_hpl["Counts"] = df_hpl.groupby(["Token"]).transform("count")
df_mws["Counts"] = df_mws.groupby(["Token"]).transform("count")

Sort by 'Counts' in descending order:

In [None]:
df_eap = df_eap.sort_values(['Counts'], ascending=False)
df_hpl = df_hpl.sort_values(['Counts'], ascending=False)
df_mws = df_mws.sort_values(['Counts'], ascending=False)

Now drop duplicate rows:

In [None]:
df_eap = df_eap.drop_duplicates()
df_hpl = df_hpl.drop_duplicates()
df_mws = df_mws.drop_duplicates()

Have a look at the top10 of  words for each author:

In [None]:
df_eap.head(10) # Edgar Allan Poe

In [None]:
df_hpl.head(10) # H. P. Lovecraft

In [None]:
df_mws.head(10) # Mary Wollstonecraft Shelley

Get the head of each data frame, here the first n=10 rows, and combine all three data frames to one:

In [None]:
n = 10

df_top =  pd.concat([
    df_eap.head(n),
    df_hpl.head(n),
    df_mws.head(n)
])

df_top.set_index(['Token'], inplace=True)

Set some plotting parameters:

In [None]:
params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}

plt.rcParams.update(params)

Make a plot. All subplots have identical y-limits for for comparison reasons:

In [None]:
fig, ax = plt.subplots(1, 3)

ymax = max(df_top["Counts"] + 10)

ax[0].set_ylim([0, ymax])
ax[1].set_ylim([0, ymax])
ax[2].set_ylim([0, ymax])

group_a = df_top[df_top.Author=='EAP']
group_b = df_top[df_top.Author=='HPL']
group_c = df_top[df_top.Author=='MWS']

group_a.plot(kind='bar', rot=45, title = "EAP", ax=ax[0])
group_b.plot(kind='bar', rot=45, title = "HPL", ax=ax[1])
group_c.plot(kind='bar', rot=45, title = "MWS", ax=ax[2])

fig.tight_layout()