# Using `pandas` to explore texts

The first line of code run _here_ is something internal to Jupyter Notebooks that allows us to place any graphical output into the page itself and not in a separate window or file. (We can still save output to a file, if we want.)

In [None]:
%pylab inline
figsize(12, 6)

In [1]:
# The work of the two lines above can also be achieved in one line.
# See if your growing Pythonista abilities can't tell how this is done!
mdg = open('texts/mdg.txt', 'r').read()

In [2]:
# We need the regular expression library
import re

mdg_words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()

In [3]:
import numpy as np
import pandas as pd

mdg_series = pd.Series(mdg_words)

print(mdg_series)

0             off
1           there
2              to
3             the
4           right
5       somewhere
6              is
7               a
8           large
9          island
10           said
11        whitney
12           it's
13         rather
14              a
15        mystery
16           what
17         island
18             is
19             it
20      rainsford
21          asked
22            the
23            old
24         charts
25           call
26             it
27          'ship
28           trap
29         island
          ...    
7987           is
7988           to
7989      furnish
7990            a
7991       repast
7992          for
7993          the
7994       hounds
7995          the
7996        other
7997         will
7998        sleep
7999           in
8000         this
8001         very
8002    excellent
8003          bed
8004           on
8005        guard
8006    rainsford
8007           he
8008          had
8009        never
8010        slept
8011      

We can take that sequence of words and count them:

In [None]:
mdg_counts = mdg_series.value_counts()
print(mdg_counts)

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Let's graph the 50 most frequent words:
# =-=-=-=-=-=-=-=-=-=-= 

mdg_counts.iloc[0:50].plot(kind='bar')

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Save these results to a CSV file (makes it easier for the Excel-impaired)
# =-=-=-=-=-=-=-=-=-=-= 

mdg_counts.to_csv('../data/mdg_word_freq.csv')

In [None]:
mpl.style.use('ggplot')
ax = df[['Word','Frequency']].plot(kind='bar', 
                                   title ="Frequency of Words in MDG",
                                   figsize=(20,10),
                                   legend=True)
ax.set_xlabel("Word")
ax.set_ylabel("Occurrences")
ax.set_xticklabels(list(df['Word'])) 
mpl.pyplot.show()

In [None]:
import nltk
myword = mdg.concordance("dangerous")
print(myword)

In [None]:
text.similar("love")
text.common_contexts(["husband", "wife"])
text.collocations()

In [None]:
import nltk
mdgtokens = nltk.word_tokenize(mdg)
len(mdgtokens)

import nltk, re

mdg_raw = open("./mdg.txt").read()
mdg_words = re.sub("[^a-zA-Z'-]"," ", mdg_raw)
mdg_case = mdg_words.lower()

# print(mdg_case)



import re


mdg_word_list = mdg_words.split()
print(mdg_word_list)

sorted(set(mdg_word_list))

len(sorted(set(mdg_word_list)))

# Lexical Diversity of MDG:
len(mdg2_word_list) / len(set(mdg2_word_list))

In [None]:
len(mdg_tokens) / len(set(mdg_tokens))

On average, a word occurs four times in "The Most Dangerous Game."

Out of curiosity, how many words occur four times?

In [None]:
wordfrequency = nltk.FreqDist(mdg_tokens)
four_times = [word for word in wordfrequency.keys() if wordfrequency[word] == 4]
print(four_times)

In [None]:
mdg_text.count("dangerous")

In [None]:
mdg_text.concordance("dangerous")

Where does "dangerous" occur within the larger text?

In [None]:
mdg_text.dispersion_plot(["dangerous", "danger", "game", "fear"])

In [None]:
wordfrequency.plot()