## Explore movie description  
This notebook is used to initially look at the text data and make some plots and figures to help understand what's in it.

To execute each cell, click on it and then type `shift-enter`.  You can insert
another code or markdown cell above or below any cell by clicking on `Insert` in the above menu bar.

In [1]:
# first a function to read in a file given a filename

def read_the_file(fname):
    '''Reads the filename - should have a .txt extension.

       Returns a text string containing the entire description.
    ''' 
    f = open(fname, 'r')
    textstring = f.read()
    return textstring

In [2]:
# relative path to description
path_including_filename = "../data/train_to_busan_description.csv"

text = read_the_file(path_including_filename)

In [3]:
text

'Seok-woo, a divorced fund manager, is a workaholic and absentee father to his \nyoung daughter, Su-an. For her birthday the next day, she wishes for her father\nto take her to Busan to see her mother. They board the KTX at Seoul Station. \nOthers on the same train are tough working-class husband Sang-hwa and his \npregnant wife Seong-kyeong, a high school baseball team, rich-yet-egotistical \nCOO Yon-suk, elderly sisters In-gil and Jon-gil, and a homeless man who is \nexperiencing post-traumatic stress disorder.  As the train departs, a convulsing\nyoung woman boards the train with a bite wound on her leg. The woman soon becomes\na zombie and attacks a train attendant, who then also turns into a zombie. The \ninfection quickly spreads throughout the train. Baseball player Yong-guk, a girl\nnamed Jin-hee who has a crush on him, and several passengers manage to escape to \nanother car. News broadcasts report zombie outbreaks throughout the country. The \ntrain stops at Daejeon, but the 

In [4]:
# the text has \n (newlines) in it.  If we print it, the newlines will render
print(text)

Seok-woo, a divorced fund manager, is a workaholic and absentee father to his 
young daughter, Su-an. For her birthday the next day, she wishes for her father
to take her to Busan to see her mother. They board the KTX at Seoul Station. 
Others on the same train are tough working-class husband Sang-hwa and his 
pregnant wife Seong-kyeong, a high school baseball team, rich-yet-egotistical 
COO Yon-suk, elderly sisters In-gil and Jon-gil, and a homeless man who is 
experiencing post-traumatic stress disorder.  As the train departs, a convulsing
young woman boards the train with a bite wound on her leg. The woman soon becomes
a zombie and attacks a train attendant, who then also turns into a zombie. The 
infection quickly spreads throughout the train. Baseball player Yong-guk, a girl
named Jin-hee who has a crush on him, and several passengers manage to escape to 
another car. News broadcasts report zombie outbreaks throughout the country. The 
train stops at Daejeon, but the surviving pas

### Clean the text

In [5]:
# specify how many characters to show after each step
nc = 160

# lowercase it
text_lc = text.lower()
text_lc[:nc]  # show first nc characters

'seok-woo, a divorced fund manager, is a workaholic and absentee father to his \nyoung daughter, su-an. for her birthday the next day, she wishes for her father\nt'

In [6]:
# remove punctuation
from string import punctuation

text_np = ''.join([ch for ch in text_lc if ch not in punctuation])
text_np[:nc]

'seokwoo a divorced fund manager is a workaholic and absentee father to his \nyoung daughter suan for her birthday the next day she wishes for her father\nto take '

In [8]:
# remove newline characters
text_nnl = text_np.replace('\n', ' ')
text_nnl[:nc]

'seokwoo a divorced fund manager is a workaholic and absentee father to his  young daughter suan for her birthday the next day she wishes for her father to take '

In [9]:
# split into words
words = text_nnl.split(' ')
words

['seokwoo',
 'a',
 'divorced',
 'fund',
 'manager',
 'is',
 'a',
 'workaholic',
 'and',
 'absentee',
 'father',
 'to',
 'his',
 '',
 'young',
 'daughter',
 'suan',
 'for',
 'her',
 'birthday',
 'the',
 'next',
 'day',
 'she',
 'wishes',
 'for',
 'her',
 'father',
 'to',
 'take',
 'her',
 'to',
 'busan',
 'to',
 'see',
 'her',
 'mother',
 'they',
 'board',
 'the',
 'ktx',
 'at',
 'seoul',
 'station',
 '',
 'others',
 'on',
 'the',
 'same',
 'train',
 'are',
 'tough',
 'workingclass',
 'husband',
 'sanghwa',
 'and',
 'his',
 '',
 'pregnant',
 'wife',
 'seongkyeong',
 'a',
 'high',
 'school',
 'baseball',
 'team',
 'richyetegotistical',
 '',
 'coo',
 'yonsuk',
 'elderly',
 'sisters',
 'ingil',
 'and',
 'jongil',
 'and',
 'a',
 'homeless',
 'man',
 'who',
 'is',
 '',
 'experiencing',
 'posttraumatic',
 'stress',
 'disorder',
 '',
 'as',
 'the',
 'train',
 'departs',
 'a',
 'convulsing',
 'young',
 'woman',
 'boards',
 'the',
 'train',
 'with',
 'a',
 'bite',
 'wound',
 'on',
 'her',
 'leg'

In [10]:
def print_word_stats(words):
    num_words = len(words)
    unique_words = set(words)
    num_unique_words = len(unique_words)
    print(f"The number of words in the description is {num_words}.")
    print(f"The number of unique words in the description is {num_unique_words}.")

In [11]:
# before removing stopswords
print_word_stats(words)

The number of words in the description is 655.
The number of unique words in the description is 266.


In [12]:
# remove stopwords
from sklearn.feature_extraction import stop_words
stopwords = stop_words.ENGLISH_STOP_WORDS

words_nsw = [word for word in words if word not in stopwords]



In [13]:
# after removing stopwords
print_word_stats(words_nsw)

The number of words in the description is 347.
The number of unique words in the description is 199.


In [14]:
words_nsw

['seokwoo',
 'divorced',
 'fund',
 'manager',
 'workaholic',
 'absentee',
 'father',
 '',
 'young',
 'daughter',
 'suan',
 'birthday',
 'day',
 'wishes',
 'father',
 'busan',
 'mother',
 'board',
 'ktx',
 'seoul',
 'station',
 '',
 'train',
 'tough',
 'workingclass',
 'husband',
 'sanghwa',
 '',
 'pregnant',
 'wife',
 'seongkyeong',
 'high',
 'school',
 'baseball',
 'team',
 'richyetegotistical',
 '',
 'coo',
 'yonsuk',
 'elderly',
 'sisters',
 'ingil',
 'jongil',
 'homeless',
 'man',
 '',
 'experiencing',
 'posttraumatic',
 'stress',
 'disorder',
 '',
 'train',
 'departs',
 'convulsing',
 'young',
 'woman',
 'boards',
 'train',
 'bite',
 'wound',
 'leg',
 'woman',
 'soon',
 'zombie',
 'attacks',
 'train',
 'attendant',
 'turns',
 'zombie',
 '',
 'infection',
 'quickly',
 'spreads',
 'train',
 'baseball',
 'player',
 'yongguk',
 'girl',
 'named',
 'jinhee',
 'crush',
 'passengers',
 'manage',
 'escape',
 '',
 'car',
 'news',
 'broadcasts',
 'report',
 'zombie',
 'outbreaks',
 'country'

In [15]:
# it looks like '' occurs as a word in many places - remove it
words_cleaned = [word for word in words_nsw if word is not '']

In [16]:
words_cleaned

['seokwoo',
 'divorced',
 'fund',
 'manager',
 'workaholic',
 'absentee',
 'father',
 'young',
 'daughter',
 'suan',
 'birthday',
 'day',
 'wishes',
 'father',
 'busan',
 'mother',
 'board',
 'ktx',
 'seoul',
 'station',
 'train',
 'tough',
 'workingclass',
 'husband',
 'sanghwa',
 'pregnant',
 'wife',
 'seongkyeong',
 'high',
 'school',
 'baseball',
 'team',
 'richyetegotistical',
 'coo',
 'yonsuk',
 'elderly',
 'sisters',
 'ingil',
 'jongil',
 'homeless',
 'man',
 'experiencing',
 'posttraumatic',
 'stress',
 'disorder',
 'train',
 'departs',
 'convulsing',
 'young',
 'woman',
 'boards',
 'train',
 'bite',
 'wound',
 'leg',
 'woman',
 'soon',
 'zombie',
 'attacks',
 'train',
 'attendant',
 'turns',
 'zombie',
 'infection',
 'quickly',
 'spreads',
 'train',
 'baseball',
 'player',
 'yongguk',
 'girl',
 'named',
 'jinhee',
 'crush',
 'passengers',
 'manage',
 'escape',
 'car',
 'news',
 'broadcasts',
 'report',
 'zombie',
 'outbreaks',
 'country',
 'train',
 'stops',
 'daejeon',
 'surv

In [17]:
# let's get a count of each word in the description
from collections import Counter

word_counts = Counter(words_cleaned)

In [18]:
word_counts

Counter({'seokwoo': 6,
         'divorced': 1,
         'fund': 1,
         'manager': 1,
         'workaholic': 1,
         'absentee': 1,
         'father': 3,
         'young': 2,
         'daughter': 3,
         'suan': 8,
         'birthday': 1,
         'day': 1,
         'wishes': 1,
         'busan': 3,
         'mother': 1,
         'board': 1,
         'ktx': 1,
         'seoul': 1,
         'station': 2,
         'train': 18,
         'tough': 1,
         'workingclass': 1,
         'husband': 1,
         'sanghwa': 3,
         'pregnant': 1,
         'wife': 1,
         'seongkyeong': 7,
         'high': 1,
         'school': 1,
         'baseball': 2,
         'team': 1,
         'richyetegotistical': 1,
         'coo': 1,
         'yonsuk': 7,
         'elderly': 1,
         'sisters': 1,
         'ingil': 4,
         'jongil': 1,
         'homeless': 3,
         'man': 3,
         'experiencing': 1,
         'posttraumatic': 1,
         'stress': 1,
         'disorder': 

In [19]:
# let's find the 20 most common, and plot them
num = 20
most_common = word_counts.most_common(num)

In [20]:
most_common

[('train', 18),
 ('suan', 8),
 ('seongkyeong', 7),
 ('yonsuk', 7),
 ('seokwoo', 6),
 ('zombie', 5),
 ('passengers', 5),
 ('zombies', 5),
 ('ingil', 4),
 ('yongguk', 4),
 ('jinhee', 4),
 ('conductor', 4),
 ('father', 3),
 ('daughter', 3),
 ('busan', 3),
 ('sanghwa', 3),
 ('homeless', 3),
 ('man', 3),
 ('attendant', 3),
 ('escape', 3)]

### Make a bar chart of the most common words

In [None]:
# pre-processing
labels = [tup[0] for tup in most_common]
counts = [tup[1] for tup in most_common]
print(labels)
print(counts)

In [None]:
# use matplotlib to make a bar chart
import matplotlib.pyplot as plt

# choose a nice matplotlib style
plt.style.use('ggplot')

In [None]:
# change the default text size (it's usually too small)
plt.rcParams.update({'font.size': 14})

In [None]:
# from "A simple bar chart" in your Matplotlib.pdf cheatsheat

N = len(labels)
fig, ax = plt.subplots(figsize=(12, 6))
width = 0.8
ticklocations = list(range(N))
ax.bar(ticklocations, counts, width, linewidth=4.0, align='center')
ax.set_xticks(ticks=ticklocations)
ax.set_xticklabels(labels, rotation=90)
ax.set_xlim(min(ticklocations)-0.6, max(ticklocations)+0.6)
ax.set_yticks(range(N))
ax.set_ylim((0,N))
ax.yaxis.grid(True)
ax.set_xlabel('word')
ax.set_ylabel('counts');

### Wordcloud visualization
A wordcloud is another nice way to visualize the frequency or importance of
words in text data. Alt-tab to your Unix/Linux terminal and install a wordcloud
utility for Python from the command line:
```bash
$ conda install -c conda-forge wordcloud
```

In [None]:
# get cleaned words into one-string for wordcloud utility
cleaned_text = ' '.join([word for word in words_cleaned])
cleaned_text[:nc]

In [None]:
from wordcloud import WordCloud

wordcloud = WordCloud(background_color="white", width=960, height=960, margin=8).generate(cleaned_text)
fig, ax = plt.subplots(figsize=(8,8))
ax.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()