# <font color = 'royalblue'> <center> Homework: *State Your Assumptions*

### Get the Data

In [2]:
# load packages
from dvc.api import read,get_url
import pandas as pd
import re

# store text file
txt = read('resources/data/shakespeare/shakespeare.txt', 
           repo='https://github.com/TLP-COI/text-data-course')

# view
print(txt[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



### Part 1

Split the text file into a table, such that

- each row is a single line of dialogue
- there are columns for
    - the speaker
    - the line number
    - the line dialogue (the text)

Hint: you will need to use RegEx to do this rapidly. See the in-class "markdown" example!

Question(s):

What assumptions have you made about the text that allowed you to do this?

In [3]:
########################

The text is structured as follows:

Speaker:\nText\nn

Assumptions:
- the speaker's name is always followed by a colon and a line break
- the text (speaker's dialogue) comes after the colon and line break (after the speaker's name) and ends with double line breaks (/n/n)

**Extract each dialogue spoken and the speaker's name i.e. 'speaker_name:\ndialogue'**

In [4]:
# split on double line breaks
first_split = re.split("\n\n", txt)

# check length of list
len(first_split)

7222

**Separate speaker name and the dialogue**

Assumption:<br>
Speaker name is always followed by a colon. <br>
Since each item in `first_split` contains the speaker's name and the dialogue spoken, splitting on the *first* colon in the string will separate the speaker name and the dialogue.

In [5]:
# create list to store split results
second_split = []

# iterate through each item in the list and split on the first colon
for line in first_split:
    second_split.append(line.split(":", 1))
    
# check length of list
len(second_split)

7222

#### Store speakers and dialogues

In [6]:
# create list to store speakers
speakers = []

# iterate through each item in the list and select the speakers
for line in range(len(second_split)):
    speakers.append(second_split[line][0])

In [7]:
# create list to store dialogues
dialogues = []

# iterate through each item in the list and select the dialogues
for line in range(len(second_split)):
    dialogues.append(second_split[line][1])

In [8]:
# create empty dataframe
df = pd.DataFrame()

# store speakers and dialogues in dataframe
df['speaker']  = speakers
df['dialogue'] = dialogues

# add line number column starting from 1
df['line_number'] = df.index + 1

# view df
df.head()

Unnamed: 0,speaker,dialogue,line_number
0,First Citizen,"\nBefore we proceed any further, hear me speak.",1
1,All,"\nSpeak, speak.",2
2,First Citizen,\nYou are all resolved rather to die than to f...,3
3,All,\nResolved. resolved.,4
4,First Citizen,"\nFirst, you know Caius Marcius is chief enemy...",5


### Part 2

You have likely noticed that the lines are not all from the same play! Now, we will add some useful metadata to our table:

- Determine a likely source title for each line
- add the title as a 'play' column in the data table.
- make sure to document your decisions, assumptions, external data sources, etc.

This is fairly open-ended, and you are not being judged completely on accuracy. Instead, think outside the box a bit as to how you might accomplish this, and attempt to justify whatever approximations or assumptions you felt were appropriate.
___

For the source title, I am going to use a Kaggle dataset (https://www.kaggle.com/kingburrito666/shakespeare-plays/version/4) on Shakespeare's plays (including character names and dialogues). 

In [9]:
# read in Shakespeare play data
df_shakespeare = pd.read_csv('Shakespeare_data.csv')

# view
df_shakespeare.head()

Unnamed: 0,Dataline,Play,PlayerLinenumber,ActSceneLine,Player,PlayerLine
0,1,Henry IV,,,,ACT I
1,2,Henry IV,,,,SCENE I. London. The palace.
2,3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
3,4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
4,5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"


In [10]:
# make speaker names uppercase (in order to match) and store in df
speaker_df = pd.DataFrame(df['speaker'].str.upper())
df_shakespeare['Player'] = df_shakespeare['Player'].str.upper()

# rename column to match shakespeare_df column name for speakers
speaker_df.rename(columns = {'speaker':'Player'}, inplace = True)

I will get the list of plays of a specific speaker e.g. all plays with the speaker 'First Citizen'. Then I will select the most common play from this list as the source title/play. The assumption here is that the play in which a specific speaker/character had a lot of dialogues is the most likely source title for the dialogues by the speaker.  

Limitation of this assumption: some dialogues may be from plays where the speaker has a minor part i.e. it may not be the most common play. 

In [14]:
# create empty list to store source titles
plays = []

# store the most common play for each speaker
for i in range(len(speaker_df)):
    plays_list = list(df_shakespeare.loc[df_shakespeare['Player'] == speaker_df.Player[i], 'Play'])
    if len(plays_list) == 0:
        plays.append('none')
    else:
        plays.append(max(set(plays_list), key = plays_list.count))
    
# check length of plays
len(plays)

7222

In [16]:
# add 'play' column 
df['play'] = plays

# view dataframe
df

Unnamed: 0,speaker,dialogue,line_number,play
0,First Citizen,"\nBefore we proceed any further, hear me speak.",1,Coriolanus
1,All,"\nSpeak, speak.",2,macbeth
2,First Citizen,\nYou are all resolved rather to die than to f...,3,Coriolanus
3,All,\nResolved. resolved.,4,macbeth
4,First Citizen,"\nFirst, you know Caius Marcius is chief enemy...",5,Coriolanus
...,...,...,...,...
7217,ANTONIO,\nNor I; my spirits are nimble.\nThey fell tog...,7218,Merchant of Venice
7218,SEBASTIAN,"\nWhat, art thou waking?",7219,Twelfth Night
7219,ANTONIO,\nDo you not hear me speak?,7220,Merchant of Venice
7220,SEBASTIAN,\nI do; and surely\nIt is a sleepy language an...,7221,Twelfth Night


In [18]:
# view rows where there was no match 
df[df['play'] == 'none']

Unnamed: 0,speaker,dialogue,line_number,play
530,"Senators, &C",\nWe'll surety him.,531,none
538,"Senators, &C","\nWeapons, weapons, weapons!\n'Tribunes!' 'Pat...",539,none
2142,Ghost of Prince Edward,,2143,none
2143,Ghost of King Henry VI,,2144,none
2150,Ghosts of young Princes,,2151,none
2152,Ghost of BUCKINGHAM,,2153,none
2750,\nSAMPSON,"\nGregory, o' my word, we'll not carry coals.",2751,none
5704,\nMARIANA,"\nBreak off thy song, and haste thee quick awa...",5705,none
6569,ALL SERVING-MEN,"\nHere, here, sir; here, sir.",6570,none


Assumption: for speakers that were not found in the external dataset and for which there is no play information, the speaker is assumed to be in the same play as the previous speaker. In other words: for rows where there was not an exact match for a speaker, the preceding row will be used to obtain the play/source title. 

Limitation: if the play for the first observation was missing, this would have to be modified. 

In [30]:
# condition: select rows where no play was found
# replace with play in the previous row
s = df["play"].eq("none")
df.loc[s, "play"] = pd.np.nan
df["play"].ffill(inplace=True)

# check
df[df['play'] == 'none']

  df.loc[s, "play"] = pd.np.nan


Unnamed: 0,speaker,dialogue,line_number,play


### Part 3

Pick one or more of the techniques described in this chapter:

- keyword frequency
- entity relationships
- markov language model
- bag-of-words, TF-IDF
- semantic embedding

make a case for a technique to measure how important or interesting a speaker is. The measure does not have to be both important and interesting, and you are welcome to come up with another term that represents "useful content", or tells a story (happiest speaker, worst speaker, etc.)

Whatever you choose, you must
- document how your technique was applied
- describe why you believe the technique is a valid approximation or exploration of how important, interesting, etc., a speaker is.
- list some possible weaknesses of your method, or ways you expect your assumptions could be violated within the text.

This is mostly about learning to transparently document your decisions, and iterate on a method for operationalizing useful analyses on text. Your explanations should be understandable; homeworks will be peer-reviewed by your fellow students