# Imports

In [1]:
import pandas as pd
import numpy as np
import re

# Set up display area to show dataframe in jupyter qtconsole
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 100)
# pd.options.display.max_rows
pd.set_option('display.max_colwidth', None)

In [2]:
subtitles = pd.read_csv('house_subs.csv', sep=';')
print(len(subtitles))
subtitles.iloc[np.r_[0:100, -100:0]]

284492


Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,- We are condemned to useless labor.
1,2004,606018,180472,2,- Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,- Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
5,2004,606018,180472,11,Better stop or it'll stick that way.
6,2004,606018,180472,12,You have a patient in exam one.
7,2004,606018,180472,13,"Yeah, see I'm--I'm off at 12:00, and it's already five of."
8,2004,606018,180472,17,Hi.
9,2004,606018,180472,19,- What seems to be the problem?


# Clean dataset

Remove sentences such as:

(BELL RINGS) 

(CHILDREN GIGGLING) 

\[ Seasonal Music \] 

\[ Wheezing, Gasping \] 


In [3]:
subtitles = subtitles[~(subtitles.sentence.str.contains("\(" and "\)") & subtitles.sentence.str.isupper())]
subtitles = subtitles[~(subtitles.sentence.str.startswith("[") & subtitles.sentence.str.endswith("]"))]
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,- We are condemned to useless labor.
1,2004,606018,180472,2,- Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,- Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284481,2012,2121965,6371178,559,"Oh, come on!"
284484,2012,2121965,6371178,564,This is embarrassing.
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.
284486,2012,2121965,6371178,568,How...


Remove parts of sentences such as:

\[ House \] We are condemned to useless labor. 

\[ Music Continues \] Still out by 12:00. 

{\It means}Whatever they got is from the donor's blood. 

She was his high school{\sweetheart)} sweetie. 

{\I know }You told us it was none of our business, but if{\ House thinks that} your Huntington's is affecting you, 

{\The surgery}It was on her bowel, not her brain.

In [4]:
subtitles['sentence'] = subtitles['sentence'].map(lambda x: re.sub("[(\[{)].*[(\]})]", "", x).strip())
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,- We are condemned to useless labor.
1,2004,606018,180472,2,- Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,- Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284481,2012,2121965,6371178,559,"Oh, come on!"
284484,2012,2121965,6371178,564,This is embarrassing.
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.
284486,2012,2121965,6371178,568,How...


Remove sentences like:

~ \[Piano: 

Ouse over speakerphone\] how'd the bubble test go?

\[hissing sound are you kidding me?

In [5]:
subtitles = subtitles[~(subtitles.sentence.str.contains("\[") | subtitles.sentence.str.contains("\]"))]
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,- We are condemned to useless labor.
1,2004,606018,180472,2,- Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,- Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284481,2012,2121965,6371178,559,"Oh, come on!"
284484,2012,2121965,6371178,564,This is embarrassing.
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.
284486,2012,2121965,6371178,568,How...


Remove parts of sentences like:

{\pos}Carl got a new heart and lung. 

{\pos(194,215}He can wait till I finish slaying a guy in a skullcap and a pair of tights. 

Remove 10 sentences with  {y: i} like:

{y: i}Hi, this is Blake Hanson calling for Dr. Wilson. 

Remove excessive characters with re.sub('[^!-~]+',' ',x).strip()


In [6]:
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: re.sub(r'[\{\\pos].*[\}]',' ',x).strip())
subtitles = subtitles[~subtitles.sentence.str.contains(r"\{y: i\}")] #10 sentences with  {y: i}, e.g., {y: i}Hi, this is Blake Hanson calling for Dr. Wilson.
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: re.sub('[^!-~]+',' ',x).strip())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Remove '-' at the beginning of sentences:

\- We are condemned to useless labor. \
\- She take the pill?


In [7]:
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace("-", "").strip() if x.startswith("-") else x)
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,We are condemned to useless labor.
1,2004,606018,180472,2,Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284481,2012,2121965,6371178,559,"Oh, come on!"
284484,2012,2121965,6371178,564,This is embarrassing.
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.
284486,2012,2121965,6371178,568,How...


Replace '"' with "''"

In [8]:
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace('"', "''"))
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,We are condemned to useless labor.
1,2004,606018,180472,2,Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284481,2012,2121965,6371178,559,"Oh, come on!"
284484,2012,2121965,6371178,564,This is embarrassing.
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.
284486,2012,2121965,6371178,568,How...


Drop duplicated sentences:

In [9]:
subtitles = subtitles.drop_duplicates(subset=['sentence'])
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,We are condemned to useless labor.
1,2004,606018,180472,2,Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284474,2012,2121965,6371178,549,He claimed to be on some heroic quest for truth.
284475,2012,2121965,6371178,550,But the truth is he was a bitter jerk who liked making people miserable.
284480,2012,2121965,6371178,557,A million times he needed me and the one time that I needed...
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.


In [10]:
subtitles = subtitles.loc[subtitles.sentence.str.len() > 4]
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,We are condemned to useless labor.
1,2004,606018,180472,2,Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284474,2012,2121965,6371178,549,He claimed to be on some heroic quest for truth.
284475,2012,2121965,6371178,550,But the truth is he was a bitter jerk who liked making people miserable.
284480,2012,2121965,6371178,557,A million times he needed me and the one time that I needed...
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.


Remove sentences like:

{ o one goes anywhere! 

You're not trying to cure her,{\. 

{\You don't need one. 

} There are plenty of other people you can't shoot. 

In [11]:
subtitles = subtitles[~(subtitles.sentence.str.contains(r"{") | subtitles.sentence.str.contains(r"}") | subtitles.sentence.str.contains(r"\\"))] #7 sentences
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,We are condemned to useless labor.
1,2004,606018,180472,2,Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284474,2012,2121965,6371178,549,He claimed to be on some heroic quest for truth.
284475,2012,2121965,6371178,550,But the truth is he was a bitter jerk who liked making people miserable.
284480,2012,2121965,6371178,557,A million times he needed me and the one time that I needed...
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.


Remove sentences like:

\*(Don't worry. 

\*Click* 5 p.m., Dr. House checks out. 

Then the infection lowers her blood pressure... 50 over *** at one point.

In [12]:
subtitles = subtitles[~subtitles.sentence.str.contains(r"\*")]
subtitles

Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,We are condemned to useless labor.
1,2004,606018,180472,2,Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284474,2012,2121965,6371178,549,He claimed to be on some heroic quest for truth.
284475,2012,2121965,6371178,550,But the truth is he was a bitter jerk who liked making people miserable.
284480,2012,2121965,6371178,557,A million times he needed me and the one time that I needed...
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.


In [339]:
subtitles.to_csv('clean_house_subs.csv', sep=';', index=False)

In [13]:
# Additional cleaning for subtitles
# 68 rows less


subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace('- -', '-- '))
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace('---', '--'))
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace(' -- ', '-- '))
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace('- ', '-- ') if ('- ' in x) and ('-- ' not in x) else x)
# subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace('-- ', ' -- '))
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace('-', '') if x.endswith('--') or x.endswith('-') else x)
subtitles['sentence'] = subtitles['sentence'].apply(lambda x: x.replace("''", '"') if "''" in x else x)
subtitles = subtitles.drop_duplicates(subset=['sentence'])
print(len(subtitles))
subtitles

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/p

204047


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,year,imdb_id,sub_id,sent_id,sentence
0,2004,606018,180472,1,We are condemned to useless labor.
1,2004,606018,180472,2,Fourth circle of hell.
2,2004,606018,180472,5,I'm sure Dante would agree that qualifies as useless.
3,2004,606018,180472,7,Oops.
4,2004,606018,180472,10,Could an eight-year-old do this?
...,...,...,...,...,...
284474,2012,2121965,6371178,549,He claimed to be on some heroic quest for truth.
284475,2012,2121965,6371178,550,But the truth is he was a bitter jerk who liked making people miserable.
284480,2012,2121965,6371178,557,A million times he needed me and the one time that I needed...
284485,2012,2121965,6371178,565,I'd sworn I'd turned this off.
