# Preprocessing - "The Office" dataset
This notebook aims to provide parameterizable functions to preprocess the "The Office" dataset for further NLP analysis. 

In [1]:
import pandas as pd

PATH = "../data/"
FILE = "the-office-lines_scripts.csv"

In [2]:
df = pd.read_csv(PATH+FILE, sep=",", index_col="id")

In [3]:
# Parameters
param_dict = {
    "concat_scenes": True,
    "extract_direc": False, 
    "rmv_stopwords": True,
    
    "exp_contractions": True,
    "tokenizer": ("TreeBankWord", True)
}


In [4]:
from preprocessing_nlp import preprocess

preprocessed_df = preprocess(df, **param_dict)

pd.set_option("display.max_colwidth", None)
preprocessed_df

Unnamed: 0,season,episode,scene,line_text,season_episode
0,1,1,1,"[ Michael ] : All right Jim . Your quarterlies look good . How things library ? [ Jim ] : Oh , I told . I could close . So ... [ Michael ] : So come master guidance ? Is saying , grasshopper ? [ Jim ] : Actually , called , yeah . [ Michael ] : All right . Well , let show done .",101
1,1,1,2,"[ Michael ] : [ phone ] Yes , I would like speak office manager , please . Yes , hello . This Michael Scott . I Regional Manager Dunder Mifflin Paper Products . Just wanted talk manager-a-manger . [ quick cut scene ] All right . Done deal . Thank much , sir . You gentleman scholar . Oh , I sorry . OK . I sorry . My mistake . [ hangs ] That woman I talking , ... She low voice . Probably smoker , ... [ Clears throat ] So way done .",101
2,1,1,3,"[ Michael ] : I , uh , I Dunder Mifflin 12 years , last four Regional Manager . If want come ... See entire floor . So kingdom , far eye see . This receptionist , Pam . Pam ! Pam-Pam ! Pam Beesly . Pam us ... forever . Right , Pam ? [ Pam ] : Well . I know . [ Michael ] : If think cute , seen couple years ago . [ growls ] [ Pam ] : What ? [ Michael ] : Any messages ? [ Pam ] : Uh , yeah . Just fax . [ Michael ] : Oh ! Pam , Corporate . How many times I told ? There special filing cabinet things corporate . [ Pam ] : You told . [ Michael ] : It called wastepaper basket ! Look ! Look face .",101
3,1,1,4,"[ Michael ] : People say I best boss . They go , 'God never worked place like . You hilarious . ' 'And get best us . ' [ shows camera WORLD 'S BEST BOSS mug ] I think pretty much sums . I found Spencer Gifts .",101
4,1,1,5,[ Dwight ] : [ singing ] Shall I play ? Pa rum pump um pum [ Imitates heavy drumming ] I gifts . Pa rum pump um pum [ Imitates heavy drumming ],101
...,...,...,...,...,...
8844,9,23,112,"[ Creed ] : It seems arbitrary . I applied job company hiring . I took desk back empty . But [ chuckles ] matter get end , human beings miraculous gift make place home . [ standing two cops ] Let .",923
8845,9,23,113,[ Meredith ] : I feel lucky I got chance share crummy story anyone thinks one take dump paper shredder . You alone sister . Let get beer sometime .,923
8846,9,23,114,[ Phyllis ] : I happy filmed I remember everyone . I worked paper company years I never wrote anything .,923
8847,9,23,115,"[ Jim ] : I sold paper company 12 years . My job speak clients phone quantities types copier paper . Even I love every minute , everything I , I owe job . This stupid wonderful boring amazing job .",923


In [7]:
# save preprocessed data
preprocessed_df.to_csv(PATH+"preprocessed_for_modeling_"+FILE, sep=";")

In [5]:
from preprocessing_nlp import extract_features
# feature extraction
param_dict = {
    "concat_scenes": False,
    "extract_direc": False, 
    "remove_punct": True, 
    "rmv_stopwords": False,
    "lwr": True, 
    "exp_contractions": True,
    "conversion": "lemmmatize"
}
test = preprocess(df, **param_dict)
feature_df = extract_features(df, "count")
feature_df.shape

(59911, 20866)

In [6]:
# save the preprocessed data
# df.to_csv(PATH+"preprocessed_"+FILE, sep=",", index=True)
# feature_df.to_csv(PATH+"feature_"+FILE, sep=",", index=True)