# Emotion Classification: Data Cleaning

Anaysis by Frank Flavell

## Business Case

The goal of this project is to develop a Natural Language Understanding (NLU) algorithm for classifying the underlying emotion associated with a chat message so that chatbots and other programs can use this information to deliver a better experience to users. 

People have goals.  Some goals are explicit and others are implicit.  Explicit goals are usually easy to identify because a person can clearly articulate them, like buying groceries, resolving a billing issue, traveling to the beach, updating a software, etc.  Implicit goals, on the other hand, are more difficult to identify.  These are emotional goals that aren't always articulated even though the achievement of these emotional goals are often more valuable than the explicit goals.  Not only does a person want to buy groceries, they also want to feel good about the experience.

It requires emotional intelligence to recognize emotions and act in ways that will address these emotions in a healthy and harmonious way.  As we all know from customer service interactions, not all people have this level of emotional intelligence.  If a program can automate this emotional classification process, then organizations could use this information to dramatically improve a wide variety of service encounters acorss industries.

The specific application of this classification algorithm will be for the NLU pipeline of an emotionally intelligence chatbot.


## Dataset

I will be using the [DailyDialogue](http://yanran.li/dailydialog) dataset compiled for the International Joint Conference on Natural Language Processing (IJCNLP) in Taipei, Taiwan by Yanran Li, Hui Su, and Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu.

This is a very thorough dataset that includes over 13,000 conversations and over 100,000 utterances.  Each conversation has been manually categroized by a topic while each utterance additionally categorized with an emotion and a statement type.
* ***Dialogue: string.  One utterance per row.***
* ***Emotion: int. The emotion associated with the text.***
    * 0: No emotion
    * 1: Anger
    * 2: Disgust
    * 3: Fear
    * 4: Happiness
    * 5: Sadness
    * 6: Surprise
* ***Type: int. The type of utterance.***
    * 1: Inform
    * 2: Question
    * 3: Directive
    * 4: Commissive
* ***Topic: The general topic of the conversation.***
    * 1: Ordinary Life
    * 2: School Life
    * 3: Culture & Education
    * 4: Attitude & Emotion
    * 5: Relationship
    * 6: Tourism
    * 7: Health
    * 8: Work
    * 9: Politics
    * 10: Finance


## Table of Contents<span id="0"></span>

1. [**Import Dialogue**](#1)
2. [**Import Conversation Topics & Merge**](#2)
3. [**Import Emotion Classification**](#3)
4. [**Import Dialogue Act**](#4)
5. [**Compare DF to Emotions and Acts**](#5)
6. [**Explode Conversations and Topics to Utterances**](#6)
7. [**Explode Emotions and Merge**](#7)
8. [**Explode Dialogue Acts and Merge**](#8)
9. [**Update Datatypes**](#9)

### Pre-Processing Cleaning Pipeline for Futuer Inputs

10. [**Remove Unnecessary Spaces**](#10)
11. [**Lowercase**](#11)
12. [**Expand Contractions**](#12)
13. [**Lemmatization**](#13)
14. 




## Package Import

In [3]:
# import external libraries

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import re #regex

pip install -U spacy
import spacy
nlp = spacy.load('en_core', parse=True, tag=True, entity=True)

# Configure matplotlib for jupyter.
%matplotlib inline

## Data Import & Cleaning

The data comes in 4 different .txt files, one for each feature: dialogue, topic, emotion, and dialogue act.  In all files, each line contains one conversation containing several utterances between two people.  I needed to 'explode' each conversation so each utterance had its own row in the dataframe.

I also discovered that the conversation at index 672 was missing one emotion and dialogue act classification.  I investigated the strings at this index in the emotion and act dataframes and updated the values to include the appropropriate classification.  With this update, I could effectively merge the features together into a master df.

The result is a dataframe containing 102,980 non-null string utterances in the dialogue column and integer objects labeling each utterance in the topic, emotion, and type columns.

## <span id="1"></span>1. Import Dialogue
#### [Return Contents](#0)

In [261]:
df = pd.read_csv("data/dialogues_text.txt", delimiter="\t", header=None)
df.columns = ["dialogue"]

In [262]:
df.head()

Unnamed: 0,dialogue
0,The kitchen stinks . \t I'll throw out the gar...
1,"So Dick , how about getting some coffee for to..."
2,Are things still going badly with your housegu...
3,"Would you mind waiting a while ? \t Well , how..."
4,Are you going to the annual party ? I can give...


In [263]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13118 entries, 0 to 13117
Data columns (total 1 columns):
dialogue    13118 non-null object
dtypes: object(1)
memory usage: 102.6+ KB


In [264]:
df['dialogue'][0]

"The kitchen stinks . \\t I'll throw out the garbage . \\t"

In [265]:
type(df['dialogue'][0])

str

No null values hiding as an empty string.

In [266]:
df[df['dialogue'] == '']

Unnamed: 0,dialogue


## <span id="2"></span>2. Import Conversation Topics & Merge
#### [Return Contents](#0)

In [267]:
df['topic'] = pd.read_csv('data/dialogues_topic.txt', header=None)

In [268]:
df.head()

Unnamed: 0,dialogue,topic
0,The kitchen stinks . \t I'll throw out the gar...,1
1,"So Dick , how about getting some coffee for to...",1
2,Are things still going badly with your housegu...,1
3,"Would you mind waiting a while ? \t Well , how...",1
4,Are you going to the annual party ? I can give...,1


In [269]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13118 entries, 0 to 13117
Data columns (total 2 columns):
dialogue    13118 non-null object
topic       13118 non-null int64
dtypes: int64(1), object(1)
memory usage: 205.1+ KB


No null values hiding as an empty string.

In [270]:
df[df['dialogue'] == '']

Unnamed: 0,dialogue,topic


## <span id="3"></span>3. Import Emotion Classification
#### [Return Contents](#0)

In [271]:
# Import the txt file.
emotions = pd.read_csv('data/dialogues_emotion.txt', header=None)
# Label the emotions df column name
emotions.columns = ["emotion"]

In [272]:
emotions.head()

Unnamed: 0,emotion
0,2 0
1,4 2 0 1 0
2,0 1 0 0
3,0 0 0 4
4,0 4 4


We can see that the number of rows in the emotions dataframe matches the number of rows in the dialogue and topic dataframe: 13,118.

In [273]:
emotions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13118 entries, 0 to 13117
Data columns (total 1 columns):
emotion    13118 non-null object
dtypes: object(1)
memory usage: 102.6+ KB


No null values hiding as an empty string.

In [274]:
emotions[emotions['emotion'] == '']

Unnamed: 0,emotion


Each conversation emotion classification string contains an extra space at the end, which means it will create an extra row when we explode the conversations to utterances.  We will need to deal with this when the time comes.

In [275]:
emotions.emotion[0]

'2 0 '

## <span id="4"></span>4. Import Dialogue Act
#### [Return Contents](#0)

In [276]:
# Import the txt file.
acts = pd.read_csv('data/dialogues_act.txt', header=None)
# Label the emotions df column name
acts.columns = ['type']

In [277]:
acts.head()

Unnamed: 0,type
0,3 4
1,3 4 3 1 1
2,2 1 3 4
3,3 2 1 1
4,3 4 1


We can also see that the number of rows in the dialogue acts dataframe matches the number of rows in the dialogue & topic as well as emotions dataframes: 13,118.

In [278]:
acts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13118 entries, 0 to 13117
Data columns (total 1 columns):
type    13118 non-null object
dtypes: object(1)
memory usage: 102.6+ KB


No null values hiding as an empty string.

In [279]:
acts[acts['type'] == '']

Unnamed: 0,type


## <span id="5"></span>5. Compare DF to Emotions & Acts
#### [Return Contents](#0)

Unfortunately, when I initially exploded the rows of the emotion and acts dataframes, they were both one row short of the dialogue dataframe, which meant there was one label missing!  Conversations also weren't labeled with an ID number that I could use to match with the labels from other .txt files.  If the utterance and labels don't match up, then it would completely undermine my ability to predict the emotional classification of utterances.

Since the original conversation per row dataframes all contain the same number of rows, I made a list containing the number of utterances per row as well as a list containing the number of emotion labels per row.  I compared the two lists and identified that the row at index 672 was the only row that didn't contain the same number of utterances (12) to emotion labels (11).  This was also true for the acts dataframe.  I examined the contents of row 672 and updated the values to contain the correct classifications.

In [280]:
num_utter = df.dialogue.apply(lambda x: len(x.split('\\t'))-1)

In [281]:
num_emo = emotions.emotion.apply(lambda x: len(x.split(' '))-1)

In [282]:
compare = num_utter == num_emo

In [283]:
compare.index[compare == False]

Int64Index([672], dtype='int64')

In [284]:
num_emo.iloc[672]

11

In [285]:
num_utter.iloc[672]

12

In [286]:
df.iloc[672, 0]

"Sam , can we stop at this bicycle shop ? \\t Do you want to buy a new bicycle ? \\t Yes , and they have a sale on now . \\t What happened to your old one ? \\t I left it at my parent's house , but I need one here as well . \\t I've been using Jim's old bike but he needs it back . \\t Let's go then . \\t Look at this mountain bike . It is only £ 330 . Do you like it ? \\t I prefer something like this one - a touring bike , but it is more expensive . \\t How much is it ? \\t The price on the tag says £ 565 but maybe you can get a discount . \\t OK , let's go and ask . \\t"

In [287]:
emotions.iloc[672]

emotion    0 0 0 0 0 0 0 0 0 0 0 
Name: 672, dtype: object

In [288]:
type(emotions.emotion[672])

str

In [290]:
emotions.emotion[672] = '0 0 0 0 0 0 0 0 0 0 0 0'

In [291]:
emotions.emotion[672]

'0 0 0 0 0 0 0 0 0 0 0 0'

In [292]:
acts.type[672] = '2 2 1 2 1 1 3 2 1 2 1 3'

In [293]:
acts.type[672]

'2 2 1 2 1 1 3 2 1 2 1 3'

## <span id="6"></span>6. Explode Conversations & Topics into Utterances
#### [Return Contents](#0)

Once each row of each dataframe contained the same number of values, I 'exploded' the conversations into utterances, making a dataframe with one utterance per row.

The function below is courtesy of [James Allen](https://gist.github.com/jlln/338b4b0b55bd6984f883).

In [294]:
def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split
    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pd.DataFrame(new_rows)
    return new_df

In [295]:
df = splitDataFrameList(df,'dialogue','\\t')
df = df[df.dialogue != '']
df.reset_index(drop=True, inplace=True)

In [296]:
df.head()

Unnamed: 0,dialogue,topic
0,The kitchen stinks .,1
1,I'll throw out the garbage .,1
2,"So Dick , how about getting some coffee for to...",1
3,Coffee ? I don ’ t honestly like that kind of...,1
4,"Come on , you can at least try a little , bes...",1


## <span id="7"></span>7. Explode Emotions & Merge
#### [Return Contents](#0)

In [297]:
# Split the strings in each row and expand their own rows
emotions = splitDataFrameList(emotions,'emotion',' ')
# Remove any rows that contain an empty quote
emotions = emotions[emotions.emotion != '']
# Reset Index
emotions.reset_index(drop=True, inplace=True)

In [298]:
emotions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102980 entries, 0 to 102979
Data columns (total 1 columns):
emotion    102980 non-null object
dtypes: object(1)
memory usage: 804.7+ KB


Correct number of rows!  We're good to go.

In [299]:
df['emotion'] = emotions['emotion']

In [300]:
df.head()

Unnamed: 0,dialogue,topic,emotion
0,The kitchen stinks .,1,2
1,I'll throw out the garbage .,1,0
2,"So Dick , how about getting some coffee for to...",1,4
3,Coffee ? I don ’ t honestly like that kind of...,1,2
4,"Come on , you can at least try a little , bes...",1,0


## <span id="8"></span>8. Explode Dialogue Acts & Merge
#### [Return Contents](#0)

In [301]:
# Split the strings in each row and expand their own rows
acts = splitDataFrameList(acts,'type',' ')
# Remove any rows that contain an empty quote
acts = acts[acts.type != '']
# Reset Index
acts.reset_index(drop=True, inplace=True)

In [302]:
acts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102980 entries, 0 to 102979
Data columns (total 1 columns):
type    102980 non-null object
dtypes: object(1)
memory usage: 804.7+ KB


Correct number of rows!  We're good to go.

In [303]:
df['type'] = acts['type']

In [304]:
df.head()

Unnamed: 0,dialogue,topic,emotion,type
0,The kitchen stinks .,1,2,3
1,I'll throw out the garbage .,1,0,4
2,"So Dick , how about getting some coffee for to...",1,4,3
3,Coffee ? I don ’ t honestly like that kind of...,1,2,4
4,"Come on , you can at least try a little , bes...",1,0,3


## <span id="9"></span>9. Update Datatypes
#### [Return Contents](#0)

In [305]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102980 entries, 0 to 102979
Data columns (total 4 columns):
dialogue    102980 non-null object
topic       102980 non-null int64
emotion     102980 non-null object
type        102980 non-null object
dtypes: int64(1), object(3)
memory usage: 3.1+ MB


In [306]:
df.emotion = df.emotion.astype('int')

In [307]:
df.type = df.type.astype('int')

In [308]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102980 entries, 0 to 102979
Data columns (total 4 columns):
dialogue    102980 non-null object
topic       102980 non-null int64
emotion     102980 non-null int64
type        102980 non-null int64
dtypes: int64(3), object(1)
memory usage: 3.1+ MB


## <span id="10"></span>10. Remove Unecessary Spaces
#### [Return Contents](#0)

Removing spaces at the beginning and end of each utterance.

In [309]:
df.dialogue = df.dialogue.str.strip()

Removing spaces surrounding punction marks.

In [310]:
def pun_before(text):
    updated = re.sub(r"\s([?.!’,:\s|$])", r"\1", text)
    return updated

def pun_after(text):
    updated = re.sub(r'''([’])\s''', r'\1', text)
    return updated

In [311]:
def fix_pun(text):
    before = pun_before(text)
    after = pun_after(before)
    return after

In [312]:
fix_pun(str(df.iloc[5,0]))

'What’s wrong with that? Cigarette is the thing I go crazy for.'

In [313]:
df['dialogue'] = df.dialogue.apply(fix_pun)

In [314]:
df.head()

Unnamed: 0,dialogue,topic,emotion,type
0,The kitchen stinks.,1,2,3
1,I'll throw out the garbage.,1,0,4
2,"So Dick, how about getting some coffee for ton...",1,4,3
3,Coffee? I don’t honestly like that kind of stuff.,1,2,4
4,"Come on, you can at least try a little, beside...",1,0,3


## <span id="11"></span>11. Lowercase
#### [Return Contents](#0)

It is easier to make additional pre-processing updates to the text if we make all text lowercase.  For example, after this I will expand contractions and our list of contractions are lower case.  In order to match this list of contractions in all the utterances, all utterances also need to be lower case.

In [315]:
df['dialogue'] = df['dialogue'].apply(lambda x: x.lower())

In [316]:
df.head()

Unnamed: 0,dialogue,topic,emotion,type
0,the kitchen stinks.,1,2,3
1,i'll throw out the garbage.,1,0,4
2,"so dick, how about getting some coffee for ton...",1,4,3
3,coffee? i don’t honestly like that kind of stuff.,1,2,4
4,"come on, you can at least try a little, beside...",1,0,3


## <span id="12"></span>12. Expand Contractions
#### [Return Contents](#0)

Now that the spaces surrounding punctuation are eliminated, we can deal with contractions by expanding them into their original form so they are easier to vectorize.

Contraction map and expander function courtesy of [Dipanjan Sarkar](https://www.kdnuggets.com/author/dipanjan-sarkar).

In [317]:
cList = {
"ain’t": "is not",
"aren’t": "are not",
"can’t": "cannot",
"can’t’ve": "cannot have",
"’cause": "because",
"could’ve": "could have",
"couldn’t": "could not",
"couldn’t’ve": "could not have",
"didn’t": "did not",
"doesn’t": "does not",
"don’t": "do not",
"hadn’t": "had not",
"hadn’t’ve": "had not have",
"hasn’t": "has not",
"haven’t": "have not",
"he’d": "he would",
"he’d’ve": "he would have",
"he’ll": "he will",
"he’ll’ve": "he he will have",
"he’s": "he is",
"how’d": "how did",
"how’d’y": "how do you",
"how’ll": "how will",
"how’s": "how is",
"I’d": "I would",
"I’d’ve": "I would have",
"I’ll": "I will",
"I’ll’ve": "I will have",
"I’m": "I am",
"I’ve": "I have",
"i’d": "i would",
"i’d’ve": "i would have",
"i’ll": "i will",
"i’ll’ve": "i will have",
"i’m": "i am",
"i’ve": "i have",
"isn’t": "is not",
"it’d": "it would",
"it’d’ve": "it would have",
"it’ll": "it will",
"it’ll’ve": "it will have",
"it’s": "it is",
"let’s": "let us",
"ma’am": "madam",
"mayn’t": "may not",
"might’ve": "might have",
"mightn’t": "might not",
"mightn’t’ve": "might not have",
"must’ve": "must have",
"mustn’t": "must not",
"mustn’t’ve": "must not have",
"needn’t": "need not",
"needn’t’ve": "need not have",
"o’clock": "of the clock",
"oughtn’t": "ought not",
"oughtn’t’ve": "ought not have",
"shan’t": "shall not",
"sha’n’t": "shall not",
"shan’t’ve": "shall not have",
"she’d": "she would",
"she’d’ve": "she would have",
"she’ll": "she will",
"she’ll’ve": "she will have",
"she’s": "she is",
"should’ve": "should have",
"shouldn’t": "should not",
"shouldn’t’ve": "should not have",
"so’ve": "so have",
"so’s": "so as",
"that’d": "that would",
"that’d’ve": "that would have",
"that’s": "that is",
"there’d": "there would",
"there’d’ve": "there would have",
"there’s": "there is",
"they’d": "they would",
"they’d’ve": "they would have",
"they’ll": "they will",
"they’ll’ve": "they will have",
"they’re": "they are",
"they’ve": "they have",
"to’ve": "to have",
"wasn’t": "was not",
"we’d": "we would",
"we’d’ve": "we would have",
"we’ll": "we will",
"we’ll’ve": "we will have",
"we’re": "we are",
"we’ve": "we have",
"weren’t": "were not",
"what’ll": "what will",
"what’ll’ve": "what will have",
"what’re": "what are",
"what’s": "what is",
"what’ve": "what have",
"when’s": "when is",
"when’ve": "when have",
"where’d": "where did",
"where’s": "where is",
"where’ve": "where have",
"who’ll": "who will",
"who’ll’ve": "who will have",
"who’s": "who is",
"who’ve": "who have",
"why’s": "why is",
"why’ve": "why have",
"will’ve": "will have",
"won’t": "will not",
"won’t’ve": "will not have",
"would’ve": "would have",
"wouldn’t": "would not",
"wouldn’t’ve": "would not have",
"y’all": "you all",
"y’all’d": "you all would",
"y’all’d’ve": "you all would have",
"y’all’re": "you all are",
"y’all’ve": "you all have",
"you’d": "you would",
"you’d’ve": "you would have",
"you’ll": "you will",
"you’ll’ve": "you will have",
"you’re": "you are",
"you’ve": "you have"
}

In [318]:
contractions_re = re.compile('(%s)' % '|'.join(cList.keys()))
def expand_contractions(s, contractions_dict=cList):
     def replace(match):
         return cList[match.group(0)]
     return contractions_re.sub(replace, s)

In [319]:
expand_contractions(df.dialogue[5], contractions_dict=cList)

'what is wrong with that? cigarette is the thing i go crazy for.'

In [320]:
df['dialogue'] = df.dialogue.apply(lambda x: expand_contractions(x, contractions_dict=cList))


In [321]:
df.head(10)

Unnamed: 0,dialogue,topic,emotion,type
0,the kitchen stinks.,1,2,3
1,i'll throw out the garbage.,1,0,4
2,"so dick, how about getting some coffee for ton...",1,4,3
3,coffee? i do not honestly like that kind of st...,1,2,4
4,"come on, you can at least try a little, beside...",1,0,3
5,what is wrong with that? cigarette is the thin...,1,1,1
6,"not for me, dick.",1,0,1
7,are things still going badly with your housegu...,1,0,2
8,getting worse. now he is eating me out of hous...,1,1,1
9,"leo, i really think you are beating around the...",1,0,3


In [322]:
df.to_pickle("data/dialogue_master.pickle")

## <span id="13"></span>13. Lemmatization
#### [Return Contents](#0)

In [329]:
df = pd.read_pickle("data/dialogue_master.pickle")

In [325]:
df.head()

Unnamed: 0,dialogue,topic,emotion,type
0,the kitchen stinks.,1,2,3
1,i'll throw out the garbage.,1,0,4
2,"so dick, how about getting some coffee for ton...",1,4,3
3,coffee? i do not honestly like that kind of st...,1,2,4
4,"come on, you can at least try a little, beside...",1,0,3


In [330]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
def nltk2wn_tag(nltk_tag):
  if nltk_tag.startswith('J'):
    return wordnet.ADJ
  elif nltk_tag.startswith('V'):
    return wordnet.VERB
  elif nltk_tag.startswith('N'):
    return wordnet.NOUN
  elif nltk_tag.startswith('R'):
    return wordnet.ADV
  else:                    
    return None
def lemmatize_sentence(sentence):
  nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))    
  wn_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), nltk_tagged)
  res_words = []
  for word, tag in wn_tagged:
    if tag is None:                        
      res_words.append(word)
    else:
      res_words.append(lemmatizer.lemmatize(word, tag))
  return " ".join(res_words)

In [331]:
df['dialogue'] = df.dialogue.apply(nltk2wn_tag)

In [332]:
df.head()

Unnamed: 0,dialogue,topic,emotion,type
0,,1,2,3
1,,1,0,4
2,,1,4,3
3,,1,2,4
4,,1,0,3


In [None]:
df['dialogue'] = df.dialogue.apply(lemmatize_sentence)

In [253]:
pip install -U spacy

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/a7/90/785efc7bd26ff8e399f03d02b259216cf29b389a8f3c2412624f0ac32b00/spacy-2.2.4-cp37-cp37m-macosx_10_9_x86_64.whl (10.5MB)
[K     |████████████████████████████████| 10.5MB 2.0MB/s eta 0:00:01    |█████████████████████████▋      | 8.4MB 2.0MB/s eta 0:00:02
[?25hCollecting srsly<1.1.0,>=1.0.2 (from spacy)
[?25l  Downloading https://files.pythonhosted.org/packages/77/21/2bfb8d559ed128b43e3a12e28579ab5f6b043f1ac079168de3025c0d0a39/srsly-1.0.2-cp37-cp37m-macosx_10_9_x86_64.whl (182kB)
[K     |████████████████████████████████| 184kB 4.4MB/s eta 0:00:01
[?25hCollecting wasabi<1.1.0,>=0.4.0 (from spacy)
  Downloading https://files.pythonhosted.org/packages/21/e1/e4e7b754e6be3a79c400eb766fb34924a6d278c43bb828f94233e0124a21/wasabi-0.6.0-py3-none-any.whl
Collecting plac<1.2.0,>=0.9.6 (from spacy)
  Downloading https://files.pythonhosted.org/packages/86/85/40b8f66c2dd8f4fd9f09d59b22720cffecf1331e788b8a0cab5bafb353d1/p

In [255]:
import spacy 
# Load English tokenizer, tagger, 
# parser, NER and word vectors 
nlp = spacy.load("en_core_web_sm")

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [326]:
import nltk

# w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    for word in text:
        return lemmatizer.lemmatize(word)

In [327]:
df['dialogue'] = df.dialogue.apply(lemmatize_text)

In [328]:
df.head(50)

Unnamed: 0,dialogue,topic,emotion,type
0,t,1,2,3
1,i,1,0,4
2,s,1,4,3
3,c,1,2,4
4,c,1,0,3
5,w,1,1,1
6,n,1,0,1
7,a,1,0,2
8,g,1,1,1
9,l,1,0,3


In [None]:
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer() 
  
print("rocks :", lemmatizer.lemmatize("rocks"))

In [257]:
def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    

In [None]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

## <span id="10"></span>10. Export Cleaned Data
#### [Return Contents](#0)

Save cleaned data to a pickle for easy access in the future.

In [54]:
df.to_pickle("data/dialogue_master.pickle")