# Data Cleaning

## Introduction

This notebook goes through a necessary step of any data science project - data cleaning.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll use data from Kaggle
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in standard text format: **Corpus** (a collection of text).

## Problem Statement

Our goal is to look at famous Anime Quotes and note their similarities and differences. Specifically, I'd like to know if their popularity beacuse of seeking a fixed truth or because of seeking a personal opinion. Why are ‘Itachi’, ‘Loffi’ and ‘Eren’ among the most lovely characters of all time? Is Levi's attractiveness due to his words or his actions?

## Getting The Data

We will use 'Anime Quotes' dataset downloaded from Kaggle.

In [1]:
# import pandas module under alias pd
import pandas as pd
pd.set_option('max_colwidth',150)

# data import using pandas module : data_df
data_df = pd.read_csv('AnimeQuotes.csv')

# see first 5 observations
data_df.head()

Unnamed: 0,Quote,Character,Anime
0,"People’s lives don’t end when they die, it ends when they lose faith.",Itachi Uchiha,Naruto
1,"If you don’t take risks, you can’t create a future!",Monkey D Luffy,One Piece
2,"If you don’t like your destiny, don’t accept it.",Naruto Uzumaki,Naruto
3,"When you give up, that’s when the game ends.",Mitsuyoshi Anzai,Slam Dunk
4,All we can do is live until the day we die. Control what we can…and fly free.,Deneil Young,Uchuu Kyoudai or Space Brothers


## Cleaning The Data

First, we will merge individual quotes for each character so we can make analysis on each character.

In [2]:
# Find the merged name data set and rename the 'Name' column
quotes = data_df.groupby(['Character'])['Quote'].apply(' '.join).reset_index().rename(columns={'Quote':'Quotes'})

# Join it to the original dataset
data = data_df.merge(quotes, on='Character')

# Drop the 'Name' column then drop duplicates.
data = data.drop(columns=['Quote']).drop_duplicates()

# Set character name as our dataeset index
data = data.set_index('Character')

# See the first 5 observations
data.head()

Unnamed: 0_level_0,Anime,Quotes
Character,Unnamed: 1_level_1,Unnamed: 2_level_1
Itachi Uchiha,Naruto,"People’s lives don’t end when they die, it ends when they lose faith."
Monkey D Luffy,One Piece,"If you don’t take risks, you can’t create a future! Forgetting is like a wound. The wound may heal, but it has already left a scar. Being lonely i..."
Naruto Uzumaki,Naruto,"If you don’t like your destiny, don’t accept it. Hard work is worthless for those that don’t believe in themselves. If you don’t like your destiny..."
Mitsuyoshi Anzai,Slam Dunk,"When you give up, that’s when the game ends."
Deneil Young,Uchuu Kyoudai or Space Brothers,All we can do is live until the day we die. Control what we can…and fly free.


In [3]:
# Let's take a look at the quotes for Naruto Uzumaki
data.Quotes.loc['Naruto Uzumaki']

'If you don’t like your destiny, don’t accept it. Hard work is worthless for those that don’t believe in themselves. If you don’t like your destiny, don’t accept it. Instead, have the courage to change it the way you want it to be.'

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [4]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [5]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data.Quotes.apply(round1))
data_clean

Unnamed: 0_level_0,Quotes
Character,Unnamed: 1_level_1
Itachi Uchiha,people’s lives don’t end when they die it ends when they lose faith
Monkey D Luffy,if you don’t take risks you can’t create a future forgetting is like a wound the wound may heal but it has already left a scar being lonely is mor...
Naruto Uzumaki,if you don’t like your destiny don’t accept it hard work is worthless for those that don’t believe in themselves if you don’t like your destiny do...
Mitsuyoshi Anzai,when you give up that’s when the game ends
Deneil Young,all we can do is live until the day we die control what we can…and fly free
...,...
Tobio Kageyama,being the best decoy ever is as cool as being the ace you can fly even higher if they adjust to me i have to adjust in turn whoever stops adjustin...
Yuu Nishinoya,life s a bore if you don t challenge yourself
Tanaka Saeko,there are some flowers you only see when you take detours
Ittetsu Takeda,being weak means that there is room to grow


Note, that we don't need Anime Name the character belongs to right now. So, we are fine dropping 'Anime' column.

In [6]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [7]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.Quotes.apply(round2))
data_clean

Unnamed: 0_level_0,Quotes
Character,Unnamed: 1_level_1
Itachi Uchiha,peoples lives dont end when they die it ends when they lose faith
Monkey D Luffy,if you dont take risks you cant create a future forgetting is like a wound the wound may heal but it has already left a scar being lonely is more ...
Naruto Uzumaki,if you dont like your destiny dont accept it hard work is worthless for those that dont believe in themselves if you dont like your destiny dont a...
Mitsuyoshi Anzai,when you give up thats when the game ends
Deneil Young,all we can do is live until the day we die control what we canand fly free
...,...
Tobio Kageyama,being the best decoy ever is as cool as being the ace you can fly even higher if they adjust to me i have to adjust in turn whoever stops adjustin...
Yuu Nishinoya,life s a bore if you don t challenge yourself
Tanaka Saeko,there are some flowers you only see when you take detours
Ittetsu Takeda,being weak means that there is room to grow


**NOTE:** This data cleaning aka text pre-processing step could go on for a while, but we are going to stop for now. After going through some analysis techniques, if we see that the results don't make sense or could be improved, we can come back and make more edits.

## Organizing The Data

I mentioned earlier that the output of this notebook will be clean, organized data in standard text format: **Corpus** (a collection of text). We already created a corpus in an earlier step.

In [8]:
# Let's take a look at our original dataframe
data_df

Unnamed: 0,Quote,Character,Anime
0,"People’s lives don’t end when they die, it ends when they lose faith.",Itachi Uchiha,Naruto
1,"If you don’t take risks, you can’t create a future!",Monkey D Luffy,One Piece
2,"If you don’t like your destiny, don’t accept it.",Naruto Uzumaki,Naruto
3,"When you give up, that’s when the game ends.",Mitsuyoshi Anzai,Slam Dunk
4,All we can do is live until the day we die. Control what we can…and fly free.,Deneil Young,Uchuu Kyoudai or Space Brothers
...,...,...,...
116,Life s a bore if you don t challenge yourself,Yuu Nishinoya,Haikyuu
117,There are some flowers you only see when you take detours,Tanaka Saeko,Haikyuu
118,Being weak means that there is room to grow,Ittetsu Takeda,Haikyuu
119,Today might be the chance to grasp the chance to let your talent bloom,Tooru Oikawa,Haikyuu


In [9]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

# Let's also pickle the cleaned data
data_clean.to_pickle('data_clean.pkl')