# Preprocessing

Dataset: Quotes-500K (https://github.com/ShivaliGoel/Quotes-500K)

Download links: 
- https://www.mediafire.com/file/elzplaxcnpf91qs/quotes.csv/file
- https://drive.google.com/file/d/1M5TTqsLw7uVZfJfvH-fNpRxL9Mw0Iky8/view?usp=sharing
- https://archive.org/details/quotes_20230625

In [89]:
import pandas as pd
import re

In [90]:
quotes_df = pd.read_csv('quotes.csv')
quotes_df

Unnamed: 0,quote,author,category
0,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"attributed-no-source, best, life, love, mistak..."
1,You've gotta dance like there's nobody watchin...,William W. Purkey,"dance, heaven, hurt, inspirational, life, love..."
2,You know you're in love when you can't fall as...,Dr. Seuss,"attributed-no-source, dreams, love, reality, s..."
3,A friend is someone who knows all about you an...,Elbert Hubbard,"friend, friendship, knowledge, love"
4,Darkness cannot drive out darkness: only light...,"Martin Luther King Jr., A Testament of Hope: T...","darkness, drive-out, hate, inspirational, ligh..."
...,...,...,...
499704,I do believe the most important thing I can do...,John C. Stennis,"Past, Believe, Help"
499705,I'd say I'm a bit antimadridista although I do...,Isco,"Team, Humility, Know"
499706,The future is now.,Nam June Paik,Now
499707,"In all my life and in the future, I will alway...",Norodom Sihamoni,"Life, My Life, Servant"


In [91]:
# Remove category column - unnecessary for this project
quotes_df = quotes_df.drop(columns=['category'])

# Remove rows where quote is empty
quotes_df = quotes_df.dropna(subset=['quote'])

# Some rows have nulls in the author or category - we keep these
quotes_df = quotes_df.fillna('Unknown')

In [92]:
# Some entries in 'author' have NAME, BOOK TITLE (separated by comma): just get the NAME and discard BOOK TITLE
quotes_df['author'] = quotes_df['author'].apply(lambda x: x.split(',')[0] if isinstance(x, str) else x)
# Replace multiple spaces with single space
quotes_df['author'] = quotes_df['author'].apply(lambda x: re.sub(r' +', r' ', x) if isinstance(x, str) else x)
quotes_df['author'] = quotes_df['author'].str.strip()

def get_mismatched_rows(df):
    """
    Returns the rows in the dataframe where the data is entered incorrectly
    among the 'quote,' 'author,' and 'category' columns. Often, this occurs as
    part of the quote being entered in 'author', and the author being in
    'category.'

    We filter where the 'author' value is more than 5 words long as well as
    not being capitalised. This means the rows returned are unlikely to have
    the actual author in 'author', so we can safely presume the 'author' value
    is the continuation of the quote in 'quote.' This justifies concatenating
    the 'quote' and 'author' values after calling this function.
    """
    df['author_len'] = df['author'].apply(lambda x: len(x.split(' ')))
    df['author_caps'] = df['author'].apply(lambda x: x[0].isupper() if x != '' else False)

    rows_to_fix = df[(df['author_len'] > 5) & (df['author_caps'] == False)]
    return rows_to_fix

In [93]:
rows_to_fix = get_mismatched_rows(quotes_df)

# Concatenate the 'quote' and 'author' values
for index, row in rows_to_fix.iterrows():
    # Concatenate the 'author' value to the 'quote' and update 'author' with 'category'
    rows_to_fix.at[index, 'quote'] = row['quote'] + " " + row['author']
    # rows_to_fix.at[index, 'author'] = row['category']
    # rows_to_fix.at[index, 'category'] = ' '

rows_fixed = rows_to_fix
rows_fixed.head()

Unnamed: 0,quote,author,author_len,author_caps
807,If you love something so much let it go. If it...,if it doesn't it never was,6,False
1236,Submission is not about authority and it is no...,it is all about relationships of love and resp...,9,False
1917,Emotion without reason lets people walk all ov...,reason without emotion is a mask for cruelty.,8,False
2730,We all are so deeply interconnected we have no...,we have no option but to love all. Be kind and...,34,False
2882,I'd run my whole life long to reach you paddle...,paddle my way across Atlantic and Pacific,7,False


In [94]:
# Replace the incorrect rows with the newly fixed rows
cleaned_df = pd.concat([quotes_df.drop(rows_fixed.index), rows_fixed])

get_mismatched_rows(cleaned_df)
# Now, we're just left with the rows where the author is unknown because the
# quote was split over all three columns. We, again, combine the values in
# 'quote' and 'author.'

Unnamed: 0,quote,author,author_len,author_caps
807,If you love something so much let it go. If it...,if it doesn't it never was,6,False
1236,Submission is not about authority and it is no...,it is all about relationships of love and resp...,9,False
1917,Emotion without reason lets people walk all ov...,reason without emotion is a mask for cruelty.,8,False
2730,We all are so deeply interconnected we have no...,we have no option but to love all. Be kind and...,34,False
2882,I'd run my whole life long to reach you paddle...,paddle my way across Atlantic and Pacific,7,False
...,...,...,...,...
424406,Being is always becoming people change and sta...,people change and stay the same.,6,False
424693,We normally know we're getting older when the ...,unless you're a cancer survivor! Then we love ...,11,False
424814,Cole was meticulous to a fault office scuttleb...,office scuttlebut had it that he never went ou...,17,False
424825,Naivete in grownups is often charming but when...,but when coupled with vanity it is indistingui...,10,False


In [95]:
rows_to_fix = get_mismatched_rows(cleaned_df)

# Concatenate the 'quote' and 'author' values
for index, row in rows_to_fix.iterrows():
    # Concatenate the 'author' value to the 'quote' and update 'author' with 'category'
    rows_to_fix.at[index, 'quote'] = row['quote'] + " " + row['author']
    # rows_to_fix.at[index, 'author'] = row['category']
    # rows_to_fix.at[index, 'category'] = ' '

rows_fixed = rows_to_fix
rows_fixed.head()

Unnamed: 0,quote,author,author_len,author_caps
807,If you love something so much let it go. If it...,if it doesn't it never was,6,False
1236,Submission is not about authority and it is no...,it is all about relationships of love and resp...,9,False
1917,Emotion without reason lets people walk all ov...,reason without emotion is a mask for cruelty.,8,False
2730,We all are so deeply interconnected we have no...,we have no option but to love all. Be kind and...,34,False
2882,I'd run my whole life long to reach you paddle...,paddle my way across Atlantic and Pacific,7,False


In [96]:
# Replace the incorrect rows with the newly fixed rows
cleaned_df = pd.concat([cleaned_df.drop(rows_fixed.index), rows_fixed])

In [97]:
get_mismatched_rows(cleaned_df)
# Left with nothing mismatched according to our heuristic in the mismatched function!

Unnamed: 0,quote,author,author_len,author_caps
807,If you love something so much let it go. If it...,if it doesn't it never was,6,False
1236,Submission is not about authority and it is no...,it is all about relationships of love and resp...,9,False
1917,Emotion without reason lets people walk all ov...,reason without emotion is a mask for cruelty.,8,False
2730,We all are so deeply interconnected we have no...,we have no option but to love all. Be kind and...,34,False
2882,I'd run my whole life long to reach you paddle...,paddle my way across Atlantic and Pacific,7,False
...,...,...,...,...
424406,Being is always becoming people change and sta...,people change and stay the same.,6,False
424693,We normally know we're getting older when the ...,unless you're a cancer survivor! Then we love ...,11,False
424814,Cole was meticulous to a fault office scuttleb...,office scuttlebut had it that he never went ou...,17,False
424825,Naivete in grownups is often charming but when...,but when coupled with vanity it is indistingui...,10,False


In [98]:
# Replace all empty strings with whitespace before converting to CSV.
# Otherwise, when we read the CSV back in, the value will be registered as NULL.
cleaned_df = cleaned_df.replace('', ' ')

# Remove indicator columns we no longer need
output_df = cleaned_df.drop(['author_len', 'author_caps'], axis=1)

# Export
# output_df.to_csv('quotes_clean.csv', index=False)

### An aside...

In [99]:
# There are still 2000+ rows where the quote is split... but these are more
# annoying to handle because the 'author' starts with a capital letter, but is
# longer than 5 words. This could be an author, or an author + book title.
# 2500 is only 0.5% of our 500,000 quote dataset, so not worth the effort!
cleaned_df[cleaned_df['author_len'] > 5]

Unnamed: 0,quote,author,author_len,author_caps
2757,Don't touch me,I'll die if you touch me.,6,True
2885,Just in case you ever foolishly forget,I'm never not thinking of you.,6,True
2918,I'm not afraid of death,I just don't want to be there when it happens.,10,True
3418,I don't want to achieve immortality through my...,I want to achieve immortality through not dyin...,20,True
3965,Life is but a day,A fragile dew-drop on its perilous wayFrom a t...,10,True
...,...,...,...,...
424406,Being is always becoming people change and sta...,people change and stay the same.,6,False
424693,We normally know we're getting older when the ...,unless you're a cancer survivor! Then we love ...,11,False
424814,Cole was meticulous to a fault office scuttleb...,office scuttlebut had it that he never went ou...,17,False
424825,Naivete in grownups is often charming but when...,but when coupled with vanity it is indistingui...,10,False


### Make quotes_clean file smaller!

In [100]:
import re
output_df['quote_stripped'] = output_df['quote'].apply(lambda x: re.sub('[^a-z]', '', x.lower()))
output_df.drop_duplicates(subset='quote_stripped', keep='first', inplace=True)
output_df.drop(columns=['quote_stripped'], inplace=True)
output_df


Unnamed: 0,quote,author
0,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe
1,You've gotta dance like there's nobody watchin...,William W. Purkey
2,You know you're in love when you can't fall as...,Dr. Seuss
3,A friend is someone who knows all about you an...,Elbert Hubbard
4,Darkness cannot drive out darkness: only light...,Martin Luther King Jr.
...,...,...
424406,Being is always becoming people change and sta...,people change and stay the same.
424693,We normally know we're getting older when the ...,unless you're a cancer survivor! Then we love ...
424814,Cole was meticulous to a fault office scuttleb...,office scuttlebut had it that he never went ou...
424825,Naivete in grownups is often charming but when...,but when coupled with vanity it is indistingui...


In [101]:
output_df = output_df[output_df['author'].apply(lambda x: len(x.split()) <= 5)]
output_df

Unnamed: 0,quote,author
0,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe
1,You've gotta dance like there's nobody watchin...,William W. Purkey
2,You know you're in love when you can't fall as...,Dr. Seuss
3,A friend is someone who knows all about you an...,Elbert Hubbard
4,Darkness cannot drive out darkness: only light...,Martin Luther King Jr.
...,...,...
499704,I do believe the most important thing I can do...,John C. Stennis
499705,I'd say I'm a bit antimadridista although I do...,Isco
499706,The future is now.,Nam June Paik
499707,"In all my life and in the future, I will alway...",Norodom Sihamoni


In [102]:
# Sample rows to get under 100mb for github
final_output_df = output_df.sample(n=450000, random_state=1)

final_output_df.to_csv('quotes_clean.csv', index=False)